Skip to main content
Log in

TBC: A clustering algorithm based on prokaryotic taxonomy

  • Articles
  • Published:
The Journal of Microbiology Aims and scope Submit manuscript

Abstract

High-throughput DNA sequencing technologies have revolutionized the study of microbial ecology. Massive sequencing of PCR amplicons of the 16S rRNA gene has been widely used to understand the microbial community structure of a variety of environmental samples. The resulting sequencing reads are clustered into operational taxonomic units that are then used to calculate various statistical indices that represent the degree of species diversity in a given sample. Several algorithms have been developed to perform this task, but they tend to produce different outcomes. Herein, we propose a novel sequence clustering algorithm, namely Taxonomy-Based Clustering (TBC). This algorithm incorporates the basic concept of prokaryotic taxonomy in which only comparisons to the type strain are made and used to form species while omitting full-scale multiple sequence alignment. The clustering quality of the proposed method was compared with those of MOTHUR, BLASTClust, ESPRIT-Tree, CD-HIT, and UCLUST. A comprehensive comparison using three different experimental datasets produced by pyrosequencing demonstrated that the clustering obtained using TBC is comparable to those obtained using MOTHUR and ESPRIT-Tree and is computationally efficient. The program was written in JAVA and is available from http://sw.ezbiocloud.net/tbc.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res.25, 3389–3402.

    Article  PubMed  CAS  Google Scholar 

  • Bacon, D.J. and Anderson, W.F. 1986. Multiple sequence alignment. J. Mol. Biol.191, 153–161.

    Article  PubMed  CAS  Google Scholar 

  • Cai, Y. and Sun, Y. 2011. ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time. Nucleic Acids Res. doi:10.1093/nar/gkr349.

  • Cameron, M., Bernstein, Y., and Williams, H.E. 2007. Clustered sequence representation for fast homology search. J. Comput. Biol.14, 594–614.

    Article  PubMed  CAS  Google Scholar 

  • Chao, A. 1984. Non-parametric estimation of the number of classes in a population. Scand. J. Stat.11, 265–270.

    Google Scholar 

  • Chao, A.L. and Lee, S.M. 1992. Estimating the number of classes via sample coverage. J. Am. Stat. Assoc.87, 210–217.

    Google Scholar 

  • Chao, A.M., Ma, M.C., and Yang, M.C.K. 1993. Stopping rules and estimation for recapture debugging with unequal failure rates. Biometrika80, 193–201.

    Article  Google Scholar 

  • Chun, J., Kim, K.Y., Lee, J.H., and Choi, Y. 2010. The analysis of oral microbial communities of wild-type and toll-like receptor 2-deficient mice using a 454 GS FLX Titanium pyrosequencer. BMC Microbiol.10, 101.

    Article  PubMed  Google Scholar 

  • Edgar, R.C. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res.32, 1792–1797.

    Article  PubMed  CAS  Google Scholar 

  • Edgar, R.C. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics26, 2460–2461.

    Article  PubMed  CAS  Google Scholar 

  • Hamady, M. and Knight, R. 2009. Microbial community profiling for human microbiome projects: Tools, techniques, and challenges. Genome Res.19, 1141–1152.

    Article  PubMed  CAS  Google Scholar 

  • Hurlbert, S.H. 1971. The non-concept of species diversity: a critique and alternative parameters. Ecology52, 577–586.

    Article  Google Scholar 

  • Kuenne, C.T., Ghai, R., Chakraborty, T., and Hain, T. 2007. GECO — linear visualization for comparative genomics. Bioinformatics23, 125–126.

    Article  PubMed  CAS  Google Scholar 

  • Li, W. and Godzik, A. 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics22, 1658–1659.

    Article  PubMed  CAS  Google Scholar 

  • Li, W., Jaroszewski, L., and Godzik, A. 2001. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics17, 282–283.

    Article  PubMed  CAS  Google Scholar 

  • Li, W., Jaroszewski, L., and Godzik, A. 2002. Sequence clustering strategies improve remote homology recognitions while reducing search times. Protein Eng.15, 643–649.

    Article  PubMed  CAS  Google Scholar 

  • Li, W., Wooley, J.C., and Godzik, A. 2008. Probing metagenomics by rapid cluster analysis of very large datasets. PLoS One3, e3375.

    Article  PubMed  Google Scholar 

  • Ling, Z., Kong, J., Liu, F., Zhu, H., Chen, X., Wang, Y., Li, L., Nelson, K.E., Xia, Y., and Xiang, C. 2010. Molecular analysis of the diversity of vaginal microbiota associated with bacterial vaginosis. BMC Genomics11, 488.

    Article  PubMed  Google Scholar 

  • Metzker, M.L. 2010. Sequencing technologies — the next generation. Nat. Rev. Genet.11, 31–46.

    Article  PubMed  CAS  Google Scholar 

  • Myers, E.W. and Miller, W. 1988. Optimal alignments in linear space. Comput. Appl. Biosci.4, 11–17.

    PubMed  CAS  Google Scholar 

  • Petrosino, J.F., Highlander, S., Luna, R.A., Gibbs, R.A., and Versalovic, J. 2009. Metagenomic pyrosequencing and microbial identification. Clin. Chem.55, 856–866.

    Article  PubMed  CAS  Google Scholar 

  • Retief, J.D. 2000. Phylogenetic analysis using PHYLIP. Methods Mol. Biol.132, 243–258.

    PubMed  CAS  Google Scholar 

  • Schloss, P.D., Westcott, S.L., Ryabin, T., Hall, J.R., Hartmann, M., Hollister, E.B., Lesniewski, R.A., Oakley, B.B., Parks, D.H., Robinson, C.J., andet al. 2009. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol.75, 7537–7541.

    Article  PubMed  CAS  Google Scholar 

  • Thompson, J.D., Higgins, D.G., and Gibson, T.J. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res.22, 4673–4680.

    Article  PubMed  CAS  Google Scholar 

  • Wayne, L.G., Brenner, D.J., Colwell, R.R., Grimont, P.A.D., Kandler, O., Krichevsky, M.I., Moore, L.H., Moore, W.E.C., Murray, R.G.E., Stackebrandt, E., andet al. 1987. Report of the ad hoc committee on reconciliation of approaches to bacterial systematics. Int. J. Syst. Bacteriol.37, 463–464.

    Article  Google Scholar 

  • Yang, F., Zhu, Q., Tang, D., and Zhao, M. 2009. Using affinity propagation combined post-processing to cluster protein sequences. Protein Pept. Lett.17, 681–689.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jongsik Chun.

Additional information

Supplemental material for this article may be found at http://www.springer.com/content/120956

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, JH., Yi, H., Jeon, YS. et al. TBC: A clustering algorithm based on prokaryotic taxonomy. J Microbiol. 50, 181–185 (2012). https://doi.org/10.1007/s12275-012-1214-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12275-012-1214-6

Keywords

Navigation