Abstract
High-throughput DNA sequencing technologies have revolutionized the study of microbial ecology. Massive sequencing of PCR amplicons of the 16S rRNA gene has been widely used to understand the microbial community structure of a variety of environmental samples. The resulting sequencing reads are clustered into operational taxonomic units that are then used to calculate various statistical indices that represent the degree of species diversity in a given sample. Several algorithms have been developed to perform this task, but they tend to produce different outcomes. Herein, we propose a novel sequence clustering algorithm, namely Taxonomy-Based Clustering (TBC). This algorithm incorporates the basic concept of prokaryotic taxonomy in which only comparisons to the type strain are made and used to form species while omitting full-scale multiple sequence alignment. The clustering quality of the proposed method was compared with those of MOTHUR, BLASTClust, ESPRIT-Tree, CD-HIT, and UCLUST. A comprehensive comparison using three different experimental datasets produced by pyrosequencing demonstrated that the clustering obtained using TBC is comparable to those obtained using MOTHUR and ESPRIT-Tree and is computationally efficient. The program was written in JAVA and is available from http://sw.ezbiocloud.net/tbc.
Similar content being viewed by others
References
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res.25, 3389–3402.
Bacon, D.J. and Anderson, W.F. 1986. Multiple sequence alignment. J. Mol. Biol.191, 153–161.
Cai, Y. and Sun, Y. 2011. ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time. Nucleic Acids Res. doi:10.1093/nar/gkr349.
Cameron, M., Bernstein, Y., and Williams, H.E. 2007. Clustered sequence representation for fast homology search. J. Comput. Biol.14, 594–614.
Chao, A. 1984. Non-parametric estimation of the number of classes in a population. Scand. J. Stat.11, 265–270.
Chao, A.L. and Lee, S.M. 1992. Estimating the number of classes via sample coverage. J. Am. Stat. Assoc.87, 210–217.
Chao, A.M., Ma, M.C., and Yang, M.C.K. 1993. Stopping rules and estimation for recapture debugging with unequal failure rates. Biometrika80, 193–201.
Chun, J., Kim, K.Y., Lee, J.H., and Choi, Y. 2010. The analysis of oral microbial communities of wild-type and toll-like receptor 2-deficient mice using a 454 GS FLX Titanium pyrosequencer. BMC Microbiol.10, 101.
Edgar, R.C. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res.32, 1792–1797.
Edgar, R.C. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics26, 2460–2461.
Hamady, M. and Knight, R. 2009. Microbial community profiling for human microbiome projects: Tools, techniques, and challenges. Genome Res.19, 1141–1152.
Hurlbert, S.H. 1971. The non-concept of species diversity: a critique and alternative parameters. Ecology52, 577–586.
Kuenne, C.T., Ghai, R., Chakraborty, T., and Hain, T. 2007. GECO — linear visualization for comparative genomics. Bioinformatics23, 125–126.
Li, W. and Godzik, A. 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics22, 1658–1659.
Li, W., Jaroszewski, L., and Godzik, A. 2001. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics17, 282–283.
Li, W., Jaroszewski, L., and Godzik, A. 2002. Sequence clustering strategies improve remote homology recognitions while reducing search times. Protein Eng.15, 643–649.
Li, W., Wooley, J.C., and Godzik, A. 2008. Probing metagenomics by rapid cluster analysis of very large datasets. PLoS One3, e3375.
Ling, Z., Kong, J., Liu, F., Zhu, H., Chen, X., Wang, Y., Li, L., Nelson, K.E., Xia, Y., and Xiang, C. 2010. Molecular analysis of the diversity of vaginal microbiota associated with bacterial vaginosis. BMC Genomics11, 488.
Metzker, M.L. 2010. Sequencing technologies — the next generation. Nat. Rev. Genet.11, 31–46.
Myers, E.W. and Miller, W. 1988. Optimal alignments in linear space. Comput. Appl. Biosci.4, 11–17.
Petrosino, J.F., Highlander, S., Luna, R.A., Gibbs, R.A., and Versalovic, J. 2009. Metagenomic pyrosequencing and microbial identification. Clin. Chem.55, 856–866.
Retief, J.D. 2000. Phylogenetic analysis using PHYLIP. Methods Mol. Biol.132, 243–258.
Schloss, P.D., Westcott, S.L., Ryabin, T., Hall, J.R., Hartmann, M., Hollister, E.B., Lesniewski, R.A., Oakley, B.B., Parks, D.H., Robinson, C.J., andet al. 2009. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol.75, 7537–7541.
Thompson, J.D., Higgins, D.G., and Gibson, T.J. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res.22, 4673–4680.
Wayne, L.G., Brenner, D.J., Colwell, R.R., Grimont, P.A.D., Kandler, O., Krichevsky, M.I., Moore, L.H., Moore, W.E.C., Murray, R.G.E., Stackebrandt, E., andet al. 1987. Report of the ad hoc committee on reconciliation of approaches to bacterial systematics. Int. J. Syst. Bacteriol.37, 463–464.
Yang, F., Zhu, Q., Tang, D., and Zhao, M. 2009. Using affinity propagation combined post-processing to cluster protein sequences. Protein Pept. Lett.17, 681–689.
Author information
Authors and Affiliations
Corresponding author
Additional information
Supplemental material for this article may be found at http://www.springer.com/content/120956
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Lee, JH., Yi, H., Jeon, YS. et al. TBC: A clustering algorithm based on prokaryotic taxonomy. J Microbiol. 50, 181–185 (2012). https://doi.org/10.1007/s12275-012-1214-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12275-012-1214-6