Maximizing genetic differentiation in core collections by PCA-based clustering of molecular marker data

van Heerwaarden, Joost; Odong, T. L.; van Eeuwijk, F. A.

doi:10.1007/s00122-012-2016-2

Maximizing genetic differentiation in core collections by PCA-based clustering of molecular marker data

Original Paper
Published: 21 November 2012

Volume 126, pages 763–772, (2013)
Cite this article

Theoretical and Applied Genetics Aims and scope Submit manuscript

Joost van Heerwaarden¹,
T. L. Odong¹ &
F. A. van Eeuwijk¹

1057 Accesses
12 Citations
1 Altmetric
Explore all metrics

Abstract

Developing genetically diverse core sets is key to the effective management and use of crop genetic resources. Core selection increasingly uses molecular marker-based dissimilarity and clustering methods, under the implicit assumption that markers and genes of interest are genetically correlated. In practice, low marker densities mean that genome-wide correlations are mainly caused by genetic differentiation, rather than by physical linkage. Although of central concern, genetic differentiation per se is not specifically targeted by most commonly employed dissimilarity and clustering methods. Principal component analysis (PCA) on genotypic data is known to effectively describe the inter-locus correlations caused by differentiation, but to date there has been no evaluation of its application to core selection. Here, we explore PCA-based clustering of marker data as a basis for core selection, with the aim of demonstrating its use in capturing genetic differentiation in the data. Using simulated datasets, we show that replacing full-rank genotypic data by the subset of genetically significant PCs leads to better description of differentiation and improves assignment of genotypes to their population of origin. We test the effectiveness of differentiation as a criterion for the formation of core sets by applying a simple new PCA-based core selection method to simulated and actual data and comparing its performance to one of the best existing selection algorithms. We find that although gains in genetic diversity are generally modest, PCA-based core selection is equally effective at maximizing diversity at non-marker loci, while providing better representation of genetically differentiated groups.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Principal components analysis - K-means transposon element based foxtail millet core collection selection method

Article Open access 16 February 2016

Core Hunter 3: flexible core subset selection

Article Open access 31 May 2018

The impact of sample selection strategies on genetic diversity and representativeness in germplasm bank collections

Article Open access 27 November 2019

References

Astle W, Balding DJ (2009) Population structure and cryptic relatedness in genetic association studies. Stat Sci 24:451–471
Article Google Scholar
Banfield JD, Raftery AE (1993) Model-based gaussian and non-gaussian clustering. Biometrics 49:803–821
Article Google Scholar
Bataillon TM, David JL, Schoen DJ (1996) Neutral genetic markers and conservation genetics: simulated germplasm collections. Genetics 144:409–417
PubMed CAS Google Scholar
Becquet C, Patterson N, Stone AC, Przeworski M, Reich D (2007) Genetic structure of chimpanzee populations. PLoS Genet 3:617–626
Article CAS Google Scholar
Bowcock AM, Ruizlinares A, Tomfohrde J et al (1994) High-resolution of human evolutionary trees with polymorphic microsatellites. Nature 368:455–457
Article PubMed CAS Google Scholar
Brown AHD (1989) Core collections: a practical approach to genetic resources management. Genome 31:818–824
Article Google Scholar
Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26:297–302
Article Google Scholar
Eckert AJ, van Heerwaarden J, Wegrzyn JL et al (2010) Patterns of population structure and environmental associations to aridity across the range of loblolly pine (Pinus taeda L., Pinaceae). Genetics 185:969–982
Article PubMed CAS Google Scholar
Fraley C (1998) Algorithms for model-based Gaussian hierarchical clustering. SIAM J Sci Comput 20:270–281
Article Google Scholar
Fraley C, Raftery AE (1999) MCLUST: software for model-based cluster analysis. J Classif 16:297–306
Article Google Scholar
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97:611–631
Article Google Scholar
Franco J, Crossa J, Diaz J et al (1997) A sequential clustering strategy for classifying gene bank accessions. Crop Sci 37:1656–1662
Article Google Scholar
Franco J, Crossa J, Taba S, Shands H (2005) A sampling strategy for conserving genetic diversity when forming core subsets. Crop Sci 45:1035–1044
Article Google Scholar
Franco J, Crossa J, Warburton ML, Taba S (2006) Sampling strategies for conserving maize diversity when forming core subsets using genetic markers. Crop Sci 46:854–864
Article Google Scholar
Franco J, Crossa J, Desphande S (2009) Hierarchical multiple-factor analysis for classifying genotypes based on phenotypic and genetic data. Crop Sci 50:105
Article Google Scholar
Frankel OH (1984) Genetic perspectives of germplasm conservation. Genetic manipulation: impact on man and society, pp 161–170
Goldstein DB, Linares AR, Cavallisforza LL, Feldman MW (1995) An evaluation of genetic distances for use with microsatellite loci. Genetics 139:463–471
PubMed CAS Google Scholar
Gouesnard B, Bataillon TM, Decoux G et al (2001) MSTRAT: an algorithm for building germ plasm core collections by maximizing allelic or phenotypic richness. J Hered 92:93–94
Article PubMed CAS Google Scholar
Hellenthal G, Stephens M (2007) msHOT: modifying Hudson’s ms simulator to incorporate crossover and gene conversion hotspots. Bioinformatics 23:520–521
Article PubMed CAS Google Scholar
Hudson RR (2002) Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18:337–338
Article PubMed CAS Google Scholar
Jaccard P (1908) Nouvelles recherches sur la distribution florale. Bull Soc Vaud Sci Nat 44:223–269
Google Scholar
Jansen J, van Hintum T (2007) Genetic distance sampling: a novel sampling method for obtaining core collections using genetic distances with an application to cultivated lettuce. Theor Appl Genet 114:421–428
Article PubMed CAS Google Scholar
Johnstone IM (2001) On the distribution of the largest eigenvalue in principal components analysis. Ann Stat 29:295–327
Article Google Scholar
Kaufman L, Rousseeuw PJ (1990) Finding groups in data. An introduction to cluster analysis. Wiley, New York
Book Google Scholar
Kimura M, Ohta T (1978) Stepwise mutation model and distribution of allelic frequencies in a finite population. Proc Natl Acad Sci USA 75:2868
Article PubMed CAS Google Scholar
Lee C, Abdool A, Huang CH (2009) PCA-based population structure inference with generic clustering algorithms. BMC Bioinform 10(Suppl 1):S73
Article Google Scholar
Manel S, Berthoud F, Bellemain E et al (2007) A new individual-based spatial approach for identifying genetic discontinuities in natural populations. Mol Ecol 16:2031–2043
Article PubMed CAS Google Scholar
McVean G (2009) A genealogical interpretation of principal components analysis. PLoS Genet 5:e1000686
Article PubMed Google Scholar
Milligan GW, Cooper M (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50:159–179
Article Google Scholar
Nei M (1972) Genetic distance between populations. Am Nat 106:283
Article Google Scholar
Nei M, Roychoudhury AK (1974) Sampling variances of heterozygosity and genetic distance. Genetics 76:379
PubMed CAS Google Scholar
Odong TL, van Heerwaarden J, Jansen J, van Hintum TJ, van Eeuwijk FA (2011a) Determination of genetic structure of germplasm collections: are traditional hierarchical clustering methods appropriate for molecular marker data? Theor Appl Genet 123:195–205
Article PubMed CAS Google Scholar
Odong TL, van H J, Jansen J, van H TJL, van E FA (2011b) Statistical techniques for defining reference sets of accessions and microsatellite markers. Crop Science 51:2401
Article Google Scholar
Ohta T (1982) Linkage disequilibrium with the island model. Genetics 101:139
PubMed CAS Google Scholar
Patterson N, Price AL, Reich D (2006) Population structure and eigen analysis. PLoS Genet 2:e190
Article PubMed Google Scholar
R, DCT (2009) R: a language and environment for statistical computing
Reif JC, Melchinger AE, Frisch M (2005) Genetical and mathematical properties of similarity and dissimilarity coefficients applied in plant breeding and seed bank management. Crop Sci 45:1–7
Article Google Scholar
Reynolds J, Weir BS, Cockerham CC (1983) Estimation of the coancestry coefficient: basis for a short-term genetic distance. Genetics 105:767
PubMed CAS Google Scholar
Rogers DJ, Tanimoto TT (1960) A computer programming for classical plants. Science 132:1115–1118
Article PubMed CAS Google Scholar
Schoen DJ, Brown AHD (1993) Conservation of allelic richness in wild crop relatives is aided by assessment of genetic-markers. P Natl Acad Sci USA 90:10623–10627
Article CAS Google Scholar
Sillanpää MJ (2010) Overview of techniques to account for confounding due to population stratification and cryptic relatedness in genomic data association analyses. Heredity 106:511–519
Article PubMed Google Scholar
Thachuk C, Crossa J, Franco J et al (2009) Core Hunter: an algorithm for sampling genetic resources based on multiple genetic measures. BMC Bioinform 10:243
Article Google Scholar
Tishkoff SA, Reed FA, Friedlaender FR et al (2009) The genetic structure and history of Africans and African Americans. Science 324:1035–1044
Article PubMed CAS Google Scholar
Tracy CA, Widom H (1994) Level-spacing distributions and the airy kernel. Commun Math Phys 159:151–174
Article Google Scholar
Van Heerwaarden J, Ross-Ibarra J, Doebley J et al (2010) Fine scale genetic structure in the wild ancestor of maize (Zea mays ssp. parviglumis). Mol Ecol 19:1162–1173
Article PubMed Google Scholar
van Heerwaarden J, Doebley J, Briggs WH et al (2011) Genetic signals of origin, spread, and introgression in a large sample of maize landraces. Proc Natl Acad Sci USA 108:1088–1092
Article PubMed Google Scholar
Van Hintum TJL, Brown AHD, Spillane C, Hodgkin T (2000) Core collections of plant genetic resources. Bioversity International
Weir BS, Cockerham CC (1984) Estimating F-statistics for the analysis of population structure. Evolution 38:1358–1370
Article Google Scholar
Wright S (1951) The genetical structure of populations. Ann Eugen 15:323–354
Google Scholar

Download references

Acknowledgments

The authors wish to thank Carmen de Vicente, former leader of subprogram 5 of the Generation Challenge Program (GCP), for providing financial support (GCP 4008.23) and guidance. We thank Diego Ortega Del Vecchyo for contributing software and three anonymous reviewers for comments on earlier versions of the manuscript.

Author information

Authors and Affiliations

Biometris, Wageningen UR, Wageningen, The Netherlands
Joost van Heerwaarden, T. L. Odong & F. A. van Eeuwijk

Authors

Joost van Heerwaarden
View author publications
You can also search for this author in PubMed Google Scholar
T. L. Odong
View author publications
You can also search for this author in PubMed Google Scholar
F. A. van Eeuwijk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joost van Heerwaarden.

Additional information

Communicated by G. Bryan.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOC 205 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

van Heerwaarden, J., Odong, T.L. & van Eeuwijk, F.A. Maximizing genetic differentiation in core collections by PCA-based clustering of molecular marker data. Theor Appl Genet 126, 763–772 (2013). https://doi.org/10.1007/s00122-012-2016-2

Download citation

Received: 20 October 2011
Accepted: 05 November 2012
Published: 21 November 2012
Issue Date: March 2013
DOI: https://doi.org/10.1007/s00122-012-2016-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Maximizing genetic differentiation in core collections by PCA-based clustering of molecular marker data

Abstract

Access this article

Similar content being viewed by others

Principal components analysis - K-means transposon element based foxtail millet core collection selection method

Core Hunter 3: flexible core subset selection

The impact of sample selection strategies on genetic diversity and representativeness in germplasm bank collections

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (DOC 205 kb)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Maximizing genetic differentiation in core collections by PCA-based clustering of molecular marker data

Abstract

Access this article

Similar content being viewed by others

Principal components analysis - K-means transposon element based foxtail millet core collection selection method

Core Hunter 3: flexible core subset selection

The impact of sample selection strategies on genetic diversity and representativeness in germplasm bank collections

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (DOC 205 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation