Abstract
Key message
We evaluated several methods for computing shrinkage estimates of the genomic relationship matrix and demonstrated their potential to enhance the reliability of genomic estimated breeding values of training set individuals.
Abstract
In genomic prediction in plant breeding, the training set constitutes a large fraction of the total number of genotypes assayed and is itself subject to selection. The objective of our study was to investigate whether genomic estimated breeding values (GEBVs) of individuals in the training set can be enhanced by shrinkage estimation of the genomic relationship matrix. We simulated two different population types: a diversity panel of unrelated individuals and a biparental family of doubled haploid lines. For different training set sizes (50, 100, 200), number of markers (50, 100, 200, 500, 2,500) and heritabilities (0.25, 0.5, 0.75), shrinkage coefficients were computed by four different methods. Two of these methods are novel and based on measures of LD, the other two were previously described in the literature, one of which was extended by us. Our results showed that shrinkage estimation of the genomic relationship matrix can significantly improve the reliability of the GEBVs of training set individuals, especially for a low number of markers. We demonstrate that the number of markers is the primary determinant of the optimum shrinkage coefficient maximizing the reliability and we recommend methods eligible for routine usage in practical applications.
Similar content being viewed by others
References
Astle W, Balding DJ (2009) Population structure and cryptic relatedness in genetic association studies. Stat Sci 24(4):451–471. doi:10.1214/09-STS307. http://projecteuclid.org/euclid.ss/1271770342, arXiv:1010.4681v1
Bernardo R, Yu J (2007) Prospects for genomewide selection for quantitative traits in Maize. Crop Sci 47(3):1082. doi:10.2135/cropsci2006.11.0690. https://www.crops.org/publications/cs/abstracts/47/3/1082
de Los Campos G, Hickey JM, Pong-Wong R, Daetwyler HD, Calus MPL (2013) Whole-genome regression and prediction methods applied to plant and animal breeding. Genetics 193(2), pp. 327–45. doi:10.1534/genetics.112.143313. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3567727&tool=pmcentrez&rendertype=abstract
Dekkers JCM (2007) Prediction of response to marker-assisted and genomic selection using selection index theory. J Anim Breed Genet 124(6):331–41. doi:10.1111/j.1439-0388.2007.00701.x. http://www.ncbi.nlm.nih.gov/pubmed/18076470
Endelman JB (2011) Ridge regression and other kernels for genomic selection with R Package rrBLUP. Plant Genome J 4(3):250. doi:10.3835/plantgenome2011.08.0024. https://www.crops.org/publications/tpg/abstracts/4/3/250
Endelman JB, Jannink JL (2012) Shrinkage estimation of the realized relationship matrix. G3 2(11):1405–13. doi:10.1534/g3.112.004259. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3484671&tool=pmcentrez&rendertype=abstract
Frisch M, Melchinger AE (2007) Variance of the parental genome contribution to inbred lines derived from biparental crosses. Genetics 176(1):477–88, doi:10.1534/genetics.106.065433. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1893034&tool=pmcentrez&rendertype=abstract
Goddard ME, Wray NR, Verbyla K, Visscher PM (2009) Estimating effects and making predictions from genome-wide marker data. Stat Sci 24(4):517–529. doi:10.1214/09-STS306. http://projecteuclid.org/euclid.ss/1271770346, arXiv:1010.4710v1
Goddard ME, Hayes BJ, Meuwissen THE (2011) Using the genomic relationship matrix to predict the accuracy of genomic selection. J Anim Breed Genet 128(6):409–21, doi:10.1111/j.1439-0388.2011.00964.x. http://www.ncbi.nlm.nih.gov/pubmed/22059574
Habier D, Fernando RL, Dekkers JCM (2007) The impact of genetic relationship information on genome-assisted breeding values. Genetics 177(4):2389–97. doi:10.1534/genetics.107.081190. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2219482&tool=pmcentrez&rendertype=abstract
Habier D, Fernando RL, Garrick DJ (2013) Genomic BLUP decoded: a look into the black box of genomic prediction. Genetics 194(3):597–607. doi:10.1534/genetics.113.152207. http://www.ncbi.nlm.nih.gov/pubmed/23640517
Hayes B, Goddard M (2010) Genome-wide association and genomic selection in animal breeding. Genome 53(11): 876–83. doi:10.1139/G10-076. http://www.ncbi.nlm.nih.gov/pubmed/21076503
Hayes BJ, Bowman PJ, Chamberlaina J, Goddard ME (2009) Invited review: genomic selection in dairy cattle: progress and challenges. J Dairy Sci 92(2):433–43. doi:10.3168/jds.2008-1646. http://www.ncbi.nlm.nih.gov/pubmed/19164653
Henderson CR (1973) Sire evaluation and genetic trends. J Anim Sci, pp 10–41
Hill W, Robertson A (1968) Linkage disequilibrium in finite populations. Theor Appl Genet 38(6):226–231. http://link.springer.com/article/10.1007/BF01245622
Hill WG (2010) Understanding and using quantitative genetic variation. Philos Trans R Soc Lond Ser B Biol Sci 365(1537);73–85. doi:10.1098/rstb.2009.0203. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2842708&tool=pmcentrez&rendertype=abstract
Hill WG, Weir BS (2011) Variation in actual relationship as a consequence of Mendelian sampling and linkage. Genet Res 93(1):47–64. doi:10.1017/S0016672310000480. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3070763&tool=pmcentrez&rendertype=abstract
Kang HM, Zaitlen Na, Wade CM, Kirby A, Heckerman D, Daly MJ, Eskin E (2008) Efficient control of population structure in model organism association mapping. Genetics 178(3):1709–23. doi:10.1534/genetics.107.080101. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2278096&tool=pmcentrez&rendertype=abstract
Lande R, Thompson R (1990) Efficiency of marker-assisted selection in the improvement of quantitative traits. Genetics 124(3):743–56. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1203965&tool=pmcentrez&rendertype=abstract
Lynch M, Walsh B (1998) Genetics and analysis of quantitative traits, 1st edn. Sinauer Associates, Sunderland
Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157(4):1819–1829. http://www.genetics.org/content/157/4/1819.abstract
Montana G (2005) HapSim: a simulation tool for generating haplotype data with pre-specified allele frequencies and LD coefficients. Bioinformatics (Oxford, England) 21(23): 4309–11, doi:10.1093/bioinformatics/bti689. http://www.ncbi.nlm.nih.gov/pubmed/16188927
Powell JE, Visscher PM, Goddard ME (2010) Reconciling the analysis of IBD and IBS in complex trait studies. Nat Rev Genet 11(11): 800–5. doi:10.1038/nrg2865. http://www.ncbi.nlm.nih.gov/pubmed/20877324
R Core Team (2014) R: a language and environment for statistical computing. http://www.r-project.org/
Riedelsheimer C, Melchinger AE (2013) Optimizing the allocation of resources for genomic selection in one breeding cycle. TAG Theoret Appl Genet 126(11):2835–48. doi:10.1007/s00122-013-2175-9. http://www.ncbi.nlm.nih.gov/pubmed/23982591
Riedelsheimer C, Technow F, Melchinger AE (2012) Comparison of whole-genome prediction models for traits with contrasting genetic architecture in a diversity panel of maize inbred lines. BMC genomics 13(1):452. doi:10.1186/1471-2164-13-452. http://www.mendeley.com/research/comparison-of-whole-genome-prediction-models-for-traits-with-contrasting-genetic-architecture-in-a-d-1/
Searle SR, Casella G, McCulloch CE (1992) Variance components, 1st edn. Wiley-Interscience, Hoboken
Smith JSC, Hussain T, Jones ES, Graham G, Podlich D, Wall S, Williams M (2008) Use of doubled haploids in maize breeding: implications for intellectual property protection and genetic diversity in hybrid crops. Mol Breed 22(1):51–59. doi:10.1007/s11032-007-9155-1. http://link.springer.com/10.1007/s11032-007-9155-1
Technow F (2013) hypred: simulation of genomic data in applied genetics. http://cran.r-project.org/web/packages/hypred/
VanRaden PM (2008) Efficient methods to compute genomic predictions. J Dairy Sci 91(11):4414–23. doi:10.3168/jds.2007-0980. http://www.ncbi.nlm.nih.gov/pubmed/18946147
Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden Pa, Heath AC, Martin NG, Montgomery GW, Goddard ME, Visscher PM (2010) Common SNPs explain a large proportion of the heritability for human height. Nat Genet 42(7):565–9. doi:10.1038/ng.608. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3232052&tool=pmcentrez&rendertype=abstract
Conflict of interest
The authors declare no conflict of interest associated with this study.
Ethical standards
The authors declare that ethical standards are met, and all the experiments comply with the current laws of the country in which they were performed.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Hiroyoshi Iwata.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix
Appendix
In this appendix, we describe Methods effLD and RM in detail.
Method effLD
In order to account for the genetic variance explained by markers beyond the ones immediately adjacent to QTL, we devised a measure for effective LD (\({\text {LD}}_{\text {eff}}\)). Because QTL genotypes are generally unobservable, we use marker loci as a proxy.
Suppose that \(M\) biallelic markers are located on a chromosomal segment where \(p_i\) is the estimated allele frequency (of the major allele) at the \(i\)th marker. The LD between marker \(i\) and \(j\) can be computed according to Hill and Robertson (1968) as
where \(p_{ij}\) is the joint probability of the major allele occurring at both marker loci \(i\) and \(j\). \({\text {LD}}_{\text {eff}}\) is then calculated as follows. For each chromosome,
-
1.
compute \(p_{ij}\) for all marker pairs as \(p_{ij} = r_{ij} \sqrt{p_{i} p_j \left( 1 - p_{i} \right) \left( 1 - p_j \right) } + p_{i} p_j\)
-
2.
compute the covariance matrix \({\mathbf {\Sigma }} = \left\{ \Sigma _{ij} \right\}\) by solving the equations \({\mathrm {\Phi }}\!\left( z(p_i), z(p_j) ; \Sigma _{ij} \right) = p_{ij}\) for \(\Sigma _{ij}\) for all marker pairs, where \({\mathrm {\Phi }}\) is the cumulative distribution function of the standard bivariate normal distribution with mean zero and covariance \(\Sigma _{ij}\) and \(z(p_i)\) refers to the \(p_i{\text {th}}\) quantile of the univariate standard normal distribution (Montana 2005).
-
3.
compute the conditional variance for each locus \(i\), given all others, as \(\sigma _i = {\mathbf {\Sigma _{i,i}}} - {\mathbf {\Sigma _{i , -i}}} {\mathbf {\Sigma ^{-1}_{-i,-i}}} {\mathbf {\Sigma _{-i, i}}}.\) Here, the subscript \(\varvec{\mathrm {i}}\) denotes the \(i\)th row or column, whereas \(\varvec{\mathrm {-i}}\) denotes all but the \(i\)th row or column. Considering now the \(i\)th locus as a QTL, we imagine a hypothetical marker locus \(h\) in the proximity that would effectively lead to the same conditional variance at the \(i\)th locus.
-
4.
compute \(p^*_{ih} = {\mathrm {\Phi} }\!\left( z(p_i), 0 ; \sqrt{1 - \sigma _i} \right)\)
-
5.
compute the effective LD for each locus \(i\) of all loci (\(L\)) as
$$\begin{aligned} {\text {LD}}_{\text {eff}} =\sum _{i = 1}^{L} \frac{ \left( p^*_{ih} - 0.5 p_i \right) ^2 }{0.5 p_i \left( 1 - p_i \right) \left( 1 - 0.5 \right) } \end{aligned}$$(8) -
6.
take the average across all loci on the same chromosome
Finally, take the average across all chromosomes. Intuitively, LDeff would be the average coefficient of LD that would be observed between a QTL and a hypothetical marker with 0.5 allele frequency that would reduce the variance of the QTL genotype from \({\mathbf {\Sigma _{i,i}}}\) to \(\sigma _i\).
Method RM
We use the model and notation of Dekkers (2007)
where the phenotypic value \(Y_i\) of the \(i\)th individual is decomposed into its genetic value \(G_i\) and an environmental deviate \(E_i\). The genetic value is further partitioned into QTL effect \(Q_i\) that is associated with marker through LD and effects \(R_i\) that is independent of markers. The effects \(Q_i\) can be further subdivided into a prediction \(\widehat{Q}_i\) and a prediction error \(e_i\), both being uncorrelated with one another.
A selection index combining phenotypic data and GEBVs can be constructed as \({\mathbf {b}} = {\mathbf {P}}^{-1}{\mathbf {G}}\), e.g., Lande and Thompson (1990), where
Without loss of generality, we assume \(\sigma _G^2 = {\text {var}}(G_i) = 1\) and \(\sigma _P^2 = {\text {var}}(P_i) = \frac{1}{h^2}\). Also, let \(q^2 = {\text {var}}(Q_i)\) be the proportion of variance contributed by QTL that are in LD with markers. Then
where the last equality follows from the uncorrelatedness of the predictor \(\widehat{Q}_i\) with the model residual \(e_i\). Thus, \(r_{\widehat{Q}_i} = \frac{\sigma _{\widehat{Q}_i}^2}{\sigma _{Q_i}^2}\) is the proportion of genetic variance contributed by \(Q_i\) that is explained by the GEBV \(\widehat{Q}_i\). Assuming \(r \left( \widehat{Q}_i , R_i \right) = 0\), we obtain
With this, we obtain \({\text {cov}}(\widehat{Q}_i , G_i) = q^2 r_{\widehat{Q}_i}^2\). Since \({\text {cov}}(Y_i , G_i) = 1\) , we have
Further, we have \({\text {var}}(\widehat{Q}_i) = q^2 r_{\widehat{Q}_i}^2\), \({\text {var}}(P_i) = \frac{1}{h^2}\). Assuming that \(\widehat{Q}_i\) and \(E_i\) are uncorrelated, i.e., \(r \left( \widehat{Q}_i , E_i \right) = 0\), we have \({\text {cov}}(\widehat{Q}_i , P_i) = {\text {cov}}(\widehat{Q}_i , G_i) = q^2 r_{\widehat{Q}_i}^2\). Hence,
By multiplying \({\mathbf {P}}^{-1}\) and \({\mathbf {G}}\), we obtain
In particular, we have
This is equivalent to Eq. 3 in Lande and Thompson (1990). The quantity \(q^2 r_{\widehat{Q}_i}^2\) is equal to \(r_{\text {MG}}^2\) in Dekkers (2007), which is the proportion of genetic variance that is explained by the GEBV. In practice, this parameter can be estimated using cross-validation as the squared predictive ability. In particular, we used fivefold cross-validation with five replications to estimate \(r_{\text {MG}}^2\) from the training set. The assumptions \(r \left( \widehat{Q}_i , R_i \right) = 0\) and \(r \left( \widehat{Q}_i , E_i \right) = 0\) are obviously not fulfilled with finite population sizes, as was validated by means of simulation.
Rights and permissions
About this article
Cite this article
Müller, D., Technow, F. & Melchinger, A.E. Shrinkage estimation of the genomic relationship matrix can improve genomic estimated breeding values in the training set. Theor Appl Genet 128, 693–703 (2015). https://doi.org/10.1007/s00122-015-2464-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00122-015-2464-6