Shrinkage estimation of the genomic relationship matrix can improve genomic estimated breeding values in the training set

Müller, Dominik; Technow, Frank; Melchinger, Albrecht E.

doi:10.1007/s00122-015-2464-6

Shrinkage estimation of the genomic relationship matrix can improve genomic estimated breeding values in the training set

Original Paper
Published: 04 March 2015

Volume 128, pages 693–703, (2015)
Cite this article

Theoretical and Applied Genetics Aims and scope Submit manuscript

Dominik Müller ORCID: orcid.org/0000-0001-7769-8468¹,
Frank Technow¹^nAff2 &
Albrecht E. Melchinger¹^nAff2

662 Accesses
11 Citations
Explore all metrics

Abstract

Key message

We evaluated several methods for computing shrinkage estimates of the genomic relationship matrix and demonstrated their potential to enhance the reliability of genomic estimated breeding values of training set individuals.

Abstract

In genomic prediction in plant breeding, the training set constitutes a large fraction of the total number of genotypes assayed and is itself subject to selection. The objective of our study was to investigate whether genomic estimated breeding values (GEBVs) of individuals in the training set can be enhanced by shrinkage estimation of the genomic relationship matrix. We simulated two different population types: a diversity panel of unrelated individuals and a biparental family of doubled haploid lines. For different training set sizes (50, 100, 200), number of markers (50, 100, 200, 500, 2,500) and heritabilities (0.25, 0.5, 0.75), shrinkage coefficients were computed by four different methods. Two of these methods are novel and based on measures of LD, the other two were previously described in the literature, one of which was extended by us. Our results showed that shrinkage estimation of the genomic relationship matrix can significantly improve the reliability of the GEBVs of training set individuals, especially for a low number of markers. We demonstrate that the number of markers is the primary determinant of the optimum shrinkage coefficient maximizing the reliability and we recommend methods eligible for routine usage in practical applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bayesian Genomic Linear Regression

A new approach fits multivariate genomic prediction models efficiently

Article Open access 17 June 2022

Derivation of Linear Models for Quantitative Traits by Bayesian Estimation with Gibbs Sampling

References

Astle W, Balding DJ (2009) Population structure and cryptic relatedness in genetic association studies. Stat Sci 24(4):451–471. doi:10.1214/09-STS307. http://projecteuclid.org/euclid.ss/1271770342, arXiv:1010.4681v1
Bernardo R, Yu J (2007) Prospects for genomewide selection for quantitative traits in Maize. Crop Sci 47(3):1082. doi:10.2135/cropsci2006.11.0690. https://www.crops.org/publications/cs/abstracts/47/3/1082
de Los Campos G, Hickey JM, Pong-Wong R, Daetwyler HD, Calus MPL (2013) Whole-genome regression and prediction methods applied to plant and animal breeding. Genetics 193(2), pp. 327–45. doi:10.1534/genetics.112.143313. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3567727&tool=pmcentrez&rendertype=abstract
Dekkers JCM (2007) Prediction of response to marker-assisted and genomic selection using selection index theory. J Anim Breed Genet 124(6):331–41. doi:10.1111/j.1439-0388.2007.00701.x. http://www.ncbi.nlm.nih.gov/pubmed/18076470
Endelman JB (2011) Ridge regression and other kernels for genomic selection with R Package rrBLUP. Plant Genome J 4(3):250. doi:10.3835/plantgenome2011.08.0024. https://www.crops.org/publications/tpg/abstracts/4/3/250
Endelman JB, Jannink JL (2012) Shrinkage estimation of the realized relationship matrix. G3 2(11):1405–13. doi:10.1534/g3.112.004259. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3484671&tool=pmcentrez&rendertype=abstract
Frisch M, Melchinger AE (2007) Variance of the parental genome contribution to inbred lines derived from biparental crosses. Genetics 176(1):477–88, doi:10.1534/genetics.106.065433. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1893034&tool=pmcentrez&rendertype=abstract
Goddard ME, Wray NR, Verbyla K, Visscher PM (2009) Estimating effects and making predictions from genome-wide marker data. Stat Sci 24(4):517–529. doi:10.1214/09-STS306. http://projecteuclid.org/euclid.ss/1271770346, arXiv:1010.4710v1
Goddard ME, Hayes BJ, Meuwissen THE (2011) Using the genomic relationship matrix to predict the accuracy of genomic selection. J Anim Breed Genet 128(6):409–21, doi:10.1111/j.1439-0388.2011.00964.x. http://www.ncbi.nlm.nih.gov/pubmed/22059574
Habier D, Fernando RL, Dekkers JCM (2007) The impact of genetic relationship information on genome-assisted breeding values. Genetics 177(4):2389–97. doi:10.1534/genetics.107.081190. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2219482&tool=pmcentrez&rendertype=abstract
Habier D, Fernando RL, Garrick DJ (2013) Genomic BLUP decoded: a look into the black box of genomic prediction. Genetics 194(3):597–607. doi:10.1534/genetics.113.152207. http://www.ncbi.nlm.nih.gov/pubmed/23640517
Hayes B, Goddard M (2010) Genome-wide association and genomic selection in animal breeding. Genome 53(11): 876–83. doi:10.1139/G10-076. http://www.ncbi.nlm.nih.gov/pubmed/21076503
Hayes BJ, Bowman PJ, Chamberlaina J, Goddard ME (2009) Invited review: genomic selection in dairy cattle: progress and challenges. J Dairy Sci 92(2):433–43. doi:10.3168/jds.2008-1646. http://www.ncbi.nlm.nih.gov/pubmed/19164653
Henderson CR (1973) Sire evaluation and genetic trends. J Anim Sci, pp 10–41
Hill W, Robertson A (1968) Linkage disequilibrium in finite populations. Theor Appl Genet 38(6):226–231. http://link.springer.com/article/10.1007/BF01245622
Hill WG (2010) Understanding and using quantitative genetic variation. Philos Trans R Soc Lond Ser B Biol Sci 365(1537);73–85. doi:10.1098/rstb.2009.0203. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2842708&tool=pmcentrez&rendertype=abstract
Hill WG, Weir BS (2011) Variation in actual relationship as a consequence of Mendelian sampling and linkage. Genet Res 93(1):47–64. doi:10.1017/S0016672310000480. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3070763&tool=pmcentrez&rendertype=abstract
Kang HM, Zaitlen Na, Wade CM, Kirby A, Heckerman D, Daly MJ, Eskin E (2008) Efficient control of population structure in model organism association mapping. Genetics 178(3):1709–23. doi:10.1534/genetics.107.080101. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2278096&tool=pmcentrez&rendertype=abstract
Lande R, Thompson R (1990) Efficiency of marker-assisted selection in the improvement of quantitative traits. Genetics 124(3):743–56. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1203965&tool=pmcentrez&rendertype=abstract
Lynch M, Walsh B (1998) Genetics and analysis of quantitative traits, 1st edn. Sinauer Associates, Sunderland
Google Scholar
Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157(4):1819–1829. http://www.genetics.org/content/157/4/1819.abstract
Montana G (2005) HapSim: a simulation tool for generating haplotype data with pre-specified allele frequencies and LD coefficients. Bioinformatics (Oxford, England) 21(23): 4309–11, doi:10.1093/bioinformatics/bti689. http://www.ncbi.nlm.nih.gov/pubmed/16188927
Powell JE, Visscher PM, Goddard ME (2010) Reconciling the analysis of IBD and IBS in complex trait studies. Nat Rev Genet 11(11): 800–5. doi:10.1038/nrg2865. http://www.ncbi.nlm.nih.gov/pubmed/20877324
R Core Team (2014) R: a language and environment for statistical computing. http://www.r-project.org/
Riedelsheimer C, Melchinger AE (2013) Optimizing the allocation of resources for genomic selection in one breeding cycle. TAG Theoret Appl Genet 126(11):2835–48. doi:10.1007/s00122-013-2175-9. http://www.ncbi.nlm.nih.gov/pubmed/23982591
Riedelsheimer C, Technow F, Melchinger AE (2012) Comparison of whole-genome prediction models for traits with contrasting genetic architecture in a diversity panel of maize inbred lines. BMC genomics 13(1):452. doi:10.1186/1471-2164-13-452. http://www.mendeley.com/research/comparison-of-whole-genome-prediction-models-for-traits-with-contrasting-genetic-architecture-in-a-d-1/
Searle SR, Casella G, McCulloch CE (1992) Variance components, 1st edn. Wiley-Interscience, Hoboken
Book Google Scholar
Smith JSC, Hussain T, Jones ES, Graham G, Podlich D, Wall S, Williams M (2008) Use of doubled haploids in maize breeding: implications for intellectual property protection and genetic diversity in hybrid crops. Mol Breed 22(1):51–59. doi:10.1007/s11032-007-9155-1. http://link.springer.com/10.1007/s11032-007-9155-1
Technow F (2013) hypred: simulation of genomic data in applied genetics. http://cran.r-project.org/web/packages/hypred/
VanRaden PM (2008) Efficient methods to compute genomic predictions. J Dairy Sci 91(11):4414–23. doi:10.3168/jds.2007-0980. http://www.ncbi.nlm.nih.gov/pubmed/18946147
Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden Pa, Heath AC, Martin NG, Montgomery GW, Goddard ME, Visscher PM (2010) Common SNPs explain a large proportion of the heritability for human height. Nat Genet 42(7):565–9. doi:10.1038/ng.608. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3232052&tool=pmcentrez&rendertype=abstract

Download references

Conflict of interest

The authors declare no conflict of interest associated with this study.

Ethical standards

The authors declare that ethical standards are met, and all the experiments comply with the current laws of the country in which they were performed.

Author information

Frank Technow & Albrecht E. Melchinger
Present address: DuPont Pioneer, Johnston, IA, USA

Authors and Affiliations

University of Hohenheim, Stuttgart, Germany
Dominik Müller, Frank Technow & Albrecht E. Melchinger

Authors

Dominik Müller
View author publications
You can also search for this author in PubMed Google Scholar
Frank Technow
View author publications
You can also search for this author in PubMed Google Scholar
Albrecht E. Melchinger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dominik Müller.

Additional information

Communicated by Hiroyoshi Iwata.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2787 KB)

Appendix

In this appendix, we describe Methods effLD and RM in detail.

Method effLD

In order to account for the genetic variance explained by markers beyond the ones immediately adjacent to QTL, we devised a measure for effective LD (${\text {LD}}_{\text {eff}}$). Because QTL genotypes are generally unobservable, we use marker loci as a proxy.

Suppose that $M$ biallelic markers are located on a chromosomal segment where $p_i$ is the estimated allele frequency (of the major allele) at the $i$th marker. The LD between marker $i$ and $j$ can be computed according to Hill and Robertson (1968) as

$$\begin{aligned} r_{ij}^2 = \frac{ \left( p_{ij} - p_{i} p_j \right) ^2 }{p_{i} p_j \left( 1 - p_{i} \right) \left( 1 - p_j \right) }, \end{aligned}$$

(7)

where $p_{ij}$ is the joint probability of the major allele occurring at both marker loci $i$ and $j$. ${\text {LD}}_{\text {eff}}$ is then calculated as follows. For each chromosome,

1.
compute $p_{ij}$ for all marker pairs as $p_{ij} = r_{ij} \sqrt{p_{i} p_j \left( 1 - p_{i} \right) \left( 1 - p_j \right) } + p_{i} p_j$
2.
compute the covariance matrix ${\mathbf {\Sigma }} = \left\{ \Sigma _{ij} \right\}$ by solving the equations ${\mathrm {\Phi }}\!\left( z(p_i), z(p_j) ; \Sigma _{ij} \right) = p_{ij}$ for $\Sigma _{ij}$ for all marker pairs, where ${\mathrm {\Phi }}$ is the cumulative distribution function of the standard bivariate normal distribution with mean zero and covariance $\Sigma _{ij}$ and $z(p_i)$ refers to the $p_i{\text {th}}$ quantile of the univariate standard normal distribution (Montana 2005).
3.
compute the conditional variance for each locus $i$, given all others, as $\sigma _i = {\mathbf {\Sigma _{i,i}}} - {\mathbf {\Sigma _{i , -i}}} {\mathbf {\Sigma ^{-1}_{-i,-i}}} {\mathbf {\Sigma _{-i, i}}}.$ Here, the subscript $\varvec{\mathrm {i}}$ denotes the $i$th row or column, whereas $\varvec{\mathrm {-i}}$ denotes all but the $i$th row or column. Considering now the $i$th locus as a QTL, we imagine a hypothetical marker locus $h$ in the proximity that would effectively lead to the same conditional variance at the $i$th locus.
4.
compute $p^*_{ih} = {\mathrm {\Phi} }\!\left( z(p_i), 0 ; \sqrt{1 - \sigma _i} \right)$
5.
compute the effective LD for each locus $i$ of all loci ($L$) as
$$\begin{aligned} {\text {LD}}_{\text {eff}} =\sum _{i = 1}^{L} \frac{ \left( p^*_{ih} - 0.5 p_i \right) ^2 }{0.5 p_i \left( 1 - p_i \right) \left( 1 - 0.5 \right) } \end{aligned}$$
(8)
6.
take the average across all loci on the same chromosome

Finally, take the average across all chromosomes. Intuitively, LDeff would be the average coefficient of LD that would be observed between a QTL and a hypothetical marker with 0.5 allele frequency that would reduce the variance of the QTL genotype from ${\mathbf {\Sigma _{i,i}}}$ to $\sigma _i$.

Method RM

We use the model and notation of Dekkers (2007)

$$\begin{aligned} {Y_i} = G_i + E_i = \widehat{Q}_i+ e_i + R_i + E_i, \end{aligned}$$

(9)

where the phenotypic value $Y_i$ of the $i$th individual is decomposed into its genetic value $G_i$ and an environmental deviate $E_i$. The genetic value is further partitioned into QTL effect $Q_i$ that is associated with marker through LD and effects $R_i$ that is independent of markers. The effects $Q_i$ can be further subdivided into a prediction $\widehat{Q}_i$ and a prediction error $e_i$, both being uncorrelated with one another.

A selection index combining phenotypic data and GEBVs can be constructed as ${\mathbf {b}} = {\mathbf {P}}^{-1}{\mathbf {G}}$, e.g., Lande and Thompson (1990), where

$${\mathbf{G}} = \left( {\begin{array}{*{20}l} {{\text{cov}}(\hat{Q}_{i} ,G_{i} )} \\ {{\text{cov}}(Y_{i} ,G_{i} )} \\ \end{array} } \right)\quad {\text{and}}\quad {\mathbf{P}} = \left( {\begin{array}{*{20}c} {{\text{var}}(\hat{Q}_{i} );} & {\quad {\text{cov}}(\hat{Q}_{i} ,Y_{i} )} \\ {{\text{cov}}(Y_{i} ,\hat{Q}_{i} );} & {\quad {\text{var}}(Y_{i} )} \\ \end{array} } \right).$$

(10)

Without loss of generality, we assume $\sigma _G^2 = {\text {var}}(G_i) = 1$ and $\sigma _P^2 = {\text {var}}(P_i) = \frac{1}{h^2}$. Also, let $q^2 = {\text {var}}(Q_i)$ be the proportion of variance contributed by QTL that are in LD with markers. Then

$$\begin{aligned} r \left( \widehat{Q}_i , Q_i \right) = r_{\widehat{Q}_i} = \frac{{\text {cov}}(\widehat{Q}_i , Q_i)}{\sigma _{\widehat{Q}_i} \sigma _{Q_i}} = \frac{\sigma _{\widehat{Q}_i}}{\sigma _{Q_i}}, \end{aligned}$$

(11)

where the last equality follows from the uncorrelatedness of the predictor $\widehat{Q}_i$ with the model residual $e_i$. Thus, $r_{\widehat{Q}_i} = \frac{\sigma _{\widehat{Q}_i}^2}{\sigma _{Q_i}^2}$ is the proportion of genetic variance contributed by $Q_i$ that is explained by the GEBV $\widehat{Q}_i$. Assuming $r \left( \widehat{Q}_i , R_i \right) = 0$, we obtain

$$\begin{aligned} r \left( \widehat{Q}_i , G_i \right) = \frac{{\text {cov}}(\widehat{Q}_i , G_i)}{\sigma _{\widehat{Q}_i} \sigma _{G_i}} = \frac{\sigma _{Q_i} {\text {cov}}(\widehat{Q}_i , G_i)}{\sigma _{\widehat{Q}_i} \sigma _{Q_i} \sigma _{G_i}} = q r_{\widehat{Q}_i}. \end{aligned}$$

(12)

With this, we obtain ${\text {cov}}(\widehat{Q}_i , G_i) = q^2 r_{\widehat{Q}_i}^2$. Since ${\text {cov}}(Y_i , G_i) = 1$ , we have

$${\mathbf{G}} = \left( {\begin{array}{*{20}l} {{\text{cov}}(\hat{Q}_{i} ,G_{i} )} \\ {{\text{cov}}(Y_{i} ,G_{i} )} \\ \end{array} } \right) = \left( {\begin{array}{*{20}l} {q^{2} r_{{\hat{Q}_{i} }}^{2} } \\ 1 \\ \end{array} } \right)$$

(13)

Further, we have ${\text {var}}(\widehat{Q}_i) = q^2 r_{\widehat{Q}_i}^2$, ${\text {var}}(P_i) = \frac{1}{h^2}$. Assuming that $\widehat{Q}_i$ and $E_i$ are uncorrelated, i.e., $r \left( \widehat{Q}_i , E_i \right) = 0$, we have ${\text {cov}}(\widehat{Q}_i , P_i) = {\text {cov}}(\widehat{Q}_i , G_i) = q^2 r_{\widehat{Q}_i}^2$. Hence,

$${\mathbf{P}} = \left( {\begin{array}{*{20}l} {{\text{var}}(\hat{Q}_{i} );} & {{\text{cov}}(\hat{Q}_{i} ,Y_{i} )} \\ {{\text{cov}}(Y_{i} ,\hat{Q}_{i} );} & {{\text{var}}(Y_{i} )} \\ \end{array} } \right) = \left( {\begin{array}{*{20}l} {q^{2} r_{{\hat{Q}_{i} }}^{2} ;} & {\quad q^{2} r_{{\hat{Q}_{i} }}^{2} } \\ {q^{2} r_{{\hat{Q}_{i} }}^{2} ;} & {\quad \frac{1}{{h^{2} }}} \\ \end{array} } \right)$$

(14)

By multiplying ${\mathbf {P}}^{-1}$ and ${\mathbf {G}}$, we obtain

$$\begin{aligned} b_1 = \frac{1 - h^2}{1 - h^2 q^2 r_{\widehat{Q}_i}^2} \quad {\text {and}}\quad b_2 = \frac{h^2 - h^2 q^2 r_{\widehat{Q}_i}^2}{1 - h^2 q^2 r_{\widehat{Q}_i}^2}, \end{aligned}$$

(15)

In particular, we have

$$\begin{aligned} \frac{b_1}{b_2} = \frac{\frac{1}{h^2} - 1}{1 - q^2 r_{\widehat{Q}_i}^2} = \frac{\frac{1}{h^2} - 1}{1 - r_{M\!G}^2}. \end{aligned}$$

(16)

This is equivalent to Eq. 3 in Lande and Thompson (1990). The quantity $q^2 r_{\widehat{Q}_i}^2$ is equal to $r_{\text {MG}}^2$ in Dekkers (2007), which is the proportion of genetic variance that is explained by the GEBV. In practice, this parameter can be estimated using cross-validation as the squared predictive ability. In particular, we used fivefold cross-validation with five replications to estimate $r_{\text {MG}}^2$ from the training set. The assumptions $r \left( \widehat{Q}_i , R_i \right) = 0$ and $r \left( \widehat{Q}_i , E_i \right) = 0$ are obviously not fulfilled with finite population sizes, as was validated by means of simulation.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Müller, D., Technow, F. & Melchinger, A.E. Shrinkage estimation of the genomic relationship matrix can improve genomic estimated breeding values in the training set. Theor Appl Genet 128, 693–703 (2015). https://doi.org/10.1007/s00122-015-2464-6

Download citation

Received: 17 September 2014
Accepted: 10 January 2015
Published: 04 March 2015
Issue Date: April 2015
DOI: https://doi.org/10.1007/s00122-015-2464-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Shrinkage estimation of the genomic relationship matrix can improve genomic estimated breeding values in the training set