Skip to main content
Log in

Shrinkage estimation of the genomic relationship matrix can improve genomic estimated breeding values in the training set

  • Original Paper
  • Published:
Theoretical and Applied Genetics Aims and scope Submit manuscript

Abstract

Key message

We evaluated several methods for computing shrinkage estimates of the genomic relationship matrix and demonstrated their potential to enhance the reliability of genomic estimated breeding values of training set individuals.

Abstract

In genomic prediction in plant breeding, the training set constitutes a large fraction of the total number of genotypes assayed and is itself subject to selection. The objective of our study was to investigate whether genomic estimated breeding values (GEBVs) of individuals in the training set can be enhanced by shrinkage estimation of the genomic relationship matrix. We simulated two different population types: a diversity panel of unrelated individuals and a biparental family of doubled haploid lines. For different training set sizes (50, 100, 200), number of markers (50, 100, 200, 500, 2,500) and heritabilities (0.25, 0.5, 0.75), shrinkage coefficients were computed by four different methods. Two of these methods are novel and based on measures of LD, the other two were previously described in the literature, one of which was extended by us. Our results showed that shrinkage estimation of the genomic relationship matrix can significantly improve the reliability of the GEBVs of training set individuals, especially for a low number of markers. We demonstrate that the number of markers is the primary determinant of the optimum shrinkage coefficient maximizing the reliability and we recommend methods eligible for routine usage in practical applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

Download references

Conflict of interest

The authors declare no conflict of interest associated with this study.

Ethical standards

The authors declare that ethical standards are met, and all the experiments comply with the current laws of the country in which they were performed.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dominik Müller.

Additional information

Communicated by Hiroyoshi Iwata.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2787 KB)

Appendix

Appendix

In this appendix, we describe Methods effLD and RM in detail.

Method effLD

In order to account for the genetic variance explained by markers beyond the ones immediately adjacent to QTL, we devised a measure for effective LD (\({\text {LD}}_{\text {eff}}\)). Because QTL genotypes are generally unobservable, we use marker loci as a proxy.

Suppose that \(M\) biallelic markers are located on a chromosomal segment where \(p_i\) is the estimated allele frequency (of the major allele) at the \(i\)th marker. The LD between marker \(i\) and \(j\) can be computed according to Hill and Robertson (1968) as

$$\begin{aligned} r_{ij}^2 = \frac{ \left( p_{ij} - p_{i} p_j \right) ^2 }{p_{i} p_j \left( 1 - p_{i} \right) \left( 1 - p_j \right) }, \end{aligned}$$
(7)

where \(p_{ij}\) is the joint probability of the major allele occurring at both marker loci \(i\) and \(j\). \({\text {LD}}_{\text {eff}}\) is then calculated as follows. For each chromosome,

  1. 1.

    compute \(p_{ij}\) for all marker pairs as \(p_{ij} = r_{ij} \sqrt{p_{i} p_j \left( 1 - p_{i} \right) \left( 1 - p_j \right) } + p_{i} p_j\)

  2. 2.

    compute the covariance matrix \({\mathbf {\Sigma }} = \left\{ \Sigma _{ij} \right\}\) by solving the equations \({\mathrm {\Phi }}\!\left( z(p_i), z(p_j) ; \Sigma _{ij} \right) = p_{ij}\) for \(\Sigma _{ij}\) for all marker pairs, where \({\mathrm {\Phi }}\) is the cumulative distribution function of the standard bivariate normal distribution with mean zero and covariance \(\Sigma _{ij}\) and \(z(p_i)\) refers to the \(p_i{\text {th}}\) quantile of the univariate standard normal distribution (Montana 2005).

  3. 3.

    compute the conditional variance for each locus \(i\), given all others, as \(\sigma _i = {\mathbf {\Sigma _{i,i}}} - {\mathbf {\Sigma _{i , -i}}} {\mathbf {\Sigma ^{-1}_{-i,-i}}} {\mathbf {\Sigma _{-i, i}}}.\) Here, the subscript \(\varvec{\mathrm {i}}\) denotes the \(i\)th row or column, whereas \(\varvec{\mathrm {-i}}\) denotes all but the \(i\)th row or column. Considering now the \(i\)th locus as a QTL, we imagine a hypothetical marker locus \(h\) in the proximity that would effectively lead to the same conditional variance at the \(i\)th locus.

  4. 4.

    compute \(p^*_{ih} = {\mathrm {\Phi} }\!\left( z(p_i), 0 ; \sqrt{1 - \sigma _i} \right)\)

  5. 5.

    compute the effective LD for each locus \(i\) of all loci (\(L\)) as

    $$\begin{aligned} {\text {LD}}_{\text {eff}} =\sum _{i = 1}^{L} \frac{ \left( p^*_{ih} - 0.5 p_i \right) ^2 }{0.5 p_i \left( 1 - p_i \right) \left( 1 - 0.5 \right) } \end{aligned}$$
    (8)
  6. 6.

    take the average across all loci on the same chromosome

Finally, take the average across all chromosomes. Intuitively, LDeff would be the average coefficient of LD that would be observed between a QTL and a hypothetical marker with 0.5 allele frequency that would reduce the variance of the QTL genotype from \({\mathbf {\Sigma _{i,i}}}\) to \(\sigma _i\).

Method RM

We use the model and notation of Dekkers (2007)

$$\begin{aligned} {Y_i} = G_i + E_i = \widehat{Q}_i+ e_i + R_i + E_i, \end{aligned}$$
(9)

where the phenotypic value \(Y_i\) of the \(i\)th individual is decomposed into its genetic value \(G_i\) and an environmental deviate \(E_i\). The genetic value is further partitioned into QTL effect \(Q_i\) that is associated with marker through LD and effects \(R_i\) that is independent of markers. The effects \(Q_i\) can be further subdivided into a prediction \(\widehat{Q}_i\) and a prediction error \(e_i\), both being uncorrelated with one another.

A selection index combining phenotypic data and GEBVs can be constructed as \({\mathbf {b}} = {\mathbf {P}}^{-1}{\mathbf {G}}\), e.g., Lande and Thompson (1990), where

$${\mathbf{G}} = \left( {\begin{array}{*{20}l} {{\text{cov}}(\hat{Q}_{i} ,G_{i} )} \\ {{\text{cov}}(Y_{i} ,G_{i} )} \\ \end{array} } \right)\quad {\text{and}}\quad {\mathbf{P}} = \left( {\begin{array}{*{20}c} {{\text{var}}(\hat{Q}_{i} );} & {\quad {\text{cov}}(\hat{Q}_{i} ,Y_{i} )} \\ {{\text{cov}}(Y_{i} ,\hat{Q}_{i} );} & {\quad {\text{var}}(Y_{i} )} \\ \end{array} } \right).$$
(10)

Without loss of generality, we assume \(\sigma _G^2 = {\text {var}}(G_i) = 1\) and \(\sigma _P^2 = {\text {var}}(P_i) = \frac{1}{h^2}\). Also, let \(q^2 = {\text {var}}(Q_i)\) be the proportion of variance contributed by QTL that are in LD with markers. Then

$$\begin{aligned} r \left( \widehat{Q}_i , Q_i \right) = r_{\widehat{Q}_i} = \frac{{\text {cov}}(\widehat{Q}_i , Q_i)}{\sigma _{\widehat{Q}_i} \sigma _{Q_i}} = \frac{\sigma _{\widehat{Q}_i}}{\sigma _{Q_i}}, \end{aligned}$$
(11)

where the last equality follows from the uncorrelatedness of the predictor \(\widehat{Q}_i\) with the model residual \(e_i\). Thus, \(r_{\widehat{Q}_i} = \frac{\sigma _{\widehat{Q}_i}^2}{\sigma _{Q_i}^2}\) is the proportion of genetic variance contributed by \(Q_i\) that is explained by the GEBV \(\widehat{Q}_i\). Assuming \(r \left( \widehat{Q}_i , R_i \right) = 0\), we obtain

$$\begin{aligned} r \left( \widehat{Q}_i , G_i \right) = \frac{{\text {cov}}(\widehat{Q}_i , G_i)}{\sigma _{\widehat{Q}_i} \sigma _{G_i}} = \frac{\sigma _{Q_i} {\text {cov}}(\widehat{Q}_i , G_i)}{\sigma _{\widehat{Q}_i} \sigma _{Q_i} \sigma _{G_i}} = q r_{\widehat{Q}_i}. \end{aligned}$$
(12)

With this, we obtain \({\text {cov}}(\widehat{Q}_i , G_i) = q^2 r_{\widehat{Q}_i}^2\). Since \({\text {cov}}(Y_i , G_i) = 1\) , we have

$${\mathbf{G}} = \left( {\begin{array}{*{20}l} {{\text{cov}}(\hat{Q}_{i} ,G_{i} )} \\ {{\text{cov}}(Y_{i} ,G_{i} )} \\ \end{array} } \right) = \left( {\begin{array}{*{20}l} {q^{2} r_{{\hat{Q}_{i} }}^{2} } \\ 1 \\ \end{array} } \right)$$
(13)

Further, we have \({\text {var}}(\widehat{Q}_i) = q^2 r_{\widehat{Q}_i}^2\), \({\text {var}}(P_i) = \frac{1}{h^2}\). Assuming that \(\widehat{Q}_i\) and \(E_i\) are uncorrelated, i.e., \(r \left( \widehat{Q}_i , E_i \right) = 0\), we have \({\text {cov}}(\widehat{Q}_i , P_i) = {\text {cov}}(\widehat{Q}_i , G_i) = q^2 r_{\widehat{Q}_i}^2\). Hence,

$${\mathbf{P}} = \left( {\begin{array}{*{20}l} {{\text{var}}(\hat{Q}_{i} );} & {{\text{cov}}(\hat{Q}_{i} ,Y_{i} )} \\ {{\text{cov}}(Y_{i} ,\hat{Q}_{i} );} & {{\text{var}}(Y_{i} )} \\ \end{array} } \right) = \left( {\begin{array}{*{20}l} {q^{2} r_{{\hat{Q}_{i} }}^{2} ;} & {\quad q^{2} r_{{\hat{Q}_{i} }}^{2} } \\ {q^{2} r_{{\hat{Q}_{i} }}^{2} ;} & {\quad \frac{1}{{h^{2} }}} \\ \end{array} } \right)$$
(14)

By multiplying \({\mathbf {P}}^{-1}\) and \({\mathbf {G}}\), we obtain

$$\begin{aligned} b_1 = \frac{1 - h^2}{1 - h^2 q^2 r_{\widehat{Q}_i}^2} \quad {\text {and}}\quad b_2 = \frac{h^2 - h^2 q^2 r_{\widehat{Q}_i}^2}{1 - h^2 q^2 r_{\widehat{Q}_i}^2}, \end{aligned}$$
(15)

In particular, we have

$$\begin{aligned} \frac{b_1}{b_2} = \frac{\frac{1}{h^2} - 1}{1 - q^2 r_{\widehat{Q}_i}^2} = \frac{\frac{1}{h^2} - 1}{1 - r_{M\!G}^2}. \end{aligned}$$
(16)

This is equivalent to Eq. 3 in Lande and Thompson (1990). The quantity \(q^2 r_{\widehat{Q}_i}^2\) is equal to \(r_{\text {MG}}^2\) in Dekkers (2007), which is the proportion of genetic variance that is explained by the GEBV. In practice, this parameter can be estimated using cross-validation as the squared predictive ability. In particular, we used fivefold cross-validation with five replications to estimate \(r_{\text {MG}}^2\) from the training set. The assumptions \(r \left( \widehat{Q}_i , R_i \right) = 0\) and \(r \left( \widehat{Q}_i , E_i \right) = 0\) are obviously not fulfilled with finite population sizes, as was validated by means of simulation.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Müller, D., Technow, F. & Melchinger, A.E. Shrinkage estimation of the genomic relationship matrix can improve genomic estimated breeding values in the training set. Theor Appl Genet 128, 693–703 (2015). https://doi.org/10.1007/s00122-015-2464-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00122-015-2464-6

Keywords

Navigation