Research Communication

Genetics and Genomics

Genetic and environmental perturbations lead to regulatory decoherence

Princeton University, United States
University of California, San Francisco, United States
University of California, Los Angeles, United States
Tampere University, Finland
Tampere University, Tampere University Hospital, Finland
University of Turku, Finland
Turku University Hospital, Finland
Systems Epidemiology, Baker Heart and Diabetes Institute, Australia
University of Oulu, Finland
University of Eastern Finland, Finland
University of Bristol, United Kingdom
The Alfred Hospital, Monash University, Australia

Mar 5, 2019

Open access
Copyright information

Abstract
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

Correlation among traits is a fundamental feature of biological systems that remains difficult to study. To address this problem, we developed a flexible approach that allows us to identify factors associated with inter-individual variation in correlation. We use data from three human cohorts to study the effects of genetic and environmental variation on correlations among mRNA transcripts and among NMR metabolites. We first show that environmental exposures (infection and disease) lead to a systematic loss of correlation, which we define as 'decoherence'. Using longitudinal data, we show that decoherent metabolites are better predictors of whether someone will develop metabolic syndrome than metabolites commonly used as biomarkers of this disease. Finally, we demonstrate that correlation itself is under genetic control by mapping hundreds of 'correlation quantitative trait loci (QTLs)'. Together, this work furthers our understanding of how and why coordinated biological processes break down, and points to a potential role for decoherence in disease.

Editorial note: This article has been through an editorial process in which the authors decide how to respond to the issues raised during peer review. The Reviewing Editor's assessment is that all the issues have been addressed (see decision letter).

https://doi.org/10.7554/eLife.40538.001

Introduction

A major goal in evolutionary and medical genetics is to identify the coordinated regulatory processes that differ between individuals as a function of their disease state, environment, or genetic background. One powerful approach for doing so is to identify the effect of environmental or genetic perturbations on the degree to which genes are correlated at the mRNA level – a phenomenon known as ‘co-expression’. From a medical perspective, identifying factors associated with changes in co-expression between healthy and sick individuals should point toward key regulatory changes driving phenotypic differences between groups (Mentzen et al., 2009; Hudson et al., 2009; Kostka and Spang, 2004; Choi et al., 2005; de la Fuente, 2010). From an evolutionary perspective, work on decanalization (Gibson, 2009a; Gibson, 2009b; Careau et al., 2014) indicates that gene regulatory networks evolve over many generations of stabilizing selection, and new mutations or novel environments may disrupt these fine-tuned connections. Decanalization has been hypothesized to explain the recent rise of non-communicable diseases in humans (Gibson, 2009b; Hu et al., 2016), with the idea being that recent major shifts in diet and lifestyle have led to dysregulation of regulatory programs that evolved under past environmental conditions. However, testing this hypothesis remains challenging in practice.

Our ability to move forward in understanding how environmental and genetic variation affect molecular correlations – in both humans and other organisms where this topic has been extensively discussed (Wang et al., 2013; Sun et al., 2008; Southworth et al., 2009; Fukushima, 2013; Huang et al., 2015) – is limited by the available methodology (reviewed in (de la Fuente, 2010; van Dam et al., 2017; Singh et al., 2018)). In particular, most studies of differential co-expression to date have relied on two types of approaches: (i) building co-expression networks separately within each group of interest (e.g., diseased versus healthy) and contrasting their properties (Kostka and Spang, 2004; Choi et al., 2005; Steiger, 1980; Watson, 2006); or (ii) asking whether predefined sets of genes are differentially co-expressed between groups (Fukushima, 2013; Jen et al., 2006). Generally, these approaches are designed to compare population-level statistics (e.g., correlation coefficients) between two sample groups, and thus cannot test the effect of continuous predictor variables on variation in correlation. Further, these approaches are generally designed for datasets in which no individual-level covariates (e.g., age, sex, or batch) may bias co-expression estimates, which could lead to false conclusions if not accounted for (Parsana, 2017). This limited flexibility has made it difficult to identify individual-specific factors that predict co-expression in humans or other organism that are not amenable to controlled manipulations.

To address this gap, we present a new approach for interrogating sources of variance in correlation. Our approach builds on previous work (Steiger, 1980), and is based on the fact that the Pearson correlation coefficient is equal to the average element-wise product of two traits measured across individuals, after each trait is mean centered and scaled; this value reflects the relationship between two traits within a population sample. By extension, to obtain a measure of the degree of correlation between two traits for each individual in a sample, we can simply keep the vector of products (i.e., we do not perform averaging across individuals). This vector of products can be used as the outcome variable in a linear model or linear mixed model, and consequently our approach can accommodate covariates and continuous predictor variables. We call our approach ‘Correlation by Individual Level Product’ (CILP), because for each individual the product between two traits is estimated and modeled as a continuous outcome variable (Figure 1).

Figure 1 with 2 supplements see all

Download asset Open asset

Illustration of decoherence and ‘Correlation by Individual Level Product’ (CILP).

(a) Co-expression networks in two different environments. Environment one represent normal (or control/baseline) conditions where the expression levels of gene 1 and gene 2 are highly correlated and co-regulated. (b) Environment 2 represents a stressful or unhealthy condition leading to lower correlation between the expression levels of gene 1 and gene 2. At the network level, this translates into a lower network degree (i.e., lower average correlation across genes or fewer connected nodes). We call this change in the correlation structure 'decoherence'. This is what we could expect if stressful conditions lower transcriptional robustness and lead to dysregulation of gene expression. (c) Differences in correlation between the expression levels of gene 1 and gene 2 in cases versus controls, or between genotypes, which translates into an average difference in product between these groups.

https://doi.org/10.7554/eLife.40538.002

Leveraging this approach, we explore the effects of environmental and genetic variation on the degree to which two molecular traits are correlated. First, we test the hypothesis that stressful physiological conditions (here, bacterial infection or metabolic syndrome) lead to a loss of correlation among molecular traits that are correlated under normal conditions. We refer to this loss of correlation in an infected or diseased state, relative to a baseline or healthy state, as ‘decoherence’. We test for decoherence using gene expression data derived from monocytes at baseline or following stimulation with lipopolysaccharide (Fairfax et al., 2014), a component of bacterial cells walls and a potent stimulant of the innate immune response (Fairfax et al., 2014). We also test for decoherence using blood-derived NMR metabolite data from the Young Finns Study (YFS) (Raitakari et al., 2008). Strikingly, we find strong evidence in both datasets. Finally, we use CILP to map ‘correlation QTL’, defined as SNPs that affect the magnitude of the correlation between two mRNA transcripts. Using genotype and whole blood-derived gene expression data from the Netherlands Study of Depression and Anxiety (Penninx et al., 2008), we identity and replicate hundreds of correlation QTLs. Together, our new approach allows us to identify genetic variants and environmental factors that disrupt molecular co-regulation. Further, the flexible, robust approach we propose opens the door to future investigations of the causes and consequences of trait covariation in many contexts.

Results

Using 'Correlation by Individual Level Product' (CILP) to test for sources of variance in correlation

Let $y_{1}, y_{2}$ be two outcomes measured across individuals in a population sample, with means $\bar{y_{1}}, \bar{y_{2}}$ and variances $σ_{1}^{2}, σ_{2}^{2},$ respectively. We wish to associate the Pearson correlation coefficient between these two variables, $E [(y_{1} - \bar{y_{1}}) (y_{2} - \bar{y_{2}})] / \sqrt{σ_{1}^{2} σ_{2}^{2}}$ , with some predictor variable $x$ . We propose the following statistical test. First, we calculate the demeaned, element-wise product of the outcomes: $[(y_{1} - \bar{y_{1}}) (y_{2} - \bar{y_{2}})]$ , and then normalize by the square root of the product of variances: $[(y_{1} - \bar{y_{1}}) (y_{2} - \bar{y_{2}}) / \sqrt{σ_{1}^{2} σ_{2}^{2}}]$ . The resulting vector of values represents the estimate of the correlation within each individual in the sample, which can subsequently be modeled using approaches appropriate for continuous outcome variables. In practice, we use linear models or linear mixed effects models to test for associations between a given set of products and a fixed effect predictor variable, while controlling for covariates. We note that for our correlation QTL analyses, described later in the text, we use a modified version of CILP for computational efficiency and scalability. This modified version is described in the section ‘Comparison of approaches for quantifying correlation’ (see also Supplementary file 1).

Simulations reveal power to detect sources of variance in correlation across many scenarios

To confirm that our approach does not result in biased p-value distributions, and to understand the power of CILP across a range of effect sizes and sample sizes, we first performed extensive simulations. We focused our simulations on the identification of ‘correlation QTL’, defined as SNPs that affect the magnitude of the correlation between two mRNA transcripts; however, we note that the results are generalizable to other predictor variables. For each simulation, we generated 10,000 gene pairs that differed in covariance as a function of genotypic class. We did so using the multivariate normal distribution to simulate pairs of continuous distributions (but see Figure 1—figure supplement 1 for results where count data were simulated from a negative binomial distribution). Following simulation of gene expression data, we used CILP and linear models to detect differences in correlation as a function of genotype.

With an effect size of 0.3 and n = 1000, we detected 98.19% of simulated true positive correlation QTLs at a threshold of p=0.05, and 50.91% of true positives after correcting for multiple hypothesis testing (using Bonferroni correction). Under the null, where the effect size was set to zero, we detect 5.19% of correlation QTLs at a threshold of p=0.05 and no correlation QTL after correcting for multiple testing (Table 1, Figure 1—figure supplement 1). To ensure that changes in the mean, variance, or covariance of gene expression levels do not increase our false positive rate, we also assessed power to detect correlation QTLs when: (i) the focal SNP affected mean gene expression levels (e.g., through an eQTL or proportion QTL); (ii) the focal SNP affected the variance of gene expression levels (e.g., through a varQTL); (iii) the mean or variance of gene expression levels were affected by a variable unrelated to genotype (e.g., through batch, cell type, or environmental effects); and (iv) a variable unrelated to genotype affected the correlation structure of the entire transcriptome. None of these scenarios increased the false positive rate, though the presence of strong varQTLs did decrease power to detect correlation QTLs (Table 1, Figure 1—figure supplement 2). We also found that including the expression levels of both genes as covariates in our models did not improve our power to detect correlation QTLs (Figure 1—figure supplement 1).

Table 1

Power to detect simulated correlation QTL across a wide range of scenarios (n = 1000 for all simulations).

https://doi.org/10.7554/eLife.40538.005

Simulation parameters
(1) Correlation QTL effect^*	x	x	x	x	x	x	x	x
(2) Genotype predicts the mean of gene one expression levels^† (e.g., through an eQTL or proportion QTL)		x	x		x
(3) Genotype predicts the mean of gene two expression levels^†			x
(4) Genotype predicts the variance of gene one expression levels^‡ (e.g., through a varQTL)				x	x
(5) A variable that is random with respect to genotype predicts the mean of gene one expression levels^†(e.g., through technical, environmental, or cell type heterogeneity effects)						x
(6) A variable that is random with respect to genotype predicts the mean of gene two expression levels^†						x
(7) A variable that is random with respect to genotype predicts the variance of gene one expression levels^‡(e.g., through technical, environmental, or cell type heterogeneity effects)							x	x
(8) A variable that is random with respect to genotype predicts the variance of gene two expression levels^‡								x
Results
Power (proportion of true positives detected, Bonferroni-corrected p<0.05)	0.5091	0.4934	0.5055	0.0716	0.0679	0.4998	0.0884	0.0533
False positive rate (proportion of true negatives detected, nominal p<0.05)	0.0519	0.0529	0.0494	0.0532	0.0507	0.0527	0.0529	0.0522

Simulated effect sizes were as follows: correlation QTL^* = 0.3, mean effects^† = 1, variance effects ^‡ = 10. When Bonferroni-corrected p-values were used, the false positive rate was 0 across all simulation scenarios. Abbreviations: eQTL = expression QTL and vQTL = variance QTL.

Immune challenges results in decreased co-regulation of gene expression

Next, we used gene expression data collected from human monocytes (Fairfax et al., 2014), to ask whether patterns of co-expression differed between cells assayed at baseline versus following treatment with LPS for 2 or 24 hr (n = 214 samples were assayed across the three conditions). To limit our search space, we focused on 1460 genes that were differentially expressed at the 2 hr time point (FDR < 5%), and tested for differences in correlation between unexposed and LPS-exposed cells across 1,063,611 possible transcript pairs (equivalent to 1460 chose 2). We found 958 gene pairs with a significant (FDR < 5%) change in correlation between the uninfected/baseline state versus 2 hr post LPS stimulation, 461 of which replicated with similar effect sizes at the 24 hr time point (correlation between effect sizes at the two time points: R² = 0.12, p<10⁻¹⁶, binomial test for concordance of effect size direction: p<10⁻¹⁶; Figure 2).

Figure 2 with 1 supplement see all

Download asset Open asset

Bacterial infection (i.e., treatment with LPS) leads to decoherence in primary monocyte gene expression.

(a) Correlation changes have similar effect sizes after 2 hr or 24 hr of LPS stimulation. (b) Comparison of the Spearman correlation coefficient estimated in unstimulated cells (x-axis) versus those treated with LPS (y-axis). Gene pairs for which we detect significantly stronger correlations in the unstimulated (blue dots) or LPS treated (green dots) condition are highlighted. (c) An example of changes in correlation as a function of condition. Gene expression levels for OAS1 and OAS3 are negatively correlated at baseline, but become positively correlated after 2 hr or 24 hr treatment with LPS.

https://doi.org/10.7554/eLife.40538.006

We observed no relationship between the magnitude of the difference in mean gene expression levels between conditions and the magnitude of the difference in correlation between conditions, suggesting our results are not driven by statistical artifacts associated with large mean changes in transcription following LPS stimulation (Figure 2—figure supplement 1). Related to this point, we found minimal overlap (<5% for all comparisons) between genes with significant infection-associated variation in correlation and (i) genes that were differentially expressed after 2 or 24 hr of LPS treatment (FDR < 5%) or (ii) genes with eQTL that were specific to the baseline, 2 hr LPS treatment, or 24 hr LPS treatment samples (as reported in Fairfax et al. (2014)).

In total, we identified gene pairs that are more highly correlated in cells assayed post LPS stimulation versus at baseline, as well as gene pairs that lose correlation following a simulated bacterial infection. Overall, however, we found much greater support for the latter pattern: 61% (at the 2 hr time point) and 73% (at the 24 hr time point) of significant transcript pairs were more strongly correlated across individuals in uninfected versus LPS challenged cells. Importantly, this represents a statistically significant bias toward a loss of correlation (i.e. decoherence) following two hours of simulated bacterial infection (p=1.05×10⁻⁷, log₂odds = 0.503, Fisher’s exact test), with an even stronger bias toward decoherence after 24 hr of immune challenge (p<10⁻¹⁶, log₂odds = 0.941, Fisher’s exact test).

Metabolic syndrome disrupts metabolite co-regulation

Identification of disease-associated variation in metabolite correlations

We next applied our correlation test to whole blood-derived NMR metabolite data collected from a population-based, longitudinal study of young Finnish individuals (the cardiovascular risk in young Finns study, abbreviated ‘YFS’) (Raitakari et al., 2008; Nuotio et al., 2014). Our dataset included 159 metabolite measures collected across three time points that spanned a 10 year period (2001: n = 1564, 2007: n = 1498, 2011: n = 1501). We note, however, that our analyses did not focus on changes in correlation across time, but rather, changes in correlation between individuals that were healthy versus those that met the criteria for metabolic syndrome (Grundy et al., 2005).

To test the hypothesis that disease disrupts homeostasis and leads to decoherence, we asked whether specific pairs of metabolites were correlated in healthy individuals, but no longer correlated in individuals with metabolic syndrome. Across 11491 unique metabolite pairs that met our criteria for analysis (see Materials and methods), healthy individuals and those with metabolic syndrome showed broadly similar correlation structures (Figure 3). However, using CILP, we identified strong effects of health status on the degree of correlation between a subset of metabolite pairs. Specifically, we identified 1528 metabolite pairs that were more correlated in healthy individuals relative to those with metabolic syndrome, and 619 metabolite pairs that showed the opposite pattern (FDR < 5%; linear mixed model controlling for age, sex, year, and individual identity). This represents a 2.20x enrichment (p<10⁻¹⁶, Fisher’s exact test) of metabolite pairs that lose correlation following the onset of disease. Importantly, the set of 1528 metabolite pairs that lose correlation includes metabolites that are both more highly expressed (n = 93 metabolites, FDR < 5%) and more lowly expressed (n = 32) in individuals with metabolic syndrome, as well as metabolites that do not significantly differ between sample groups (n = 4).

Figure 3 with 1 supplement see all

Download asset Open asset

Metabolic syndrome leads to decoherence among particular metabolite pairs.

(a) Correlation matrices showing the magnitude of the Spearman correlation coefficient, for healthy individuals and those with metabolic syndrome. (b) Comparison of the Spearman correlation coefficient estimated in healthy individuals (x-axis) versus those with metabolic syndrome (y-axis). Overall, the magnitude of the correlation is similar between groups (R² = 0.62, p<10⁻¹⁶), though the slope of the line is already less than 1, indicating a tendency toward stronger correlations in healthy individuals (beta = 0.814). For the subset of metabolite pairs that display significantly stronger correlations in the healthy (blue dots) or metabolic syndrome class (green dots), this effect is more pronounced (beta = 0.597). (c) Categorical enrichment of metabolites that exhibit stronger pairwise correlations in the healthy or metabolic syndrome class (x-axis: log₂ odds ratio from a Fisher’s exact test; y-axis: 11 functional classes tested (annotations taken from (Nath et al., 2017)). Asterisks indicate significant enrichment. (d) A composite measure of metabolites that are strongly decoherent at the first time point (i.e., that show decreases in correlation in individuals with metabolic syndrome) can predict an individual’s future health status. Y-axis: Principal component 1 of 34 metabolites, measured at the first time point, with the strongest evidence for dysregulation. X-axis: values are stratified by whether an individual was healthy at the first (t0) and the last (t3) time point, had metabolic syndrome at t0 and t3, or developed metabolic syndrome between t0 and t3 (linear model, p=3.03×10⁻¹⁰).

https://doi.org/10.7554/eLife.40538.008

Metabolic syndrome affects co-regulation of particular metabolite classes

We found overall support for the hypothesis that metabolic syndrome disrupts the correlation structure that exists in healthy individuals. To identify the specific pathways and processes targeted by metabolic syndrome, we assigned each metabolite in our dataset to one of 11 functional classes (as in Nath et al. (2017); Supplementary file 2), and asked whether each class was enriched for metabolites that exhibited decreases in correlation. Here, we found the strongest enrichment for apolipoproteins (hypergeometric test, odds ratio = 1.94, p=2.34×10⁻¹³), measures of total cholesterol (odds ratio = 1.39, p=1.58×10⁻¹⁵), and small molecules involved in energy metabolism (odds ratio = 1.43, p=2.75×10⁻¹³). Among metabolite pairs that showed the opposite pattern (i.e., were more strongly correlated in individuals with metabolic syndrome), we found an enrichment of fatty acids (odds ratio = 1.30, p=7.10×10⁻⁶), as well as HDL (odds ratio = 1.18, p=1.53×10⁻⁵) and LDL lipoproteins (odds ratio = 1.28, p=3.58×10⁻⁶) (Figure 3; Supplementary file 3). Together, these results suggest that metabolite co-regulation is strongly perturbed by disease, and further, that particular classes of metabolites are more sensitive and prone to decoherence than others.

Finally, because our dataset included metabolite data collected from the same individuals across multiple time points, we asked whether metabolite pairs that became dysregulated (i.e., lost correlation with disease) at the first time point could be used to identify individuals that would develop metabolic syndrome at a later time point. To do so, we performed PCA on the 34 metabolites that displayed the strongest evidence for decoherence at the first time point, and used PC1 to ask whether individuals that developed metabolic syndrome at the last time point already showed diagnostic changes (see Materials and methods). Strikingly, we found that PC1, constructed only using data from the first time point, could distinguish between individuals that remained healthy across all time points and individuals that were healthy at the first time point but developed metabolic syndrome later on (R² = 0.056, p=3.03×10⁻¹⁰, AIC = −448.40; Figure 3). This effect was stronger than that observed in parallel analyses using triglyceride levels at the first time point (a classic biomarker of metabolic syndrome; R² = 0.046, p=6.32×10⁻⁹, AIC = -432.61) or an index created from the 34 metabolites that exhibited the strongest mean differences between healthy and metabolic syndrome individuals at the first time point (R² = 0.037, p=1.42×10⁻⁶, AIC = −422.08).

Genetic variation impacts co-expression of metabolism-related genes

Detection and replication of hundreds of correlation QTLs

Finally, we used CILP to identify genetic variants that control the degree of correlation between a pair of gene transcripts, a pattern we refer to as ‘correlation QTLs’. Genotypic effects on co-expression could arise through several possible mechanisms. For example, a SNP that disrupts a transcription factor (TF) binding site in a promoter would lead to low levels of co-expression between the TF and the target gene, but only for individuals carrying the disrupting variant. In these individuals, increased expression of the TF would fail to increase expression of the target gene, leading to an association between SNP genotype and variation in co-expression. This is one mechanistic scenario that has been repeatedly proposed to generate variation in co-expression (de la Fuente, 2010; Gaiteri et al., 2014), with some empirical support (Nath et al., 2017). However, the relationship between genetic variation and co-expression is largely unexplored, suggesting that alterative mechanisms may exist that have yet to be uncovered.

To map SNPs that affect the magnitude or direction of pairwise gene expression correlations, we applied CILP to genotype and whole blood-derived gene expression data from the Netherlands Study of Depression and Anxiety (abbreviated 'NESDA' (Penninx et al., 2008)). We used a filtered set of 93,197 SNP genotypes and 33,302 gene expression measurements collected for 2477 individuals in our discovery dataset and 1337 individuals in our replication dataset (see Materials and methods). Because our dataset was not well-powered to test all possible pairwise combinations of gene expression measurements against all SNPs, and because we were interested in understanding co-expression patterns among genes important to metabolic diseases, we focused on the 475 probes most strongly associated with BMI. In total, we identified 484 associations between a SNP and variation in co-expression in our NESDA discovery dataset (at a 10% FDR threshold, corresponding to p<4.6×10⁻⁹; linear model controlling for sex, age, smoking behavior, major depressive disorder, red blood cell counts, year of sample collection, study phase, and the first five principal components from a PCA on the filtered genotype call set). These 484 associations involved 247 unique probe sets and 51 genotyped SNP, each of which was involved in 1–424 (mean ±s.d.=3.91 ± 26.9) and 1–173 (mean ±s.d.=9.43 ± 34.1) correlation QTLs, respectively (Figure 4; Supplementary file 4).

Figure 4 with 6 supplements see all

Download asset Open asset

CILP approach reveals hundreds of correlation QTLs.

(a) Example of a correlation QTL, where the SNP rs10953329 controls the magnitude of the correlation between the mRNA expression levels of *POC1B* and *RIOK3.* (b) Gene pairs involved in significant (FDR < 10%) correlation QTL. Each black segment represents a gene, and each line connecting two segments represents a significant correlation QTL. Lines are colored by the identity of the SNP controlling the magnitude of the correlation between the gene pair. (c) Many correlation QTL identified in the NESDA discovery set (n = 2477) replicate in a second set of NESDA participants (n = 1337). Plot shows effect sizes for each correlation QTL, estimated in the discovery or replication cohort (effect sizes are derived from in *matrixEQTL* (Shabalin, 2012)). Points are colored to indicate whether a given correlation QTL passed Bonferroni correction in the replication dataset.

https://doi.org/10.7554/eLife.40538.010

To confirm our results, and overcome any potential biases raised by sub-selection of the data, we replicated our correlation QTL in a separate set of NESDA participants. Here, we found that 304/484 correlation QTLs replicated (at a Bonferroni corrected p=1.03×10⁻⁴, n = 1337; 428/484 correlation QTLs replicated at a 10% FDR; Figure 4). Further, in whole-blood derived gene expression data from 1414 YFS participants, we found that 47/74 testable correlation QTLs replicated at a 10% FDR (0 replicated at a Bonferroni corrected p=6.76×10⁻⁴, but the direction of the effects consistently agreed across datasets, binomial test, p<10⁻¹⁶; see Materials and methods).

General characteristics of correlation QTLs

For the list of 484 correlation QTLs we identified, we performed several follow-up analyses to gain mechanistic insight. First, we asked about the physical location of correlation QTLs relative to the genes for which they altered correlation structure (e.g., are the correlation QTLs acting in cis or trans), as well as the degree to which correlations QTLs also had mean or variance effects on gene expression (Supplementary file 5; Figure 4—figure supplement 1). Here, we found that 73.76% of correlation QTL were located on the same chromosome as one of the two correlated genes, 14.57% were located on the same chromosome as both correlated genes, and 11.57% were acting in trans on both genes. Further, 87.81% of significant correlation QTL were also cis eQTL or varQTL for one of the correlated genes, but no correlation QTL had mean or variance effects on both genes (Figure 4—figure supplement 1).

Next, we asked whether the set of genes involved in significant correlation QTLs were enriched for (i) particular biological processes and pathways (using Gene Ontology annotations (Harris et al., 2004)), (ii) known TFs, or (iii) known targets of TFs. For analysis (ii) and (iii), we drew on a public database of known TF-target gene interactions created through sentence-based text-mining followed by manual curation (Han et al., 2015). We strongly expected genes involved in correlation QTLs to be enriched for TFs or known targets of TFs, given that the mechanisms that have been proposed to generate genetic effects on variation in correlation almost universally involve genotype-dependent TF activity or disruption of TF binding sites (de la Fuente, 2010; Gaiteri et al., 2014). In support of these ideas, genes involved in significant correlation QTLs were 1.90x more likely to be TFs (hypergeometric test, p=1.7×10⁻³) and 1.52x more likely to be known targets of transcription factors (p=6.7×10⁻⁴) relative to the background set of all genes tested. Additionally, genes involved in significant correlation QTLs were enriched for biological processes involved in metabolism and metabolic disease (James et al., 2012; Rani et al., 2016) such as cellular response to oxidative stress (hypergeometric test, odds ratio = 2.21, p<10⁻¹⁶), intracellular signal transduction (log₂ odds ratio = 1.44, p<10⁻¹⁶), and mitophagy (the selective degradation of mitochondria; log₂ odds ratio = 0.82, p=0.018; Supplementary file 6).

Potential mechanisms generating correlation QTL

Many (87.81%) of the correlation QTL we uncovered involve three SNPs (rs1384673, rs333170, and rs829373), which have strong effects on the mean expression of one gene, RAP1GAP (Figure 4—figure supplement 2). The function of RAP1GAP is to convert the transcription factor RAP1 from its active GTP-bound form to its inactive GDP-bound form, and genes involved in correlation QTL with RAP1GAP tend to be closer to RAP1 binding sites (two-sided Wilcoxon-signed rank test, p=0.04) and to be bound in vivo by RAP1 (Fisher’s exact test, odds ratio = 2.14, p=0.03; ChIP-seq data are from mouse embryonic fibroblasts (Martinez et al., 2010)). We hypothesize that the RAP1GAP correlation QTL ‘hotspot’ is thus generated because genotype controls whether RAP1 and its targets are constitutively ‘on’ or ‘off’, and variation in RAP1 activity levels only modulates the expression of RAP1 target genes within the ‘on’ genotype (see schematic in Figure 4—figure supplement 3). In addition to uncovering a mechanism that can generate large-scale effects on correlation – specifically, cases where the targets of a given TF are constitutively repressed in one genotypic class, but dynamically regulated in another – this example is of interest because previous work has found evidence for converge of trans eQTL effects on RAP1GAP in whole blood (Fehrmann et al., 2011), and has also implicated RAP1 in the development of insulin resistance, inflammation, and obesity (Kaneko et al., 2016).

Though the correlation QTLs we uncovered near RAP1GAP were the only ‘hotspot’ (Figure 4), we also found a number of other intriguing examples. For SNPs that regulated the co-expression of a given gene pair in trans, we found scenarios such as: (i) regulation of co-expression between genes annotated for the same GO terms (e.g., GLRX5 and RPS29 are both annotated for ‘metabolic processes’); (ii) regulation of co-expression between a TF and its known target gene (e.g., GATA2 and RAB24); and (iii) regulation of co-expression between disease-associated genes (e.g., SLC4A1 and TPM1 are associated with renal (Alper, 2003; Yenchitsomanus et al., 2005) and cardiovascular disease (England et al., 2017), respectively). We note that in these cases the mechanism that induces a correlation QTL is unclear; nevertheless, our correlation QTL screen uncovers novel loci regulating the relationship between disease and metabolic genes that could be the focus of future follow-up work.

Discussion

Patterns of transcriptional correlation are widely considered to arise from co-regulation between genes. The analysis of co-expression has become an essential tool for the functional interpretation of transcriptional variation (de la Fuente, 2010; Gaiteri et al., 2014), with increasing relevance for medical diagnosis (van Dam et al., 2017; Vidal et al., 2011; Barabási et al., 2011) and our understanding of trait evolution (Wang et al., 2013; Sun et al., 2008; Southworth et al., 2009; Fukushima, 2013; Huang et al., 2015). However, we still have a primitive understanding of the factors that shape correlations between genes. Specifically, how do environmental perturbations alter essential patterns of co-regulation? And, to what degree are some genotypes better than others at buffering these disruptive effects? To address these questions, we developed a novel, generalizable approach to test whether any predictor variable (e.g., environment, genotype, or another variable of interest) affects the degree of correlation between a pair of measured variables. This simple approach, which relies on calculating the product between two variables after normalization, allows us to produce an individual level estimate of correlation rather than a summary statistic for a population (Figure 1). Consequently, we are able to study correlation as a bona fide trait, and leverage the flexibility of statistical linear modeling approaches to identify factors associated with variation in correlation between individuals.

We begin by investigating the influence of two environmental perturbations on molecular co-regulation. First, we use data from a controlled experiment, where human monocytes were challenged with a simulated bacterial infection (Fairfax et al., 2014). Second, we draw on a dataset of 159 metabolites collected from a population-based, longitudinal study of Finnish individuals (Raitakari et al., 2008), and contrast correlation patterns between healthy individuals versus those suffering from metabolic syndrome. In both instances, we find that stressful environmental exposures (infection and disease) lead to decoherence, manifested as a widespread decrease in the magnitude of pairwise correlation coefficients between mRNA transcripts or metabolites (Figure 3).

Although we expect the relationship between mRNA transcripts and metabolites to change following an environmental perturbation (e.g., both LPS stimulation and metabolic syndrome clearly affect mean gene expression or metabolite levels (Fairfax et al., 2014; Nath et al., 2017)), the consistent direction of effects we observe on the correlation is striking. Specifically, of all metabolite and mRNA transcript pairs that are significantly differentially correlated, there is a strong directional bias toward a decrease in correlation following the environmental perturbation: 61–74% of significant pairs follow this pattern. In other words, under stress, some genes and metabolites that are typically co-regulated no longer are. Similar biased loses in correlation have also been observed in miRNA pairs measured in the plasma of patients suffering from cognitive impairment versus healthy controls (Kayano et al., 2016), as well as in gene expression data collected from aging versus young mice (Southworth et al., 2009) and in a wide range of cancers (Anglani et al., 2014). However, as this pattern has generally not been tested for explicitly, the degree to which it commonly characterizes disease, aging, or environmental perturbations remains to be seen.

The loss of transcriptional robustness we observe is an intuitive extension of decanalization models (Gibson, 2009b). Decanalization models posit that, through many generations of stabilizing selection, biological systems evolve to maintain homeostasis under a certain range of environmental (or genetic) perturbations, and changing the environment beyond this range will result in homeostatic breakdown (Gibson, 2009a; Gibson, 2009b). This ‘breakdown’ has traditionally been thought of as an increase in variance (Geiler-Samerotte et al., 2018; Gibson and Wagner, 2000), and we note that our discussion of decanalization as a change in correlation is a departure from traditional models. Nevertheless, we believe we are potentially seeing decanalization play out at the molecular level, where formerly co-regulated molecules become dysregulated following a shift in the environment. Importantly, the longitudinal nature of the YFS dataset allowed us to track the health status of the same individuals over time and test the hypothesis that decoherence at the molecular level is associated with disease risk. In support of decanalization models, we found that metabolite pairs that lose correlation following disease onset at the first-time point could be used to distinguish between perpetually healthy individuals and those that would develop disease later on. This set of metabolites are particularly interesting from a clinical perspective, as they appear to be especially sensitive to the homeostatic breakdown associated with metabolic syndrome, and could thus potentially serve as biomarkers.

In our final set of analyses, we investigated the role of genetic variation in driving variation in co-expression. To date, work on differential co-expression has largely focused on contrasts between cases and controls (de la Fuente, 2010; van Dam et al., 2017; Rotival and Petretto, 2014), between tissues (Saha et al., 2017; Gao et al., 2016), or between individuals inhabiting different environments (de la Fuente, 2010; Hsu et al., 2015), while the genetic basis of differential co-expression has received much less attention. Our ability to map correlation QTLs directly using CILP fills this gap, and reveals that correlation between mRNA transcripts is under genetic control. Our analyses build on three recent studies that came to similar conclusions using more indirect approaches and/or much smaller sample sizes. First, Nath et al. identified correlated groups of mRNA transcripts and metabolites (termed ‘modules’), and performed genome-wide scans to associate specific SNPs with variation in each module (Nath et al., 2017). In a related analysis, Lukowski et al. demonstrated that mRNA transcripts with evidence for genetic correlations were more likely to be regulated by the same eQTL (Lukowski et al., 2017). Finally, van der Wijst recently leveraged single-cell mRNA-seq data from 45 individuals to build personalized co-expression networks, and used these data to identify genetic variants that predicted inter-individual variation in correlation structure (van der Wijst et al., 2018). Together, these studies point toward pervasive genetic control of co-expression, and CILP provides the tools to probe this form of regulation even further. In particular, mapping correlation QTLs across different environments, and compiling a more mechanistic picture of how genetic variation affects correlation dynamics (e.g., using ChIP-seq or ATAC-seq data integrated with correlation QTL scans) are major priorities for future work.

Our transcriptome analyses focused on pre-defined gene sets because of our need to limit the search space. In particular, we focused on differentially expressed or BMI-associated genes (in the Fairfax et al (Fairfax et al., 2014). and NESDA (Penninx et al., 2008) datasets, respectively), because testing for variation in correlation across all genes expressed in a given tissue would require tens of millions of tests. Applying CILP using unbiased screens will require extremely large sample sizes to overcome the multiple testing burden. Importantly, power to detect variation in correlation using CLIP does increase with sample size. This would not be the case with other methods that compare group rather than individual-level variation in correlation (de la Fuente, 2010; van Dam et al., 2017); for these methods, increasing sample size would increase the precision of the estimated correlation in a given group, but not directly the power to detect differential correlation.

Lastly, we focused on human genomics in this study because of the availability of large datasets with individual-level data. However, our study builds on extensive previous work in model systems investigating variation in co-expression in yeast (Wang et al., 2013; Sun et al., 2008), mice (Southworth et al., 2009), Arabidopsis (Fukushima, 2013; Mine et al., 2018; Zhang et al., 2017), and Drosophila (Huang et al., 2015), all of which point to the critical importance of co-regulation in complex trait variation. The framework we outline here could be particularly useful in conjunction with a growing number of multi-parent mapping populations (Long et al., 2014; Iraqi and Collaborative Cross Consortium, 2012; de Koning and McIntyre, 2017; Cubillos et al., 2013; Pascual et al., 2016), which provide unprecedented statistical power in model systems. In particular, CILP paired with controlled manipulations in model systems could allow researchers to uncover genotype x environment effects on correlation, a problem that has been notoriously difficult to study in humans. Similarly, while here we focused exclusively on molecular traits, our approach could be paired with many additional data types. Moving beyond genetics, CILP could be used to identify the drivers of community level correlations in ecological datasets, tradeoffs (that manifest as negative correlations) between different fitness components or life history traits, and much more. In essence, questions of how and why correlations between molecular or organism-level phenotypes vary is at the heart of many fields, and we anticipate that our approach will thus be widely applicable.

Share this article

Cite this article

Illustration of decoherence and ‘Correlation by Individual Level Product’ (CILP).

Power to detect simulated correlation QTL across a wide range of scenarios (n = 1000 for all simulations).

Bacterial infection (i.e., treatment with LPS) leads to decoherence in primary monocyte gene expression.

Metabolic syndrome leads to decoherence among particular metabolite pairs.

CILP approach reveals hundreds of correlation QTLs.

Author details

Amanda Lea

Contribution

Contributed equally with

Competing interests

Meena Subramaniam

Contribution

Contributed equally with

Competing interests

Arthur Ko

Contribution

Competing interests

Terho Lehtimäki

Contribution

Competing interests

Emma Raitoharju

Contribution

Competing interests

Mika Kähönen

Contribution

Competing interests

Ilkka Seppälä

Contribution

Competing interests

Nina Mononen

Contribution

Competing interests

Olli T Raitakari

Contribution

Competing interests

Mika Ala-Korpela

Contribution

Competing interests

Päivi Pajukanta

Contribution

Competing interests

Noah Zaitlen

Contribution

Contributed equally with

Competing interests

Julien F Ayroles

Contribution

Contributed equally with

For correspondence

Competing interests

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organism

Further reading