Abstract
Over a quarter of drugs that enter clinical development fail because they are ineffective. Growing insight into genes that influence human disease may affect how drug targets and indications are selected. However, there is little guidance about how much weight should be given to genetic evidence in making these key decisions. To answer this question, we investigated how well the current archive of genetic evidence predicts drug mechanisms. We found that, among well-studied indications, the proportion of drug mechanisms with direct genetic support increases significantly across the drug development pipeline, from 2.0% at the preclinical stage to 8.2% among mechanisms for approved drugs, and varies dramatically among disease areas. We estimate that selecting genetically supported targets could double the success rate in clinical development. Therefore, using the growing wealth of human genetic data to select the best targets and indications should have a measurable impact on the successful development of new drugs.
Similar content being viewed by others
Main
Attrition is a major challenge in drug discovery and development, with more than half of clinical studies failing because of lack of efficacy1,2,3,4. The widespread failure of preclinical model systems to adequately predict efficacy in humans has led drug developers to look for other sources of evidence to inform decisions about which targets to pursue and for which indications (disease or reason for treatment for which a drug is approved). Since the completion of the Human Genome Project and the rise of genome-wide association studies (GWAS) and whole-genome and whole-exome sequencing studies, there has been rapid progress in identifying the genes that influence human health and disease5. These genetic insights can potentially transform the process of selecting the best drug targets and indications6, the key decisions in drug discovery. There are several examples of genes associated with disease traits that have been proven to be effective drug targets. One canonical example is the target for statins, HMGCR, which has been associated with serum cholesterol levels7. Several other examples were recently highlighted for rheumatoid arthritis8. Such examples and the rapidly growing body of human genetic data led us to ask how much weight should be given to genetic associations when choosing which drug targets to pursue for a desired indication.
Results
In this study, we go beyond previous work on drug repositioning9 to investigate how well clinically successful drug mechanisms (the protein product modulated to elicit a clinical response) are predicted by known genetic associations and how that prediction may change across the drug development pipeline, from preclinical and clinical phases to launched drugs (Drug Approval Process; see URLs). An overview of the data sources, filtering and processing applied is provided in Figure 1a. To broadly capture statistically significant (P ≤ 1 × 10−8) common variant genetic associations, we used GWASdb10, which combines data from multiple sources, including the National Human Genome Research Institute (NHGRI) GWAS Catalog, the tables and supplementary materials of manuscripts archived in the NHGRI GWAS Catalog, and the database of Genotypes and Phenotypes (dbGaP), among others. To allow comparisons among all data sources, we manually mapped all traits to the most specific Medical Subject Heading (MeSH) terms applicable. Genetic variants were mapped to potential causal genes using a combination of linkage disequilibrium (LD), position, expression quantitative trait locus (eQTL) and epigenetic data (for example, see Fig. 1b). When we observed multiple possible variant-to-gene mappings, these were ranked on the overall strength of evidence. In the final data set, we had 18,566 genetic associations to 434 MeSH traits that mapped to 6,120 genes outside of the extended major histocompatibility complex (xMHC), with a total of 13,855 gene-trait combinations. Genes involved in rare, mendelian traits were derived from Online Mendelian Inheritance in Man (OMIM), providing a data set with 1,898 genes annotated to affect 2,145 traits with MeSH terms, for a total of 2,627 gene-trait combinations. The GWASdb and OMIM gene-MeSH pairs were largely non-overlapping, yielding a combined set of 16,459 gene-trait combinations (Supplementary Fig. 1 and Supplementary Table 1).
Information about drugs across the various stages of development was drawn from the commercial Informa Pharmaprojects database. Of a total of 61,104 drugs (including combination therapies; Supplementary Note), there were 22,270 drugs known to modulate 1,824 human non-xMHC drug targets for 705 indications, giving a total of 19,085 target-indication pairs (Supplementary Fig. 2 and Supplementary Tables 2,3,4). Aggregation of the drug information at the target and indication levels eliminated redundancies in drug mechanisms within the database, such as multiple formulations of the same drug or multiple drugs within the same drug class used to treat the same indications.
We found that the target genes for drugs approved in the United States or the European Union, our definition of 'successful drug mechanisms', were significantly enriched among genes associated with variation in human traits (Fig. 2). The greatest enrichment was for genes identified using OMIM (odds ratio (OR) = 7.2, P = 8.9 × 10−74), where 206 of 389 (53%) target genes for approved drugs were also associated with a mendelian trait, a proportion comparable to that in a previous report11. Genes associated with traits through genome-wide associations were also significantly enriched (OR = 2.0, P = 2.9 × 10−10), particularly when genes were limited to the top-ranked gene for each associated variant (OR = 2.7, P = 1.3 × 10−14), with 98 (25%) genes in common. However, we also observed that genes considered to be classically druggable, having binding domains for small molecule drugs12 (n = 2,639), were also highly enriched among OMIM and GWASdb genes (OR = 1.9 and 1.7, respectively). To account for this relationship, we also assessed the enrichment of genetic associations within the druggable subset of the genome. In this analysis, there was decreased but still highly significant (P < 1 × 10−3) enrichment of the OMIM and top GWASdb genes (OR = 4.5 and 1.6, respectively). There was little added enrichment when considering the combined effects of OMIM and GWASdb. One potential explanation for the correlation between successful drug targets and evidence of genetic effects is that genes that result in notable phenotypic changes when altered are also the most responsive to drug-induced alterations. The greater enrichment among successful targets of genes that give rise to mendelian disorders in comparison to those involved in complex traits supports this explanation. Residual variance intolerance score (RVIS) was recently developed to assess the tolerance of a gene to mutational perturbation13. We observed a statistically significant association between genes falling within the lower quartile of the RVIS distribution (most intolerant to change) and approved drug status (OR = 2.1, P = 7.7 × 10−10). However, conditioning on RVIS had little impact on the effect of OMIM and GWASdb association status and hence is an independent predictor of target success and not an explanation for the effect of genetic associations (Supplementary Note).
The analysis above did not take into account alignment between the drug indications and the associated traits. Therefore, we next investigated the percentage of approved target–indication pairs with a corresponding genetic association tied to the same gene for a similar trait. Using the structure of the MeSH hierarchy to estimate indication-trait similarity14 (Supplementary Fig. 3), we found that 239 of 395 (61%) approved drug indications had at least 1 genetic association (OMIM or GWASdb) with a similar trait (relative similarity ≥ 0.7) and that 158 (40%) approved indications had at least 5 associations reported. The approved drug indications having fewer than five genetic associations—such as anxiety, depression, headache, coronary restenosis and kidney stones—included both diseases where many studies have been done with little success and understudied areas of medical interest currently lacking substantial genetic investigation (Supplementary Table 5).
To assess the support that a genetic association provides to drug mechanisms, we focused on the subset of 158 approved drug indications with at least 5 genetic associations for a similar trait, taking this to signify that the indication has been reasonably well studied by genetic approaches (that is, focusing on instances where an opportunity exists for genetic data to support the target indication; Supplementary Table 5). Of 820 target-indication pairs, 67 (8.2%) were supported by one or more genetic associations when considering the combined evidence of both OMIM and GWASdb (Fig. 3a and Supplementary Table 6). Further, we found that there was significant variability among indication categories (P = 1.1 × 10−16; Fig. 3a), with the highest degree of genetic support for indications related to musculoskeletal, metabolic and blood categories (percent overlap of greater than 30%) and little or no genetic support for oncology, skin, eye and digestive categories. We observed that there was slightly greater support with GWASdb than with OMIM (4.5% versus 4.1%, respectively; Fig. 3b, Supplementary Figs. 4,5,6 and Supplementary Table 7), although the overlap with OMIM represented a much larger fraction of the total number of OMIM gene-trait associations in comparison to GWASdb (1.2% versus 0.27%, respectively). These results were somewhat sensitive to restricting the indications to those that had varying levels of genetic support, although a cutoff of at least five associations per indication yielded the best tradeoff between the number of indications considered and overall genetic support (Supplementary Fig. 7).
If genetic association data are predictive of successful mechanisms of action, then we would expect the percent of target-indication pairs with genetic evidence to increase the further the corresponding drug has progressed in the drug development pipeline, with approval representing a mechanism that has passed the highest evidentiary standards. This is just the pattern that we observed when considering OMIM and GWASdb together or separately (Fig. 3b), where in each instance the enrichment of genetic support for target-indication pairs was the lowest in phase I and increased in subsequent phases through drug approval. The genetic support increased from 2.0% for target-indication pairs that had only progressed as far as phase I clinical trials to 8.2% for approved drugs, over a fourfold increase, suggesting that the odds of successful drug mechanisms with genetic support are many times greater than without. For new mechanisms in early development, we cannot rule out the influence that relatively recent GWAS may have had on the choice of targets and indications; however, accounting for such an influence would lead to an upward bias in the estimated overlap at that early stage and a downward bias in the increase in enrichment with progression. It is also possible that the reporting of successful drug mechanisms has influenced some gene-trait annotations that have been added to OMIM, although an informal review of several entries did not find this to be a likely contributor. The enrichment of genetic support we observe here is consistent with a recent AstraZeneca portfolio review3. Among 38 phase II programs, an OR of 3.5 (95% confidence interval (CI) = 0.73–20.6, P = 0.10) was observed in comparing the genetic support for projects that progressed to that for projects that did not.
Discussion
On one hand, there are limitations to the ability to identify the genes that are causally related to a genetic association, which, given our inclusive strategy to map all possible causal genes, could inflate our estimate of the proportion of successful drug mechanisms with genetic support. On the other hand, the information available about the functional genomic landscape is incomplete, and there will be many causal relationships left undetected or ascribed to the wrong gene, resulting in a bias of the enrichment estimates toward the null. However, the growing body of functional genomic information will continue to improve the ability to correctly ascribe a molecular pathway by which genetically associated variants influence traits. Such data can also help identify the causal mechanism underlying the association and inform what treatments could lead to a positive outcome in patients. In addition, catalogs of genetic variants that influence human traits are far from complete, which would lead to an underestimation of the proportion of drugs with genetic support. We have identified a number of therapeutic areas where there are large gaps in knowledge about the genetic factors involved, divided evenly across the pipeline (Supplementary Fig. 8). We advocate continued support for research on the genetics of these areas to aid in the development of more effective treatments. The availability of a precompetitive genetic resource similar to that produced for the purposes of this analysis that integrates all known genetic associations with measures of statistical confidence, using a common trait ontology, and integrates the most recent sources of functional genomic information to list and rank potential causal pathways would be an invaluable tool for the drug discovery process.
Another potential source of bias is that genetic associations could already be driving decisions on which drugs make it into clinical development and for which indications. Although this would have affected a small subset of the historical drug data, given that drug discovery and development timelines generally extend back well over 10 years, the impact of this bias would be to increase the proportion of drugs with genetic evidence earlier in the pipeline, leading to an underestimation of the relative benefit of genetic support. There may also be instances where known mechanisms for drugs could lead to targeted genetic research that finds supporting information, which would disproportionately affect the overlap with approved drugs. We would not expect these biases to measurably affect the GWAS-based results. However, there is greater potential for the manually curated results in OMIM to influence target selection or for drug targets to influence genetic research. We reviewed the 39 OMIM genes and traits that overlapped approved drug targets and indications (Supplementary Table 6) and found several potential instances where genetic information led to the development of therapeutics, including use of the gene product as a therapeutic, as in the case of von Willebrand disease where von Willebrand complex is used in treatment. This finding partially explains the greater overall enrichment of targets associated with traits in OMIM.
Ultimately, we want to know the probability that a therapeutic agent that properly engages the target protein at safe and efficacious doses in the relevant tissues will have the intended effects to prevent or treat disease in patients3,4. Several pieces of information required for a thorough analysis are missing from the public domain; most notably, there are relatively few data available on drugs that failed in clinical development and the reasons for these failures (Supplementary Note). However, with the historical information available on drug and, hence, target-indication progression through the clinical pipeline, we can derive estimates of the value the support of genetic information brings. Given the observations in our data, we estimated the ratio of the probability of progressing in the drug development pipeline given that the drug mechanism has the support of genetic information to the probability of the drug progressing without genetic support (Table 1 and Supplementary Note), where we considered support from GWASdb and OMIM in combination as well as separately. OMIM support yielded a slightly higher probability of success than GWASdb support. We estimated that genetic support had the largest impact on the probability of progressing from phase II to phase III (ratio = 1.5, combined), with the next largest impact for progression from phase I to phase II (ratio = 1.2, combined); the smallest apparent contribution was for progression from phase III to approved status (ratio = 1.1, combined). We also estimated the converse ratio of the probability of failure to progress in the absence of genetic support versus with support (Supplementary Note). As expected, we found that, overall, target-indication pairs that entered clinical development that lacked genetic support were significantly less likely to reach drug approval (ratio = 1.3, 95% confidence interval = 1.2–1.5, combined), and the lack of genetic support in progression had the greatest impact earlier in the drug development process.
The relatively low impact of genetic support on success in phase III is surprising, given that attrition rate estimates attribute most phase III failures to lack of efficacy2. It may be that failures in phase III are different in nature from those in earlier stages, for example, because they may reflect a failure to improve over standard of care rather than failure of the targeted biological mechanism to be causal for disease at all. Or it may be that, in phase III, study endpoints are more complex and less closely related to specific biological mechanisms, including the use of broad endpoints such as all major coronary events in cardiovascular outcome studies. In addition, we note the limitations of the available data. We rely on the latest stage to which a target-indication pair was reported to have progressed as a proxy for success and failure, although such data may be incomplete or even inaccurate in some cases. Furthermore, the interpretation of risk ratios is dependent on the absolute risk, which varies substantially by phase.
Overall, we estimate that drug mechanisms with genetic support would succeed twice as often as those without it (from phase I to approval). Therefore, increasing the proportion of discovery and development activities focused on targets with genetic support and allowing genetic data to guide selection of the most appropriate indications should lead to lower rates of failure due to lack of efficacy in clinical development.
Methods
Genetic data.
Genetic association data were drawn from the data available in GWASdb10 (version dated 21 May 2013), a manually curated database that brings together information from eight sources. We excluded all data from PharmGKB and the Genetic Association Database. Genetic associations reported from these two sources contained no supporting statistical association evidence (with most P values equal to zero) to accompany the entries, and the new associations included were largely drawn from candidate gene association studies that lacked rigorous criteria for reporting a statistical association. In particular, we found that there were a large number of candidate gene associations in PharmGKB for drug target genes, which would result in an upward bias in the number of drug targets with supposed genetic associations. We also excluded a few large metabolomic studies with numerous traits screened that had very large numbers of associations reported. Finally, we identified one study15 where a supplementary table was misinterpreted, leading to many falsely identified associations that were also excluded. For the variants, traits and P values reported, we removed any duplicate entries found across the various GWASdb data sources. For the purposes of this study, we set a P-value threshold of 1 × 10−8 to limit associations to those with relatively strong evidence. The OMIM database (accessed 3 October 2013) was used to provide additional information on the effects of genetic variants and mutations on human traits. Only entries with valid MeSH terms were included in the analyses reported here.
Genetic variant-to-gene mapping.
Variants with phenotypic associations were mapped to the genes that they could be causally affecting through a combination of approaches. First, all variants in LD having r2 ≥ 0.5 with each associated variant were identified on the basis of the 1000 Genomes Project pilot sequence genotypes for the European-ancestry (CEU) population16. No effort was made to conduct LD pruning to represent independent associations as the purpose of our study was to identify all possible genes that could be responsible for the observed effect. For each variant in LD, the plausible mapping of a variant to a particular gene was performed using a combination of physical proximity to the gene, evidence for association of the variant with the expression of the gene and determination of whether the variant fell within a regulatory element predicted to affect the expression of the gene. The variant was mapped to the physical location of the gene plus or minus 5 kb on the basis of the longest gene transcript to define the gene boundaries plus 1.5 kb in UCSC-distributed RefSeq (v37.1) annotation. Gene eQTLs were drawn from eqtl.chicago.edu (accessed 21 May 2013), which includes eQTLs from several studies of several cell lines and primary tissues as well as the results from primary liver tissue17 at false discovery rate (FDR) ≤ 0.1, computed by Kruskal-Wallis test. To map variants to genes on the basis of regulatory evidence, we identified all variants that fell within a predicted transcription factor binding site located within a DHS peak using RegulomeDB18 (accessed 7 February 2013). For variants with a RegulomeDB score ≤4, we determined whether the genomic location overlapped a DHS peak that was either located with a gene TSS or a distal DHS peak that was correlated with a TSS DHS across cell lines, as described19 (data courtesy of J. Stamatoyannopoulos, University of Washington). Variants that affected the amino acid sequence of any gene transcripts were identified via the Ensembl Variant Effect Predictor from the European Bioinformatics Institute (EBI; accessed 27 February 2014). We restricted our analyses to genes reported in GENCODE (v17) or RefSeq (v37.1).
In many instances, a variant with a phenotypic association could be mapped to more than one gene using this combination of approaches. We devised an ad hoc scoring scheme to assess the relative weight of evidence for a causal relationship between the variant reported to be associated and each gene to which it was mapped (Supplementary Fig. 9), including the source of the association, the LD between the associated variant and the variant mapped to the gene, the nature of the mapping information and the number of times that the variant in LD had been associated with the trait. This scheme yielded a potential gene score between 0 and 11, with 11 reflecting the strongest evidence. The factors included in the gene scoring scheme were also used to rank the variant-to-gene mappings, such that the top-ranked gene for a particular variant presumably had the strongest evidence (Fig. 2). When two gene mappings had equal support, the ranking was arbitrarily decided.
Drug data.
Information about drugs, their gene targets, the indications for which they have been investigated and the latest stage of development to which they have progressed was derived from the commercial Informa Pharmaprojects database. Drugs were retained for analysis if (i) they were annotated to have human gene targets (on the basis of GENCODE v16), (ii) the gene did not map to the xMHC and (iii) the indication could be mapped to a MeSH term. Most analyses using Pharmaprojects were conducted using a transformation of the data into a single entry per gene target and indication with the latest phase in development to which that unique combination progressed for any drug. A target was defined as successful in treating an indication if a drug targeting that gene product was approved for the corresponding indication in the United States or the European Union, as annotated in Pharmaprojects.
Medical Subject Heading term mapping and use.
We used the MeSH thesaurus to provide a common vocabulary among traits from GWASdb and OMIM and indications from Pharmaprojects. MeSH term mappings to OMIM traits was derived from Comparative Toxicogenomics Database mapping20. Mappings for GWASdb and Pharmaprojects were performed manually using the MeSH Browser by searching with each of the unique original terms listed in the respective database and identifying the overall best match. Some traits did not yield a satisfactory MeSH term. Any data entries missing MeSH terms were excluded from the primary analyses described in this study.
When comparing the overlap between traits with respect to evidence for genetic association and drug indications, we recognized that there could be many instances where the genetic evidence was for a trait very closely related to the indication but not an exact match. To allow for such near misses, we used similarity measures based on the MeSH ontology, implemented in the UMLS::Similarity Perl module14. Several measures of similarity and relationships are implemented in this package. We evaluated all of these measures on a subset of 50 randomly selected MeSH entries from our combined data set to assess how well the subsequent trait clustering reflected expert interpretation. On the basis of this evaluation, we selected two similarity measures that incorporated both path distance and information content, Resnik21 and Lin22. The measures were standardized to a measure of relative similarity between zero and one and averaged together to yield a final relative similarity measure for subsequent analysis. We noted that in some instances, because of the structure of the MeSH ontology, very closely related traits resulted in very low measures of similarity. Two examples are systolic or diastolic blood pressure with hypertension and bone mineral density with osteoporosis. To address this, we reviewed the laboratory-based MeSH terms and manually assigned relative similarity scores of 0.5, 0.7 and 0.9 on the basis of the known relationships between traits. The two examples above were assigned a relative similarity of 0.9. The manually assigned relative similarities are given in Supplementary Table 8. The relative similarity matrix used for the analyses is available in Supplementary Data Set 1. Each MeSH term was subsequently manually mapped to 1 of 20 disease categories (Supplementary Table 9).
Genetic association enrichment.
We assessed enrichment of genetic associations both without and with respect to the trait underlying the association. We assessed enrichment without respect to trait or indication as presented in Figure 2 by constructing a 2 × 2 table of genes in GENCODE (v17) or RefSeq (v37.1) and counts corresponding to the presence or absence of the gene as a target for a drug approved in the United States or the European Union versus the presence or absence of evidence for genetic association for each gene. Evidence of genetic association was further stratified by OMIM, any possible gene for each GWASdb association and the top gene (top ranked, as described above) for each GWASdb association. Enrichment for RVIS was based on published scores13, with stratification for the lowest quartile. The druggable genome was based on the description of Hopkins and Groom12. Odds ratios and 95% confidence intervals were estimated using the exact method implemented in fisher.exact in R.
The overlap between genetic evidence and drug targets presented in Figure 3, taking traits and indications into account, was based on the direct overlap of gene and target names with a relative trait-indication similarity of at least 0.7. The confidence intervals presented were computed using the Pearson-Klopper exact method implemented in the binom package in R. A permutation test (Supplementary Fig. 10) was performed to assess the significance of the observed overlap given the high degree of correlation among genes and traits in the data. In the permutation test, the null distribution was simulated by breaking the relationships between traits and genes in the genetic association data. This was done in a manner to maintain the relationships among genes associated with the same trait by permuting the traits and replacing all associations for the observed trait with the same permuted trait (for example, by replacing all genes originally associated with alopecia with those associated with type 2 diabetes in permutation 1, with those associated with Kawasaki disease in permutation 2, etc.). We conducted 10,000 replicates.
All statistical analyses were conducted using R version 3.1.0 (ref. 23). Most figures were created using the R package ggplot2 (ref. 24).
Code availability.
The R scripts and Sweave files used to process the data and conduct the analyses described herein are available from the authors by request. All key analyses can be reproduced from Supplementary Data Sets 1,2,3,4 and the supplementary tables included.
URLs.
Drug Development Process, http://www.fda.gov/downloads/Drugs/ResourcesForYou/Consumers/UCM284393.pdf; GWASdb, http://jjwanglab.org/gwasdb; Online Mendelian Inheritance in Man (OMIM), http://www.omim.org/; MeSH browser, https://www.nlm.nih.gov/mesh/MBrowser.html; UMLS::Similarity, http://www.d.umn.edu/~tpederse/umls-similarity.html; PharmGKB, https://www.pharmgkb.org/; Genetic Association Database, http://geneticassociationdb.nih.gov/; Informa Pharmaprojects database, http://www.citeline.com/; MeSH thesaurus, http://www.nlm.nih.gov/mesh.
References
DiMasi, J.A., Feldman, L., Seckler, A. & Wilson, A. Trends in risks associated with new drug development: success rates for investigational drugs. Clin. Pharmacol. Ther. 87, 272–277 (2010).
Arrowsmith, J. & Miller, P. Trial watch: phase II and phase III attrition rates 2011–2012. Nat. Rev. Drug Discov. 12, 569 (2013).
Cook, D. et al. Lessons learned from the fate of AstraZeneca's drug pipeline: a five-dimensional framework. Nat. Rev. Drug Discov. 13, 419–431 (2014).
Morgan, P. et al. Can the flow of medicines be improved? Fundamental pharmacokinetic and pharmacological principles toward improving Phase II survival. Drug Discov. Today 17, 419–424 (2012).
Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, D1001–D1006 (2014).
Plenge, R.M., Scolnick, E.M. & Altshuler, D. Validating therapeutic targets through human genetics. Nat. Rev. Drug Discov. 12, 581–594 (2013).
Kathiresan, S. et al. Common variants at 30 loci contribute to polygenic dyslipidemia. Nat. Genet. 41, 56–65 (2009).
Okada, Y. et al. Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature 506, 376–381 (2014).
Sanseau, P. et al. Use of genome-wide association studies for drug repositioning. Nat. Biotechnol. 30, 317–320 (2012).
Li, M.J. et al. GWASdb: a database for human genetic variants identified by genome-wide association studies. Nucleic Acids Res. 40, D1047–D1054 (2012).
Wang, Z.Y. & Zhang, H.Y. Rational drug repositioning by medical genetics. Nat. Biotechnol. 31, 1080–1082 (2013).
Hopkins, A.L. & Groom, C.R. The druggable genome. Nat. Rev. Drug Discov. 1, 727–730 (2002).
Petrovski, S., Wang, Q., Heinzen, E.L., Allen, A.S. & Goldstein, D.B. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 9, e1003709 (2013).
McInnes, B.T., Pedersen, T. & Pakhomov, S.V. UMLS-Interface and UMLS-Similarity: open source software for measuring paths and semantic similarity. AMIA Annu. Symp. Proc. 2009, 431–435 (2009).
Patsopoulos, N.A. et al. Genome-wide meta-analysis identifies novel multiple sclerosis susceptibility loci. Ann. Neurol. 70, 897–912 (2011).
1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
Schadt, E.E. et al. Mapping the genetic architecture of gene expression in human liver. PLoS Biol. 6, e107 (2008).
Boyle, A.P. et al. Annotation of functional variation in personal genomes using RegulomeDB. Genome Res. 22, 1790–1797 (2012).
Maurano, M.T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012).
Davis, A.P. et al. The Comparative Toxicogenomics Database: update 2013. Nucleic Acids Res. 41, D1104–D1114 (2013).
Resnik, P. Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J. Artif. Intell. Res. 11, 95–130 (1999).
Lin, D. in Proc. Int. Conf. Machine Learning 296–304 (Morgan Kaufmann Publishers, 1998).
R Development Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2014).
Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer, 2009).
Acknowledgements
We would like to thank N. Srivastava for much of the manual mapping of GWASdb traits and Pharmaprojects indications to MeSH terms and P. Agarwal for many helpful conversations. GWASdb-related work was supported by the Research Grants Council, Hong Kong SAR, China (781511M, 17121414M) and the National Natural Science Foundation of China (91229105).
Author information
Authors and Affiliations
Contributions
This work was conceived by M.R.N., H.T., L.R.C., J.C.W. and P.S. The primary analyses were designed and conducted by M.R.N. Supporting analyses were provided by J.L.P. and J.S. The mapping of variants to genes was conducted by M.R.N., P.N., Y.S. and A.F. The GWASdb data were created and provided by P.C.S., M.J.L. and J.W. The manuscript was written by M.R.N. with contributions from H.T., J.C.W. and P.S.
Corresponding author
Ethics declarations
Competing interests
M.R.N., H.T., J.L.P., J.S., L.R.C., J.C.W. and P.S. are employees of GlaxoSmithKline, a global healthcare company, that may conceivably benefit financially through this publication.
Integrated supplementary information
Supplementary Figure 1 Summary of genetic association data and their traits and gene mappings.
Distribution of the (a) number of publications or sources and (b) reported associations for each unique MeSH term. (c) Distribution of the number of genes mapped for each MeSH term. (d) Distribution of the number of genes mapped to each SNP (excluding SNPs with no genes mapped; n = 5,272). (e) Distribution of P values for all unique associations. These summaries are limited to publications and sources with at least one association with a P value ≤1 × 10–8. Panels b and c were truncated at 50, panel d was truncated at 30 and panel e was truncated at 100. All values over those thresholds are shown at the maximum value. (f) Distribution of the number of genes for each unique MeSH term in OMIM.
Supplementary Figure 2 Summary of the drug data and their target gene and indications.
Distribution of the (a) number of drugs observed for each target gene in the analysis data set (truncated at 50). (b) Distribution of the number of target genes for each drug (i.e., multiple drug targets or combinations of therapeutic agents). (c) Distribution of the number of MeSH terms (i.e., unique indications) for each drug (truncated at 15). (d) Distribution of the number of drugs listed for each MeSH term (truncated at 100). (e) Distribution of the number of target genes for each MeSH term (truncated at 100). (f) Distribution of the number of MeSH terms for each target gene (truncated at 50).
Supplementary Figure 3 Illustrated use of the MeSH ontology to estimate relative similarity.
The methods used in this study (lin and resnik, implemented in UMLS::Similarity) combined both path length and information content. See the Online Methods for additional details.
Supplementary Figure 4 Overlap of drug targets with genetic associations by disease category and latest development phase.
(a) Overlap between drug targets and their indications with genetic associations for similar traits. The percentage of target-indication pairs overlapping with gene-trait combinations from GWASdb or OMIM for the latest development phase each pair achieved as recorded in Pharmaprojects. The number of unique target-indication pairs for each category at each phase is shown to the right of each plot. Exact 95% confidence intervals are shown. (b) Distribution of the number of target-indication pairs at each phase by category.
Supplementary Figure 5 Overlap between drug targets and their indications with genetic associations for similar traits with genetic associations restricted to GWASdb only.
Overlap for (a) drugs approved in the United States or European Union and (b) the furthest development phase to which each target-indication pair progressed. Exact 95% confidence intervals are shown.
Supplementary Figure 6 Overlap between drug targets and their indications with genetic associations for similar traits with genetic associations restricted to OMIM only.
Overlap for (a) drugs approved in the United States or European Union and (b) the furthest development phase to which each target-indication pair progressed. Exact 95% confidence intervals are shown.
Supplementary Figure 7 Tradeoff between the number of indications studied and overall genetic support.
The tradeoff between the number of indications studied and overall genetic support when setting a lower bound on the number of independent genes associated with a trait related to each indication (relative similarity ≥ 0.7), restricted to drugs approved in the United States or European Union. The percentage of target-indication pairs with genetic support increases as indications are restricted to those with the most genetic information available, although at the cost of considering far fewer indications. The analyses reported in Figure 3 and Supplementary Figures 4,5,6 selected five as the threshold, where the first enrichment plateau is observed.
Supplementary Figure 8 Distribution of the number of genes associated with traits similar (≥0.7) to the indications included in the analysis of overlap with genetic associations.
The few indications with very large numbers of genes associated were truncated at 50. (The full range is available in Supplementary Table 5.) The box corresponds to the interquartile range, the center line corresponds to the median, the whisker correspond to the maximum or 1.5 times the interquartile range (whichever is largest) and the points identify further outliers. The numbers given on the y axis are the number of unique indications observed in each phase. There is no statistically significant variability among phases (P = 0.18); analysis of the rank of the number of associations with phase as ordered variable) or with the linear trend (P = 0.37; analysis of rank of number of associations with phase as numeric with 1 = preclinical and 5 = approved in the United States or European Union).
Supplementary Figure 9 System for scoring the strength of evidence tying a variant with a phenotypic association to a gene.
For variant function, “DHS Rdb 3” indicates that the variant has a RegulomeDB score of 3 and falls within a proximal or distal DHS site, “eQTL or DHS (2)” indicates that the variant was either identified as an eQTL in the University of Chicago eQTL database or had a RegulomeDB score of 2 and “eQTL & DHS” indicates that the variant was both identified as an eQTL and fell within a DHS site with a RegulomeDB score of 2 or less. LD is in the form of r2.
Supplementary Figure 10 Permutation test of overlap between approved drug target–indications and genetic evidence (GWASdb or OMIM).
(a) The permutation scheme to simulate the null distribution. (b) The distribution of the percent of gene-trait and target-indication pairs that overlap over 10,000 permutations and the overlap observed in the original data (red downward arrow). (c) The overlap observed in the original data overall and by disease category (red points) and the median percent overlap over 10,000 permutations (red ×).
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–10 and Supplementary Note. (PDF 2117 kb)
Supplementary Table 1: Count of publications, associations and genes corresponding to each MeSH term.
MeSH: unique MeSH terms mapped to GWASdb traits. Publications: the number of publications or unique data sources reporting associations with the MeSH term. Associations: the number of unique SNPs reported to be associated with the MeSH term. Genes: the number of unique genes to which the SNPs associated with the MeSH term are mapped. (XLSX 26 kb)
Supplementary Table 2: Count of drugs and genes mapping to each MeSH term.
MeSH: unique MeSH terms mapped to indications. Drugs: the number of drugs in Pharmaprojects with the MeSH term as an indication. Genes: the number of genes reported to be targets for the drugs with the MeSH term as an indication. (XLSX 30 kb)
Supplementary Table 3: Count of drugs and MeSH terms corresponding to each drug target gene.
Gene: unique drug target genes. Drugs: the number of drugs reported to target the gene product. MeSH: the number of MeSH terms for which the drugs targeting the gene product are indicated. (XLSX 51 kb)
Supplementary Table 4: Count of target genes and MeSH terms corresponding to each drug.
Drug: unique drugs. Genes: the number of genes reported to be targets for the drug. MeSH: the number of MeSH terms indicated for the drug. (XLSX 599 kb)
Supplementary Table 5: All indications for drugs with reported human drug targets with the number of genes associated with them.
MSH.Ind: drug indication (mapped to the best MeSH term). N.Traits: the number of indications or traits with a relative similarity ≥0.7 to the indication. N.Assns: the number of independent associations reported for the indication or a similar trait. Traits: list of traits with a relative similarity ≥0.7 to the indication; may be other indications without a corresponding genetic trait.lApprovedUS.EU: logical indication of whether the indication is for a drug approved in the United States or European Union. (XLSX 45 kb)
Supplementary Table 6: Drug targets with a genetic association (GWASdb or OMIM) mapped to the same gene for a trait with relative similarity ≥0.7 to the corresponding indication.
Gene: drug target gene. MSH.Ind: drug indication (mapped to the best MeSH term). Category: indication disease category. MSH.Trt: MeSH term for the trait with a genetic association to the drug target gene. pvalue: the P value for the genetic association. If multiple associations mapped the same gene to the same trait, the smallest P value with the largest gene score was selected. P values of zero indicate that the association is from OMIM. eCat: extension of the RegulomeDB category. If the variant mapping the association to the listed gene was not due to a DHS-correlated enhancer, a value of "0" was assigned to amino acid–changing variants, "2" was assigned for eQTLs and "9" was assigned otherwise. Associations from OMIM were not assigned a value. Rank: the rank of the mapping of the given gene for the reported association. RelSim: relative similarity between the indication and trait. LatestPhase: the latest phase that each unique gene-indication pair achieved in Pharmaprojects. (XLSX 2587 kb)
Supplementary Table 7: Counts of target-indication pairs included in the analyses presented in Figure 3b and used to estimate the enrichment of genetic associations.
Phase.Latest: latest phase in the development pipeline to which each target-indication combination has progressed. Num.Targets: total number of target-indication combinations progressing to that phase. N.Overlap: number of target-indication combinations that overlap with a gene-trait association. Percent: percent of target-indication combinations that overlap with a gene-trait association. Genetic.Evidence: the source of the genetic evidence. (XLSX 11 kb)
Supplementary Table 8: Manually scored MeSH term similarities to values of 0.9, 0.75 and 0.5 reflecting a subjective measure of similarity not captured via the ontological relationships.
MSH1: MeSH term 1. MSH2: MeSH term 2. ManSim: manually assigned similarity. RelSim: relative similarity averaging Resnik and Lin relative similarity measures. RelSim.Res: Resnik relative similarity. RelSim.Lin: Lin relative similarity. (XLSX 28 kb)
Supplementary Table 9: Manual assignment of traits or drug indications (MeSH terms) to disease categories.
MSH: MeSH term. MSH.Top: MeSH term for the top level of the MeSH hierarchy. Disease: the original disease trait. Indication: the original indication. Category: manually assigned trait or indication category. (XLSX 151 kb)
Supplementary Data Set 1: GWASdb entries with MeSH terms mapped for each trait and genes annotated as described in the Online Methods.
Disease: the name of the trait for the corresponding genetic association as provided by GWASdb, which is generally taken directly from the GWAS catalog or whatever source from which the association was derived. snp_id: the identifier (generally a dbSNP rs ID) of the SNP reported to be associated with disease. Link: the reference for the association. Most references are a PubMed ID for the published paper. pvalue: the P value reported for the association of snp_id with disease. Source: the origin of the association information. It may be the following: GWAS:A/B: results listed in the NHGRI GWAS Catalog. The publications the associations of GWAS:B are drawn from published tables and supplementary information. Omim: from Online Mendelian Inheritance in Man. All P values are zero. GWASCentral: from GWAS Central. dbGaP: associations from dbGaP. SNP.Trait.Cnt: the number of associations of the same snp_id with the same disease in the original data set. These have been reduced to a single row in this data set, and the minimum P value was selected. MSH: Medical Subject Heading for disease. Manually mapped by Computational Biology. MSH.Top: the MeSH term for the top level of the branch to which the trait is mapped. In most instances, there are many branches to which a single MSH may be mapped. When this occurs, the most common top-level term in GWASdb is selected. snp.ld: a SNP in linkage disequilibrium (LD) with snp_id that provides a plausible connection to a gene. Gene: a gene that snp.ld is within 5 kb of, is an eQTL for or sits in a DNase I hypersensitivity site that is correlated with, or is within the transcription start site of. r2: LD between snp_id and snp.ld. eqtl: indicates whether snp.ld is an eQTL for a gene. The eQTL data are drawn from eqtl.uchicago.edu. rdb: indicates whether snp.ld-gene mapping is the result of a DHS correlation (from Maurano et al. (2012), provided by J. Stamatoyannopoulos). Cat.rdb: RegulomeDB category of the SNP (if rdb is "yes"). Lower values indicate more lines of converging functional evidence. eCat: a derivative of Cat.rdb, filling in values where rdb is "no." If eqtl is "yes" but rdb is "no," then it gets a value of 2. If snp.ld is a missense variant (amino acid change), the value is 0. The value is 9 otherwise. AAEffect: amino acid effect of snp.ld. AAScore: Condel score from VEP for nonsynonymous variants. GeneScore: an overall assessment of the evidence that the associated variant has a causal effect on the gene in question, ranging from values of zero to eight. Higher scores imply higher weight of causal evidence. The contributions to GeneScore are summarized on a separate GeneScore Wiki page. Rank: the rank for the given gene for its strength of connection to snp_id. This takes LD and functional evidence into account. (TXT 12043 kb)
Supplementary Data Set 2: Reduction of the genetic association data to a single row per gene and trait as used for most analyses described.
Variable names are as given for Supplementary Data Set 1. (TXT 2903 kb)
Supplementary Data Set 3: Data set of unique target-indication combinations in Pharmaprojects.
Gene: drug target gene. MSH: MeSH term for drug indication. MSH.Top: top-level MeSH term. Phase.Latest: latest phase to which the target-indication pair progressed through the development pipeline. lApprovedUS.EU: indicator of whether a drug for the target-indication pair has been approved in the United States or European Union. (TXT 1485 kb)
Supplementary Data Set 4: Relative similarity matrix of MeSH terms.
Row and column names correspond to each MeSH term for which a relative similarity could be computed. (TXT 58752 kb)
Rights and permissions
About this article
Cite this article
Nelson, M., Tipney, H., Painter, J. et al. The support of human genetic evidence for approved drug indications. Nat Genet 47, 856–860 (2015). https://doi.org/10.1038/ng.3314
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/ng.3314
This article is cited by
-
TBK1, a prioritized drug repurposing target for amyotrophic lateral sclerosis: evidence from druggable genome Mendelian randomization and pharmacological verification in vitro
BMC Medicine (2024)
-
Development of a human genetics-guided priority score for 19,365 genes and 399 drug indications
Nature Genetics (2024)
-
Association between human blood metabolome and the risk of pre-eclampsia
Hypertension Research (2024)
-
Raynaud phenomenon: from GWAS to drug repurposing
Nature Reviews Rheumatology (2024)
-
The Role of Genetics in Advancing Cardiometabolic Drug Development
Current Atherosclerosis Reports (2024)