B RIEFINGS IN BIOINF ORMATICS . VOL 16. NO 1. 16 ^23 Advance Access published on 13 January 2014 doi:10.1093/bib/bbt091 The genomic and functional characteristics of disease genes Andrew Collins Submitted: 10th October 2013; Received (in revised form) : 2nd December 2013 Abstract Increasing evidence indicates that genes containing disease causal variation have distinct functional and genomic properties. The importance of understanding these properties is highlighted by efforts to filter lists of variants from next-generation sequencing studies, where the number of potentially deleterious variants, which are in fact unrelated to disease, may be large. Available evidence indicates that the majority of disease genes are ‘non-essential’ and their products occupy functionally peripheral positions in protein networks. They tend to be intermediate between genes that have core biological functions, particularly low mutation rates and low haplotype diversity, and genes for which high haplotype diversity and high mutation rates are advantageous (such as those involved in sensory perception and some immune system functions). Evidence presented here supports these conclusions through analysis of integrated data sets incorporating the latest mutational profiles, linkage disequilibrium structure and other genomic properties of individual genes. The analysis highlights the contrasting functions of genes predicted as least and most likely to contain disease variation and provides a basis for filtering gene variant lists to exclude the least plausible disease candidates. Keywords: disease genes; next-generation sequencing; eliminating false-positive variants; mutational heterogeneity; linkage disequilibrium INTRODUCTION Accumulating evidence indicates genes containing disease causal variation ‘disease genes’ have distinct genomic and functional properties. The development of an understanding of these characteristics is important, as researchers are faced with voluminous next-generation sequencing (NGS) data, which may identify hundreds or thousands of potentially deleterious variants per sample. Many of these variants are implausible disease candidates, and a number of analyses have considered strategies to identify false positives and filter gene and variant lists to generate a much smaller, and better supported, set of candidates. Fuentes Fajardo et al. [1] suggested excluding variants when located in genomic regions known to be highly polymorphic or reflect sequence read alignment errors or the minor-allele status of the reference genome used for alignment. Prominent in their list of highly polymorphic genes unlikely to contain pathogenic variation are genes from olfactory and taste receptor families. Lawrence et al. [2] considered the difficulty of identifying true cancer driver mutations in tumour-normal paired samples and devised strategies to account for the wide variation in mutation frequencies across genes. Specifically, they developed a tool, MutSigCV, which models mutational heterogeneity and provides a robust filter to reduce the number of false positives in candidate variant lists. Their initial analyses in lung squamous cell carcinoma samples revealed many signals from genes unlikely to be involved in cancer (e.g. 25% were from olfactory receptor genes). By accounting for mutational heterogeneity in individual genes, they eliminated many Corresponding author. Andrew Collins, Genetic Epidemiology and Genomic Informatics Group, Human Genetics, Faculty of Medicine, University of Southampton, Duthie Building (808), Southampton General Hospital, Tremona Road, Southampton SO16 6YD, UK. Tel.: 02380796939; Fax: 02380794264; E-mail: [email protected] Andrew Collins is the head of the Genetic Epidemiology and Bioinformatics Research Group at the University of Southampton and is involved in next-generation sequencing studies of a number of diseases. ß The Author 2014. Published by Oxford University Press. For Permissions, please email: [email protected] Genomic and functional characteristics false positives and determined a better supported set of candidates. Based on extensive NGS and other data, they characterized genomic features of 17 667 genes including mutation frequency, expression levels and DNA replication timing. Petrovski et al. [3] considered how well individual genes tolerate functional genetic variation. Using data from 6503 exome sequences, they assessed the extent to which individual genes have relatively more, or less, functional genetic variation than expected, given the total amount of neutral variation in the gene. They presented the Residual Variation Intolerance Score (RVIS) for 16 956 genes: when the score is 0, the gene has an average number of common functional variants, given its total mutational burden; when the score is <0, the gene has less common functional variation than might be expected; and when the score is >0, the gene has more variation. Negative scores are therefore suggestive of purifying selection and positive scores balanced or positive selection, or both. They found highly significant evidence that genes containing mutations underlying Online Mendelian Inheritance in Man (OMIM)-classified Mendelian diseases have lower RVIS values than non-disease genes. Similar earlier analyses considered the ratio of non-synonymous to synonymous substitutions (Dn/Ds) as a metric for understanding disease gene properties [4]. The ratio represents the proportion of amino acid changes in a gene that reach fixation and are therefore not strongly deleterious. Conflicting results were obtained when using this metric to understand the impact of selection on disease variation in other studies [5, 6]. A particular difficulty arises because the set of non-disease genes used in all comparisons is likely to contain a proportion of genes with essential functions that are under intense purifying selection, which do not therefore contribute to disease. This point is highlighted by Gibson et al. [7], who considered patterns of linkage disequilibrium (LD) for individual genes. Specifically, they established that the known disease genes are underrepresented among genes with the strongest LD: these are genes with the lowest haplotype diversity suggesting they are least tolerant to mutation. They found that many genes in this class are involved in essential biological functions such as phosphorylation, cell division, cellular transport and metabolic processes. In contrast, disease genes were found to be over-represented in the class of genes that have average levels of LD. The class of genes with the weakest 17 LD was found to be enriched for genes with functions for which high haplotype diversity is advantageous (e.g. sensory perception and immune response genes). Human disease genes have been classified into essential and non-essential genes based on comparative mouse data [8–10]. The classification considers essential genes as those that are embryonically lethal when activated. Although the validity of comparisons with the mouse genome has been questioned, these analyses concluded that the majority (60–75%) of disease genes are non-essential. It has also been shown that disease genes are less likely to operate as highly connected protein nodes or ‘hubs’ [11] in gene interaction networks. The interaction network properties of disease genes have been extensively studied [8, 9, 12]. Goh et al. [8] found the expression patterns of disease genes indicate that their products are localized in the ‘functional periphery’ of protein networks. Dancik et al. [13] described the proteins encoded by disease genes as having ‘intermediate’ connectivity in biological networks. Cai et al. [14] demonstrated that Mendelian and complex disease genes have distinct protein–protein interaction properties. Specifically, disease genes tend to be highly connected to other genes, but these genes are often not well connected among themselves. Their analysis aligned with other work [10] that recognized that Mendelian disease genes tend to be evolutionarily old and this duration has provided time for gene products to develop significant protein–protein connections. They conclude that disease genes occupy topologically important positions as ‘brokers’ in these networks connecting proteins, which do not otherwise interact with each other. Their placement in vulnerable positions in protein interaction networks means their disruption contributes to disease phenotypes. A wealth of published evidence supports the suggestion that disease genes have distinct genomic and functional properties and characterizing these properties is potentially useful in filtering NGS variant lists to establish true disease candidates. Integration and analysis of published data sets that quantify some of the genomic properties of individual genes is presented here. The integrated data set forms a basis for further development of multivariate models for ranking variant lists. Gene Ontology enrichment analyses that consider the functional characteristics of genes predicted as least and most likely to contain disease variation are also presented. 18 Collins MATERIALS AND METHODS To investigate the genomic and functional properties of disease genes, four data sets were integrated before analysis: (i) Lawrence et al. [2] (Supplementary Table S5) list properties of 17 667 genes including average expression level across 91 cell lines in the Cancer Cell Line Encyclopedia (CCLE, http://www. broadinstitute.org/ccle/home); DNA replication time [15] on a scale of 100 (early) to 1500 (late); average non-coding mutation frequency (mutations per base pair) from a panel of 126 wholegenome sequenced cancer samples; and the local GC content of the gene (on a 100-kb scale), local gene density and a HiC-derived metric [16]. The latter considers the 3D architecture of genomes and, specifically, identifies genomic regions (compartments) that contain loosely packed chromatin (which is ‘open’, accessible and active) versus densely packed regions (with ‘closed’, inaccessible and inactive chromatin). The HiC data are encoded such that negative values correspond to genes from the closed chromatin compartment ‘B’ and positive to genes from the open compartment ‘A’. (ii) Gibson etal. [7] present LD maps of 10 414 genes based on genome-wide LD maps built from exome sequences. Maps are in LD units (LDUs; [17]), which are analogous to centimorgans in linkage maps, where one LDU is the physical (kilobase) distance over which LD declines to ‘background’ levels. Genes were allocated a category of 1–5, with 1 representing ‘strong’ LD, where LDU/kb ¼ 0.0; 2, where 0 < LDU/kb 0.01; 3, where 0.01 < LDU/ kb 0.02; 4, where 0.02 < LDU/kb 0.05 and ‘weak’ LD, where ldu/kb > 0.05. Gene lengths in kilobases and total number of (exonic) singlenucleotide polymorphisms (SNPs) identified in these genes from the exome samples were also derived from this data set. (iii) Petrovski et al. [3] computed RVIS for 16 956 genes made available in their data set S2. The RVIS provides a measure of the departure from the average number of common functional mutations found in genes with a similar mutational burden. (iv) The Galaxy server at http://main.g2.bx.psu. edu/library describes 9133 unique variants described by OMIM as associated with some human diseases of which at least 7296 (80%) are associated with Mendelian disease [18], with the remainder associated with complex diseases. Gibson et al. [7] found these variants are located within 1086 of 10 414, 10.4% of genes with LDU maps in their analysis. For this analysis, the genes containing disease variation were classified as ‘disease genes’, and all remaining genes were classified as ‘non-disease’. The integration of data sets 1–4 produces complete data for 9131 genes including 1033 disease genes (11.3%) with comprehensively described genomic properties (totals given in Supplementary Table S1). The list of 1033 genes containing known disease variation is given in Supplementary Table S2. Logistic regression with the dependent variable as non-disease gene (0) and disease gene (1) was undertaken. The analysis considered univariate models (Table 1) and a multivariate model in which variables were selected stepwise (Table 2). To investigate the stability of the multivariate model, two nonoverlapping subsets of the data, containing 4565 and 4566 genes (data sets S1 and S2), were produced. Results from the final multivariate model evaluated on each subset are given in Supplementary Table S3. Using scores predicted under the multivariate model (Table 3), Gene Ontology enrichment analysis was undertaken for the 100 genes with lowest predicted scores (Table 4) and the 100 genes with the highest predicted scores (Table 5) using DAVID against a ‘whole genome’ background (http://david.abcc. ncifcrf.gov/). This analysis establishes function enrichment among subsets of genes (clustered into ‘gene groups’) predicted to be least and most likely to contain disease variation under this model. To test for consistency, corresponding analyses were carried out for the two subsets of the data using predicted scores computed from a multivariate model established for each data set using the same predictor variables identified in Table 2. RESULTS Univariate analyses (Table 1) show that the average expression level of genes in the CCLE is not significantly related to disease gene classification in these data (P ¼ 0.548). Mutation rates are lower in genes highly expressed in the germline [2, 19], and a relationship with disease gene status might have been expected given the significant relationship with Genomic and functional characteristics 19 Table 1: Univariate logistic regression analyses for disease versus non-disease genes Variable Odds ratio Standard error z-score 95% Confidence interval Pseudo R-squared P-value Number of SNPs in gene Size of gene (kilobase) Strength of LD (from 1 ¼strong, to 5 ¼ weak) Somatic mutation frequency (log10 mutations/Mb) DNA replication time (early ¼100 ^ late ¼1500) Mean GC content of reads covering gene HiC compartment (negative ¼ closed, gene poor, positive ¼ open, gene rich) Average expression level in the CCLE Local gene density/Mb RVIS 1.0101 1.0009 1.0819 0.4969 0.9998 3.9330 8.6198 0.0017 0.0003 0.0286 0.1210 0.0001 2.2962 11.8822 5.99 2.99 2.98 2.87 1.40 2.35 1.56 1.0068 ^1.0134 1.0003^1.0016 1.0273^1.1394 0.3082^ 0.8010 0.9995^1.0000 1.2525^12.3504 0.5783^128.4846 0.0056 0.0013 0.0014 0.0013 0.0003 0.0008 0.0004 <0.001 0.003 0.003 0.004 0.160 0.019 0.118 1 0.9934 0.8443 0.0000 0.0030 0.0262 0.60 2.20 5.45 0.9999^1.0000 0.9876 ^ 0.9993 0.7945^ 0.8973 0.0001 0.0008 0.0048 0.548 0.028 <0.001 Note: Significant P-values shown in bold. Table 2: Multivariate logistic regression model for disease versus non-disease genes Variable Odds ratio Standard error z-score 95% Confidence interval P-valuea Number of SNPs in gene Strength of LD (from 1 ¼strong, to 5 ¼ weak) Somatic non-coding mutation frequency (log10 mutations/Mb) Mean GC content of reads covering the gene Local gene density (1Mb) RVIS 1.0106 1.0732 0.4672 6.8580 0.9886 0.8717 0.0018 0.0294 0.1208 5.0631 0.0037 0.0255 5.92 2.58 2.94 2.61 3.09 4.70 1.0071^1.0141 1.0171^1.1324 0.2815^ 0.7754 1.6135^29.1490 0.9814 ^ 0.9958 0.8231^ 0.9231 <0.001 0.010 0.003 0.009 0.002 <0.001 Note: aPseudo R-squared for the model ¼ 0.0152. Significant P-values shown in bold. Table 3: Means for variables by predicted score from the multivariate model Predicted score from multivariate model Number of genes Proportion of disease genes Number of SNPs in gene LD category (1 ¼strong LD, 5 ¼ weak LD) Log10 mutations/Mb ( 1 000 000) Mean GC content (frequency) Local gene density/Mb RVIS <0.09 0.09^< 0.10 0.10<0.11 0.11<0.12 0.12<0.13 0.13 1818 1394 1574 1390 1021 1934 0.073 0.093 0.103 0.108 0.140 0.163 12.28 12.23 13.73 15.12 17.36 30.55 2.18 2.30 2.47 2.80 2.99 3.32 3.97 3.16 2.88 2.75 2.62 2.65 0.42 0.44 0.44 0.45 0.46 0.47 20.00 15.04 13.54 13.22 12.93 11.56 0.86 0.22 0.03 0.14 0.33 0.85 mutation frequency (P ¼ 0.004). Although these data are based on expression in cancer cell lines, the authors note that matched normal tissue shows similar expression patterns. The replication time of a DNA region during the cell cycle is associated with high mutation rates for late replicating regions, perhaps due to depletion of the pool of free nucleotides [20]. The data are based on replication timing from HeLa cells but with similar results for blood cell lines [15, 21]. There is no evidence for an association between disease gene classification and replication timing in these data (P ¼ 0.160). The data for the HiC chromatin compartment [16], which quantifies chromosome regions with open or closed chromatin, 20 Collins Table 4: Enriched GO terms (P < 0.01, biological process) for gene cluster 1 (enrichment score 39.6) from the 100 genes with the lowest predicted score under the multivariate model GO term Gene count Fold enrichment Sensory perception of smell Sensory perception of chemical stimulus Sensory perception Cognition Neurological system process G-protein coupled receptor protein signalling pathway Cell surface receptor linked signal transduction 55 55 55 55 56 55 29 26 15 14 10 11 55 6.7 Table 5: Enriched GO terms (P < 0.01, biological process) for gene cluster 1 (enrichment score 5.6) and gene cluster 2 (enrichment score 5.4) from the 100 genes with the highest predicted score under the multivariate model GO term Gene count Fold enrichment GG1: Cell adhesion GG1: Biological adhesion GG2: Extracellular matrix organization GG2: Extracellular structure organization GG2: Epidermis development GG2: Ectoderm development 6 6 3 3 3 3 12 12 43 28 25 23 are also not related to the disease status in these data (P ¼ 0.118). Disease genes are significantly longer overall [odds ratio (OR) ¼ 1.0009, P ¼ 0.003] and contain more SNPs (OR ¼ 1.0101, P < 0.001) compared with non-disease genes (Table 1). Other authors have found disease genes to be longer than reference genes [10], with an average of 747 amino acids versus 478 for reference genes. Gibson et al. [7] found significantly more disease genes in the gene category with intermediate strengths of LD and significantly fewer among genes with the strongest LD. Association with the less strong LD categories is reflected here (OR ¼ 1.082, P ¼ 0.003). The LD profile of disease genes appears intermediate between genes with the strongest LD, which have ‘essential’ biological functions (including phosphorylation, cell division and metabolic processes), and the weakest, which are associated with sensory perception and immune response, roles for which high haplotype diversity is advantageous. Univariate analyses indicate that disease genes are associated with relatively low non-coding region mutation frequencies (OR ¼ 0.497, P ¼ 0.004), high GC content (OR ¼ 3.93, P ¼ 0.019) and low local gene density (OR ¼ 0.99, P ¼ 0.028). Low mutation rates align with evidence from the RVIS metric, which is significantly negative for disease genes (OR ¼ 0.84, P < 0.001) consistent with other findings [3] and reflecting the impact of purifying selection on disease genes. Multivariate analyses (Table 2) show similar relationships for the retained model terms (number of SNPs in gene, strength of LD, mutation frequency, GC content, local gene density and RVIS metric). Scores from this model (Table 3), divided into six categories, show that genes in the most disease geneenriched category have 2.5 as many SNPs as the least enriched, two-third the mutation rate and just over half the local gene density of the most disease gene poor class. The strongly negative RVIS, higher GC content and reduced LD trends for disease genes are also shown here. Tests of the multivariate model on the data subsets (Supplementary Table S3) show that the components of the model have consistent effect directions, although there is loss of significance, reflecting reduced power, particularly for data set S2. Gene functional enrichment analysis against the ‘whole genome’ background by DAVID for the 100 genes with the lowest predicted score show high enrichment, compared with a whole genome background, of functions related to sensory perception (perception of smell, chemical stimuli, cognition, etc.; Table 4). Genes with these functions are prominent as a source of false positives in NGS gene lists [1, 2]. Tests on the data subsets (S1, S2) using the models in Supplementary Table S3 established that the 100 genes with the lowest predicted score in each set show identical gene group enrichment as in Table 4, with enrichment scores of 28.4 and 18.8, respectively, for the most enriched gene group. The 100 highest scoring genes from the multivariate model show enrichment of functions related to cell adhesion, epidermis development and similar functions (Table 5), and this cluster includes the disease genes COL5A1 (mutations cause Ehlers–Danlos syndrome), COL6A3 (mutations cause Bethlem myopathy and Ullrich congenital muscular dystrophy) and COL7A1 (which causes epidermolysis bullosa dystrophica and related diseases). Reduced Genomic and functional characteristics enrichment scores for these gene groups, compared with the 100 lowest scoring genes under the model reflects, in part, fewer genes in each cluster, perhaps reflecting the diversity of pathways containing disease variation. Tests on the 100 highest scoring genes from each of the S1 and S2 data subsets show the gene group with the highest enrichment score for S1 (score ¼ 2.44) comprises functions related to ion transport and includes the RYR1 gene, which is involved in central core disease of muscle and other conditions. For the S2 sample, the gene group with the highest enrichment score (score ¼ 4.99) includes regulatory functions of the cytoskeleton and organelles and includes the SYNE1 gene, which is involved in spinocerebellar ataxia. DISCUSSION The data support evidence that genes containing disease variants have distinct genomic and functional characteristics, and this favours the further development of models that predict disease candidates to facilitate filtering of NGS variant lists. The multivariate model (Table 2) demonstrates that genes known to contain disease variants have low RVIS (P < 0.001) and low overall mutation rates (P < 0.003) consistent with the impact of purifying selection. However, they also tend to be GC-rich (P < 0.009), which is a feature associated with high mutability. Analyses of non-coding DNAs have shown CpG islands (adjacent C and G nucleotides) have 10-fold greater mutation rate than other sites [22]. Therefore, genes containing disease variants may be relatively susceptible to mutation, but subject to elevated purifying selection. Evidence from other studies demonstrates that disease genes have functionally peripheral roles in gene networks [8], rather than ‘hub’ functions encoded by genes subject to more intense selection through in utero lethality. Evidence for moderately elevated selection is supported by the finding [7] that disease genes have levels of LD intermediate between essential genes that have low haplotype diversity and genes involved in sensory perception and some immune system functions that have weak LD/high haplotype diversity. Gradually declining intensity in selective pressure might underlie this transition from strong to weak LD reflecting the different functional roles of the genes involved. Disease genes are significantly longer (P ¼ 0.003, Table 1) and contain more (exonic) SNPs 21 (P < 0.001; Table 2) than non-disease genes. Several studies have found disease genes are long, in particular, with longer protein-coding sequence [6]. Gene length is related to the finding that disease genes tend to be old [5, 6, 10], implying that although interaction network and transcriptional analyses suggest disease genes are not concentrated in hubs, their ancient origin suggests roles in old biological processes. Disease mutations have been found to occur at conserved locations in proteins [23, 24]. Thus, there is the puzzle that processes with an ancient evolutionary origin are still vulnerable to disease mutations, given the length of time for fitness-reducing mutations to have been eliminated by selection. The evidence here suggests that disease genes may be exposed to moderate purifying selection operating against a relatively mutable genomic background, which contributes to the persistence of mutations. Untangling the dominant processes underlying disease or non-disease status of a gene is challenging. LD structure, for example, is particularly complex reflecting the interaction of recombination, selection, population history and also mutation. The independently derived non-coding mutation rate data, the mutation intolerance score data (RVIS) and measures of LD contribute significantly to the model (Table 2). Therefore, evidence from LD structure unrelated to mutation is contributing to the model. The component of LD structure that reflects selection may be dominant, but independently derived, measures of selection at the individual gene level would presumably enhance the resolution of the model. Furthermore, because many disease genes remain undiscovered (the ‘missing heritability’ [25]), the resolution of the model is reduced because a proportion of genes classified as non-disease do in fact contain disease variation. The data presented here represent known disease variants of which at least 80% are Mendelian [18]. The extent to which the findings can be related to genes underlying more complex phenotypes is uncertain. Cai et al. [14] found a difference between complex disease genes identified through genome-wide association study (GWAS) methods and those found through other methods. They noted that Mendelian and nonGWAS complex trait genes showed distinct and remarkably consistent protein network properties and GWAS-identified complex trait genes deviated only slightly from non-disease genes in their network properties. They attribute the weakness of this 22 Collins signal, in part, to the small sample size of GWAS genes, but also because GWAS genes may underlie etiologically different diseases and more polygenic phenotypes. However, the observation that nonGWAS-derived complex trait genes show distinct network properties might suggest that the functional roles of many GWAS-derived (regulatory) genes are not well understood and the gene(s) they are impacting may be uncertain. Models that identify and rank disease gene candidates might therefore be valuable in establishing the target genes impacted by variants identified by GWAS. The data presented here represent some of the genomic properties of genes (mutation rates, LD structure, gene size, GC content, etc.), and the available evidence suggests many genes associated with disease show significant differences from non-disease genes. The predictive ability of a gene classifier is reduced by unobserved disease genes in the non-disease category and incomplete understanding of patterns of selection operating on genes with essential functions. It seems likely that the utility of a predictive model will be enhanced through integration of data describing roles of genes in protein networks and functional roles in pathways. The utility of analyses, which integrate functional and genomic profiles of individual genes, is likely to increase, as more genomes are sequenced and gene functions are better understood. FUNDING This work was funded through a research grant provided by the Newlife Foundation for disabled children. References 1. 2. 3. 4. 5. 6. 7. 8. 9. SUPPLEMENTARY DATA Supplementary data are available online at http:// bib.oxfordjournals.org/ 10. 11. Key points NGS identifies many variants that appear potentially deleterious but are implausible disease candidates. Many of these variants are found in genes with particularly high mutation rates with biological functions for which high haplotype diversity is advantageous (e.g. sensory perception). By integrating published data sets and contrasting genes containing known disease variation with all other genes, disease genes are shown to have low mutation rates and are relatively intolerant to functional genetic variation. They also show average levels of LD are relatively GC-rich and are longer than non-disease genes. The findings consistent with other evidence that suggests disease genes are subject to moderate purifying selection, are relatively old in evolutionary terms and occupy functionally peripheral roles in gene networks. The integration of genomic and functional properties of disease and non-disease genes establishes a classifier, which is useful for ranking candidate disease genes in NGS data and for highlighting those variants that are least likely to be causal. 12. 13. 14. 15. 16. 17. Fuentes Fajardo KV, Adams D, Mason CE, et al. Detecting false-positive signals in exome sequencing. Hum Mutat 2012;33(4):609–13. Lawrence MS, Stojanov P, Polak P, et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 2013;499(7457):214–18. Petrovski S, Wang Q, Heinzen EL, et al. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet 2013;9(8):e1003709. Blekhman R, Man O, Herrmann L, et al. Natural selection on genes that underlie human disease susceptibility. Curr Biol 2008;18(12):883–9. Kondrashov FA, Ogurtsov AY, Kondrashov AS. Bioinformatical assay of human gene morbidity. Nucleic Acids Res 2004;32(5):1731–7. Smith NGC, Eyre-Walker A. Human disease genes: patterns and predictions. Gene 2003;318:169–75. Gibson J, Tapper W, Ennis S, Collins A. Exome-based linkage disequilibrium maps of individual genes: functional clustering and relationship to disease. Hum Genet 2013; 132(2):233–43. Goh KI, Cusick ME, Valle D, et al. The human disease network. Proc Natl Acad Sci USA 2007;104(21): 8685–90. Feldman I, Rzhetsky A, Vitkup D. Network properties of genes harboring inherited disease mutations. Proc Natl Acad Sci USA 2008;105(11):4323–8. Domazet-Lošo T, Tautz D. An ancient evolutionary origin of genes associated with human genetic diseases. Mol Biol Evol 2008;25(12):2699–707. He X, Zhang J. Why do hubs tend to be essential in protein networks? PLoS Genet 2006;2(6):e88. Jiang X, Liu B, Jiang J, et al. Modularity in the genetic disease-phenotype network. FEBS Lett 2008;582(17): 2549–54. Dančı́k V, Petri Seiler K, Young DW, et al. Distinct biological network properties between the targets of natural products and disease genes. J Am Chem Soc 2010;132(27): 9259–61. Cai JJ, Borenstein E, Petrov DA. Broker genes in human disease. Genome Biol Evol 2010;2:815–25. Chen CL, Rappailles A, Duquenne L, et al. Impact of replication timing on non-CpG and CpG substitution rates in mammalian genomes. Genome Res 2010;20(4): 447–57. Lieberman-Aiden E, van Berkum NL, Williams L, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 2009; 326(5950):289–93. Tapper W, Collins A, Gibson J, et al. A map of the human genome in linkage disequilibrium units. Proc Natl Acad Sci USA 2005;102(33):11835–9. Genomic and functional characteristics 18. Li MX, Gui HS, Kwan JSH, et al. A comprehensive framework for prioritizing variants in exome sequencing studies of Mendelian diseases. Nucleic Acids Res 2012;40(7):e53. 19. Pleasance ED, Stephens PJ, O’Meara S, et al. A small-cell lung cancer genome with complex signatures of tobacco exposure. Nature 2009;463(7278):184–90. 20. Stamatoyannopoulos JA, Adzhubei I, Thurman RE, et al. Human mutation rate associated with DNA replication timing. Nat Genet 2009;41(4):393–5. 21. Koren A, Polak P, Nemesh J, etal. Differential relationship of DNA replication timing to different forms of human mutation and variation. AmJ Hum Genet 2012;91(6):1033–44. 23 22. Hodgkinson A, Eyre-Walker A. Variation in the mutation rate across mammalian genomes. Nat Rev Genet 2011;12(11): 756–66. 23. Miller MP, Kumar S. Understanding human disease mutations through the use of interspecific genetic variation. Hum Mol Genet 2001;10(21):2319–28. 24. Mooney SD, Klein TE. The functional importance of disease-associated mutation. BMC Bioinformatics 2002; 3(1):24. 25. Manolio TA, Collins FS, Cox NJ, et al. Finding the missing heritability of complex diseases. Nature 2009;461(7265): 747–53.
© Copyright 2026 Paperzz