Hannah Morgan 1 12/8/16 If The Genes Fit, Wear Them: Analyzing Gene-Disease Associations With a Focus on Scoliosis A girl of eleven years is noted as a possible scoliosis patient in a standard medical screening, later to be diagnosed with a severe case of the common skeletal disorder, scoliosis. Her case is described as adolescent idiopathic scoliosis, indicating that the cause of the disorder is unknown. However, her parents have mild scoliosis, her sister has a postural deformity called (hyper)kyphosis, and her niece has been diagnosed with the same severe scoliosis and a similar orthotic treatment, all of which supports the hypothesis that the disorder may be related to genetics. If so, how much if this disorder can be attributed to genetics? Can this “how much” even be quantified? The Comparative Toxicogenomics Database curated a dataset that reflects noted relationships between genes and diseases. This dataset will never be able to prove causation between genes and diseases; however, analyzing the dataset for patterns can shed light on how genes may be influencing the appearance of diseases. The main purpose of this project is twofold. Firstly, it will gauge overall patterns in the set between genes and diseases. Some questions of the study include: what is the most central gene? and do connectivity patterns differ between known genetic diseases (sickle-cell anemia) and diseases that are not genetically caused but possibly genetically influenced (drug-induced liver injury)? Secondly, the project will investigate how scoliosis relates to the rest of the dataset. Some questions include: is the degree of scoliosis unusual? what is its centrality? is the relationship of scoliosis with its neighborhood stronger or weaker than would be randomly expected? and are the second-degree neighbors skeletal disorders, too? Ultimately, this project seeks to determine whether scoliosis is more or less genetically associated than the full dataset. The Data To investigate questions regarding the relationship of genes and health conditions, this project utilizes a dataset entitled CTD Gene-Disease Associations1, an extensive set provided by the Comparative Toxicogenomics Database. The data consists of over 17,200 genes/proteins and 5,200 diseases/disorders, stored as an edge list. When processed into an undirected graph from data frame, the graph has order of 22,481 and a size of 888,519. Edges indicate gene-disease associations, because the gene and disease appeared simultaneously in a test subject. Thus, the graph is bipartite. It is also not connected; the largest cluster is of order 21,102 (which is still rather too large to be used as a subgraph. The data will need to subdivided in another way for ease of analysis.) Before addressing what the data identifies about scoliosis, the data should be analyzed for patterns in gene-disease connectivity. As mentioned previously, this data set of gene-disease associations cannot be used to prove causation. What diseases are inherently genetic? The data cannot answer this question. Nevertheless, the associations represented by the data can illuminate unusual relationships between genes and diseases. One way to do this is by comparing the gene-disease data with randomly generated graphs that share a few of the same invariants; other invariants can then be analyzed to determine whether the gene-disease data aligns with a random model, or whether it displays other patterns of association. First, let’s establish some of the invariants of the gene-disease data. Of the 22481 nodes, 5218 are diseases and 17263 are genes. The degree distribution for the full data set is portrayed in the log-log graph below (Fig. 1. Black line is median degree, red line represents scoliosis). Its general linearity indicates the most nodes have a low degree, though few nodes have a massive degree. The node of highest degree is part of the disease category: Drug-Induced Liver Injury, with a degree of 13899. The 1 Dataset can be found here: http://amp.pharm.mssm.edu/Harmonizome/dataset/CTD+Gene-Disease+Associations Hannah Morgan 2 12/8/16 cause of the disease is in the name – drug-induced – suggesting that the disease is not genetic, though there may be genetic factors that influence appearance of the disease. The massive degree of the disease indicates that it is associated with a huge group of genes, roughly 80.5 percent of all the genes/proteins. The node with the next highest degree is Necrosis at 13552, connected with roughly 78.5 percent of the genes/proteins. However, 3190 of the disease nodes (61 percent of the diseases) have only one gene connection. Diseases with so many gene associations likely are not affected by most of them, but merely appear simultaneously with the genes. Let’s next consider a random bipartite graph that was generated to have the same number of type-a nodes (5218), type-b nodes (17263), and edges between the nodes (888519). The random graph was analyzed for its degree distribution (Fig 2). It is clear that when a graph is randomly generated with order and size parameters, edges are placed randomly – the placement of one edge has no influence on another edge. Therefore, the degree distribution is two separate bell curves for type-a and type-b nodes; most nodes have moderate degrees, not high or low. There are no naturally occurring outliers. What, then, do we make of the geneFigure 1 disease distribution, which has non-random gradation from low-degree nodes to very highdegree nodes? For starters, it simply confirms association between genes and diseases. When disease node degree is low, there is some discrimination; the disease is associated with a finite set of genes, possibly indicating close gene-disease association. When disease node degree is very high, it demonstrates the disease associates with genes indiscriminately. Therefore, does this indicate that genetic diseases are of low degree and non-genetic diseases are of high degree? Perhaps sometimes, but not always. It’s possible that a genetic disease could be associated with many genes that combine to cause the disease. It’s also possible that a nongenetic disease appears infrequently, and therefore there is less data and fewer associations for that disease. Figure 2 Hannah Morgan 3 12/8/16 Consider Fig. 3, the degree distribution for diseases only. The blue lines indicate diseases that are likely non-genetic conditions: drug-induced liver injury, poisoning (both high-degree), water-electrolyte imbalance, brain injuries, papillomavirus infections, cytomegalovirus infections, and respiratory syncytial virus infections. The green lines indicate diseases known to be genetic: Down Syndrome, Hemophilia, Marfan Syndrome, Sickle Cell Disease, betaThalassemia, and Holoprosencephaly. The red lines indicate our diseases of interest: Scoliosis (red), Sleep Disorders (yellow), and skeletal disorders including three types of arthritis and osteomalacia (orange). From such a tiny, non-random (biased) sample of diseases, conclusions cannot be drawn about degree distribution as it relates to whether a disease is genetic. (It would require more resources than available to comprehensively categorize the disease data by knowngenetic, known-non-genetic, and unknown-genetic.) Still, the distribution seems to allow that huge degrees belong mostly to non-genetic diseases. Figure 4 Extension: Scoliosis Above was a rather lengthy discussion of the CTD genedisease dataset as a whole – it outlined some data analysis that verified the non-randomness of the associations, as well as addressed how genetic/non-genetic diseases appear in the degree distribution. This section extends the methods used previously to dig deeper into how scoliosis relates to the data set. First, consider the centrality of Scoliosis, especially as compared with other disease nodes. The top of Fig. 4 depicts the closeness distribution for disease nodes. (Because the dataset is so large, estimate_closeness() was used with cutoff=2. It would not have been beneficial to use a subgraph at this point, since that would affect the closeness values.) Scoliosis is much more central to the dataset than average for a disease. Notice, however, that the closeness value of 2.036e-09 that scoliosis has is much smaller than average for the random graph (Fig. 4 bottom). It makes sense that the random graph would have a more uniform dispersion of closeness than the actual data. The CTD data has a wider range of closeness values. The most central disease per the closeness value is once again Drug-Induced Liver Injury, which a value of 1.0299e-08. Meanwhile, the most central gene is TNF, which is also the gene of the highest degree. This concept of “centrality” may not mean much in this dataset. It seems that Drug-Induced Liver Injury is associated with a huge number of genes simply because it is non-genetic and could appear with any gene. Likewise, TNF seems to be generally Figure 3 Hannah Morgan 4 12/8/16 associated with numerous diseases without being a genetic contributor to the disease. Thus, we will need to dig deeper than centrality to learn anything about the patterns around Scoliosis. What might Scoliosis’ second-degree neighbors indicate about Scoliosis? An ego graph for Scoliosis was made to reflect the neighborhood – the first ring of associated genes, and the second ring, which was made up of diseases that have a gene in common with Scoliosis. Finally, this much smaller subgraph can be plotted as nodes and edges (Fig. 5). The order is 634, only 2.8 percent of the total number of nodes. But when the next ring of associated genes is included, the graph is unwieldy: the thirddegree neighborhood has an order of 16802, which is 75 percent of the total nodes. The sixth-degree neighborhood of Scoliosis contains all the nodes in its cluster, which is 21102. For manageability, the second-degree neighborhood was the focus. Within this neighborhood, there are 597 diseases and Figure 5 37 genes (Scoliosis has degree 37). The gene with both the highest closeness value and degree is CYCS, which is connected to 453 nodes (75.9 percent) of this neighborhood. The degree distribution of diseases for the neighborhood (Fig. 6) resembles the original distribution of the total dataset; it is linear with the highest frequency of small degrees. (This makes sense, because the edges for numerous diseases in the second ring were removed in making the subgraph.) What are the second-degree neighbors like, the diseases that share at least one gene with Scoliosis? There are four diseases that share every gene with Scoliosis: Drug-Induced Abnormalities (7934), Hyperplasia (11806), Necrosis (13552), and Prenatal Exposure Delayed Effects (12642). Drug-Induced Liver Injury (13899) shares 36 genes, and Fatty Liver (11464) and Learning Disorders (9705) both share 35 genes. Included in parenthesis are the degrees of these diseases within the full Figure 6 dataset. Because each of these diseases have huge degrees originally, it is not very impressive that they share so many degrees with Scoliosis. What diseases in the neighborhood of Scoliosis have the highest ratio between their degree in the full graph and their degree in the subgraph? An algorithm was used to find the highest ratios, and the following table contains these nodes. Diseases Degree in Degree in full graph subgraph CATARACT 23 1 1 Cataract, Congenital Nuclear, Autosomal Recessive 3 1 1 Craniolenticulosutural Dysplasia 1 1 DEAFNESS, AUTOSOMAL RECESSIVE 74 1 1 Dementia, familial British 1 1 Dementia, familial Danish 1 1 Familial Glucocorticoid Deficiency 1 1 1 Immune dysfunction with T-cell inactivation due to calcium entry defect 2 1 1 Porphyria, Erythropoietic 1 1 Hannah Morgan 5 12/8/16 Pyruvate Dehydrogenase E3-Binding Protein Deficiency SPONDYLOCOSTAL DYSOSTOSIS 1, AUTOSOMAL RECESSIVE Thrombocytopenia 4 1 1 1 1 1 1 Though the table may not seem very interesting, it is insightful – this table shows the diseases that share all its associated genes with Scoliosis. I, as the researcher, do not have the knowledge to analyze these diseases and find patterns in their relationship to Scoliosis. In future research, that may be a fruitful exploration. Conclusion Despite the immensity of this dataset, it seemed to barely scratch the surface of all we have to gain from studying genes and how they are manifested in human health. There are certainly ways this data could be improved. In addition to documenting the gene-disease associations, it could be beneficial to track the proportion of times these associations have proven to be true in subjects. Are Scoliosis and CYCS frequently seen together, or only in rare instances? Tracking frequency of association could help determine how strong the associations are and be another indicator for the genetics of a disease. Additionally, the data could be better categorized: diseases can be sorted into categories based on anatomical regions (skeletal, muscular, mental), and the genes/protein aggregate data could be split into separate groups for gene and protein. As for improvement of the research, there are opportunities for more thoroughly exploring the patterns between known-genetic diseases and known-non-genetic diseases, which may provide insight into categorizing unknown diseases into genetic or non-genetic. Another way to expand the research would be to choose different focal diseases and run the algorithm to generate the above table, thereby honing networks of diseases by finding those that share a greater proportion of genes. As discovered, Drug-Induced Liver Injury may not have strong ties to any single subgraph of genes and diseases, while Thrombocytopenia 4 has clear connection to Scoliosis through a single shared gene, its only associated gene. It would be interesting to examine this partial set of diseases that are closely related to Scoliosis to see if there are patterns in what kind of diseases they are. The CTD dataset was an interesting study. Analyzing the graph made it clear that the graph was non-random – though it cannot be concluded that any set of genes leads to the appearance of disease/disorder, there are clearly strong relationships between genes and diseases. Some diseases of high degree and centrality are less interesting because they seem to bind with genes indiscriminately. Meanwhile, the diseases of low degree allow us to examine why they appear with a particular gene. The more focused study of Scoliosis did not yield anything conclusive in regard to how it relates to its seconddegree neighborhood. Nevertheless, the research was able to identify that Scoliosis is more central than average. Also, it was beneficial progress to filter the neighborhood for those diseases which have most genes in common with Scoliosis, and it will require further research to make sense of these results. Hannah Morgan 6 12/8/16 Bibliography These sources were referenced when trying to brainstorm what “random” diseases to use as “genetic” and “non-genetic” examples "Specific Genetic Disorders." National Human Genome Research Institute (NHGRI). N.p., n.d. Web. 08 Dec. 2016. "Category:Skeletal Disorders." Wikipedia. Wikimedia Foundation, n.d. Web. 08 Dec. 2016.
© Copyright 2025 Paperzz