Gene-Disease Analysis Paper

Hannah Morgan 1
12/8/16
If The Genes Fit, Wear Them:
Analyzing Gene-Disease Associations With a Focus on Scoliosis
A girl of eleven years is noted as a possible scoliosis patient in a standard medical screening, later
to be diagnosed with a severe case of the common skeletal disorder, scoliosis. Her case is described as
adolescent idiopathic scoliosis, indicating that the cause of the disorder is unknown. However, her
parents have mild scoliosis, her sister has a postural deformity called (hyper)kyphosis, and her niece has
been diagnosed with the same severe scoliosis and a similar orthotic treatment, all of which supports the
hypothesis that the disorder may be related to genetics. If so, how much if this disorder can be attributed
to genetics? Can this “how much” even be quantified?
The Comparative Toxicogenomics Database curated a dataset that reflects noted relationships
between genes and diseases. This dataset will never be able to prove causation between genes and
diseases; however, analyzing the dataset for patterns can shed light on how genes may be influencing the
appearance of diseases. The main purpose of this project is twofold. Firstly, it will gauge overall patterns
in the set between genes and diseases. Some questions of the study include: what is the most central
gene? and do connectivity patterns differ between known genetic diseases (sickle-cell anemia) and
diseases that are not genetically caused but possibly genetically influenced (drug-induced liver injury)?
Secondly, the project will investigate how scoliosis relates to the rest of the dataset. Some questions
include: is the degree of scoliosis unusual? what is its centrality? is the relationship of scoliosis with its
neighborhood stronger or weaker than would be randomly expected? and are the second-degree neighbors
skeletal disorders, too? Ultimately, this project seeks to determine whether scoliosis is more or less
genetically associated than the full dataset.
The Data
To investigate questions regarding the relationship of genes and health conditions, this project
utilizes a dataset entitled CTD Gene-Disease Associations1, an extensive set provided by the Comparative
Toxicogenomics Database. The data consists of over 17,200 genes/proteins and 5,200 diseases/disorders,
stored as an edge list. When processed into an undirected graph from data frame, the graph has order of
22,481 and a size of 888,519. Edges indicate gene-disease associations, because the gene and disease
appeared simultaneously in a test subject. Thus, the graph is bipartite. It is also not connected; the largest
cluster is of order 21,102 (which is still rather too large to be used as a subgraph. The data will need to
subdivided in another way for ease of analysis.)
Before addressing what the data identifies about scoliosis, the data should be analyzed for
patterns in gene-disease connectivity. As mentioned previously, this data set of gene-disease associations
cannot be used to prove causation. What diseases are inherently genetic? The data cannot answer this
question. Nevertheless, the associations represented by the data can illuminate unusual relationships
between genes and diseases. One way to do this is by comparing the gene-disease data with randomly
generated graphs that share a few of the same invariants; other invariants can then be analyzed to
determine whether the gene-disease data aligns with a random model, or whether it displays other patterns
of association.
First, let’s establish some of the invariants of the gene-disease data. Of the 22481 nodes, 5218
are diseases and 17263 are genes. The degree distribution for the full data set is portrayed in the log-log
graph below (Fig. 1. Black line is median degree, red line represents scoliosis). Its general linearity
indicates the most nodes have a low degree, though few nodes have a massive degree. The node of
highest degree is part of the disease category: Drug-Induced Liver Injury, with a degree of 13899. The
1
Dataset can be found here: http://amp.pharm.mssm.edu/Harmonizome/dataset/CTD+Gene-Disease+Associations
Hannah Morgan 2
12/8/16
cause of the disease is in the name – drug-induced – suggesting that the disease is not genetic, though
there may be genetic factors that influence appearance of the disease. The massive degree of the disease
indicates that it is associated with a huge group of genes, roughly 80.5 percent of all the genes/proteins.
The node with the next highest degree is Necrosis at 13552, connected with roughly 78.5 percent of the
genes/proteins. However, 3190 of the disease
nodes (61 percent of the diseases) have only one
gene connection. Diseases with so many gene
associations likely are not affected by most of
them, but merely appear simultaneously with the
genes.
Let’s next consider a random bipartite
graph that was generated to have the same
number of type-a nodes (5218), type-b nodes
(17263), and edges between the nodes (888519).
The random graph was analyzed for its degree
distribution (Fig 2). It is clear that when a graph
is randomly generated with order and size
parameters, edges are placed randomly – the
placement of one edge has no influence on
another edge. Therefore, the degree distribution
is two separate bell curves for type-a and type-b
nodes; most nodes have moderate degrees, not
high or low. There are no naturally occurring
outliers.
What, then, do we make of the geneFigure 1
disease distribution, which has non-random
gradation from low-degree nodes to very highdegree nodes? For starters, it simply confirms
association between genes and diseases. When
disease node degree is low, there is some
discrimination; the disease is associated with a
finite set of genes, possibly indicating close
gene-disease association. When disease node
degree is very high, it demonstrates the disease
associates with genes indiscriminately.
Therefore, does this indicate that genetic diseases
are of low degree and non-genetic diseases are of
high degree? Perhaps sometimes, but not
always. It’s possible that a genetic disease could
be associated with many genes that combine to
cause the disease. It’s also possible that a nongenetic disease appears infrequently, and
therefore there is less data and fewer associations
for that disease.
Figure 2
Hannah Morgan 3
12/8/16
Consider Fig. 3, the degree distribution for
diseases only. The blue lines indicate diseases that are
likely non-genetic conditions: drug-induced liver
injury, poisoning (both high-degree), water-electrolyte
imbalance, brain injuries, papillomavirus infections,
cytomegalovirus infections, and respiratory syncytial
virus infections. The green lines indicate diseases
known to be genetic: Down Syndrome, Hemophilia,
Marfan Syndrome, Sickle Cell Disease, betaThalassemia, and Holoprosencephaly. The red lines
indicate our diseases of interest: Scoliosis (red), Sleep
Disorders (yellow), and skeletal disorders including
three types of arthritis and osteomalacia (orange).
From such a tiny, non-random (biased) sample of
diseases, conclusions cannot be drawn about degree
distribution as it relates to whether a disease is genetic.
(It would require more resources than available to
comprehensively categorize the disease data by knowngenetic, known-non-genetic, and unknown-genetic.)
Still, the distribution seems to allow that huge degrees
belong mostly to non-genetic diseases.
Figure 4
Extension: Scoliosis
Above was a rather lengthy discussion of the CTD genedisease dataset as a whole – it outlined some data analysis that
verified the non-randomness of the associations, as well as
addressed how genetic/non-genetic diseases appear in the degree
distribution. This section extends the methods used previously to
dig deeper into how scoliosis relates to the data set.
First, consider the centrality of Scoliosis, especially as
compared with other disease nodes. The top of Fig. 4 depicts the
closeness distribution for disease nodes. (Because the dataset is so
large, estimate_closeness() was used with cutoff=2. It would not
have been beneficial to use a subgraph at this point, since that would
affect the closeness values.) Scoliosis is much more central to the
dataset than average for a disease. Notice, however, that the
closeness value of 2.036e-09 that scoliosis has is much smaller than
average for the random graph (Fig. 4 bottom). It makes sense that
the random graph would have a more uniform dispersion of
closeness than the actual data. The CTD data has a wider range of
closeness values. The most central disease per the closeness value is
once again Drug-Induced Liver Injury, which a value of 1.0299e-08.
Meanwhile, the most central gene is TNF, which is also the gene of
the highest degree. This concept of “centrality” may not mean much
in this dataset. It seems that Drug-Induced Liver Injury is associated
with a huge number of genes simply because it is non-genetic and
could appear with any gene. Likewise, TNF seems to be generally
Figure 3
Hannah Morgan 4
12/8/16
associated with numerous diseases without being a genetic contributor to the disease. Thus, we will need
to dig deeper than centrality to learn anything about the patterns around Scoliosis.
What might Scoliosis’ second-degree neighbors indicate
about Scoliosis? An ego graph for Scoliosis was made to reflect
the neighborhood – the first ring of associated genes, and the
second ring, which was made up of diseases that have a gene in
common with Scoliosis. Finally, this much smaller subgraph can
be plotted as nodes and edges (Fig. 5). The order is 634, only 2.8
percent of the total number of nodes. But when the next ring of
associated genes is included, the graph is unwieldy: the thirddegree neighborhood has an order of 16802, which is 75 percent
of the total nodes. The sixth-degree neighborhood of Scoliosis
contains all the nodes in its cluster, which is 21102.
For manageability, the second-degree neighborhood was
the focus. Within this neighborhood, there are 597 diseases and
Figure 5
37 genes (Scoliosis has degree 37). The gene with both the highest
closeness value and degree is CYCS, which is connected to 453
nodes (75.9 percent) of this neighborhood. The degree
distribution of diseases for the neighborhood (Fig. 6) resembles
the original distribution of the total dataset; it is linear with the
highest frequency of small degrees. (This makes sense, because
the edges for numerous diseases in the second ring were
removed in making the subgraph.)
What are the second-degree neighbors like, the diseases
that share at least one gene with Scoliosis? There are four
diseases that share every gene with Scoliosis: Drug-Induced
Abnormalities (7934), Hyperplasia (11806), Necrosis (13552),
and Prenatal Exposure Delayed Effects (12642). Drug-Induced
Liver Injury (13899) shares 36 genes, and Fatty Liver (11464)
and Learning Disorders (9705) both share 35 genes. Included in
parenthesis are the degrees of these diseases within the full
Figure 6
dataset. Because each of these diseases have huge degrees
originally, it is not very impressive that they share so many degrees with Scoliosis. What diseases in the
neighborhood of Scoliosis have the highest ratio between their degree in the full graph and their degree in
the subgraph? An algorithm was used to find the highest ratios, and the following table contains these
nodes.
Diseases
Degree in Degree in
full graph subgraph
CATARACT 23
1
1
Cataract, Congenital Nuclear, Autosomal Recessive 3
1
1
Craniolenticulosutural Dysplasia
1
1
DEAFNESS, AUTOSOMAL RECESSIVE 74
1
1
Dementia, familial British
1
1
Dementia, familial Danish
1
1
Familial Glucocorticoid Deficiency 1
1
1
Immune dysfunction with T-cell inactivation due to calcium entry defect 2
1
1
Porphyria, Erythropoietic
1
1
Hannah Morgan 5
12/8/16
Pyruvate Dehydrogenase E3-Binding Protein Deficiency
SPONDYLOCOSTAL DYSOSTOSIS 1, AUTOSOMAL RECESSIVE
Thrombocytopenia 4
1
1
1
1
1
1
Though the table may not seem very interesting, it is insightful – this table shows the diseases that
share all its associated genes with Scoliosis. I, as the researcher, do not have the knowledge to analyze
these diseases and find patterns in their relationship to Scoliosis. In future research, that may be a fruitful
exploration.
Conclusion
Despite the immensity of this dataset, it seemed to barely scratch the surface of all we have to
gain from studying genes and how they are manifested in human health. There are certainly ways this
data could be improved. In addition to documenting the gene-disease associations, it could be beneficial
to track the proportion of times these associations have proven to be true in subjects. Are Scoliosis and
CYCS frequently seen together, or only in rare instances? Tracking frequency of association could help
determine how strong the associations are and be another indicator for the genetics of a disease.
Additionally, the data could be better categorized: diseases can be sorted into categories based on
anatomical regions (skeletal, muscular, mental), and the genes/protein aggregate data could be split into
separate groups for gene and protein.
As for improvement of the research, there are opportunities for more thoroughly exploring the
patterns between known-genetic diseases and known-non-genetic diseases, which may provide insight
into categorizing unknown diseases into genetic or non-genetic. Another way to expand the research
would be to choose different focal diseases and run the algorithm to generate the above table, thereby
honing networks of diseases by finding those that share a greater proportion of genes. As discovered,
Drug-Induced Liver Injury may not have strong ties to any single subgraph of genes and diseases, while
Thrombocytopenia 4 has clear connection to Scoliosis through a single shared gene, its only associated
gene. It would be interesting to examine this partial set of diseases that are closely related to Scoliosis to
see if there are patterns in what kind of diseases they are.
The CTD dataset was an interesting study. Analyzing the graph made it clear that the graph was
non-random – though it cannot be concluded that any set of genes leads to the appearance of
disease/disorder, there are clearly strong relationships between genes and diseases. Some diseases of high
degree and centrality are less interesting because they seem to bind with genes indiscriminately.
Meanwhile, the diseases of low degree allow us to examine why they appear with a particular gene. The
more focused study of Scoliosis did not yield anything conclusive in regard to how it relates to its seconddegree neighborhood. Nevertheless, the research was able to identify that Scoliosis is more central than
average. Also, it was beneficial progress to filter the neighborhood for those diseases which have most
genes in common with Scoliosis, and it will require further research to make sense of these results.
Hannah Morgan 6
12/8/16
Bibliography
These sources were referenced when trying to brainstorm what “random” diseases to use as “genetic”
and “non-genetic” examples
"Specific Genetic Disorders." National Human Genome Research Institute (NHGRI). N.p., n.d. Web. 08
Dec. 2016.
"Category:Skeletal Disorders." Wikipedia. Wikimedia Foundation, n.d. Web. 08 Dec. 2016.