ProjectItay

Prediction of symptoms for genetic disease
Itay Dangoor
Prediction of symptoms for genetic disease
Machine Learning Foundations, fall 2013
Final Project
Itay Dangoor
Introduction
Genetic disorders take a significant part of human disease and attract a lot of
research attention in the biological and medical fields. In recent years, Information
connecting human phenotypic anomalies to gene defects is being progressively
gathered to form large data sets. Having such comprehensive references for genetic
disorders and their effects makes it possible to carry large scale research involving
both cellular activity and phenotypes.
Constraint-based metabolism modeling, abbreviated CBM, is a widely used
framework to computationally examine cell metabolic behavior under different states.
CBM models describe cell metabolic processes by defining a set of metabolites
(biochemical compounds) and a set of convex constraints enforced on them. Usually a
solution space that satisfies all the constraints is obtained using Linear Programming
methods, allowing instant access to a description of all the cellular processes at once.
Thus CBM models enable systematic genome wide computational research of cellular
metabolic activity and drive novel characterization of cellular behavior. In 2007 CBM
Models for generic human cell were presented [Duarte et al 2007, Ma et al, 2007],
allowing the modeling community to extend the research focus from microorganism
engineering towards major human medical issues.
Construction of a map from the physiological cellular state to physical
symptoms is an open challenge that computational biology faces today. In this work I
will try to construct such a map by building predictors dedicated to phenotypes of
human disease. Based on gene disorder descriptions computed using CBM methods
and the human model [Duarte et al 2007], and based on observed phenotype to gene
associations, I will render predictions for human phenotype emergence as a function
of gene defect patterns.
Prediction of symptoms for genetic disease
Itay Dangoor
Data
Human Phenotype Ontology
The Human Phenotype Ontology, or HPO, is an ordered reference of human
genetic disease and associated observed phenotypes [www.human-phenotypeontology.org]. A set of phenotypes was extracted from this data set, with the
corresponding set of causative genetic disease. Then, for each disease a set of known
gene defects causing the disease was extracted. This process makes a two layers map
from phenotypes to disease and from disease to genes. Making a link between every
gene to all the phenotypes that map to a disease that map to this gene results in a
many to many phenotype to causative gene mapping. The samples in this work are
gene defects, and the classes, or labels, are the phenotypes.
Phenotype's causative gene defects s are considered positive samples, while
the rest of the genes are considered negative samples. In order to have enough training
data and render good predictions, phenotypes with less than 10 causative genes were
filtered out, resulting in 120 phenotypes with between 10 and 45 positive samples
(Figure 1).
Figure 1: histogram counting phenotypes as a function of the amount of positive samples
related to them.
Obtaining descriptors for gene defects
CBM models contain a set of metabolites (biochemical compounds), a set of
biochemical reactions formed as convex linear constraints on the metabolites and a
Prediction of symptoms for genetic disease
Itay Dangoor
map between reactions to genes regulating them. Cellular state with defected gene is
simulated using the human CBM model [Duarte et al 2007]. First in order to enforce a
gene malfunction on the model, all the reactions related to this gene are constrained to
carry zero flux. A solution that satisfies all the constraints can be found using Linear
Programming, but usually the constraints yield more the one feasible solution. To
resolve this issue FBA, Flux Balance Analysis is used [Varma and Palsson 1994].
FBA is a common CBM optimization method which adds a biologically meaningful
objective function to the LP problem. The most prevalent function simulates a
maximization of cellular growth rate, which is again solved using Linear
Programming. Samples are vectors of flux rates through the reactions obtained from
the linear program solution.
Filtering non informative dimensions
The cellular state as defined in the solution of the LP problem is a vector in
which each entry corresponds to a flux rate through one biochemical reaction in the
model. The dimension of these vectors is 3788, as 3788 reactions are defined in the
human metabolic model. Due to limitations of the model and the FBA method, some
of the reactions always carry constant flux in my simulation, and hence contain no
information. In order to increase prediction accuracy and speed, those dimensions
with zero variance among all samples were filtered out, leaving 1207 dimensions.
Learning
Learning process summary
The goal of the learning in this work is to build a predictor per phenotype and
be able to decide which phenotypes will be expressed in each state of gene defect. For
that purpose two classification methods are used as predictors. 5 predictors are trained
for each phenotype, the K-Nearest Neighbor algorithm with k = 1, 3, 5 and the
Support Vector Machine algorithm with linear kernel and radial kernel. The training
and the testing groups are built using an equal amount of positive and negative
samples. To test the prediction accuracy, a 5-fold cross-validation process is utilized.
The error rates are measured per phenotype and on the testing group only. The Error
is defined as the percentage of positive samples predicted negative plus the percentage
of negative samples predicted positive.
Prediction of symptoms for genetic disease
Itay Dangoor
K-Nearest-Neighbor
The implementation of K-NN is based on lecture 7 of the course. The
implementation is the same as the one handed in exercise 3. The distance between
samples is determined using the Euclidean metric. Since the data is of very high
dimension (1207), and the topology of the problem is not exactly known, 3
configurations of the K-NN classifier are harnessed. The configurations differ from
each other by the classification decision method:
1. NN - Return the class of the first nearest neighbor only
2. 3NN - Return the class of the mode of 3 nearest neighbors
3. 5NN - Return the class of the mode of 3 nearest neighbors
Support vector machine
For an implementation of SVM the libSVM (Chih C. C and Chih J L) package
is used. Here also since the data is of high dimension (1207), and the topology is not
known, 2 configurations for the SVM Kernel are used:
1. Linear kernel – product of two vectors is simply defined as the dot
product.
2. Radial kernel - product of two vectors is defined to be e-GN where N is the
2-norm of the difference of the vectors and G is a normalization factor, and
is set to one divided by the number of dimensions.
Cross validation
In order to create a set of test samples, a 5-fold cross validation process is
used. Both the positive and negative data are divided to 5 equal parts, and in 5 runs,
each of the parts forms the test data in turn.
Data selection and multiple runs
Since for all of the classes exists more negative samples than positive samples,
and in order to maintain balanced classifiers and produce balanced error
measurements, the amount of negative samples is reduced to be the same as of the
positive samples. The selection of negative samples is done by random selection. As
in every run the selected training negative samples are different samples, the results
may differ from one run to another. To overcome this issue and make the results more
repeatable and significant, the 5-fold prediction process was run all over for 50 times.
Prediction of symptoms for genetic disease
Itay Dangoor
Accuracy Measures
As all the predictors are for one class, the error rate of the prediction process is
calculated for each class separately. The used measure for the error rate is percent of
test samples which were falsely classified out of the total amount of test samples.
Since there are 50 runs of the prediction each producing a different score, the mean
error is taken as the final value for the prediction accuracy. In addition, the standard
deviation is calculated for the 50 values, and a p-value for the error being lower than
50% is extracted using a one sided t-test. As there are multiple classes, the bonferroni
correction is applied, taking the threshold of significant p-value from 0.01 to
0.000083.
Experimental Results
Following here are some figures describing the error rate results of the
prediction process described in the Learning chapter. As stated before, all of the
figures encapsulate an average of 50 runs.
Figure 2: mean error rate over all phenotypes.
Prediction of symptoms for genetic disease
Itay Dangoor
Figure 3: count of significant predictions out of 120 classes (p-val < 0.000083).
Figure 4: distribution of the error rates for the various phenotype predictors over the 5
classification methods.
Prediction of symptoms for genetic disease
Itay Dangoor
Figure 5: error rate distribution of the 5 most significant phenotypes for 1-NN.
Figure 6: error rates of the 5 most significant phenotypes for 3-NN.
Prediction of symptoms for genetic disease
Itay Dangoor
Figure 7: error rates of the 5 most significant phenotypes for 5-NN.
Figure 8: error rates of the 5 most significant phenotypes for SVM with linear kernel.
Prediction of symptoms for genetic disease
Itay Dangoor
Figure 9: error rates of the 5 most significant phenotypes for SVM with radial kernel.
Discussion
The human cellular metabolic model has many limitations and lacks a lot of
information in comparison to the grand complexity of the human living cell.
Moreover, the phenotypes predicted in this work are complex phenomena which are
not necessarily related to some simple metabolic cell behavior. Yet the results indicate
the prediction of most of the phenotypes is successful and significant (Figure 3).
Both K-NN and SVM could handle the classification task presented in this
work, but with a high error rate of not lower than 15 %, and sometimes as high as
45% (figure 4). Also, not to be ignored are many cases, about 16%, in which all the
classifiers failed in rendering a good significant prediction (Figure 3).
It is observed that out of the five methods implemented in this work, the radial
kernel SVM is the best method for the presented task as it presents the lowest mean
Error (Figure 2), and the highest amount of significant predictions (Figure 3). The
next best classifier for the task is the linear kernel SVM. Although the methods
present different success rates, there are phenotypes that seem to be better predictable
by a less successful method (Figure 4).
Out of 3 k-NN methods, the first nearest neighbor came out as the most fit for
the current problem (Figure 2, Figure 3). This fact could shed some light on the data,
implying that the samples are not scattered in space in a very clustered way. Another
Prediction of symptoms for genetic disease
Itay Dangoor
explanation of the first nearest neighbor better performance might be related to a
curse of dimensionality that is expressed in this work, as the samples are of dimension
1207, making the sample space very sparse in the Euclidean metric.
References
1. http://www.human-phenotype-ontology.org/
2. Duarte et al 2007
Duarte NC, Becker S a, Jamshidi N, Thiele I, Mo ML, Vo TD, Srivas R & Palsson
BØ (2007) Global reconstruction of the human metabolic network based on genomic
and bibliomic data. Proceedings of the National Academy of Sciences of the United
States of America 104: 1777–82
3. Ma et al, 2007
Ma H, Sorokin A, Mazein A, Selkov A, Selkov E, Demin O & Goryanin I (2007) The
Edinburgh human metabolic network reconstruction and its functional analysis. Mol
Syst Biol 3:
4. Varma A, Palsson BO. Metabolic flux balancing: basic concepts, scientific and
practical use. Bio. Technol. 1994;12:994-998
5. Chih C. C and Chih J L, LIBSVM: a library for support vector machines. ACM
Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011.
Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm