1 Computational Biology (2015-2016) Exercises Problem 1

1
Computational Biology (2015-2016)
Exercises
Problem 1
Consider the following expression matrix where the expression levels of 2 genes (G1 and G2)
were analyzed in 7 healthy/infected tissues (conditions C1 to C7). Consider also the problem of
grouping tissues given the expression profiles of the genes using clustering algorithms.
C1
C2
C3
C4
C5
C6
C7
G1
0.0
3.0
4.0
3.0
4.0
1.0
1.0
G2
2.0
1.0
2.0
1.0
3.0
1.0
4.0
a. Determine the dendogram found by a hierarchical clustering algorithm (HCA) using
a bottom-up approach, the Euclidean distance to compute the distance between
conditions, and the single-link distance to compute the distance between groups
(intercluster distance). Justify the decisions taken at each step of the HCA.
How would you use the dendogram to group the tissues in 2 groups (clusters) and
which will be those clusters?
b. Determine the groups found by the K-means (K=2) algorithm when the centroids are
initialized with C5 = (4,3) and C6 = (1,1).
For each iteration of the algorithm present the centroids and the conditions in each
group (cluster).
2
Problem 2
Consider the following set of examples (individuals) describing stroke susceptivity (Yes, No)
based on three attributes: Blood Pressure (Low, Normal, High), Obesity (Yes, No) and Sex (Male,
Female).
Compute a classifier based on decision trees using the ID3 algorithm.
Justify all the options taken by the algorithm while computing the decision tree.
I1
I2
I3
I4
I5
I6
I7
I8
I9
I10
Blood
Obesity
Sex
Stroke?
Pressure
High
Yes
Male
Yes
Low
Yes
Male
No
High
No
Female
Yes
Normal
Yes
Male
Yes
High
Yes
Female
Yes
Low
No
Female
No
Low
No
Male
No
Normal
Yes
Female
Yes
Normal
No
Male
No
Normal
No
Female
No
Problem 3
Consider a medical diagnosis task. We have knowledge that over the entire population of people
0.8% have cancer.
There exists a (binary) laboratory test that represents an imperfect indicator of this disease. That
test returns a correct positive result in 98% of the cases in which the disease is present, and a
correct negative result in 97% of the cases where the disease is not present.
(a) Suppose we observe a patient for whom the laboratory test returns a positive result. Calculate
the a posteriori probability that this patient truly suffers from cancer.
3
(b) Knowing that the lab test is an imperfect one, a second test (which is assumed to be
independent of the former one) is conducted. Calculate the a posteriori probabilities for cancer
and ¬cancer given that the second test has returned a positive result as well.
Problem 4
Consider the problem where the task is to describe whether a person is ill. We use a
representation based on three features per subject to describe an individual person. These features
are “running nose”, “coughing”, and “reddened skin”, each of which can take the value true (‘+’)
or false (‘–’), see Table 1.
(a) Given the data set in Table 1, determine all probabilities required to apply the naive Bayes
classifier for predicting whether a new person is ill or not.
(b) Apply the naive Bayes classifier to the test patterns corresponding to the following subjects: a
person who is coughing and has fever, a person whose nose is running and who suffers from
fever, and a person with a running nose and reddened skin (d7 = (N(-), C, R(-), F), d8 = (N, C(-),
R(-), F), and d9 = (N, C(-), R, F(-))).