1 Computational Biology (2015-2016) Exercises Problem 1 Consider the following expression matrix where the expression levels of 2 genes (G1 and G2) were analyzed in 7 healthy/infected tissues (conditions C1 to C7). Consider also the problem of grouping tissues given the expression profiles of the genes using clustering algorithms. C1 C2 C3 C4 C5 C6 C7 G1 0.0 3.0 4.0 3.0 4.0 1.0 1.0 G2 2.0 1.0 2.0 1.0 3.0 1.0 4.0 a. Determine the dendogram found by a hierarchical clustering algorithm (HCA) using a bottom-up approach, the Euclidean distance to compute the distance between conditions, and the single-link distance to compute the distance between groups (intercluster distance). Justify the decisions taken at each step of the HCA. How would you use the dendogram to group the tissues in 2 groups (clusters) and which will be those clusters? b. Determine the groups found by the K-means (K=2) algorithm when the centroids are initialized with C5 = (4,3) and C6 = (1,1). For each iteration of the algorithm present the centroids and the conditions in each group (cluster). 2 Problem 2 Consider the following set of examples (individuals) describing stroke susceptivity (Yes, No) based on three attributes: Blood Pressure (Low, Normal, High), Obesity (Yes, No) and Sex (Male, Female). Compute a classifier based on decision trees using the ID3 algorithm. Justify all the options taken by the algorithm while computing the decision tree. I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 Blood Obesity Sex Stroke? Pressure High Yes Male Yes Low Yes Male No High No Female Yes Normal Yes Male Yes High Yes Female Yes Low No Female No Low No Male No Normal Yes Female Yes Normal No Male No Normal No Female No Problem 3 Consider a medical diagnosis task. We have knowledge that over the entire population of people 0.8% have cancer. There exists a (binary) laboratory test that represents an imperfect indicator of this disease. That test returns a correct positive result in 98% of the cases in which the disease is present, and a correct negative result in 97% of the cases where the disease is not present. (a) Suppose we observe a patient for whom the laboratory test returns a positive result. Calculate the a posteriori probability that this patient truly suffers from cancer. 3 (b) Knowing that the lab test is an imperfect one, a second test (which is assumed to be independent of the former one) is conducted. Calculate the a posteriori probabilities for cancer and ¬cancer given that the second test has returned a positive result as well. Problem 4 Consider the problem where the task is to describe whether a person is ill. We use a representation based on three features per subject to describe an individual person. These features are “running nose”, “coughing”, and “reddened skin”, each of which can take the value true (‘+’) or false (‘–’), see Table 1. (a) Given the data set in Table 1, determine all probabilities required to apply the naive Bayes classifier for predicting whether a new person is ill or not. (b) Apply the naive Bayes classifier to the test patterns corresponding to the following subjects: a person who is coughing and has fever, a person whose nose is running and who suffers from fever, and a person with a running nose and reddened skin (d7 = (N(-), C, R(-), F), d8 = (N, C(-), R(-), F), and d9 = (N, C(-), R, F(-))).
© Copyright 2025 Paperzz