Consider the following distance matrix that has been used in the

Multivariate Statistics
Thomas Asendorf, Steffen Unkel
Study sheet 4
Summer term 2017
Exercise 1:
Consider the following distance matrix that has been used in the lecture to illustrate single
linkage in the context of agglomerative hierarchical clustering:


0.0
 2.0 0.0





D =  6.0 5.0 0.0
 .


 10.0 9.0 4.0 0.0

9.0 8.0 5.0 3.0 0.0
(a) Perform complete linkage and average linkage, draw the corresponding dendrograms,
and compare the results to each other and to the ones obtained in the lecture (do not
use a computer!).
(b) For both complete linkage and average linkage, compute the cophenetic correlation
between the corresponding cophenetic matrix and D.
(c) Compute the pairwise cophenetic correlations between the single linkage, complete
linkage and average linkage solutions.
Exercise 2:
Suppose we would like to compare the similarity of two text documents by comparing the
words which are in each of the documents. To do so, we can use the generalized Jaccard
similarity. The Jaccard similarity of sets T and V is defined as
sT V =
|T ∩ V |
,
|T ∪ V |
where |A| denotes the cardinality of the set A, that is, the number of its elements.
(a) Calculate the Jaccard similarity of T = {draw, complete, cophenetic, documents, each}
and V = {compare, cophenetic, linkage, draw, words}.
(b) Suppose we have a dictionary U , a set of n words. Two subsets T and V are chosen at
random from U with m ≤ n elements each. What is the expected value of the Jaccard
similarity of T and V ?
Exercise 3:
Let X = (xij ) ∈ Rn×p denote our data matrix. We would like to use K-means clustering
to find K clusters, C1 , . . . , CK , which minimize the within-group dispersion. Therefore, the
optimization problem is given as follows:
min
C1 ,...,CK
Date: 19 May 2017
p
K X X
X
(xij − xkj )2 ,
k=1 i∈Ck j=1
Page 1
where xkj =
1
|Ck |
P
i∈Ck
xij . Consider the following algorithm to solve K-means clustering:
1. Randomly assign a number, from 1 to K, to each of the observations.
2. Iterate until the cluster assignments stop changing:
(i) For each of the K clusters, compute the centroid.
(ii) Assign each observation to the cluster whose centroid is closest (where closest is
defined using Euclidean distances).
Show that the algorithm converges to a local optimum, i.e. the objective function of the
optimization problem decreases in every iteration.
Exercise 4:
Consider the data set NCI60 from the package ISLR. The data set contains gene expression
data of cell lines of different cancer types. We would like to find out whether or not different
cancer types come from similar cell lines and cluster these accordingly. In all following cluster
analyses, we will use the Euclidean distance and equally weight different gene expressions.
(a) Perform hierarchical clustering on the data set using complete-, average- and single
linkage. Plot and compare the three dendrograms. Which method shows the most
attractive clusters?
(b) How would individuals be categorized into 4 clusters using complete linkage?
(c) Compare your result from (b) to K-means clustering with K = 4. Do both methods
lead to the same results?
Date: 19 May 2017
Page 2