Proximity Matrix

Computational Biology
Clustering
Parts taken from
Introduction to Data Mining
by
Tan, Steinbach, Kumar
Lecture Slides Week 9
Clustering
• A clustering is a set of clusters
• Important distinction between hierarchical and
partitional sets of clusters
• Partitional Clustering
– A division data objects into non-overlapping subsets (clusters) such
that each data object is in exactly one subset
• Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
Partitional Clustering
Original Points
A Partitional Clustering
Hierarchical Clustering
p1
p3
p4
p2
p1 p2
Traditional Hierarchical Clustering
p3 p4
Traditional Dendrogram
p1
p3
p4
p2
p1 p2
Non-traditional Hierarchical Clustering
p3 p4
Non-traditional Dendrogram
Notion of a Cluster can be Ambiguous
How many clusters?
Six Clusters
Two Clusters
Four Clusters
What is Cluster Analysis?
• Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or unrelated to)
the objects in other groups
Intra-cluster
distances are
minimized
Inter-cluster
distances are
maximized
10 Clusters Example
Iteration 4
1
2
3
8
6
4
y
2
0
-2
-4
-6
0
5
10
15
20
x
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Iteration 2
8
6
6
4
4
2
2
y
y
Iteration 1
8
0
0
-2
-2
-4
-4
-6
-6
0
5
10
15
20
0
5
x
6
6
4
4
2
2
0
-2
-4
-4
-6
-6
x
15
20
0
-2
10
20
Iteration 4
8
y
y
Iteration 3
5
15
x
8
0
10
15
20
0
5
10
x
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Iteration 4
1
2
3
8
6
4
y
2
0
-2
-4
-6
0
5
10
15
20
x
Starting with some pairs of clusters having three initial centroids, while other have only one.
10 Clusters Example
Iteration 2
8
6
6
4
4
2
2
y
y
Iteration 1
8
0
0
-2
-2
-4
-4
-6
-6
0
5
10
15
20
0
5
8
8
6
6
4
4
2
2
0
-2
-4
-4
-6
-6
5
10
x
15
20
15
20
0
-2
0
10
x
Iteration
4
y
y
x
Iteration
3
15
20
0
5
10
x
Starting with some pairs of clusters having three initial centroids, while other have only one.
Limitations of K-means: Differing Sizes
Original Points
K-means (3 Clusters)
Limitations of K-means: Differing Density
Original Points
K-means (3 Clusters)
Limitations of K-means: Non-globular Shapes
Original Points
K-means (2 Clusters)
Hierarchical Clustering
• Produces a set of nested clusters organized as a
hierarchical tree
• Can be visualized as a dendrogram
– A tree like diagram that records the sequences of merges or
splits
5
6
0.2
4
3
4
2
0.15
5
2
0.1
1
0.05
3
0
1
3
2
5
4
6
1
Starting Situation
• Start with clusters of individual points and a proximity
matrix
p1
p2
p3
p4 p5
...
p1
p2
p3
p4
p5
.
.
Proximity Matrix
.
...
p1
p2
p3
p4
p9
p10
p11
p12
Intermediate Situation
• After some merging steps, we have some clusters
C1
C2
C3
C4
C5
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2
C5
...
p1
p2
p3
p4
p9
p10
p11
p12
Intermediate Situation
• We want to merge the two closest clusters (C2 and C5) and update
the proximity matrix.
C1
C2
C3
C4
C5
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2
C5
...
p1
p2
p3
p4
p9
p10
p11
p12
After Merging
• The question is “How do we update the proximity matrix?”
C1
C1
C4
C3
C4
?
?
?
?
C2 U C5
C3
C2
U
C5
?
C3
?
C4
?
Proximity Matrix
C1
C2 U C5
...
p1
p2
p3
p4
p9
p10
p11
p12
How to Define Inter-Cluster Similarity
p1
Similarity?
p2
p3
p4 p5
p1
p2
p3
p4
•
•
•
•
•
p5
MIN
.
MAX
.
Group Average
.
Distance Between Centroids
Proximity Matrix
Other methods driven by an objective function
– Ward’s Method uses squared error
...
How to Define Inter-Cluster Similarity
p1
p2
p3
p4 p5
p1
p2
p3
p4
•
•
•
•
•
p5
MIN
.
MAX
.
Group Average
.
Distance Between Centroids
Proximity Matrix
Other methods driven by an objective function
– Ward’s Method uses squared error
...
How to Define Inter-Cluster Similarity
p1
p2
p3
p4 p5
p1
p2
p3
p4
•
•
•
•
•
p5
MIN
.
MAX
.
Group Average
.
Distance Between Centroids
Proximity Matrix
Other methods driven by an objective function
– Ward’s Method uses squared error
...
How to Define Inter-Cluster Similarity
p1
p2
p3
p4 p5
p1
p2
p3
p4
•
•
•
•
•
p5
MIN
.
MAX
.
Group Average
.
Distance Between Centroids
Proximity Matrix
Other methods driven by an objective function
– Ward’s Method uses squared error
...
How to Define Inter-Cluster Similarity
p1
p2
p3
p4 p5
p1


p2
p3
p4
•
•
•
•
•
p5
MIN
.
MAX
.
Group Average
.
Distance Between Centroids
Proximity Matrix
Other methods driven by an objective function
– Ward’s Method uses squared error
...
Cluster Similarity: MIN or Single Link
• Similarity of two clusters is based on the two most similar
(closest) points in the different clusters
– Determined by one pair of points, i.e., by one link in the
proximity graph.
I1
I2
I3
I4
I5
I1
1.00
0.90
0.10
0.65
0.20
I2
0.90
1.00
0.70
0.60
0.50
I3
0.10
0.70
1.00
0.40
0.30
I4
0.65
0.60
0.40
1.00
0.80
I5
0.20
0.50
0.30
0.80
1.00
1
2
3
4
5
Hierarchical Clustering: MIN
1
5
3
5
0.2
2
1
2
3
0.15
6
0.1
0.05
4
4
Nested Clusters
0
3
6
2
5
Dendrogram
4
1
Measuring Cluster Validity Via Correlation
•
Two matrices
–
–
Proximity Matrix
“Incidence” Matrix
•
•
•
•
Compute the correlation between the two matrices
–
•
•
One row and one column for each data point
An entry is 1 if the associated pair of points belong to the same cluster
An entry is 0 if the associated pair of points belongs to different clusters
Since the matrices are symmetric, only the correlation between
n(n-1) / 2 entries needs to be calculated.
High correlation indicates that points that belong to the same cluster
are close to each other.
Not a good measure for some density or contiguity based clusters.
Measuring Cluster Validity Via Correlation
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
y
y
• Correlation of incidence and proximity matrices for the Kmeans clusterings of the following two data sets.
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.2
0.4
0.6
x
Corr = -0.9235
0.8
1
0
0
0.2
0.4
0.6
x
Corr = -0.5810
0.8
1
Using Similarity Matrix for Cluster Validation
• Order the similarity matrix with respect to cluster labels and
inspect visually.
1
1
0.9
0.8
0.7
Points
y
0.6
0.5
0.4
0.3
0.2
0.1
0
10
0.9
20
0.8
30
0.7
40
0.6
50
0.5
60
0.4
70
0.3
80
0.2
90
0.1
100
0
0.2
0.4
0.6
x
0.8
1
20
40
60
Points
80
0
100 Similarity
End Theory I
• 5 min mindmapping
• 10 min break
Practice I
Clustering Dataset
• We will use the same datasets as last week
• Have fun clustering with Orange
– Try K-means clustering, hierachial clustering, MDS
– Analyze differences in results
K?
• What is the k in k-means again?
• How many clusters are in my dataset?
• Solution: iterate over a reasonable number of ks
• Do that and try to find out how many clusters there are in
your data
End Practice I
• 15 min break
Theory II
Microarrays
• Gene Expression:
– We see difference between cels because of
differential
gene expression,
– Gene is expressed by transcribing DNA intosingle-stranded
mRNA,
– mRNA is later translated into a protein,
– Microarrays measure the level of mRNA expression
Microarrays
• Gene Expression:
– mRNA expression represents dynamic aspects of cell,
– mRNA is isolated and labeled using a fluorescent material,
– mRNA is hybridized to the target; level of hybridization
corresponds to light emission which is measured with a laser
Microarrays
Microarrays
Microarrays
Processing Microarray Data
• Differentiating gene expression:
– R = G  not differentiated
– R > G  up-regulated
– R < G  down regulated
Processing Microarray Data
• Problems:
– Extract data from microarrays,
– Analyze the meaning of the multiple arrays.
Processing Microarray Data
Processing Microarray Data
• Problems:
– Extract data from microarrays,
– Analyze the meaning of the multiple arrays.
Processing Microarray Data
• Microarray data:
Processing Microarray Data
• Clustering:
–
–
–
–
Find classes in the data,
Identify new classes,
Identify gene correlations,
Methods:
• K-means clustering,
• Hierarchical clustering,
• Self Organizing Maps (SOM)
Processing Microarray Data
• Distance Measures:
– Euclidean Distance:
– Manhattan Distance:
Processing Microarray Data
• K-means Clustering:
– Break the data into K clusters,
– Start with random partitioning,
– Improve it by iterating.
Processing Microarray Data
• Agglomerative Hierarchical Clustering:
Processing Microarray Data
• Self-Organizing Feature Maps:
– by Teuvo Kohonen,
– a data visualization technique which helps to understand high
dimensional data by reducing the dimensions of data to a map.
Processing Microarray Data
• Self-Organizing Feature Maps:
– humans simply cannot visualize high dimensional data as is,
– SOM help us understand this high dimensional data.
Processing Microarray Data
• Self-Organizing Feature Maps:
–
–
–
–
Based on competitive learning,
SOM helps us by producing a map of usually 1 or 2 dimensions,
SOM plot the similarities of the data by grouping
similar data items together.
Processing Microarray Data
• Self-Organizing Feature Maps:
Processing Microarray Data
• Self-Organizing Feature Maps:

Input vector, synaptic weight vector
x = [x1, x2, …, xm]T
wj=[wj1, wj2, …, wjm]T, j = 1, 2,3, l

Best matching, winning neuron
i(x) = arg min ||x-wj||, j =1,2,3,..,l

Weights wi are updated.
Figure 2. Output map containing the distributions of genes from the alpha30 database.
Chavez-Alvarez R, Chavoya A, Mendez-Vazquez A (2014) Discovery of Possible Gene Relationships through the Application of
Self-Organizing Maps to DNA Microarray Databases. PLoS ONE 9(4): e93233. doi:10.1371/journal.pone.0093233
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0093233
Figure 5. Color-coded output maps representing the final weight of neurons from the samples of
the alpha30 database.
Chavez-Alvarez R, Chavoya A, Mendez-Vazquez A (2014) Discovery of Possible Gene Relationships through the Application of
Self-Organizing Maps to DNA Microarray Databases. PLoS ONE 9(4): e93233. doi:10.1371/journal.pone.0093233
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0093233
End Theory II
• 5 min mindmapping
• 10 min break
Practice II
Microarray
• http://www.biolab.si/supp/bivisprog/dicty/dictyExample.htm
• Use the data as described on the page