0 - UTK-EECS

COSC 526 Class 14
Clustering – Part III: Spectral Clustering
Arvind Ramanathan
Computational Science & Engineering Division
Oak Ridge National Laboratory, Oak Ridge
Ph: 865-576-7266
E-mail: [email protected]
Slides inspired by: Andrew Moore (CMU), Jure Leskovec
(http://www.mmds.org)
Last Class
• DBSCAN:
– Practical clustering algorithm that works even with
difficult datasets
– Scales to large datasets
– Works well with noise
2
Spectral Clustering
3
Graph Partitioning
• Undirected graph
5
1
2
4
3
• Bi-partitioning task:
– Divide vertices into two disjoint groups
A
2
3
B
5
1
4
6
• Questions:
– How can we define a “good” partition of ?
– How can we efficiently identify such a partition?
4
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
6
Graph Partitioning
• What makes a good partition?
– Maximize the number of within-group
connections
– Minimize the number of between-group connections
5
1
2
3
A
5
6
4
B
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Graph Cuts
• Express partitioning objectives as a function
of the “edge cut” of the partition
• Cut: Set of edges with only one vertex in a
group:
A
1
2
3
6
B
5
cut(A,B) = 2
4
6
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Graph Cut Criterion
• Criterion: Minimum-cut
– Minimize weight of connections between groups
arg minA,B cut(A,B)
• Degenerate case:
“Optimal cut”
Minimum cut
• Problem:
– Only considers external cluster connections
– Does not consider internal cluster connectivity
7
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
[Shi-Malik]
Graph Cut Criteria
• Criterion: Normalized cut [Shi-Malik 1997]
– Connectivity between groups relative to the density of
each group
– vol(A): Total weight of the edges with at least one end
point in A:
vol(A) = k
å
• Intuition for this criterion:
i
iÎA
– Produce more balanced partitions
• How do we efficiently find a good partition?
– Problem: computing optimal cut is NP hard
8
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Spectral Graph Partitioning
• A: adjacency matrix of undirected G
– Aij =1
if is an edge, else 0
• x is a vector in n with components
– Think of it as a label/value of each node of
• What is the meaning of A x?
• Entry yi is a sum of labels xj of neighbors of i
9
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
What is the meaning of Ax?
• jth coordinate of A x :
– Sum of the x-values
of neighbors of j
– Make this a new value at node j
• Spectral Graph Theory:
– Analyze the “spectrum” of matrix representing
– Spectrum: Eigenvectors of a graph, ordered by the
magnitude (strength) of their corresponding
eigenvalues :
10
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Matrix Representations
• Adjacency matrix (A):
– n n matrix
– A=[aij], aij=1 if edge between node i and j
5
1
2
3
4
6
• Important properties:
1
2
3
4
5
6
1
0
1
1
0
1
0
2
1
0
1
0
0
0
3
1
1
0
1
0
0
4
0
0
1
0
1
1
5
1
0
0
1
0
1
0
0
0
1
1
0
6
– Symmetric matrix
– Eigenvectors are real and orthogonal
11
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Matrix Representations
• Degree matrix (D):
– n n diagonal matrix
– D=[dii], dii = degree of node i
5
1
2
3
12
4
6
1
2
3
4
5
6
1
3
0
0
0
0
0
2
0
2
0
0
0
0
3
0
0
3
0
0
0
4
0
0
0
3
0
0
5
0
0
0
0
3
0
6
0
0
0
0
0
2
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Matrix Representations
• Laplacian matrix (L):
– n n symmetric matrix
5
1
2
3
4
6
1
2
3
4
5
6
1
3
-1
-1
0
-1
0
2
-1
2
-1
0
0
0
3
-1
-1
3
-1
0
0
4
0
0
-1
3
-1
-1
5
-1
0
0
-1
3
-1
6
0
0
0
-1
-1
2
• What is trivial eigenpair?
𝑳 = 𝑫 − 𝑨
• Important properties:
– Eigenvalues are non-negative real numbers
– Eigenvectors are real and orthogonal
13
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Spectral Clustering: Graph = Matrix
W*v1 = v2 “propogates weights from neighbors”
e2
0.4
0.2
xxxx
xxx xx
x xx
0.0
-0.2
yyyy zzzzzz
zz z
z e1
y
-0.4
-0.4
-0.2
0
0.2
[Shi & Meila, 2002]
14
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
e3
e2
Spectral Clustering: Graph = Matrix
W*v1 = v2 “propagates weights from neighbors”
W  v  v : v is an eigenvecto r with eigenvalue 
If W is connected but roughly
block diagonal with k blocks then
• the top eigenvector is a
constant vector
• the next k eigenvectors are
roughly piecewise constant
with “pieces” corresponding to
blocks
15
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Spectral Clustering: Graph = Matrix
W*v1 = v2 “propagates weights from neighbors”
W  v  v : v is an eigenvecto r with eigenvalue 
If W is connected but roughly
block diagonal with k blocks
then
• the “top” eigenvector is a
constant vector
• the next k eigenvectors are
roughly piecewise constant
with “pieces”
corresponding to blocks
16
Spectral clustering:
• Find the top k+1
eigenvectors v1,…,vk+1
• Discard the “top” one
• Replace every node a with kdimensional vector xa =
<v2(a),…,vk+1 (a) >
• Cluster with k-means
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Spectral Clustering: Graph = Matrix
W*v1 = v2 “propogates weights from neighbors”
W  v  v : v is an eigenvecto r with eigenvalue 
• smallest eigenvecs of D-A are largest eigenvecs of A
• smallest eigenvecs of I-W are largest eigenvecs of W
Suppose each y(i)=+1 or -1:
• Then y is a cluster indicator that splits the nodes into two
• what is yT(D-A)y ?
17
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
y T ( D  A)y  y T Dy  y T Ay   d i yi  ai , j yi y j
2
i
i, j

1
2
 2 d i yi  2 ai , j yi y j 
2 i
i, j


 2
1 

 2
    aij  yi     aij  y j  2 ai , j yi y j 
2  i  j
j  i
i, j




1
2
2
  aij yi   aij y j  2 ai , j yi y j 
2  i, j
i, j
i, j


1
2
  ai , j ( yi  y j ) 
2  i, j

= size of CUT(y)
y ( I W )y  size of NCUT( y)
T
18
NCUT: roughly minimize ratio of transitions between
classes vs transitions within classes
So far…
•
How to define a “good” partition of a graph?
–
•
How to efficiently identify such a partition?
–
•
19
Minimize a given graph cut criterion
Approximate using information provided by the
eigenvalues and eigenvectors of a graph
Spectral Clustering
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Spectral Clustering Algorithms
• Three basic stages:
– 1) Pre-processing
• Construct a matrix representation of the graph
– 2) Decomposition
• Compute eigenvalues and eigenvectors of the matrix
• Map each point to a lower-dimensional representation based
on one or more eigenvectors
– 3) Grouping
• Assign points to two or more clusters, based on the new
representation
20
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Spectral Partitioning Algorithm
• 1) Pre-processing:
– Build Laplacian
matrix L of the
graph
• 2) Decomposition:
– Find eigenvalues 
and eigenvectors x
of the matrix L
– Map vertices to
corresponding
components of 2
21
=
1
2
3
4
5
6
1
3
-1
-1
0
-1
0
2
-1
2
-1
0
0
0
3
-1
-1
3
-1
0
0
4
0
0
-1
3
-1
-1
5
-1
0
0
-1
3
-1
6
0
0
0
-1
-1
2
0.0
0.4
0.3
-0.5
-0.2
-0.4
-0.5
1.0
0.4
0.6
0.4
-0.4
0.4
0.0
0.4
0.3
0.1
0.6
-0.4
0.5
0.4
-0.3
0.1
0.6
0.4
-0.5
4.0
0.4
-0.3
-0.5
-0.2
0.4
0.5
5.0
0.4
-0.6
0.4
-0.4
-0.4
0.0
3.0
3.0
1
0.3
2
0.6
3
0.3
4
-0.3
5
-0.3
6
-0.6
X=
How do we now
find the clusters?
21
Spectral Partitioning
• 3) Grouping:
– Sort components of reduced 1-dimensional vector
– Identify clusters by splitting the sorted vector in two
• How to choose a splitting point?
– Naïve approaches:
• Split at 0 or median value
– More expensive approaches:
• Attempt to minimize normalized cut in 1-dimension
(sweep over ordering of nodes induced by the eigenvector)
22
Split at 0:
Cluster A: Positive points
Cluster B: Negative points
1
0.3
2
0.6
3
0.3
4
-0.3
1
0.3
4
-0.3
5
-0.3
2
0.6
5
-0.3
6
-0.6
3
0.3
6
-0.6
A
B
22
Value of x2
Example: Spectral Partitioning
Rank in x2
23
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Example: Spectral Partitioning
Value of x2
Components of x2
Rank in x2
24
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Example: Spectral partitioning
Components of x1
25
Components of x3
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
k-Way Spectral Clustering
• How do we partition a graph into k clusters?
• Two basic approaches:
– Recursive bi-partitioning [Hagen et al., ’92]
• Recursively apply bi-partitioning algorithm in a hierarchical divisive
manner
• Disadvantages: Inefficient, unstable
– Cluster multiple eigenvectors [Shi-Malik, ’00]
• Build a reduced space from multiple eigenvectors
• Commonly used in recent papers
• A preferable approach…
26
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Why use multiple eigenvectors?
• Approximates the optimal cut [Shi-Malik, ’00]
– Can be used to approximate optimal k-way normalized cut
• Emphasizes cohesive clusters
– Increases the unevenness in the distribution of the data
– Associations between similar points are amplified,
associations between dissimilar points are attenuated
– The data begins to “approximate a clustering”
• Well-separated space
– Transforms data to a new “embedded space”,
consisting of k orthogonal basis vectors
• Multiple eigenvectors prevent instability due to
information loss
27
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Spectral Clustering: Graph = Matrix
W*v1 = v2 “propogates weights from neighbors”
W  v  v : v is an eigenvecto r with eigenvalue 
• smallest eigenvecs of D-A are largest eigenvecs of A
• smallest eigenvecs of I-W are largest eigenvecs of W
Q: How do I pick v
to be an eigenvector
for a blockstochastic matrix?
28
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Spectral Clustering: Graph = Matrix
W*v1 = v2 “propogates weights from neighbors”
W  v  v : v is an eigenvecto r with eigenvalue 
How do I pick v to
be an eigenvector
for a blockstochastic matrix?
29
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Spectral Clustering: Graph = Matrix
W*v1 = v2 “propogates weights from neighbors”
W  v  v : v is an eigenvecto r with eigenvalue 
• smallest eigenvecs of D-A are largest eigenvecs of A
• smallest eigenvecs of I-W are largest eigenvecs of W
Suppose each y(i)=+1 or -1:
• Then y is a cluster indicator that cuts the nodes into two
• what is yT(D-A)y ? The cost of the graph cut defined by y
• what is yT(I-W)y ? Also a cost of a graph cut defined by y
• How to minimize it?
• Turns out: to minimize yT X y / (yTy) find smallest eigenvector of X
• But: this will not be +1/-1, so it’s a “relaxed” solution
30
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Spectral Clustering: Graph = Matrix
W*v1 = v2 “propogates weights from neighbors”
W  v  v : v is an eigenvecto r with eigenvalue 
λ1
e1
λ2
e3
λ3
“eigengap”
λ4
e2
λ5,6,7,…
.
[Shi & Meila, 2002]
31
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Some more terms
• If A is an adjacency matrix (maybe weighted) and D
is a (diagonal) matrix giving the degree of each node
– Then D-A is the (unnormalized) Laplacian
– W=AD-1 is a probabilistic adjacency matrix
– I-W is the (normalized or random-walk) Laplacian
– More definitions are possible!
• The largest eigenvectors of W correspond to the
smallest eigenvectors of I-W
– So sometimes people talk about “bottom eigenvectors of
the Laplacian”
32
A
W
A
W
33
K-nn graph
(easy)
Fully
connected
graph,
weighted by
distance
34
35
Spectral Clustering: Pros and Cons
• Elegant, and well-founded mathematically
• Works quite well when relations are approximately
transitive (like similarity)
• Very noisy datasets cause problems
– “Informative” eigenvectors need not be in top few
– Performance can drop suddenly from good to terrible
• Expensive for very large datasets
– Computing eigenvectors is the bottleneck
36
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Use cases and runtimes
• K-Means
– FAST
– “Embarrassingly parallel”
– Not very useful on anisotropic
data
• Spectral clustering
– Excellent quality under many
different data forms
– Much slower than K-Means
37
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Further Reading
• Spectral Clustering Tutorial:
http://www.informatik.unihamburg.de/ML/contents/people/luxburg/publicati
ons/Luxburg07_tutorial.pdf
38
How to validate clustering approaches?
39
Cluster validity
• For supervised learning:
– we had a class label,
– which meant we could identify how good our training
and testing errors were
– Metric: Accuracy, Precision, Recall
• For clustering:
– How do we measure the “goodness” of the resulting
clusters?
40
Clustering random data (overfitting)
If you ask a clustering algorithm to find clusters, it will find some
41
Different aspects of validating clsuters
• Determine the clustering tendency of a set of data, i.e.,
whether non-random structure actually exists in the data
(e.g., to avoid overfitting)
• External Validation: Compare the results of a cluster
analysis to externally known class labels (ground truth).
• Internal Validation: Evaluating how well the results of a
cluster analysis fit the data without reference to external
information.
• Compare clusterings to determine which is better.
• Determining the ‘correct’ number of clusters.
42
Measures of cluster validity
• External Index: Used to measure the extent to
which cluster labels match externally supplied
class labels.
– Entropy, Purity, Rand Index
• Internal Index: Used to measure the goodness
of a clustering structure without respect to
external information.
– Sum of Squared Error (SSE), Silhouette coefficient
• Relative Index: Used to compare two different
clusterings or clusters.
43
– Often an external or internal index is used for this
function, e.g., SSE or entropy
Measuring Cluster Validation with
Correlation
• Proximity Matrix vs. Incidence matrix:
– A matrix Kij with 1 if the point belongs to the same
cluster; 0 otherwise
• Compute the correlation between the two
matrices:
– Only n(n-1)/2 values to be computed
– High values indicate similarity between points in the
same cluster
• Not suited for density based clustering
44
Another approach: use similarity matrix for
cluster validation
45
Internal Measures: SSE
• SSE is also a good measure to understand how
good the clustering is
– Lower SSE  good clustering
• Can be used to estimate number of clusters
46
More on Clustering a little later…
• We will discuss other forms of clustering in the
following classes
• Next class:
– please bring your brief write up on the two papers
– We will discuss frequent itemset mining and a few
other aspects of clustering
– Move on to Dimensionality Reduction
47
Summary
• We saw spectral clustering techniques:
– only a broad overview
– more details next class
• Speeding up Spectral clustering techniques can
be challenging
48
Clustering Big Data
Clustering
Partitioning
K-means
49
Hierarchybased
CURE, Rock
Density
Based
DBSCAN
Grid Based
WaveCluster
Model
Based
EM,
COBWEB
Categorizing Big Data Clustering Algorithms
50