COSC 526 Class 14 Clustering – Part III: Spectral Clustering Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail: [email protected] Slides inspired by: Andrew Moore (CMU), Jure Leskovec (http://www.mmds.org) Last Class • DBSCAN: – Practical clustering algorithm that works even with difficult datasets – Scales to large datasets – Works well with noise 2 Spectral Clustering 3 Graph Partitioning • Undirected graph 5 1 2 4 3 • Bi-partitioning task: – Divide vertices into two disjoint groups A 2 3 B 5 1 4 6 • Questions: – How can we define a “good” partition of ? – How can we efficiently identify such a partition? 4 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6 Graph Partitioning • What makes a good partition? – Maximize the number of within-group connections – Minimize the number of between-group connections 5 1 2 3 A 5 6 4 B J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Graph Cuts • Express partitioning objectives as a function of the “edge cut” of the partition • Cut: Set of edges with only one vertex in a group: A 1 2 3 6 B 5 cut(A,B) = 2 4 6 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Graph Cut Criterion • Criterion: Minimum-cut – Minimize weight of connections between groups arg minA,B cut(A,B) • Degenerate case: “Optimal cut” Minimum cut • Problem: – Only considers external cluster connections – Does not consider internal cluster connectivity 7 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org [Shi-Malik] Graph Cut Criteria • Criterion: Normalized cut [Shi-Malik 1997] – Connectivity between groups relative to the density of each group – vol(A): Total weight of the edges with at least one end point in A: vol(A) = k å • Intuition for this criterion: i iÎA – Produce more balanced partitions • How do we efficiently find a good partition? – Problem: computing optimal cut is NP hard 8 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Spectral Graph Partitioning • A: adjacency matrix of undirected G – Aij =1 if is an edge, else 0 • x is a vector in n with components – Think of it as a label/value of each node of • What is the meaning of A x? • Entry yi is a sum of labels xj of neighbors of i 9 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org What is the meaning of Ax? • jth coordinate of A x : – Sum of the x-values of neighbors of j – Make this a new value at node j • Spectral Graph Theory: – Analyze the “spectrum” of matrix representing – Spectrum: Eigenvectors of a graph, ordered by the magnitude (strength) of their corresponding eigenvalues : 10 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Matrix Representations • Adjacency matrix (A): – n n matrix – A=[aij], aij=1 if edge between node i and j 5 1 2 3 4 6 • Important properties: 1 2 3 4 5 6 1 0 1 1 0 1 0 2 1 0 1 0 0 0 3 1 1 0 1 0 0 4 0 0 1 0 1 1 5 1 0 0 1 0 1 0 0 0 1 1 0 6 – Symmetric matrix – Eigenvectors are real and orthogonal 11 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Matrix Representations • Degree matrix (D): – n n diagonal matrix – D=[dii], dii = degree of node i 5 1 2 3 12 4 6 1 2 3 4 5 6 1 3 0 0 0 0 0 2 0 2 0 0 0 0 3 0 0 3 0 0 0 4 0 0 0 3 0 0 5 0 0 0 0 3 0 6 0 0 0 0 0 2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Matrix Representations • Laplacian matrix (L): – n n symmetric matrix 5 1 2 3 4 6 1 2 3 4 5 6 1 3 -1 -1 0 -1 0 2 -1 2 -1 0 0 0 3 -1 -1 3 -1 0 0 4 0 0 -1 3 -1 -1 5 -1 0 0 -1 3 -1 6 0 0 0 -1 -1 2 • What is trivial eigenpair? 𝑳 = 𝑫 − 𝑨 • Important properties: – Eigenvalues are non-negative real numbers – Eigenvectors are real and orthogonal 13 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Spectral Clustering: Graph = Matrix W*v1 = v2 “propogates weights from neighbors” e2 0.4 0.2 xxxx xxx xx x xx 0.0 -0.2 yyyy zzzzzz zz z z e1 y -0.4 -0.4 -0.2 0 0.2 [Shi & Meila, 2002] 14 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org e3 e2 Spectral Clustering: Graph = Matrix W*v1 = v2 “propagates weights from neighbors” W v v : v is an eigenvecto r with eigenvalue If W is connected but roughly block diagonal with k blocks then • the top eigenvector is a constant vector • the next k eigenvectors are roughly piecewise constant with “pieces” corresponding to blocks 15 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Spectral Clustering: Graph = Matrix W*v1 = v2 “propagates weights from neighbors” W v v : v is an eigenvecto r with eigenvalue If W is connected but roughly block diagonal with k blocks then • the “top” eigenvector is a constant vector • the next k eigenvectors are roughly piecewise constant with “pieces” corresponding to blocks 16 Spectral clustering: • Find the top k+1 eigenvectors v1,…,vk+1 • Discard the “top” one • Replace every node a with kdimensional vector xa = <v2(a),…,vk+1 (a) > • Cluster with k-means J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Spectral Clustering: Graph = Matrix W*v1 = v2 “propogates weights from neighbors” W v v : v is an eigenvecto r with eigenvalue • smallest eigenvecs of D-A are largest eigenvecs of A • smallest eigenvecs of I-W are largest eigenvecs of W Suppose each y(i)=+1 or -1: • Then y is a cluster indicator that splits the nodes into two • what is yT(D-A)y ? 17 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org y T ( D A)y y T Dy y T Ay d i yi ai , j yi y j 2 i i, j 1 2 2 d i yi 2 ai , j yi y j 2 i i, j 2 1 2 aij yi aij y j 2 ai , j yi y j 2 i j j i i, j 1 2 2 aij yi aij y j 2 ai , j yi y j 2 i, j i, j i, j 1 2 ai , j ( yi y j ) 2 i, j = size of CUT(y) y ( I W )y size of NCUT( y) T 18 NCUT: roughly minimize ratio of transitions between classes vs transitions within classes So far… • How to define a “good” partition of a graph? – • How to efficiently identify such a partition? – • 19 Minimize a given graph cut criterion Approximate using information provided by the eigenvalues and eigenvectors of a graph Spectral Clustering J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Spectral Clustering Algorithms • Three basic stages: – 1) Pre-processing • Construct a matrix representation of the graph – 2) Decomposition • Compute eigenvalues and eigenvectors of the matrix • Map each point to a lower-dimensional representation based on one or more eigenvectors – 3) Grouping • Assign points to two or more clusters, based on the new representation 20 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Spectral Partitioning Algorithm • 1) Pre-processing: – Build Laplacian matrix L of the graph • 2) Decomposition: – Find eigenvalues and eigenvectors x of the matrix L – Map vertices to corresponding components of 2 21 = 1 2 3 4 5 6 1 3 -1 -1 0 -1 0 2 -1 2 -1 0 0 0 3 -1 -1 3 -1 0 0 4 0 0 -1 3 -1 -1 5 -1 0 0 -1 3 -1 6 0 0 0 -1 -1 2 0.0 0.4 0.3 -0.5 -0.2 -0.4 -0.5 1.0 0.4 0.6 0.4 -0.4 0.4 0.0 0.4 0.3 0.1 0.6 -0.4 0.5 0.4 -0.3 0.1 0.6 0.4 -0.5 4.0 0.4 -0.3 -0.5 -0.2 0.4 0.5 5.0 0.4 -0.6 0.4 -0.4 -0.4 0.0 3.0 3.0 1 0.3 2 0.6 3 0.3 4 -0.3 5 -0.3 6 -0.6 X= How do we now find the clusters? 21 Spectral Partitioning • 3) Grouping: – Sort components of reduced 1-dimensional vector – Identify clusters by splitting the sorted vector in two • How to choose a splitting point? – Naïve approaches: • Split at 0 or median value – More expensive approaches: • Attempt to minimize normalized cut in 1-dimension (sweep over ordering of nodes induced by the eigenvector) 22 Split at 0: Cluster A: Positive points Cluster B: Negative points 1 0.3 2 0.6 3 0.3 4 -0.3 1 0.3 4 -0.3 5 -0.3 2 0.6 5 -0.3 6 -0.6 3 0.3 6 -0.6 A B 22 Value of x2 Example: Spectral Partitioning Rank in x2 23 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Example: Spectral Partitioning Value of x2 Components of x2 Rank in x2 24 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Example: Spectral partitioning Components of x1 25 Components of x3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org k-Way Spectral Clustering • How do we partition a graph into k clusters? • Two basic approaches: – Recursive bi-partitioning [Hagen et al., ’92] • Recursively apply bi-partitioning algorithm in a hierarchical divisive manner • Disadvantages: Inefficient, unstable – Cluster multiple eigenvectors [Shi-Malik, ’00] • Build a reduced space from multiple eigenvectors • Commonly used in recent papers • A preferable approach… 26 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Why use multiple eigenvectors? • Approximates the optimal cut [Shi-Malik, ’00] – Can be used to approximate optimal k-way normalized cut • Emphasizes cohesive clusters – Increases the unevenness in the distribution of the data – Associations between similar points are amplified, associations between dissimilar points are attenuated – The data begins to “approximate a clustering” • Well-separated space – Transforms data to a new “embedded space”, consisting of k orthogonal basis vectors • Multiple eigenvectors prevent instability due to information loss 27 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Spectral Clustering: Graph = Matrix W*v1 = v2 “propogates weights from neighbors” W v v : v is an eigenvecto r with eigenvalue • smallest eigenvecs of D-A are largest eigenvecs of A • smallest eigenvecs of I-W are largest eigenvecs of W Q: How do I pick v to be an eigenvector for a blockstochastic matrix? 28 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Spectral Clustering: Graph = Matrix W*v1 = v2 “propogates weights from neighbors” W v v : v is an eigenvecto r with eigenvalue How do I pick v to be an eigenvector for a blockstochastic matrix? 29 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Spectral Clustering: Graph = Matrix W*v1 = v2 “propogates weights from neighbors” W v v : v is an eigenvecto r with eigenvalue • smallest eigenvecs of D-A are largest eigenvecs of A • smallest eigenvecs of I-W are largest eigenvecs of W Suppose each y(i)=+1 or -1: • Then y is a cluster indicator that cuts the nodes into two • what is yT(D-A)y ? The cost of the graph cut defined by y • what is yT(I-W)y ? Also a cost of a graph cut defined by y • How to minimize it? • Turns out: to minimize yT X y / (yTy) find smallest eigenvector of X • But: this will not be +1/-1, so it’s a “relaxed” solution 30 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Spectral Clustering: Graph = Matrix W*v1 = v2 “propogates weights from neighbors” W v v : v is an eigenvecto r with eigenvalue λ1 e1 λ2 e3 λ3 “eigengap” λ4 e2 λ5,6,7,… . [Shi & Meila, 2002] 31 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Some more terms • If A is an adjacency matrix (maybe weighted) and D is a (diagonal) matrix giving the degree of each node – Then D-A is the (unnormalized) Laplacian – W=AD-1 is a probabilistic adjacency matrix – I-W is the (normalized or random-walk) Laplacian – More definitions are possible! • The largest eigenvectors of W correspond to the smallest eigenvectors of I-W – So sometimes people talk about “bottom eigenvectors of the Laplacian” 32 A W A W 33 K-nn graph (easy) Fully connected graph, weighted by distance 34 35 Spectral Clustering: Pros and Cons • Elegant, and well-founded mathematically • Works quite well when relations are approximately transitive (like similarity) • Very noisy datasets cause problems – “Informative” eigenvectors need not be in top few – Performance can drop suddenly from good to terrible • Expensive for very large datasets – Computing eigenvectors is the bottleneck 36 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Use cases and runtimes • K-Means – FAST – “Embarrassingly parallel” – Not very useful on anisotropic data • Spectral clustering – Excellent quality under many different data forms – Much slower than K-Means 37 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Further Reading • Spectral Clustering Tutorial: http://www.informatik.unihamburg.de/ML/contents/people/luxburg/publicati ons/Luxburg07_tutorial.pdf 38 How to validate clustering approaches? 39 Cluster validity • For supervised learning: – we had a class label, – which meant we could identify how good our training and testing errors were – Metric: Accuracy, Precision, Recall • For clustering: – How do we measure the “goodness” of the resulting clusters? 40 Clustering random data (overfitting) If you ask a clustering algorithm to find clusters, it will find some 41 Different aspects of validating clsuters • Determine the clustering tendency of a set of data, i.e., whether non-random structure actually exists in the data (e.g., to avoid overfitting) • External Validation: Compare the results of a cluster analysis to externally known class labels (ground truth). • Internal Validation: Evaluating how well the results of a cluster analysis fit the data without reference to external information. • Compare clusterings to determine which is better. • Determining the ‘correct’ number of clusters. 42 Measures of cluster validity • External Index: Used to measure the extent to which cluster labels match externally supplied class labels. – Entropy, Purity, Rand Index • Internal Index: Used to measure the goodness of a clustering structure without respect to external information. – Sum of Squared Error (SSE), Silhouette coefficient • Relative Index: Used to compare two different clusterings or clusters. 43 – Often an external or internal index is used for this function, e.g., SSE or entropy Measuring Cluster Validation with Correlation • Proximity Matrix vs. Incidence matrix: – A matrix Kij with 1 if the point belongs to the same cluster; 0 otherwise • Compute the correlation between the two matrices: – Only n(n-1)/2 values to be computed – High values indicate similarity between points in the same cluster • Not suited for density based clustering 44 Another approach: use similarity matrix for cluster validation 45 Internal Measures: SSE • SSE is also a good measure to understand how good the clustering is – Lower SSE good clustering • Can be used to estimate number of clusters 46 More on Clustering a little later… • We will discuss other forms of clustering in the following classes • Next class: – please bring your brief write up on the two papers – We will discuss frequent itemset mining and a few other aspects of clustering – Move on to Dimensionality Reduction 47 Summary • We saw spectral clustering techniques: – only a broad overview – more details next class • Speeding up Spectral clustering techniques can be challenging 48 Clustering Big Data Clustering Partitioning K-means 49 Hierarchybased CURE, Rock Density Based DBSCAN Grid Based WaveCluster Model Based EM, COBWEB Categorizing Big Data Clustering Algorithms 50
© Copyright 2024 Paperzz