On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts Hariharan Narayanan Department of Computer Science University of Chicago Chicago IL 60637 [email protected] Mikhail Belkin Department of Computer Science and Engineering The Ohio State University Columbus, OH 43210 [email protected] Partha Niyogi Department of Computer Science University of Chicago Chicago IL 60637 [email protected] Abstract One of the intuitions underlying many graph-based methods for clustering and semi-supervised learning, is that class or cluster boundaries pass through areas of low probability density. In this paper we provide some formal analysis of that notion for a probability distribution. We introduce a notion of weighted boundary volume, which measures the length of the class/cluster boundary weighted by the density of the underlying probability distribution. We show that sizes of the cuts of certain commonly used data adjacency graphs converge to this continuous weighted volume of the boundary. keywords: Clustering, Semi-Supervised Learning 1 Introduction Consider the probability distribution with density p(x) depicted in Fig. 1, where darker color denotes higher probability density. Asked to cluster this probability distribution, we would probably separate it into two roughly Gaussian bumps as shown in the left panel. Same intuition applies to semi-supervised learning. Asked to point out more likely groups of data of the same type, we would be inclined to believe that these two bumps contain data points with the same labels. On the other hand, the class boundary shown in the right panel seems rather less likely. One way to state this basic intuition is the Low Density Separation assumption [5], saying that the class/cluster boundary tends to pass through regions of low density. In this paper we propose a formal measure on the complexity of the boundary, which intuitively corresponds to the Low Density Separation assumption. We will show that given a class boundary, this measure can be computed from a finite sample from the probability distribution. Moreover, we show this is done by computing the size of a cut for a partition of a certain standard adjacency graph, defined on that sample, and point out some interesting connections to spectral clustering. Figure 1: A likely cut and a less likely cut. To fix our intuitions, let us consider the question of what makes the cut in the left panel more intuitively acceptable than the cut in the right. Two features of the left cut make it more pleasing: the cut is shorter in length and the cut passes through a low-density area. Note that a very jagged cut through a low-density area or a short cut through the middle of a high-density bump would be unsatisfactory. It will therefore appear reasonable to take the length of the cut as a measure of its complexity but weight it depending on the density of the probability distribution p through which it passes. In other words, we propose the weighted length of the boundary, represented by the contour integral R along the boundary cut p(s)ds to measure the complexity of a cut. It is clear that the boundary in the left panel has a considerably lower weighted length than the boundary in the right panel of our Fig. 1. To formalize this notion further consider a (marginal) probability distribution with density p(x) supported on some domain or manifold M . This domain is partitioned in two disjoint clusters/parts. Assuming that the boundary S is a R smooth hypersurface we define the weighted volume of the cut to be S p(s)ds. Note that just as in the example above, the integral is taken over the surface of the boundary. We will show how this quantity can be approximated given empirical data and establish connections with some popular graph-based methods. 2 Connections and related work 2.1 Spectral Clustering Over the last two decades there has been considerable interest in various spectral clustering techniques (see, e.g., [6] for an overview). The idea of spectral clustering can be expressed very simply. Given a graph, we would often like to construct a balanced partitioning of the vertex set, i.e. a partitioning such which minimizes the number (or total weight) of edges across the cut. This is generally an NP-hard optimization problem. It turns out, however, that a simple real-valued relaxation can be used to reduce it to standard linear algebra, typically to finding eigenvectors of a certain graph Laplacian. We note that the quality of partition is usually measured in terms of the corresponding cut size. A critical question, when this notion is applied to general purpose clustering in the context of machine learning is how to construct the graph given data points. A typical choice here is the Gaussian weights (e.g., [11]). To summarize, a graph is obtained from a point cloud, using Gaussian or other weights, and partitioned using spectral clustering or a different algorithm, which attempts to approximate the smallest (balanced) cut. This paper provides some further evidence that choosing Gaussian weights has a strong theoretical justification. We note that while the intuition is that spectral clustering is an approximation to the minimum cut, existing results on convergence of spectral clustering [12] do not provide an interpretation to the limiting partitions or connect it to the size of the cut. 2 Figure 2: Curves of small and high condition number respectively 2.2 Graph-based semi-supervised learning Similarly to spectral clustering, graph-based semi-supervised learning constructs a graph from the data. In contrast to clustering, however, some of the data is labeled. The problem is typically to either label the unlabeled points (transduction) or, more generally, to build a classifier defined on the whole space. This may be done trying to find the minimum cut, which respects the labels of the data directly ([3]), or, using graph Laplacian as a penalty functional (e.g.,[13, 1]). One of the important intuitions of semi-supervised learning is the cluster assumption(e.g., [4]) or, more specifically, the low density separation assumption suggested in [5], which states that the class boundary passes through a low density region. We argue that this intuition needs to slightly modified by suggesting that cutting through a high density region may be acceptable as long as the length of the cut is very short. For example imagine two high-density round clusters connected by a very thin high-density thread. Cutting the thread is appropriate as long as the width of the thread is much smaller than the radii of the clusters. 2.3 Convergence of Manifold Laplacians Another closely related line of research is the connections between point-cloud graph Laplacians and Laplace-Beltrami operators on manifolds, which have been explored recently in [9, 2, 10]. A typical result in that setting shows that for a fixed function f and points sampled from a probability distribution on a manifold or a domain, the graph Laplacian applied to f converges to the manifold Laplace-Beltrami operator ∆M f . We note that the results of those papers cannot be directly applied in our situation as for us f is the indicator function of a subset and is not differentiable. Even more importantly, this paper establishes an explicit connection between the point-cloud Laplacian applied to such characteristic functions (weighted graph cuts) and a geometric quantity, which is the weighted volume of the cut boundary. This geometric connection does not easily follow from results of those papers and the techniques used in the proof of our Theorem 3 are significantly different. 3 Summary of the Main Results Let p be a probability density function on a domain M ⊆ Rd . Let S be a smooth hypersurface that separates M into two parts, S1 and S2 . The smoothness of S will be quantified by a condition number 1/τ , where τ is the radius of the largest ball that can be placed tangent to the manifold at any point, intersecting the manifold at only one point. It bounds the curvature of the manifold. 3 Definition 1 Let Kt (x, y) be the heat kernel in Rd given by Kt (x, y) := Let Mt := Kt (x, x) = 2 1 e−kx−yk /4t . (4πt)d/2 1 . (4πt)d/2 Let X := {x1 , . . . , xN } be a set of N points chosen independently at random from p. Consider the complete graph whose vertices are associated with the points in X, and where the weight of the edge between xi and xj , i 6= j is given by Wij = Kt (xi , xj ) Let W be the weight matrix. Let X1 = X ∩ S1 and X2 = X ∩ S2 be the data point which land in S1 and S2 respectively. Let D be the diagonal matrix whose entries are row sums of W (degrees of the corresponding vertices) X Dii = Wij j The normalized Laplacian associated to the data X (and parameter t) is the matrix L(t, X) := I − D−1/2 W D−1/2 . Let f = (f1 , . . . , fN ) be the indicator vector for X1 : ½ 1 fi = 0 if xi ∈ X1 otherwise There are two quantities of interest: R 1. S p(s)ds, which measures the quality of the partition S in accordance with the weighted volume of the boundary. 2. f T Lf , which measures the quality of the empirical partition in terms of its cut size. Our main Theorem shows that after an appropriate scaling, the empirical cut size converges to the volume of the boundary. Theorem 1 Let the number of points, |X| = N tend to infinity and {tN }∞ 0 , be a sequence of values of t that tend to zero such that tN > 11 . Then with probability 1, N 2d+2 √ Z π lim √ f T L(tN , X)f = p(s)ds N t S Further, for any δ ∈ (0, 1) and any ² ∈ (0, 1/2), there exists a positive constant C and an integer N0 (depending on ², δ and certain generic invariants of p and S) such that with probability 1 − δ, (∀N > N0 ), ¯ √ ¯ Z ¯ π T ¯ ¯ √ f L(tN , X)f − p(s)ds¯¯ < Ct²N . ¯N t S √ This theorem is proved by first relating the empirical quantity N √πt f T L(tN , X)f to a heat flow across the relevant cut (on the continuous domain), and then relating the heat flow to the measure of the cut. In order to state these results, we need the following notation. Definition 2 Let p(x) ψt (x) = qR Kt (x, z)p(z)dz M Let β(t, X) := √ π √ f T L(t, X)f N t 4 . S S2 S1 t1/2 Figure 3: Heat flow α tends to and r α(t) := π t Z R S p(s)ds Z Kt (x, y)ψt (x)ψt (y)dxdy. S1 S2 Where t and X are clear from context, we shall abbreviate β(t, X) to β and α(t) to α. In theorem 2 we show that for a fixed t, as the number of points |X| = N tends to infinity, with probability 1, β(t, X) tends to α(t). In theorem 3 we show that α(t) can be made arbitrarily close to the weighted volume of the boundary by making t tend to 0. √ Theorem 2 Let 0 < µ < 1. Let u := 1/ t2d+1 N 1−µ . Then, there exist positive constants C1 , C2 depending only on p and S such that with probability greater than 1 − exp (−C1 N µ ) ³ ´ d+1 |β(t, X) − α(t)| < C2 1 + t 2 uα(t). (1) e Theorem 3 For any ² ∈ (0, 12 ), there exists a constant C such that for all t such that 0 < t < τ (2d)− e−1 , ¯r Z Z ¯ Z ¯ π ¯ ¯ Kt (x, y)ψt (x)ψt (y)dxdy − p(s)ds¯¯ < C t² . ¯ t S2 S1 S (2) By letting N → ∞ and tN → 0 at suitable rates and putting together theorems 2 and 3, we obtain the following theorem: √ Theorem 4 Let the number of random data points N → ∞, and tN → 0, at rates so that u := 1/ t2d+1 N 1−µ → 0. Then, for any ² ∈ (0, 1/2), there exist positive constants C1 , C2 depending only on p and S , such that for any N > 1 with probability greater than 1 − exp (−C1 (N µ )), ¯ ¯ Z ¯ ¯ ¯β (tN , X) − p(s)ds¯¯ < C2 (t² + u) (3) ¯ S 4 Outline of Proofs 1−2² Theorem 1 is a corollary of Theorem 4, obtained by setting u to be t² , and setting µ to 2d+2 . N0 is chosen to bePa large enough integer so that an application of the union bound on all N > N0 , still gives us a probability exp (−C1 (N µ )) < δ, of the rate of convergence being worse than stated in Theorem 1. N >N0 5 Theorem 4 is a direct consequence of Theorem 2 and Theorem 3. Theorem 2: We prove theorem 2 using a generalization of McDiarmid’s inequality from [7, 8]. McDiarmid’s inequality asserts that a function of a large number of independent random variables, that is not very influenced by the value of any one of these, takes a value close to its mean. In the generalization that we use, it is permitted that over a bad set that has a small probability mass, the function is highly influenced by some of the random variables. In our setting, it can be shown that our measure of a cut, f T Lf is such a function of the independent random points in X, and so the result is applicable. There is another step involved, since the mean √ of f T Lf is not α, the quantity to which we wish to prove convergence. Therefore we need to prove that the mean E[ N √πt f T L(t, X)f ] tends to α(t) as N tends to infinity. Now, √ X X p π Kt (x, y) P P √ f T L(t, X)f = 1/N π/t . {( z6=x Kt (x, z))( z6=y Kt (y, z))}1/2 N t x∈X y∈X 1 2 If, instead, we had in the denominator of the right side vZ Z u u t p(z)Kt (x, z)dz p(z)Kt (y, z)dz, M M using the linearity of Expectation, X X p 1 sµ E N (N − 1) π/t R x∈X1 y∈X2 Kt (x, y) ¶ = α(t). ¶µ R p(z)Kt (y, z)dz p(z)Kt (x, z)dz M M Using Chernoff bounds, we can show that with high probability, for all x ∈ X, P Z z6=x Kt (x, z) ≈ p(z)Kt (x, z)dz. N −1 M Putting the last two facts together and using the Generalization of McDiarmid’s inequality from [7, 8], the result follows. Since the exact details require fairly technical calculations, we leave them to the Appendix. Theorem 3: The quantity r Z Z π α := Kt (x, y)ψt (x)ψt (y)dxdy t S1 S2 is similar to the heat that would flow in time t from one part to another if the first were heated proportional to p. Intuitively, the heat that would flow from one part to the otherRin a small interval ought to be related to the volume of the boundary between these two parts, which in our setting is S p(s)ds. To prove this relationship, we bound α both above and below in terms of the weighted volume and condition number of the boundary. These bounds are obtained by making comparisons with the “worst case”, given condition number τ1 , which is when S is a sphere of radius τ . In order to obtain a lower bound on α, we observe that if B2 is the nearest ball of radius τ contained in S1 to a point P in S2 that is within τ of S1 , Z Z Kt (x, P )ψt (x)ψt (P )dx ≥ Kt (x, P )ψt (x)ψt (P )dx, S1 B2 as in Figure 4. Similarly, to obtain an upper bound on α, we observe that if B1 is a ball or radius τ in S2 , tangent to B2 at the point of S nearest to P , Z Z Kt (x, P )ψt (x)ψt (P )dx ≤ Kt (x, P )ψt (x)ψt (P )dx. B1c S1 6 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 1 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 0 1 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 0 1 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 2 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 1 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111 00000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 11111111111111 00000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 S 1 0 1 0 P B 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 111111 000000 000000000000000 111111111111111 000000 111111 000000000000000 111111111111111 000000 111111 000000000000000 111111111111111 000000 111111 000000000000000 111111111111111 111111 000000 000000000000000 111111111111111 000000 111111 0 1 000000000000000 111111111111111 000000 111111 0 1 000000000000000 111111111111111 000000 111111 2 000000000000000 111111111111111 000000 111111 000000000000000 111111111111111 000000 111111 000000000000000 111111111111111 000000 111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 B B P Figure 4: The density of heat diffusing to point P from S1 in the left panel is less or equal to the density of heat diffusing to P from B2 in the right panel. H2 11111111111 00000000000 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 000000000000 111111111111 000000000000 111111111111 00000000000 11111111111 000000000000 111111111111 000000000000 111111111111 00000000000 11111111111 000000000000 111111111111 000000000000 111111111111 00000000000 11111111111 000000000000 111111111111 000000000000 111111111111 00000000000 11111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 00000000000 11111111111 000000000000 111111111111 000000000000 111111111111 00000000000 11111111111 000000000000 111111111111 000000000000 111111111111 00000000000 11111111111 000000000000 111111111111 000000000000 111111111111 00000000000 11111111111 000000000000 111111111111 000000000000 111111111111 00000000000 11111111111 000000000000 111111111111 000000000000 111111111111 00000000000 11111111111 000000000000 111111111111 000000000000 111111111111 00000000000 11111111111 000000000000 111111111111 00000000000 11111111111 00000000000 11111111111 2 00000000000 11111111111 00000000000 R 11111111111 B2 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 P P 2τ Figure 5: The density of heat received by point P from B2 in the left panel can be approximated by the density of heat received by P from the halfspace H2 in the right panel. We now indicate how a lower bound is obtained Z for Kt (x, P )ψt (x)ψt (P )dx. A key observation is that for R = p B2 2dt ln(1/t), R Kt (x, P )dx ¿ 1. For this reason, only the portions of B2 kx−P k>R near P contribute to the the integral Z Kt (x, P )ψt (x)ψt (P )dx. B2 It turns out that a good lower bound can be obtained by considering the integral over H2 instead, where H2 is a 2 halfspace whose boundary is at a distance τ − R 2τ from the center as in figure 4. An upper bound for Z Kt (x, P )ψt (x)ψt (P )dx B1c is obtained along similar lines. The details of this proof are in the Appendix. 7 5 Conclusion In this paper we take a step towards a probabilistic analysis of graph based methods for clustering. The nodes of the graph are identified with data points drawn at random from an underlying probability distribution on a continuous domain. For a fixed partition we show that the cut-size of the graph partition converges to the weighted volume of the boundary separating the two regions of the domain. The rates of this convergence are analyzed. If one is able to generalize our result uniformly over all partitions, this allows us to relate ideas around graph based partitioning to ideas surrounding Low Density Separation. The most important future direction would be to achieve similar results uniformly over balanced partitions. References [1] M. Belkin and P. Niyogi (2004).“Semi-supervised Learning on Riemannian Manifolds.” In Machine Learning 56, Special Issue on Clustering, 209-239. [2] M. Belkin and P. Niyogi. “Toward a theoretical foundation for Laplacian-based manifold methods.” COLT 2005. [3] A. Blum and S. Chawla, “Learning from labeled and unlabeled data using graph mincuts“, ICML 2001. [4] O.Chapelle, J. Weston, B. Scholkopf, “Cluster kernels for semi-supervised learning”, NIPS 2002. [5] O. Chapelle and A. Zien, “Semi-supervised Classification by Low Density Separation”, AISTATS 2005. [6] Chris Ding, Spectral Clustering, ICML 2004 Tutorial. [7] Samuel Kutin, Partha Niyogi, “Almost-everywhere Algorithmic Stability and Generalization Error.”, UAI 2002, 275-282 [8] S. Kutin, TR-2002-04, “Extensions to McDiarmid’s inequality when differences are bounded with high probability.” Technical report TR-2002-04 at the Department of Computer Science, University of Chicago. [9] S. Lafon, Diffusion Maps and Geodesic Harmonics, Ph. D. Thesis, Yale University, 2004. [10] M. Hein, J.-Y. Audibert, U. von Luxburg, From Graphs to Manifolds – Weak and Strong Pointwise Consistency of Graph Laplacians, COLT 2005. [11] J. Shi and J. Malik. “Normalized cuts and image segmentation.” [12] U. von Luxburg, M. Belkin, O. Bousquet, Consistency of Spectral Clustering, Max Planck Institute for Biological Cybernetics Technical Report TR 134, 2004. [13] X. Zhu, J. Lafferty and Z. Ghahramani, Semi-supervised learning using Gaussian fields and harmonic functions, ICML 2003. 8
© Copyright 2025 Paperzz