Clustering Spatial Data Using Random Walks Author : David Harel Yehuda Koren Graduate : Chien-Ming Hsiao • • • • • • Outline Motivation Objective Introduction Basic Notions Modeling The Data Clustering Using Random Walks – Separators and separating operators – Clustering by separation – Clustering spatial points • Integration with Agglomerative Clustering • Examples • Conclusion • Opinion Motivation • The characteristics of spatial data pose several difficulties for clustering algorithms • The clusters may have arbitrary shapes and nonuniform sizes – Different cluster may have different densities • The existence of noise may interfere the clustering process Objective • Present a new approach to clustering spatial data • Seeking efficient clustering algorithms. • Overcoming noise and outliers Introduction • The heart of the method is in what we shall be calling separating operators. • Their effect is to sharpen the distinction between the weights of inter-cluster edges and intra-cluster edges – By decreasing the former and increasing the latter • It can be used on their own or can be embedded in a classical agglomerative clustering framework. BASIC NOTIONS • graph-theoretic notions Let G V , E , w be a weighted graph w : weighing function (A higher value means more similar) Let S V . V k S : the set of nodes that are connected to some node of S by a path with at most k edges degG : the degree of G i, j : the edge between i and j BASIC NOTIONS • The probability of a transition from node i to node j wi, j p ij di di i ,k i, k • The probability that a random walk originating at s will reach t before returning to s Pescape s, t s ,i psi i s 0, t 1 and i pij j i , j for i s, i t MODELINE THE DATA • Delaunay triangulation (DT) – Many O(n log n) time and O(n) space algorithms exist for computing the DT of a planar point set. • K-mutual neighborhood – The k-nearest neighbors of each point can be O(n log n) time O(n) space for any fixed arbitrary dimension. • The weight of the edge (a,b) is d a, b 2 exp ave2 – d(a,b) is the Euclidean distance between a and b. – ave is the average Euclidean distance between two adjacent points. CLUSTERING USING RANDOM WALKS • To identifying natural clusters in a graph is to find ways to compute an intimacy relation between the nodes incident to each of the graph’s edges. • Identifying separators is to use an iterative process of separation. – This is a kind of sharpening pass NS : Separation by neighborhood similarity Definition : Let GV,E,w be a weighted graph and k be some small constant. The separation of G by neighborho od similarity , denoted by NS(G), is defined to be : dfn NS G Gs V , E , k k v , Pvisit u where v, u E , s u , v sim k Pvisit sim k x , y is some similarity of the vectors x and y Can be computed in time E n and space 1 CE : Separation by circular escape Definition : Let GV , E , be a weighted graph, and let k be some small consant. the separation of G by circular escape, denoted by CE, is defined todfnbe : CE G Gs V , E , ws where v, u E ws u , v CE k v, u dfn k k v, u Pescape u, v CE v, u Pescape k Can be computed in time E n and space 1 Clustering spatial points Integration with Agglomerative Clustering • The separation operators can be used as a preprocessing before activating agglomerative clustering on the graph • Can effectively prevent bad local merging opposing the graph structure. • It is equivalent to a “single link” algorithm preceded by a separation operation Examples Conclusion • It is robust in the presence of noise and outliers, and is flexible in handling data of different densities. • The CE operator yields better results than the NS operator • The time complexity of our algorithm applied to n data points is O(n log n) Opinion • Since the algorithm does not rely on spatial knowledge, we can to try it on other types of data. END
© Copyright 2026 Paperzz