clustering using random walks

Clustering Spatial Data Using
Random Walks
Author : David Harel
Yehuda Koren
Graduate : Chien-Ming Hsiao
•
•
•
•
•
•
Outline
Motivation
Objective
Introduction
Basic Notions
Modeling The Data
Clustering Using Random Walks
– Separators and separating operators
– Clustering by separation
– Clustering spatial points
• Integration with Agglomerative Clustering
• Examples
• Conclusion
• Opinion
Motivation
• The characteristics of spatial data pose several
difficulties for clustering algorithms
• The clusters may have arbitrary shapes and nonuniform sizes
– Different cluster may have different densities
• The existence of noise may interfere the clustering
process
Objective
• Present a new approach to clustering spatial data
• Seeking efficient clustering algorithms.
• Overcoming noise and outliers
Introduction
• The heart of the method is in what we shall be
calling separating operators.
• Their effect is to sharpen the distinction between the
weights of inter-cluster edges and intra-cluster edges
– By decreasing the former and increasing the latter
• It can be used on their own or can be embedded in a
classical agglomerative clustering framework.
BASIC NOTIONS
• graph-theoretic notions
Let G V , E , w be a weighted graph
w : weighing function
(A higher value means more similar)
Let S  V .
V k S  : the set of nodes that are connected to some node of S
by a path with at most k edges
degG  : the degree of G
i, j : the edge between i and j
BASIC NOTIONS
• The probability of a transition from node i to node j
wi, j 

p
ij
di 

di
i ,k 
 i, k 
• The probability that a random walk originating at s will
reach t before returning to s
Pescape s, t    s ,i  psi   i
 s  0,  t  1 and
 i   pij   j
i , j 
for i  s, i  t
MODELINE THE DATA
• Delaunay triangulation (DT)
– Many O(n log n) time and O(n) space algorithms exist for
computing the DT of a planar point set.
• K-mutual neighborhood
– The k-nearest neighbors of each point can be O(n log n)
time O(n) space for any fixed arbitrary dimension.
• The weight of the edge (a,b) is
 d a, b 2
exp 
  ave2





– d(a,b) is the Euclidean distance between a and b.
– ave is the average Euclidean distance between two
adjacent points.
CLUSTERING USING
RANDOM WALKS
• To identifying natural clusters in a graph is to
find ways to compute an intimacy relation
between the nodes incident to each of the graph’s
edges.
• Identifying separators is to use an iterative
process of separation.
– This is a kind of sharpening pass
NS : Separation by neighborhood
similarity
Definition :
Let GV,E,w be a weighted graph and k be some small
constant. The separation of G by neighborho od similarity ,
denoted by NS(G), is defined to be :
dfn
NS G   Gs V , E ,  


k
k
v , Pvisit
u 
where v, u   E ,  s u , v   sim k Pvisit
 


sim k  x , y  is some similarity of the vectors x and y
Can be computed in time  E   n and space 1
CE : Separation by circular escape
Definition :
Let GV , E ,   be a weighted graph, and let k be some small
consant. the separation of G by circular escape, denoted by CE,
is defined todfnbe :
CE G   Gs V , E , ws 
where
 v, u  E
ws u , v   CE k v, u 
dfn
k 
k 
v, u   Pescape
u, v 
CE v, u   Pescape
k
Can be computed in time  E   n and space 1
Clustering spatial points
Integration with Agglomerative
Clustering
• The separation operators can be used as a
preprocessing before activating agglomerative
clustering on the graph
• Can effectively prevent bad local merging opposing
the graph structure.
• It is equivalent to a “single link” algorithm preceded
by a separation operation
Examples
Conclusion
• It is robust in the presence of noise and
outliers, and is flexible in handling data of
different densities.
• The CE operator yields better results than the
NS operator
• The time complexity of our algorithm applied
to n data points is O(n log n)
Opinion
• Since the algorithm does not rely on spatial knowledge,
we can to try it on other types of data.
END