Efficient Algorithms for Non-parametric Clustering With Clutter

Efficient Algorithms for
Non-parametric Clustering
With Clutter
Weng-Keen Wong
Andrew Moore
(In partial fulfillment of the speaking requirement)
1
Problems From the Physical Sciences
Earthquake faults
Minefield detection
(Byers and Raftery 1998)
(Dasgupta and Raftery 1998)
2
Problems From the Physical Sciences
(Pereira 2002)
(Sloan Digital Sky Survey 2000)
3
A Simplified Example
4
Clustering with Single Linkage Clustering
Single Linkage Clustering MST
Clusters
5
Clustering with Mixture Models
Mixture of Gaussians with a
Uniform Background Component
Resulting Clusters
6
Clustering with CFF
Original Dataset
Cuevas-Febrero-Fraiman
7
Related Work
(Dasgupta and Raftery 98)
 Mixture model approach – mixture of
Gaussians for features, Poisson process for
clutter
(Byers and Raftery 98)
 K-nearest neighbour distances for all points
modeled as a mixture of two gamma
distributions, one for clutter and one for the
features
 Classify each data point based on which
component it was most likely generated from
8
Outline
1. Introduction: Clustering and Clutter
2. The Cuevas-Febreiro-Fraiman
Algorithm
3. Optimizing Step One of CFF
4. Optimizing Step Two of CFF
5. Results
9
The CFF Algorithm Step One
Find the high
density datapoints
10
The CFF Algorithm Step Two


Cluster the
high density
points using
Single Linkage
Clustering
Stop when
link length > 
11
The CFF Algorithm


Originally intended to estimate the
number of clusters
Can also be used to find clusters against a
noisy background
12
Step One: Density Estimators



Finding high density points requires a
density estimator
Want to make as few assumptions about
underlying density as possible
Use a non-parametric density estimator
13
A Simple Non-Parametric Density
Estimator
A datapoint is a high
density datapoint if:
The number of
datapoints within a
hypersphere of radius
h is > threshold c
14
Speeding up the Non-Parametric
Density Estimator


Addressed in a separate paper (Gray and
Moore 2001)
Two basic ideas:
1. Use a dual tree algorithm (Gray and Moore
2000)
2. Cut search off early without computing exact
densities (Moore 2000)
15
Step Two: Euclidean Minimum
Spanning Trees (EMSTs)



Traditional MST algorithms assume you are
given all the distances
Implies O(N2) memory usage
Want to use a Euclidean Minimum Spanning
Tree algorithm
16
Optimizing Clustering Step




Exploit recent results in computational geometry
for efficient EMSTs
Involves modification to GeoMST2 algorithm by
(Narasimhan et al 2000)
GeoMST2 is based on Well-Separated Pairwise
Decompositions (WSPDs) (Callahan 1995)
Our optimizations gain an order of magnitude
speedup, especially in higher dimensions
17
Outline for Optimizing Step Two
1.
2.
3.
4.
5.
High level overview of GeoMST2
Properties of a WSPD
How to create a WSPD
More detailed description of GeoMST2
Our optimizations
18
Intuition behind GeoMST2
19
Intuition behind GeoMST2
20
High Level Overview of GeoMST2
Well-Separated
Pairwise
Decomposition
(A1,B1)
(A2,B2)
.
.
.
(Am,Bm)
21
High Level Overview of GeoMST2
Well-Separated
Pairwise
Decomposition
Each Pair (Ai,Bi) represents a
possible edge in the MST
(A1,B1)
(A2,B2)
.
.
.
(Am,Bm)
22
High Level Overview of GeoMST2
1. Create the WellSeparated Pairwise
Decomposition
(A1,B1)
(A2,B2)
.
.
.
(Am,Bm)
2. Take the pair
(Ai,Bi) that
corresponds to
the shortest
3. If the vertices of that
edge
edge are not in the
same connected
component, add the
edge to the MST.
Repeat Step 2.
23
A Well-Separated Pair
(Callahan 1995)



Let A and B be point sets in d
Let RA and RB be their respective bounding hyper-rectangles
Define MargDistance(A,B) to be the minimum distance
between RA and RB
24
A Well-Separated Pair (Cont)
The point sets A and B are considered to be
well-separated if:
MargDistance(A,B)  max{Diam(RA),Diam(RB)}
25
Interaction Product
The interaction product between two point
sets A and B is defined as:
A  B = {{p,p’} | p  A, p’  B, p  p’}
26
Interaction Product
The interaction product between two point
sets A and B is defined as:
A  B = {{p,p’} | p  A, p’  B, p  p’}
This is the set of all distinct pairs with one element
in the pair from A and the other element from B
27
Interaction Product Definition
The interaction product between two point
sets A and B is defined as:
A  B = {{p,p’} | p  A, p’  B, p  p’}
For Example:
A = {1,2,3} B = {4,5}
A  B = {{1,4}, {1,5}, {2,4}, {2,5}, {3,4}, {3,5}}
28
Interaction Product
Now let A and B be the same point set ie.
A = {0,1,2,3,4}
B = {0,1,2,3,4}
A  B = {{0,1}, {0,2}, {0,3},{0,4},
{1,2}, {1,3}, {1,4},
{2,3}, {2,4},
{3,4}}
29
Interaction Product
Now let A and B be the same point set ie.
A = {0,1,2,3,4}
B = {0,1,2,3,4}
A  B = {{0,1}, {0,2}, {0,3}, {0,4},
{1,2}, {1,3}, {1,4},
{2,3}, {2,4},
{3,4}}
Think of this as all possible edges in a complete,
undirected graph with {0,1,2,3,4} as the vertices
30
A Well-Separated Pairwise
Decomposition
Pair #1:
Pair #2:
Pair #3:
Pair #4:
([0],[1])
([0,1], [2])
([0,1,2],[3,4])
([3], [4])
Claim:
The set of pairs {([0],[1]), ([0,1], [2]), ([0,1,2],[3,4]), ([3],
[4])} form a Well-Separated Decomposition.
31
Interaction Product Properties
If P is a point set in d then a WSPD of P is a set
of pairs (Ai,Bi),…,(Ak,Bk) with the following
properties:
1. Ai  P and Bi  P for all i = 1,…,k
2. Ai  Bi =  for all i = 1, …, k
A = {0,1,2,3,4}
B = {0,1,2,3,4}
{([0],[1]), ([0,1], [2]), ([0,1,2],[3,4]), ([3], [4])}
clearly satisfies Properties 1 and 2
32
Interaction Product Property 3
3. (Ai  Bi)  (Aj  Bj) =  for all i,j such that i  j
From {([0],[1]), ([0,1], [2]), ([0,1,2],[3,4]), ([3], [4])}
we get the following interaction products:
A1  B1 = {{0,1}}
A2  B2 = {{0,2},{1,2}}
A3  B3 = {{0,3},{1,3},{2,3},{0,4},{1,4},{2,4}}
A4  B4 = {{3,4}}
These Interaction Products are all disjoint
33
Interaction Product Property 4
4.
k
P  P  i 1 Ai  Bi
P  P = {{0,1}, {0,2}, {0,3}, {0,4}, {1,2}, {1,3}, {1,4},
{2,3}, {2,4}, {3,4}}
A1  B1 = {{0,1}}
A2  B2 = {{0,2},{1,2}}
A3  B3 = {{0,3},{1,3},{2,3},{0,4},{1,4},{2,4}}
A4  B4 = {{3,4}}
The Union of the above Interaction Products gives back
PP
34
Interaction Product Property 5
5. Ai and Bi are
well-separated for
all i=1,…,k
35
Two Points to Note about WSPDs


Two distinct points are considered to be
well-separated
For any data set of size n, there is a trivial
WSPD of size (n choose 2)
36
A Well-Separated Pairwise
Decomposition (Continued)
If there are n points in P, a
WSPD of P can be constructed
in O(nlogn) time with O(n)
elements using a fair split tree
(Callahan 1995)
37
A Fair Split Tree
38
Creating a WSPD
Are the nodes outlined in yellow well-separated? No.
39
Creating a WSPD
Recurse on children of node with widest dimension
40
Creating a WSPD
Recurse on children of node with widest dimension
41
Creating a WSPD
Recurse on children of node with widest dimension
42
Creating a WSPD
And so on…
43
Base Case
Eventually you will find a well-separated pair of nodes.
Add this pair to the WSPD.
44
Another Example of the Base Case
45
Creating a WSPD
FindWSPD(W,NodeA,NodeB)
if( IsWellSeparated(NodeA,NodeB))
AddPair(W,NodeA,NodeB)
else
if( MaxHrectDimLength(NodeA) < MaxHrectDimLength(NodeB) )
Swap(NodeA,NodeB)
FindWSPD(W,NodeA->Left,NodeB)
FindWSPD(W,NodeA->Right,NodeB)
46
High Level Overview of GeoMST2
1. Create the WellSeparated Pairwise
Decomposition
(A1,B1)
(A2,B2)
.
.
.
(Am,Bm)
2. Take the pair
(Ai,Bi) that
corresponds to
the shortest
3. If the vertices of that
edge
edge are not in the
same connected
component, add the
edge to the MST.
Repeat Step 2
47
Bichromatic Closest Pair Distance
Given two sets (Ai,Bi), the Bichromatic
Closest Pair Distance is the closest distance
from a point in Ai to a point in Bi
48
High Level Overview of GeoMST2
1. Create the WellSeparated Pairwise
Decomposition
(A1,B1)
(A2,B2)
.
.
.
(Am,Bm)
2. Take the pair
(Ai,Bi) with the
shortest BCP
distance
3. If Ai and Bi are not
already connected,
add the edge to the
MST. Repeat Step 2.
49
GeoMST2 Example Start
Current MST
50
GeoMST2 Example Iteration 1
Current MST
51
GeoMST2 Example Iteration 2
Current MST
52
GeoMST2 Example Iteration 3
Current MST
53
GeoMST2 Example Iteration 4
Current MST
54
High Level Overview of GeoMST2
1. Create the WellSeparated Pairwise
Decomposition
(A1,B1)
(A2,B2)
.
.
.
(Am,Bm)
2. Take the pair
(Ai,Bi) with the
shortest BCP
distance
Modification for
CFF:
If BCP distance >
, terminate
3. If Ai and Bi are not
already connected,
add the edge to the
MST. Repeat Step 2.
55
Optimizations



We don’t need the EMST
We just need to cluster all points that are
within  distance or less from each other
Allows two optimizations to GeoMST2
code
56
High Level Overview of GeoMST2
1. Create the WellSeparated Pairwise
Decomposition
(A1,B1)
(A2,B2)
.
.
.
(Am,Bm)
Optimizations take place
in Step 1
2. Take the pair
(Ai,Bi) with the
shortest BCP
distance
3. If Ai and Bi are not
already connected,
add the edge to the
MST. Repeat Step 2.
57
Recall: How to Create the WSPD
58
Optimization 1 Illustration
59
Optimization 1
Ignore all links that are > 
 Every pair (Ai, Bi) in the WSPD becomes
an edge unless it joins two already
connected components
 If MargDistance(Ai,Bi) > , then an edge
of length  cannot exist between a point
in Ai and Bi
 Don’t include such a pair in the WSPD
60
Optimization 2 Illustration
61
Optimization 2



Join all elements that are within 
distance of each other
If the max distance separating the
bounding hyper-rectangles of Ai and Bi
is  , then join all the points in Ai and
Bi if they are not already connected
Do not add such a pair (Ai,Bi) to the
WSPD
62
Implications of the optimizations


Reduce the amount of time spent in
creating the WSPD
Reduce the number of WSPDs, thereby
speeding up the GeoMST2 algorithm by
reducing the size of the priority queue
63
Results



Ran step two algorithms on subsets of
the Sloan Digital Sky Survey
7 attributes – 4 colors, 2 sky
coordinates, 1 redshift value
Compared Kruskal, GeoMST2, and
-clustering
64
Results (GeoMST2 vs
-Clustering vs Kruskal in 4D)
65
Results (GeoMST2 vs
-Clustering in 3D)
66
Results (GeoMST2 vs
-Clustering in 4D)
67
Results (Change in Time as 
changes for 4D data)
68
Results
(Increasing Dimensions vs Time
69
Future Work



More accurate, faster non-parametric
density estimator
Use ball trees instead of fair split tree
Optimize algorithm if we keep h
constant but vary c and 
70
Conclusions


-clustering outperforms GeoMST2 by
nearly an order of magnitude in higher
dimensions
Combining the optimizations in both
steps will yield an efficient algorithm for
clustering against clutter on massive
data sets
71

Download Report

Efficient Algorithms for Non-parametric Clustering With Clutter

Paperzz.com

Your Paperzz