WINP: A Window-based Incremental and Parallel Clustering

WINP: A Window-based
Incremental and Parallel
Clustering Algorithm for Very
Large Databases
Zhang Qiang, Zhao Zheng, Sun Zhi wei, Edward daley
IEEE 2005
Date: 2006/04/28
Speaker: Liu Yu-Jiun
Introduction


The booming of data lets us find out useful
information more difficultly.
Existing Clustering Algorithm
 Partitioning
clustering method: CLARANS
 Hierarchical clustering method: BIRCH
 Grid clustering method: STING
 Dense clustering method: DBSCAN

How to cluster accurately and efficiently?
Objects

Big Object
 Less

work but low accuracy
Small Object
 More
work but high accuracy
Basic WINP Algorithm
1.
2.
3.
Divide the data space into rectangular
cells.
Generate a detecting window as a
handling object.
WINP looks cells as handling objects and
search cluster accurately around the
location.
Definitions

Definition 1:
The mass of a cell g: mass( g )  n ,
n is the number of points located in cell g.
The mass of a cells’ collection: mass({g i })   mass( g i )

i
Definition 2 :
The ε-neighborhood of g: N  ( g )  {gi | | p, pi |   }
p is the center of g ,
pi is the center of g i ,
g  N ( g)
Definitions cont.


Definition 3:
A nonempty cell g is a core cell: core ( g ,  ,  )  true
Definition 4:
Cells g i and g j are density reachable : g i  g j ,
if core ( g i ,  ,  )  true , core ( g j ,  ,  )  true ,
 a sequence {g k | core ( g k ,  ,  )  true} and
g1  N  ( g i ), g 2  N  ( g1 ), ... , g k  N  ( g k 1 ), g j  N  ( g k )
A frame F is a nonempty set and composed of all the
cells that every two of them are density reachable.
Definitions cont.

Definition 5:
A cluster C with respect to frame F:
C  {g | mass( g )  0,  p  F lets g  N  ( p)}
Detecting Window


Find out the expecting core cells.
Window length:
2 * 
Lw  
 * Lc
L
 c 
Finding accurate clusters


Searching only around expecting core cells.
Algorithm 1 (used in clustering in a low-dimensional space)







1. If core = null then terminate, else remove a cell g from it.
2. If g is marked, then go back to 1.
3. Test whether g is a core cell.
4. If g isn’t then mark it with NOISE and go back to 1.
5. If g is a core cell, then put all nonempty cells in g’s ε-neighborhood
area into a set E.
6. Generate a new cluster C, mark g as a member of the frame
of C, mark all the cells in E as members of C.
7. …………
A compromise between accuracy and
efficiency in high-dimensions space



In a high dimensional grid space, ε-neighborhood area will be big.
Use two parameters COREMAX and CLUSTERMIN to control
efficiency and accuracy.
Algorithm 2








NumberofExpectingCoreCells
 COREMAX , then jump to 4.
1. If
NumberofNonemptyCells
2. Use Algo1 to process.
3. Terminate.
4. If Core = null, then jump to 11, else remove a cell g from Core.
5. If g has been marked, then go back to 4.
6. Generate a new cluster C, mark g as a member of the frame of C.
Step 7 to 10 are similar to steps of Algo1.
11. IF NumberofExpectingCoreCellsinCluster i  CLUSTERMIN , then use Algo1 to process
NumberofNonemptyCells
cluster i.
The selection of parameters ε and ξ



ξ = 3 * dimension for very large database with
high noise.
ξ distance (σ) = min (Δ) ,
Algorithm 3
 Random
select about 1% of the total nonempty cells.
 Calculate σ of these cells.
 Sort these σ in descending order.
 Choose the σ of the first valley point as ε.
Time Complexity of WINP




Grid creating : O(N)
Window detecting : O(M)
Accurate clusters finding : O(F * E)
Sum above : O(N+M+F*E)  O(N)





N: all points
M: the number of nonempty cells in grid space
F: the number of cells occupied by clusters
E: the average number of nonempty cells in a ε-neighborhood
area.
N >> M and N >> F*E
Incremental clustering algorithm


Different policy for large insertion and little
insertion.
Rerun Algorithm 2.
ΔN > 5%N
 Updated cells more than original core cells


Reuse the grid space and clusters.
Influence

4 results of updated cells
a) no influence
b) a new cluster
c) growth of cluster
d) merging of clusters
The distributed clustering algorithm

Algorithm 5 (parallel clustering)





1. Pi reads data block Ni to generate grid blocks
2. Exchange grid blocks among processors. Let Pi get
3. Pi makes
together and gets its own grid
~
space Gi by summing up the mass of same cells.
4. Uses Algo2 to do clustering and gets clusters
5. Finds the processors sharing a grid boundary with i and
merges the clusters.
Clusters merging for parallel clustering

Algorithm 6 (merging)




Find out all core cells with ε
distance of X.
Get the set {E i ,1 }
Si = 5, Sj = 2
Check each cell g in {E j ,r } , if there
are core cells in Pi’s clusters in g’s
ε-neighborhood area. Merge them
to a big cluster.
Pi
Pj
Experiment

Accuracy test
※ grid density 100*130
※ grid density 100*170
Experiment

Efficiency test of single workstation
※The number of clusters in the data set is 6.
※The noise level is 20%.
※Workstation: HASEE 715D.
※Grid density: 1000*1000
※JAVA
Experiment

Incremental clustering test
※1% increased v.s. 5.5% increased
※Grid density: 4000*4000
※ The speed  up factor 
Treclustering
Tincremental
Experiment

Distributed clustering test
※N: 1.2 * 10^8
※4 workstations (1.2G & 256M)
※7 clusters, 20% noise, and 5 dimensional