WINP: A Window-based
Incremental and Parallel
Clustering Algorithm for Very
Large Databases
Zhang Qiang, Zhao Zheng, Sun Zhi wei, Edward daley
IEEE 2005
Date: 2006/04/28
Speaker: Liu Yu-Jiun
Introduction
The booming of data lets us find out useful
information more difficultly.
Existing Clustering Algorithm
Partitioning
clustering method: CLARANS
Hierarchical clustering method: BIRCH
Grid clustering method: STING
Dense clustering method: DBSCAN
How to cluster accurately and efficiently?
Objects
Big Object
Less
work but low accuracy
Small Object
More
work but high accuracy
Basic WINP Algorithm
1.
2.
3.
Divide the data space into rectangular
cells.
Generate a detecting window as a
handling object.
WINP looks cells as handling objects and
search cluster accurately around the
location.
Definitions
Definition 1:
The mass of a cell g: mass( g ) n ,
n is the number of points located in cell g.
The mass of a cells’ collection: mass({g i }) mass( g i )
i
Definition 2 :
The ε-neighborhood of g: N ( g ) {gi | | p, pi | }
p is the center of g ,
pi is the center of g i ,
g N ( g)
Definitions cont.
Definition 3:
A nonempty cell g is a core cell: core ( g , , ) true
Definition 4:
Cells g i and g j are density reachable : g i g j ,
if core ( g i , , ) true , core ( g j , , ) true ,
a sequence {g k | core ( g k , , ) true} and
g1 N ( g i ), g 2 N ( g1 ), ... , g k N ( g k 1 ), g j N ( g k )
A frame F is a nonempty set and composed of all the
cells that every two of them are density reachable.
Definitions cont.
Definition 5:
A cluster C with respect to frame F:
C {g | mass( g ) 0, p F lets g N ( p)}
Detecting Window
Find out the expecting core cells.
Window length:
2 *
Lw
* Lc
L
c
Finding accurate clusters
Searching only around expecting core cells.
Algorithm 1 (used in clustering in a low-dimensional space)
1. If core = null then terminate, else remove a cell g from it.
2. If g is marked, then go back to 1.
3. Test whether g is a core cell.
4. If g isn’t then mark it with NOISE and go back to 1.
5. If g is a core cell, then put all nonempty cells in g’s ε-neighborhood
area into a set E.
6. Generate a new cluster C, mark g as a member of the frame
of C, mark all the cells in E as members of C.
7. …………
A compromise between accuracy and
efficiency in high-dimensions space
In a high dimensional grid space, ε-neighborhood area will be big.
Use two parameters COREMAX and CLUSTERMIN to control
efficiency and accuracy.
Algorithm 2
NumberofExpectingCoreCells
COREMAX , then jump to 4.
1. If
NumberofNonemptyCells
2. Use Algo1 to process.
3. Terminate.
4. If Core = null, then jump to 11, else remove a cell g from Core.
5. If g has been marked, then go back to 4.
6. Generate a new cluster C, mark g as a member of the frame of C.
Step 7 to 10 are similar to steps of Algo1.
11. IF NumberofExpectingCoreCellsinCluster i CLUSTERMIN , then use Algo1 to process
NumberofNonemptyCells
cluster i.
The selection of parameters ε and ξ
ξ = 3 * dimension for very large database with
high noise.
ξ distance (σ) = min (Δ) ,
Algorithm 3
Random
select about 1% of the total nonempty cells.
Calculate σ of these cells.
Sort these σ in descending order.
Choose the σ of the first valley point as ε.
Time Complexity of WINP
Grid creating : O(N)
Window detecting : O(M)
Accurate clusters finding : O(F * E)
Sum above : O(N+M+F*E) O(N)
N: all points
M: the number of nonempty cells in grid space
F: the number of cells occupied by clusters
E: the average number of nonempty cells in a ε-neighborhood
area.
N >> M and N >> F*E
Incremental clustering algorithm
Different policy for large insertion and little
insertion.
Rerun Algorithm 2.
ΔN > 5%N
Updated cells more than original core cells
Reuse the grid space and clusters.
Influence
4 results of updated cells
a) no influence
b) a new cluster
c) growth of cluster
d) merging of clusters
The distributed clustering algorithm
Algorithm 5 (parallel clustering)
1. Pi reads data block Ni to generate grid blocks
2. Exchange grid blocks among processors. Let Pi get
3. Pi makes
together and gets its own grid
~
space Gi by summing up the mass of same cells.
4. Uses Algo2 to do clustering and gets clusters
5. Finds the processors sharing a grid boundary with i and
merges the clusters.
Clusters merging for parallel clustering
Algorithm 6 (merging)
Find out all core cells with ε
distance of X.
Get the set {E i ,1 }
Si = 5, Sj = 2
Check each cell g in {E j ,r } , if there
are core cells in Pi’s clusters in g’s
ε-neighborhood area. Merge them
to a big cluster.
Pi
Pj
Experiment
Accuracy test
※ grid density 100*130
※ grid density 100*170
Experiment
Efficiency test of single workstation
※The number of clusters in the data set is 6.
※The noise level is 20%.
※Workstation: HASEE 715D.
※Grid density: 1000*1000
※JAVA
Experiment
Incremental clustering test
※1% increased v.s. 5.5% increased
※Grid density: 4000*4000
※ The speed up factor
Treclustering
Tincremental
Experiment
Distributed clustering test
※N: 1.2 * 10^8
※4 workstations (1.2G & 256M)
※7 clusters, 20% noise, and 5 dimensional
© Copyright 2026 Paperzz