Initializing K-means Batch Clustering

Initializing K-means Batch Clustering:
A Critical Evaluation of Several Techniques
Author : Douglas Steinley
Michael J. Brusco
Presented by : Maria Hyun
Data: 03/16/15
1
𝑑1 =
𝑛(𝑛 βˆ’ 1)
π‘ƒβˆ’1
𝑖=1
Initialization Strategy1 - Astrahan
1.
2.
3.
𝑃
𝑗=𝑖+1
𝑑(π‘₯𝑖 , π‘₯𝑗 )
Define a distance d1
(Average pairwise Euclidean distance)
For each data point, xi, Compute the
number of other data points that are
within d1 of xi.
Highest density→ first cluster seed
The remaining K βˆ’ 1 seeds are chosen by
decreasing density until d2(Another prespecified distance) from all the seeds that
have already been chosen
Dense β†’
First Seed
Initialization Strategy3 - Faber
ο‚—
Initial seeds will be actual data points
β—¦ Random sample of K data point from whole
dataset
β—¦ Denser areas more likely to be represented
by starting seeds
Initialization Strategy6 - Milligan
ο‚—
Using the solution from Ward’s (1963)
hierarchical cluster analysis
5
6
0.2
4
3
4
2
0.15
5
2
0.1
1
0.05
3
0
1
3
2
5
4
6
1
Initialization Strategy4 – Hand & Krzanowski
K-means clustering β†’ Partition P (SSE)
2. Set i = 1, Ξ± = .3, and Ξ² = .95
3. Perturb the solution from Step 1 by
moving each object to a different cluster
with probability Ξ±. Repeat the K-means
clustering β†’Pi(SSEi)
4. Set i = i + 1, replace Ξ± by Ξ±Ξ²
If SSEi < SSE then SSE = SSEi and P = Pi
Stop if i = 100.
1.
Initialization Strategy9
1.
2.
3.
4.
In differential
geometry, one of the
– Sudirections
and Dy of principal
curvature
Start single cluster β†’ Divide two sub-clusters
Sub-cluster(Largest within-cluster variance)
β†’ Next divided cluster
For Ck, Project xi ∈ Ck first principal direction
First principal component of xi β†’ 𝑦i
Divide Ck into two sub-clusters
If 𝑦i ≀ 𝑦, assign xi to
(1)
πΆπ‘˜
(2)
πΆπ‘˜
5.
(1)
πΆπ‘˜
2
& πΆπ‘˜
Otherwise, assign xi to
Repeat 2 ~ 4 until K clusters are found
Initialization Strategy12 – Steinley
Randomly divide the data into K clusters,
where observation xi with equal
probability
2. Compute the initial cluster seeds based
on the random division in Step 1
3. Repeat multiple times, choosing the
partition corresponding to the minimum
value of (5)
1.
Equation (5)
π‘›π‘˜
𝑑 2 (π‘₯𝑖 , π‘₯ (π‘˜) ) >
π‘›π‘˜ βˆ’1
π‘›π‘˜β€²
𝑑 2 (π‘₯𝑖 , π‘₯ (π‘˜β€²) )
π‘›π‘˜β€² βˆ’1
Simulation1
ο‚—
SSE (Sum of Square Error)
(π‘˜) 2
𝑃
𝐾
SSE = 𝑗=1 π‘˜=1 π‘–βˆˆπΆπ‘˜(π‘₯𝑖𝑗 βˆ’ π‘₯𝑗 )
ο‚—
ARI(Adjust Rand Index)
Evaluate the consistency between two
partitions of datasets
ο‚—
Equation (5)
π‘›π‘˜
𝑑 2 (π‘₯𝑖 , π‘₯ (π‘˜) ) >
π‘›π‘˜ βˆ’1
π‘›π‘˜β€²
𝑑 2 (π‘₯𝑖 , π‘₯ (π‘˜β€²) )
π‘›π‘˜β€² βˆ’1
Simulation1
# of Cluster (K = 4, 6, 8)
ο‚— # of Variable (P = 4, 6, 8, 10)
ο‚— Distribution of Variables
ο‚— # of Observations
ο‚— Relative cluster density
ο‚— Types of Multi-dimensional overlap
between clusters
ο‚— Probability clusters will overlap
(P = 0, 0.1, 0.2, 0.3, 0.4)
ο‚—
Simulation1 - Result
ο‚— I6 Always outperform
ο‚— Software Package β†’ Worst
Performance
Ranking of Initialization Methods
Method
I6
I12
I4
I11
I7
I2
I1
I3
I9
I8
I5(SPSS)
I10(SAS)
Average Ranking
4.69
4.74
5.26
5.26
5.70
6.54
6.93
7.21
7.45
7.52
8.01
8.68
SSE Ranking
3.93
3.78
4.28
491
5.15
7.27
6.71
6.99
8.60
8.54
9.32
8.52
ARI Ranking(Mean ARI)
5.45(.6678)
5.71(.6606)
6.23(.6317)
5.62(.6585)
6.26(.6409)
5.81(.6541)
7.16(.5934)
7.43(.5729)
6.31(.6313)
6.49(.6274)
6.71(.6317)
8.83(.5214)
Simulation1 - Result
ο‚—
Between Datasets Effects
β—¦ Marginal overlap little influence when cluster
do not overlap on all dimension
β—¦ Number of cluster, number of variables
β†’ Little effect on initializing of cluster
ο‚—
Within Datasets Effects
β—¦ Do not know the dataset
β†’ I6, I12
Effectiveness
πœ‚
Simulation1 - Result
0.01
Small
0.06
Medium
0.14
Large
πœ‚ 2 show the sensitive to different levels of
the respective factor
ο‚— Low πœ‚ 2 β†’ unaffected by change in a
particular factor level
ο‚—
πœ‚ 2 Factor Levels by Methods
Factor
Type of Overlap
Distribution
Probability of Overlap
Type of Overlap
Relative Custer Density
Number of Cluster
Number of variables
I1
.14
.09
.10
.05
.15
.01
***
I2
.19
.14
.11
.07
***
.02
***
I3
.11
.07
.06
.06
.26
.03
***
I4
.13
.12
.13
.07
.14
.02
***
I5
.07
.07
.02
.11
.06
.01
***
I6
.11
.20
.14
.10
.03
.02
***
I7
.09
.20
.16
.07
.03
.02
***
I8
.11
.13
.13
.10
.04
.04
***
I9
.13
.08
.02
.09
.01
.02
***
I10
.07
.05
.15
.04
.19
.02
***
I11
.14
.16
.16
.10
.06
.02
***
I12
.14
.15
.14
.10
.07
.02
***
βˆ—βˆ—βˆ— =β‰ͺ .01
Simulation1 - Result
ο‚—
Cluster is oddly shaped
β—¦ I12 good
β—¦ Risk in Using I9 amplified : low pairwise I6, I12
Pairwise ARI Between Methods
I6
I12
I4
I11
I7
I2
I1
I3
I9
I8
I5
I10
I6
I12
I4
I11
I7
I2
I1
I3
I9
I8
I5
I10
1.00
.85 1.00
.80
.85 1.00
.84
.84
.80 1.00
.83
.79
.76
.80 1.00
.78
.81
.49
.78
.74 1.00
.73
.76
.76
.74
.70
.74 1.00
.70
.72
.72
.71
.67
.69
.70 1.00
.68
.69
.67
.68
.66
.72
.64
.64 1.00
.71
.72
.70
.70
.68
.70
.66
.62
.62 1.00
.67
.66
.63
.66
.65
.69
.60
.61
.79
.59 1.00
.65
.66
.67
.66
.65
.64
.68
.66
.57
.59
.52 1.00
Simulation2
# of Cluster (K = 5, 10, 20)
ο‚— # of Variable (P = 25, 50, 120)
ο‚— Sample size (N = 200, 1000, 5000)
ο‚— Types of Multi-dimensional overlap
between clusters
ο‚— Probability clusters will overlap
ο‚—
Simulation2 - Result
ο‚—
I6 extraordinary
β—¦ Perfect Recovery(86.4%)
β—¦ Min ARI
Cluster recovery
Method
I6
I12
I4
I11
I7
I2
I1
I3
I9
I8
I5
I10
Simulation1
ARI Ranking(Mean ARI)
5.45(.6678)
5.71(.6606)
6.23(.6317)
5.62(.6585)
6.26(.6409)
5.81(.6541)
7.16(.5934)
7.43(.5729)
6.31(.6313)
6.49(.6274)
6.71(.6317)
8.83(.5214)
Simulation2
Method
Mean ARI
I6
I7
I11
I4
I3
I8
I1
I12
I2
I10
I5
I9
.9958
.9754
.8883
.8247
.7941
.7476
.6875
.6753
.6404
.5543
.5030
.4650
Computation Time - Result
# of Cluster ↑ β†’ Average Time ↑
ο‚— # of Variable ↑ β†’ Average Time ↑
ο‚— Sample size ↑ β†’ Average Time ↑
ο‚—
Computation time for Random Initialization of K-means
Variables
N=200
N=1000
N=2000
Clusters
Clusters
Clusters
4
8
20
4
8
20
4
8
20
4
.0073
.2301
.8932
.0297
1.6942
4.9744
.3740
6.2391
53.6229
6
.0081
.2492
1.0383
.0333
3.5610
6.7985
.4818
7.8315
85.8936
8
.0121
.2422
1.2015
.0753
4.7215
11.0251
.7320
9.2635
95.5127
10
.0127
.2945
1.6995
.1671
4.7239
13.5869
.9853
13.9428
164.6304
15
.0143
.7928
4.1263
.4907
6.1390
37.3077
1.9994
13.9134
210.5844
125
.0806
4.0058
34.0057
1.0649
34.2305
493.9347
11.0465
465.8757
2241.9232
250
.2286
6.0753
68.1105
3.4705
85.3312
813.3204
20.3711
581.1952
3980.2358
Conclusion
Multiple initialization is useful(I12, Steinley)
ο‚— Distribution of locally optimal solution
β†’ Ward’s method(I6)
ο‚— N × N Matrix impossible
β†’ Use global K-means(I11)
ο‚— Minimize SSE β†’ I12, Steinley
ο‚—
ANY QUESTION??