Initializing K-means Batch Clustering: A Critical Evaluation of Several Techniques Author : Douglas Steinley Michael J. Brusco Presented by : Maria Hyun Data: 03/16/15 1 π1 = π(π β 1) πβ1 π=1 Initialization Strategy1 - Astrahan 1. 2. 3. π π=π+1 π(π₯π , π₯π ) Define a distance d1 (Average pairwise Euclidean distance) For each data point, xi, Compute the number of other data points that are within d1 of xi. Highest densityβ first cluster seed The remaining K β 1 seeds are chosen by decreasing density until d2(Another prespecified distance) from all the seeds that have already been chosen Dense β First Seed Initialization Strategy3 - Faber ο Initial seeds will be actual data points β¦ Random sample of K data point from whole dataset β¦ Denser areas more likely to be represented by starting seeds Initialization Strategy6 - Milligan ο Using the solution from Wardβs (1963) hierarchical cluster analysis 5 6 0.2 4 3 4 2 0.15 5 2 0.1 1 0.05 3 0 1 3 2 5 4 6 1 Initialization Strategy4 β Hand & Krzanowski K-means clustering β Partition P (SSE) 2. Set i = 1, Ξ± = .3, and Ξ² = .95 3. Perturb the solution from Step 1 by moving each object to a different cluster with probability Ξ±. Repeat the K-means clustering βPi(SSEi) 4. Set i = i + 1, replace Ξ± by Ξ±Ξ² If SSEi < SSE then SSE = SSEi and P = Pi Stop if i = 100. 1. Initialization Strategy9 1. 2. 3. 4. In differential geometry, one of the β Sudirections and Dy of principal curvature Start single cluster β Divide two sub-clusters Sub-cluster(Largest within-cluster variance) β Next divided cluster For Ck, Project xi β Ck first principal direction First principal component of xi β π¦i Divide Ck into two sub-clusters If π¦i β€ π¦, assign xi to (1) πΆπ (2) πΆπ 5. (1) πΆπ 2 & πΆπ Otherwise, assign xi to Repeat 2 ~ 4 until K clusters are found Initialization Strategy12 β Steinley Randomly divide the data into K clusters, where observation xi with equal probability 2. Compute the initial cluster seeds based on the random division in Step 1 3. Repeat multiple times, choosing the partition corresponding to the minimum value of (5) 1. Equation (5) ππ π 2 (π₯π , π₯ (π) ) > ππ β1 ππβ² π 2 (π₯π , π₯ (πβ²) ) ππβ² β1 Simulation1 ο SSE (Sum of Square Error) (π) 2 π πΎ SSE = π=1 π=1 πβπΆπ(π₯ππ β π₯π ) ο ARI(Adjust Rand Index) Evaluate the consistency between two partitions of datasets ο Equation (5) ππ π 2 (π₯π , π₯ (π) ) > ππ β1 ππβ² π 2 (π₯π , π₯ (πβ²) ) ππβ² β1 Simulation1 # of Cluster (K = 4, 6, 8) ο # of Variable (P = 4, 6, 8, 10) ο Distribution of Variables ο # of Observations ο Relative cluster density ο Types of Multi-dimensional overlap between clusters ο Probability clusters will overlap (P = 0, 0.1, 0.2, 0.3, 0.4) ο Simulation1 - Result ο I6 Always outperform ο Software Package β Worst Performance Ranking of Initialization Methods Method I6 I12 I4 I11 I7 I2 I1 I3 I9 I8 I5(SPSS) I10(SAS) Average Ranking 4.69 4.74 5.26 5.26 5.70 6.54 6.93 7.21 7.45 7.52 8.01 8.68 SSE Ranking 3.93 3.78 4.28 491 5.15 7.27 6.71 6.99 8.60 8.54 9.32 8.52 ARI Ranking(Mean ARI) 5.45(.6678) 5.71(.6606) 6.23(.6317) 5.62(.6585) 6.26(.6409) 5.81(.6541) 7.16(.5934) 7.43(.5729) 6.31(.6313) 6.49(.6274) 6.71(.6317) 8.83(.5214) Simulation1 - Result ο Between Datasets Effects β¦ Marginal overlap little influence when cluster do not overlap on all dimension β¦ Number of cluster, number of variables β Little effect on initializing of cluster ο Within Datasets Effects β¦ Do not know the dataset β I6, I12 Effectiveness π Simulation1 - Result 0.01 Small 0.06 Medium 0.14 Large π 2 show the sensitive to different levels of the respective factor ο Low π 2 β unaffected by change in a particular factor level ο π 2 Factor Levels by Methods Factor Type of Overlap Distribution Probability of Overlap Type of Overlap Relative Custer Density Number of Cluster Number of variables I1 .14 .09 .10 .05 .15 .01 *** I2 .19 .14 .11 .07 *** .02 *** I3 .11 .07 .06 .06 .26 .03 *** I4 .13 .12 .13 .07 .14 .02 *** I5 .07 .07 .02 .11 .06 .01 *** I6 .11 .20 .14 .10 .03 .02 *** I7 .09 .20 .16 .07 .03 .02 *** I8 .11 .13 .13 .10 .04 .04 *** I9 .13 .08 .02 .09 .01 .02 *** I10 .07 .05 .15 .04 .19 .02 *** I11 .14 .16 .16 .10 .06 .02 *** I12 .14 .15 .14 .10 .07 .02 *** βββ =βͺ .01 Simulation1 - Result ο Cluster is oddly shaped β¦ I12 good β¦ Risk in Using I9 amplified : low pairwise I6, I12 Pairwise ARI Between Methods I6 I12 I4 I11 I7 I2 I1 I3 I9 I8 I5 I10 I6 I12 I4 I11 I7 I2 I1 I3 I9 I8 I5 I10 1.00 .85 1.00 .80 .85 1.00 .84 .84 .80 1.00 .83 .79 .76 .80 1.00 .78 .81 .49 .78 .74 1.00 .73 .76 .76 .74 .70 .74 1.00 .70 .72 .72 .71 .67 .69 .70 1.00 .68 .69 .67 .68 .66 .72 .64 .64 1.00 .71 .72 .70 .70 .68 .70 .66 .62 .62 1.00 .67 .66 .63 .66 .65 .69 .60 .61 .79 .59 1.00 .65 .66 .67 .66 .65 .64 .68 .66 .57 .59 .52 1.00 Simulation2 # of Cluster (K = 5, 10, 20) ο # of Variable (P = 25, 50, 120) ο Sample size (N = 200, 1000, 5000) ο Types of Multi-dimensional overlap between clusters ο Probability clusters will overlap ο Simulation2 - Result ο I6 extraordinary β¦ Perfect Recovery(86.4%) β¦ Min ARI Cluster recovery Method I6 I12 I4 I11 I7 I2 I1 I3 I9 I8 I5 I10 Simulation1 ARI Ranking(Mean ARI) 5.45(.6678) 5.71(.6606) 6.23(.6317) 5.62(.6585) 6.26(.6409) 5.81(.6541) 7.16(.5934) 7.43(.5729) 6.31(.6313) 6.49(.6274) 6.71(.6317) 8.83(.5214) Simulation2 Method Mean ARI I6 I7 I11 I4 I3 I8 I1 I12 I2 I10 I5 I9 .9958 .9754 .8883 .8247 .7941 .7476 .6875 .6753 .6404 .5543 .5030 .4650 Computation Time - Result # of Cluster β β Average Time β ο # of Variable β β Average Time β ο Sample size β β Average Time β ο Computation time for Random Initialization of K-means Variables N=200 N=1000 N=2000 Clusters Clusters Clusters 4 8 20 4 8 20 4 8 20 4 .0073 .2301 .8932 .0297 1.6942 4.9744 .3740 6.2391 53.6229 6 .0081 .2492 1.0383 .0333 3.5610 6.7985 .4818 7.8315 85.8936 8 .0121 .2422 1.2015 .0753 4.7215 11.0251 .7320 9.2635 95.5127 10 .0127 .2945 1.6995 .1671 4.7239 13.5869 .9853 13.9428 164.6304 15 .0143 .7928 4.1263 .4907 6.1390 37.3077 1.9994 13.9134 210.5844 125 .0806 4.0058 34.0057 1.0649 34.2305 493.9347 11.0465 465.8757 2241.9232 250 .2286 6.0753 68.1105 3.4705 85.3312 813.3204 20.3711 581.1952 3980.2358 Conclusion Multiple initialization is useful(I12, Steinley) ο Distribution of locally optimal solution β Wardβs method(I6) ο N × N Matrix impossible β Use global K-means(I11) ο Minimize SSE β I12, Steinley ο ANY QUESTION??
© Copyright 2026 Paperzz