Sampling Within k-Means Algorithm to Cluster Large Datasets Team Members: Jeremy Bejarano,1 Koushiki Bose,2 Tyler Brannan,3 Anita Thomas4 Faculty Mentors: Kofi Adragni5 and Nagaraj K. Neerchal5 Client: George Ostrouchov6 Technical Report HPCF–2011–12, www.umbc.edu/hpcf>Publications Abstract Due to current data collection technology, our ability to gather data has surpassed our ability to analyze it. In particular, k-means, one of the simplest and fastest clustering algorithms, is ill-equipped to handle extremely large datasets on even the most powerful machines. Our new algorithm uses a sample from a dataset to decrease runtime by reducing the amount of data analyzed. We perform a simulation study to compare our sampling based k-means to the standard k-means algorithm by analyzing both the speed and accuracy of the two methods. Results show that our algorithm is significantly more efficient than the existing algorithm with comparable accuracy. Key words. k-means, clustering large datasets, sample size, tolerance and confidence intervals. AMS subject classifications (2010). 65C20, 62H30, 62D05, 62F25. 1 Introduction Scientific researchers are currently collecting vast amounts of data that need analysis. For example, NASA’s Earth Science Data and Information System Project collects data from seven satellites called the Afternoon Constellation. Each satellite simultaneously measures atmospheric, oceanic, and chemical readings from different parts of the world. In total, they transmit three Terabytes (1 Terabyte = 1024 Gigabytes) of data per day to Earth. Datasets such as these are difficult to move or analyze. Analysts often use clustering algorithms to group data points with similar attributes. Though various clustering algorithms exist, our research focuses on Lloyd’s k-means algorithm which iterates until the algorithm converges. It must calculate N · k distances per iteration, where N is the total number of data points which are to be assigned to k clusters. These distance calculations are time-intensive especially with multi-dimensional data. Dr. George Ostrouchov suggested that sampling could ease this burden. The key challenge 1 Department of Mathematics, Brigham Young University Department of Applied Mathematics, Brown University 3 Department of Mathematics, North Carolina State University 4 Department of Applied Mathematics, Illinois Institute of Technology 5 Department of Mathematics and Statistics, University of Maryland, Baltimore County 6 Oak Ridge National Laboratory 2 1 d 1 2 3 4 Best 0.0000 0.0000 0.0001 0.0000 Average 0.0004 0.0001 1.2570 0.0062 Table 1.1: Absolute Difference Between the Standard and Sampler Accuracy Values (in Percentages) was to find a sample size that was large enough to yield accurate results but small enough to outperform the standard k-means’ runtime. To counter this problem, we introduced a sampling function within the k-means algorithm. The sample size n is calculated after every iteration. A random sample of size n is then used in the next iteration of the algorithm. Once the algorithm converges on our sample, the entire dataset is classified into clusters, and the k centers of the clusters are calculated. We perform a simulation study to compare the efficiency and accuracy of our algorithm, henceforth referred to as the Sampler, to the results obtained by simply running the standard k-means, henceforth called the Standard on the entire dataset. Both best and average accuracy on a scale from 0 to 100 were examined for each algorithm after each simulation’s twenty trials. Table 1.1 lists the absolute difference between the two algorithms’ accuracy values for the number of dimensions d. The algorithms have almost identical accuracies. Let t1 be the Standard ’s total runtime and t2 be the Sampler ’s total runtime. Figure 1.1 plots the ratio t1 /t2 versus number of dimensions. Thus, the sampler was t1 /t2 times faster than the standard as indicated by the vertical coordinate. Therefore, we conclude that the sampler algorithm matches the accuracy of k-means while significantly decreasing runtime. Figure 1.1: Ratio of Total Times of Twenty trials (t1/t2) The report is organized as follows. Section 2 introduces this problem in the context of both k-means algorithms and hypothesizes sampling as a solution. Section 3 discusses the implementation of sampling within the k-means algorithm and how sample size is determined. Section 4 presents the results of the simulation study for datasets having different cluster patterns and dimensions, examining the accuracy and runtimes of our methods in comparison to the standard k-means algorithm found in R [4]. 2 2 2.1 The Algorithm and Incorporation of Sampling The k-Means Algorithm The k-means algorithm [1] is a heuristic method of cluster analysis: the grouping of data points to illustrate underlying similarities. Given a set of observations X = (x1 , x2 , . . . , xN ) and a set of k means, X̄ = (x̄1 , x̄2 , . . . , x̄k ), this algorithm partitions N observations into k clusters, S1 , S2 , . . . , Sk , with k ≤ N . Clusters are determined based on each obervation’s Euclidean distance from each mean. The distance of every observation in X from each mean in X̄ is computed. Observations are subsequently classified to the cluster Sj whose center they are the closest, thereby minimizing the within-cluster sums of squares. Forming clusters in this manner is known as the assignment step. The update step involves computing new centers by finding the means of all the observations in a particular cluster. The assignment and update steps are performed iteratively until convergence occurs. Convergence is said to occur when, in a particular iteration, no data point is reclassified to a new cluster. Typically, k-means is run multiple times, and the“best classification” is recorded, since a poor choice of initial centers can result in incorrect classification and may even prevent the algorithm from converging. “Best classification” is determined by comparing the withincluster sum of squares for each run; the run having the minimum within-cluster sum-ofsquares is identified as the “best classification”. 2.2 Sampling Within k-Means Our algorithm, the Sampler, is a modification of Lloyd’s k-means algorithm [2] found in the software R [4]. Initially, the assignment step runs on only a minute sample of the dataset, as discussed in Section 3.3. Once these points are classified, the algorithm proceeds to the update step. Next, the sample is updated using a function named block_sample. The purpose of this function is to determine the size of the sample needed for the next iteration. Section 3.1 discusses the block_sample function in detail. There are two difficulties in randomly picking a sample from the dataset in each iteration: choosing random points from a large dataset is time-consuming and is complicated by the need to sample without replacement. This is especially difficult as the target datasets that we intend to use are extremely large. This enlarges what are referred to as NUMA (Non-Uniform Memory Access) issues. If we are to analyze only a small sample of the whole dataset randomly scattered across all the data, the Sampler will lose considerable time efficiency as it jumps around to read these data points. To minimize runtime, we estimate the maximum sample size n∗ that the algorithm will need. Additionally, memory is saved since a new matrix containing only the first n∗ data points of a random permutation of the original dataset is stored, rather than all N points. We estimate the maximum size w 2 −1 1 ∗ n =k + , (2.1) N 2z ∗ where the values of w and z ∗ rely on a desired width of a confidence interval for the cluster means. In constructing this confidence interval, we take the standard deviation to be one, 3 which is the maximum since the data has been centered and sphered. Additionally, to keep our calculations consistent with Section 3.1, we apply a finite population corrector, taking N as our population cluster size, so as to overestimate rather than underestimate the maximum sample size. Thus, only n∗ points are randomly chosen from our data and stored in a new dataset where n∗ N . When a sample of size n is needed, the first n points of this new dataset are chosen. This prevents repeated NUMA issues by storing the part of the dataset that the algorithm uses contiguously in memory. Hence we only need to access consecutive points in our dataset, reducing our runtime. When the algorithm converges, i.e., no points change their classification in a particular iteration, we have a subset of the population that has been clustered into k clusters. On the basis of the centers obtained, all N points in the dataset are now classified, and k new centers are computed one last time based on the clusters. These k centers and k clusters are the final output of our sampling algorithm. 3 3.1 Calculating Sample Size Determination of Sample Size within Each Iteration To cluster all the observations in our population, we need to find the centers (means) of each cluster using k-means. In order to do this, we construct a confidence interval of a chosen width around the means of each of the clusters in the sample, and estimate the sample size needed for that confidence interval. For each of the k clusters, we use the sample mean of each cluster x̄j , (for j = 1, 2, . . . , k), as an estimate for the means for the population clusters. For large samples, by the central limit theorem we can say that for each of the clusters, the sample mean follows an approximate normal distribution. Therefore, given a confidence level C, the confidence interval is of the form σj x̄j ± z ∗ √ nj j = 1, . . . , k where σj is the population standard deviation for cluster j, nj is the size of the sample needed from cluster j, and z ∗ is the upper (1 − C)/2 critical value for the standard normal distribution. The value of z ∗ is determined within the algorithm depending on the confidence interval input by the user. In our problem, we are drawing samples without replacement, i.e., in every iteration, the additional points we choose do not belong to our existing sample set. We also consider each sample point as coming from a cluster in the population. However, we do not know the sizes of the clusters in our population. Because of this fact, our sample size may not remain small in comparison with the population size, and therefore, we need to make use of a finite population correction (fpc) [3]. Since we are trying to apply a confidence interval to the mean of every cluster, we need to know the population size corresponding to every cluster. We call these population cluster sizes Nj . Since we do not know the exact 4 clusters in our population, we can estimate Nj by examining the quantity ncj , n which is the proportion of sample points that were classified into cluster j in the previous sample. Here ncj refers to the number of points of the previous sample that were classified into cluster j by the previous iteration, and n is the previous sample size. Then our population cluster sizes become ncj N. (3.1) Nj = n Hence, our confidence interval takes the form s σ Nj − n j j x̄j ± z ∗ √ , nj Nj which can be simplified to s x̄j ± z ∗ σj 1 1 − . nj Nj If the chosen width of the above confidence interval is to be w, we get the following equation s ! 1 1 w = 2 z ∗ σj . (3.2) − nj Nj Solving the above expression for nj we have " 2 #−1 w 1 + . nj = Nj 2z ∗ σj (3.3) Once the sample size for each cluster is determined, the sample size n to be used in the next iteration is: k X n= nj . (3.4) j=1 3.2 Estimation of variance Within each iteration of our Sampler algorithm, we recalculate the sample size for the desired confidence interval. As shown in Equation (3.3), this calculation requires an estimate of σj for each cluster Sj for j = 1, 2, . . . , k. Appealing to the central limit theorem, we treat the points within each cluster x ∈ Sj with x ∈ Rd as normally distributed along each dimension, centered at the true cluster centers µj ∈ Rd with standard deviation Σj ∈ Rd×d . Also, we assume that each dimension is independent and thus Σj = diag ((σ1,j )2 , . . . , (σd,j )2 ) . Therefore, Sji ∼ N µij , (σi,j )2 . (3.5) 5 Note that throughout this section we use superscripts to denote a projection onto the ith component dimension: thus x ∈ Rd but xi ∈ R for i = 1, . . . , d and Sji as the set (here the j-th cluster) whose elements have been projected onto the the i-th dimension. In our algorithm we chose to estimate σi,j using the rough estimator 6σ̂i,j = max xi − min xi . x∈Sj x∈Sj (3.6) This then leaves us with d·k estimates of standard deviations. Each σ̂i,j for j = 1, 2, . . . , k and i = 1, . . . , d could potentially be used to calculate a sample size in Equation (3.3), however, we wish only to produce one estimate σ̂j for each cluster. We thus define the estimate of standard deviation (only in the sense that we use it to determine sample size) for each cluster as the largest estimate among each dimension: σ̂j := max σ̂i,j . i 3.3 (3.7) Initial Sample Size In the first iteration of our algorithm, the Sampler clusters a minute sample from the dataset. Our block_sample function has yet to run; thus we need an initial sample size. That size is important since block_sample uses information from the previous iteration. Specifically, if p1 , p2 , . . . , pk are the true proportions of the population data points for the k clusters, then we want our initial sample to have approximately p1 , p2 , . . . , pk proportions of points from the corresponding cluster. To obtain a representative first sample, we can construct a confidence interval for the proportions [5]. Choosing a confidence level of C and a width of w, the sample size is determined below. The value z ∗ in Equations (3.8) and (3.9) represents the critical value for the (1 − C)/2 upper of the standard normal distribution. We use the worst-case scenario of p = 0.5 to determine the variance, so that we get a conservative estimate of the sample size required. ! r p(1 − p) w = 2 z∗ (3.8) n !2 p 2z ∗ p(1 − p) n= (3.9) w Solving for n with w = 0.06 and C = 95%, we get n ≈ 1000. Our code takes this number as given. Note that if the size of the dataset is less than 1000, the Sampler will simply use the entire dataset, no different than the Standard. 6 4 4.1 Simulation Study Accuracy and Efficiency of k-Means Sampler In examining the accuracy and efficiency of the two algorithms, we perform studies to compare the classifications, cluster centers, and runtimes. Accuracy (Acc) is measured by the percentage of points correctly classified. The Euclidean norm between the true centers µj of our generated dataset and the centers x̄j determined by the algorithms is used as a measure of error k X kx̄j − µj k = Err. (4.1) j=1 The time taken is simply the wall clock time observed for each of the two algorithms. Note that for the Sampler, the time taken to randomize the data as discussed in Section 2.2 is included. Randomization is essential for this algorithm in case the data has underlying patterns. Hence, this time cannot be ignored when making comparisons. Additionally, the time taken to center and sphere the data is included for both algorithms. 4.2 Data Generation For our study we generate test data as follows. We generate the points within each cluster x ∈ Sj with x ∈ Rd from normal distributions.We generate points across every dimension i such that they are normally distributed along each dimension, centered at the true cluster centers µj ∈ Rd with standard deviation Σj ∈ Rd×d (we use the same notation as in Section 3.2). Each dimension is generated independently and thus Σj = diag ((σ1,j )2 , . . . , (σd,j )2 ). Again, Sji ∼ N µij , (σi,j )2 . (4.2) For our study, all points generated by the parameters (µij , σi,j ) are said to have a true classification of j. 4.3 Results Our first study examined four datasets, of one, two, three, and four dimensions. We used a 95% confidence level and a width of 0.01 in all our studies. The cluster sizes were all equal in this study. Results for N = 119, 603, 200, k = 4, with each algorithm run 20 times on the same dataset, are given in Tables 4.1 and 4.2. The maximum sample size n∗ calculated for all our datasets is 613, 845. Notice that this is only 0.5132% of the entire dataset, and assists in the speedup of our algorithm. For the 20 trials, Table 4.1 gives the best results obtained from each of the two algorithms, for each dataset. Table 4.2 lists the averages and standard errors (s.e.) of the three statistics reported. In each of the 20 trials, both the algorithms were provided with identical centers, with different centers across runs. “Best results” refer to the particular run of the algorithm that gave the minimum within-cluster sum of squares. The 7 (a)One Dimension k-Means k-Means Sampler (b)Two Dimensions k-Means k-Means Sampler (c)Three Dimensions k-Means k-Means Sampler (d)Four Dimensions k-Means k-Means Sampler Acc Err Time 99.0679 0.0235 1113 99.0679 0.02277 419 Acc Err Time 98.7610 0.0331 1619 98.7610 0.0331 770 Acc Err Time 99.8563 0.0044 11787 99.8562 0.0044 1151 Acc Err Time 99.9719 0.0039 15729 99.9719 0.0039 1830 Table 4.1: Best Results Obtained from Generated Data: Comparison with True Clusters. third column in Table 4.1 gives the total time taken for all the 20 runs. This is significant since it allows us to compare the times taken by both the algorithms in achieving similar accuracies and errors. It is seen that they both have nearly identical errors and accuracies, but the Sampler significantly and consistently outperforms the Standard in terms of time. We notice a marked difference in results as we proceed to higher dimensions. In the case of three-dimensional data, we notice that both the algorithms do not perform as well on average, seen in Table 4.2 (c). However, if we examine the best results, we achieve 99.9% accuracy and Err = 0.0044 for both the algorithms. Thus, eventually, both algorithms give good results but show a large difference in time. The Sampler takes only 1, 151 seconds for 20 trials, compared to 11, 787 seconds for the Standard as seen in Table 4.1 (c). Therefore, for three dimensions, it is considerably more efficient to use the Sampler. Similar conclusions can be drawn from four-dimensional data in Tables 4.1 (d) and 4.2 (d). Our second study was on datasets with different types of clusters. These datasets also had 119, 603, 200 points. Although the data was generated in a similar way compared to the first study, the cluster sizes were not equal, along with other minor changes in parameters. Results for these datasets are very similar to the ones observed in Tables 4.1 and 4.2. Once again, best results, accuracies, errors, means and standard errors are all very similar for both the algorithms. Likewise, we also notice a significant decrease in time when the Sampler is used. Figure 4.1 summarizes the time results for the data used in first study. It plots the times taken for all 20 trials for both the algorithms, against the number of dimensions. Notice that although both the algorithms increase in time with dimensions, this increase is minimal in case of k-means sampler, showing an approximately linear increase. Results for two of the 20 trials from three-dimensional data are given in Table 4.5. Table 4.5 (a) gives the results for a particular run when both the algorithms performed well. Table 4.5 (b) gives results for a particular run when both the algorithms performed poorly. These results illustrate two vital points. 8 (a)One Dimension k-Means k-Means Sampler (b)Two Dimensions k-Means k-Means Sampler (c)Three Dimensions k-Means k-Means Sampler (c)Four Dimensions k-Means k-Means Sampler Acc s.e.(Acc) Err s.e.(Err) Time s.e.(Time) 99.0679 0.0000 0.0235 0.0000 55.6679 2.2271 99.0675 0.0001 0.0259 0.0008 20.9617 0.1414 Acc s.e.(Acc) Err s.e.(Err) Time s.e.(Time) 98.7610 0.0000 0.0331 0.0000 80.9680 1.4953 98.7611 0.0000 0.0332 0.0000 38.5109 0.1475 Acc s.e.(Acc) Err s.e.(Err) Time s.e.(Time) 76.1653 5.7822 10.0762 2.3218 589.3905 112.7390 77.4223 5.4655 9.9661 2.3000 57.5484 0.2734 Acc s.e.(Acc) Err s.e.(Err) Time s.e.(Time) 69.9848 6.9862 12.9450 2.9797 786.4350 148.6328 69.9910 6.9855 12.9360 2.9776 91.4829 2.5904 Table 4.2: Summary of Results Obtained from Generated Data: Comparison with True Clusters. (a)Two Dimension k-Means k-Means Sampler (b)Three Dimensions k-Means k-Means Sampler Acc 98.7328 98.7324 Acc 99.9092 99.9092 Err Time 0.1371 1873 0.1376 778 Err Time 0.0275 8455 0.0276 1161 Table 4.3: Best Results Obtained from Second Version of Generated Data: Comparison with True Clusters. (a)Two Dimension k-Means k-Means Sampler (b)Three Dimensions k-Means k-Means Sampler Acc s.e.(Acc) Err s.e.(Err) Time s.e.(Time) 98.7328 0.0000 0.1371 0.0000 93.6284 3.0807 98.7329 0.0004 0.1369 0.0004 38.8944 0.0805 Acc s.e.(Acc) Err s.e.(Err) Time s.e.(Time) 88.8786 4.4288 13.9884 4.6640 422.7588 102.2657 88.8703 4.4272 13.9645 4.6713 58.0578 0.3901 Table 4.4: Summary of Results Obtained from Second Version of Generated Data: Comparison with True Clusters. (a)Correct Clusters Accuracy Error Time k-Means 99.8564 0.0044 95.0131 k-Means Sampler 99.8564 0.0044 56.6900 (b)Incorrect Clusters Accuracy Error Time k-Means 37.5989 21.3402 1090.0576 k-Means Sampler 37.7884 21.4123 57.1936 Table 4.5: Example Differences in Runtime 9 Figure 4.1: Variation in time for Best Results Across Dimensions. First, for all 120 trials with our 6 datasets, both algorithms either work extremely well or quite poorly. In none of our trials did one algorithm do well while the other did poorly. When we ran our studies, we noticed that the poor performance can be attributed to poor random initial centers, e.g, starting with three centers from the same true cluster. Since both the algorithms begin with identical centers in every trial (but different centers between trials), if our algorithm does choose a sample that is representative of the entire dataset, it follows that both algorithms should perform similarly. Thus, the fact that both algorithms either succeed or fail together for each trial gives further credence to the assertion that our algorithm can match k-means in terms of accuracy. Second, Table 4.5 alerts the reader as to why the Sampler has drastically better times in three and four dimensions. If the algorithms fail, they both usually reach the maximum number of iterations, 250 for our studies, or a number close to that. However, the Sampler runs approximately 250 times on no more than n∗ = 613, 845 points while the Standard uses all 119, 603, 200 points. Therefore, the Standard takes significantly more time to get a wrong answer. As mentioned in Section 2.1, k-means is usually run mutltiple times due to the possibility of poor random initial centers. Thus, this time issue is quite important as an analyst may run k-means five times to ensure the best results. But, if the Standard fails even once, the runtime will be significantly longer than for the Sampler. 5 Conclusion Further work on this project might include a more comprehensive study both on more varied test datasets as well as on real weather datasets. This is especially important considering that this preliminary study was performed on rather tame datasets. Also, these datasets should analyze the performance of the algorithm on varied values of k. Lastly, this paper showed that the algorithm was accurate for relatively low sample sizes. We would like to analyze this further to see how accurate the algorithm is for even lower sample sizes. We could find the lowest sample sizes, by manipulating width and confidence level, for which the algorithm would be acceptably accurate. In order for our algorithm to be a success, it needs to meet two benchmarks: match the accuracy of the standard k-means algorithm and significantly reduce runtime. Both 10 goals are accomplished for all six datasets analyzed. However, on datasets of three and four dimension, as the data becomes more difficult to cluster, both algorithms fail to obtain the correct classifications on some trials. Nevertheless, our algorithm consistently matches the performance of the standard algorithm while becoming remarkably more efficient with time. Therefore, we conclude that analysts can use our algorithm, expecting accurate results in considerably less time. Acknowledgments This research was conducted during Summer 2011 in the REU Site: Interdisciplinary Program in High Performance Computing (www.umbc.edu/hpcreu) in the UMBC Department of Mathematics and Statistics, funded by the National Science Foundation (grant no. DMS– 0851749). This program is also supported by UMBC, the Department of Mathematics and Statistics, the Center for Interdisciplinary Research and Consulting (CIRC), and the UMBC High Performance Computing Facility (HPCF). The computational hardware in HPCF (www.umbc.edu/hpcf) is partially funded by the National Science Foundation through the MRI program (grant no. CNS–0821258) and the SCREMS program (grant no. DMS– 0821311), with additional substantial support from UMBC. References [1] L. L. Cam and J.Neyman, eds., Some Methods for Classification and Analysis of Multivariate Observations, vol. 1, Berkeley, CA:University of California Press, 1967. [2] S. Lloyd, Least Squares Quantization in PCM, IEEE Transactions on Information Theory, 28 (1982), pp. 128–137. [3] N. K. Neerchal and S. P.Millard, Environmental Statistics, CRC Press LLC, 2001. [4] R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2011. ISBN 3-90005107-0. [5] D. D. Wackerly, W. Mendenhall, and R. L. Scheaffer, Mathematical Statistics with Applications, Cengage Learning, seventh ed., 2008. 11
© Copyright 2026 Paperzz