A Clustering Method with Efficient Number of Clusters Selected

A Clustering Method with Efficient Number of Clusters
Selected Automatically Based on Shortest Path
Makki Akasha, Ibrahim Musa Ishag, Dong Gyu Lee, Keun Ho Ryu
Database/Bioinformatics Laboratory
Chungbuk National University
Cheongju, Korea
{Makki, Ibrahim, dglee, khryu} @ dblab.chungbuk.ac.kr
Abstract—Proposed method is for finding optimal number of clusters in large datasets, efficiently without any interventions from
user based on relationships among the data objects. The proposed method is divided into two main steps. First is filtering step which
uses shortest path between data objects. Second is clustering step which uses mean distance to obtain the number of clusters based on
optimal route. The main advantage of this algorithm is its ability to detect the typical number of clusters among objects in datasets.
Theoretical analysis and empirical evidence reveal that our method can efficiently self-generate the cluster group automatically rather
than other methods. We expect these results to be of interest to researchers and practitioners because it suggests a simple but very
elegant and effective alternative for clustering large datasets.
Keywords-clustering; data mining; shortest path;
Introduction
The problem of clustering datasets has become very important. Clustering algorithms divide datasets into subsets or classes.
They have been used in many applications such as knowledge discovery, compression, and medical organizations. The objects of
datasets with many attributes or dimensionality can be represented in multidimensional vector space.
Figure 1. Representation of data in two dimensions space
The main objective of clustering is to find the rational and valid organization of the data based on the relationships of data
objects. The objects within one cluster have most similarities rather than objects belonging to different clusters or classes. The
traditional clustering algorithms can be divided to two types: hierarchical or agglomerative clustering and divisible clustering
[1]. In the agglomerative clustering, the number of clusters does not need to be specified manually. We just consider only local
neighbors in each step. In divisive clustering, there are two types: Crisp clustering where each object belongs only to one cluster
and fuzzy cluster where some objects belong to every cluster of certain degree. The disadvantages of the divisible approach are
the difficulty of determining number of clusters and its sensitivity to noise and outlier [2].
Figure 2. Describing how genetic algorithm works to find optimal solution.
Proceedings of the 1st International Conference on Emerging Databases (EDB2009)
The genetic algorithms are general methods for solving and searching for solutions problem in a large space of candidate
solutions as figure 2 [3]. The genetic algorithms apply genetic operators such as selection, crossover and mutation to solve the
problems. Every solution has fitness function value depends on the problem definition.
Figure 3. Searching space.
For example, on the above figure 3, the fitness function at point x=809 has small value. Those solutions are used for
producing next generation of solutions by reproduction. The solutions with higher fitness value have more chance to reproduce.
The solution or chromosome can be represented as non-binary numbers that have integer and floating point types.
Proposed Method
The work proposed in this section aims to explain our new method which is called a clustering method for selecting efficiently
the number of clusters based on shortest path. The proposed method tries to find the optimal number of clusters automatically
based on the relationships among data objects. The proposed method is divided into two main steps. First filtering step which is
used to find strongest relationship among objects in dataset .Our proposed method uses shortest path to find strongest relationship.
It begins with reading the dataset objects and calculates the relationships among them using Euclidean distance. Traveling
salesman problem generates sample solutions to known relationships that are shown in figure 1 and 6(a). Genetic algorithm
calculates fitness function value for every solution based on fitness value. Two solutions are selected leading to emergence of new
born solution which replaces one of the parent solutions .This process continues until Genetic algorithm finds the best solution or
strongest relationship. The best solution or strongest relationship satisfies the equation (1) and it is shown as figure 8.
Route min = min (∑ (dxixi+1 +…dx1xn))
(1)
Second, after finishing filtering step, clustering step finds clusters possible to detect in a dataset. Our proposed approach uses
mean distance of shortest path given by the following equation
AVG =
DPmin /n
(2)
Where DPmin is summation of shortest path and n is the number of objects. Clustering step begins by calculating the mean of
shortest path. Then it searches for edge which is greater than mean and expose it from path. The shortest path divides to sub paths
after one or two iterations as shown in Figure 7(c and d) .Then this process iterates on every sub path. If the process obtains
more than three paths that have an object in next iteration, it must be stopped. Figure 4 shows our proposed algorithm steps to get
the clusters from datasets.
Agglomerative fuzzy clustering algorithms give us many results with different selection of number of clusters such as k-means
and c-means. After that we should compare between them to find the best result, those algorithms take more times and need
interventions from users [4]. So we propose this method to find number of clusters automatically based on relationships among
the objects. The details of the proposed approach are shown in the following sections.
Proceedings of the 1st International Conference on Emerging Databases (EDB2009)
Figure 4. Proposed method steps
Genetic Components
The genetic algorithm starts with randomly selecting initial population. The successive generations are derived by applying
the selection, crossover and mutation operators to the previous tour population .A current generation is acted upon by the three
operators to produce the successive generation such as selection, crossover and mutation operation. Before performing these
operations, fitness function fi evaluation is being implemented [1] .The method employed here is to calculate DPI as the total
Euclidean distances for each path first, then compute fi by using the following equation:
fi = DP max - DPI
(3)
Where DPmax is the longest Euclidean distance over solution in the population [1]
Selection Operation
The selection operator chooses two members from the solutions that are available within the population to participate in the
next operations of crossover and mutation. There are two popular methods for implementing this selection: The first one called
roulette selection uses the probability based on the fitness function of the solution and it is computed by using the following
equation:
Pi = fi / ∑ Fj
( 4)
The second method is called deterministic sampling which assigns a value SI to each solution or path which is evaluated by
the following equation:
Si =TRUNCAT (Pi*NS).
(5)
Where TRUNCAT means rounding real numbers into bigger integer, NS means the number of solutions or paths. The
selection operator assures that each solution or path participates as parents exactly Si times.
Crossover Operation
After the selection operation step, the solutions will be passed through the crossover operation. There are many proposals
about crossover procedures [1]. The following figure shows our proposed method
Proceedings of the 1st International Conference on Emerging Databases (EDB2009)
Figure 5. Proposed method
Our proposed method is shown in figure 5. Sometimes it is not required to solve any hard sub problems. But it can give nearly
optimal solutions for data clustering.
Figure 6. An example of proposed method
Figure 6 shows how our proposed method works. First filtering step, it begins with Initial relationships produced as figure 6
(a). The method uses genetic algorithm to find the optimal route as figure 6 (b). Second clustering step, it begins with dividing
optimal path into sub paths as figure 6(c). Our method continues dividing until the terminal condition for dividing becomes true
as figure 6 (d) then it stops. Finally each sub path is considered as clustered as figure 6 (e). The method stops when the terminal
condition becomes true. Because it may arrive to leaf level if the method iterates once.
Proceedings of the 1st International Conference on Emerging Databases (EDB2009)
Figure 7. Clustering process
The above figure shows the clusters of dataset that appear during different stages. During the clustering step the proposed
method divides the shortest path into many sub paths. This process is applied in any sub path alone until the terminal condition
becomes true. Then every sub path that has more than one object is considered as clusters otherwise as outliers.
The run time of this algorithm can be calculated based on the size of dataset N . TSP can generate sample of routes , we
suppose K relationship among those objects .Therefore the time complexity is O(NK) . Genetic algorithm can find strongest
relationship after M iterations. Hence, the total running time for Filtering part is O (MNK). But for the Clustering part, the
time complexity is going to be very small. It depends on the number of stages as in figure (7), we assume clustering step take L
iterations to find clusters. Therefore, the total complexity of our algorithm is O (MNKL).
Experimental Results
To implement our proposed algorithm, our experimental hardware setup was Pentium computer 4, memory 1 GB, CPU 2.8
GHz, and it is running window XP professional. We wrote program using mathlab tool and used iris dataset with different sizes
[5]. Table 1 shows the results of our experimental proposed approach.
TABLE 1. The results of implementation of our method by using iris dataset (column 3 and 5) with different sizes comparing with result k-mean algorithm as it
is shown in column(2,4)
Figure 8 shows the first testing .It clusters the dataset into 7 clusters according to the first row in table1. The first section of
figure 8 shows how proposed method finds the shortest path among the objects. Since the objects are few, the proposed method
will find the shortest path very quickly during the filtering step. The second section of figure 8 shows how proposed method finds
the typical clusters.
Proceedings of the 1st International Conference on Emerging Databases (EDB2009)
Figure 8. The output of our proposed method (50 tuples)
Figure 9 shows the second test. It combines the dataset into two clusters according to the second row table 1. The first section
of figure 9 shows how the proposed method finds the shortest path. Since the objects are many, the method takes more time to
find shortest path comparing with clustering step. The second section of figure 9 shows how the clustering step finds the typical
clusters.
Figure 9. The output of our proposed method (166 tuples)
Figure 10 shows the third test. It unites the dataset into 3 Clusters and 12 outliers according to the third row table 1. The first
section of figure 10 shows the filtering step which finds the shortest path. In this step, the method takes more time when the
dataset is very big. The second section of figure 10 shows clustering steps depending on filtering steps.
Figure 10. The output of our proposed method (768 tuples)
Conclusion
In this paper, we proposed novel method for clustering objects based on relationship. Our proposed method has two main
steps: Filtering step for finding optimal route or strongest relationship among data by using shortest path and clustering step
which continues dividing that optimal route into number of sub routes. The advantage of this method is to build clusters
automatically without any interventions from users. We are going to examine further extensions from our proposed method in
larger datasets.
Acknowledgment
This work was supported by the grant of the Korean Ministry of Education, Science and Technology"(The Regional Core
Research Program / Chungbuk BIT Research-Oriented University Consortium) and the Korea Science and Engineering
Foundation (KOSEF) grant funded by the Korea government (MEST) (No. R11-2008-014-02002-0).
References
1.
2.
3.
M. J. Li, M. K. NG and Y. M. Cheung ,” Agglomerative Fuzzy K-Means Clustering Algorithm With Selection of Number of Clusters”, IEEE Transaction
on knowledge and Data engineering vol .20 ,no.11,novmber , 2008.
C. F. Tsai, H. C. Wu, C. W. Tsai, "A New Data Clustering Approach for Data Mining in Large Databases," ,ispan, p. 0315, 2002 International Symposium
on Parallel Architectures, Algorithms and Networks (ISPAN '02), 2002.
H. L. R. Encarnación, S. M. B. Suárez, W. H. Rivera, V. C. Vázquez, M. A. S. Figueroa, A. R. Toro, “Genetic algorithm approach for recorder cycle time
determination in multi-stage System ”, university of Puerto,2003
Proceedings of the 1st International Conference on Emerging Databases (EDB2009)
4.
5.
6.
B. F. A. Dulaimi, and H. A. Ali, “Enhanced Traveling Salesman Problem Solving by Genetic Algorithm Technique (TSPGA)”, PWASET VOLUME 28
APRIL 2008 ISSN 1307-6884.
http://neural .cs.nthu.edu.tw/
Proceedings of the 1st International Conference on Emerging Databases (EDB2009)

Download Report

A Clustering Method with Efficient Number of Clusters Selected

Paperzz.com

Your Paperzz