Pattern Recognition 39 (2006) 5 – 21 www.elsevier.com/locate/patcog Unsupervised possibilistic clustering Miin-Shen Yanga,∗ , Kuo-Lung Wub a Department of Applied Mathematics, Chung Yuan Christian University, Chung-Li 32023, Taiwan, ROC b Department of Information Management, Kun Shan University of Technology, Tainan 71023, Taiwan, ROC Received 22 March 2004; received in revised form 4 April 2005; accepted 15 July 2005 Abstract In fuzzy clustering, the fuzzy c-means (FCM) clustering algorithm is the best known and used method. Since the FCM memberships do not always explain the degrees of belonging for the data well, Krishnapuram and Keller proposed a possibilistic approach to clustering to correct this weakness of FCM. However, the performance of Krishnapuram and Keller’s approach depends heavily on the parameters. In this paper, we propose another possibilistic clustering algorithm (PCA) which is based on the FCM objective function, the partition coefficient (PC) and partition entropy (PE) validity indexes. The resulting membership becomes the exponential function, so that it is robust to noise and outliers. The parameters in PCA can be easily handled. Also, the PCA objective function can be considered as a potential function, or a mountain function, so that the prototypes of PCA can be correspondent to the peaks of the estimated function. To validate the clustering results obtained through a PCA, we generalized the validity indexes of FCM. This generalization makes each validity index workable in both fuzzy and possibilistic clustering models. By combining these generalized validity indexes, an unsupervised possibilistic clustering is proposed. Some numerical examples and real data implementation on the basis of the proposed PCA and generalized validity indexes show their effectiveness and accuracy. 䉷 2005 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. Keywords: Fuzzy clustering; Possibilistic clustering; Fuzzy c-means; Validity indexes; Fuzzy c-partitions; Possibilistic c-memberships; Robustness 1. Introduction Cluster analysis is a method for clustering a data set into groups of similar individuals. It is an approach towards unsupervised learning as well as one of the major techniques in pattern recognition. The conventional (hard) clustering methods restrict each point of the data set to exactly one cluster. Since Zadeh [1] proposed fuzzy sets that produced the idea of partial membership described by a membership function, fuzzy clustering has been widely studied and applied in a variety of key areas (see Refs. [2–4]). In the literature on fuzzy clustering, the fuzzy c-means (FCM) clustering algorithm, proposed by Dunn [5] and extended by Bezdek [2], is the most well-known and used method. ∗ Corresponding author. Tel.: +886 3 265 3100; fax: +886 3 265 3199. E-mail address: [email protected] (M.-S. Yang). Although FCM is a very useful clustering method, its memberships do not always correspond well to the degrees of belonging of the data, and it may be inaccurate in a noisy environment [6]. To improve this weakness of FCM, and to produce memberships that have a good explanation of the degrees of belonging for the data, Krishnapuram and Keller [6] created a possibilistic approach to clustering which used a possibilistic type of membership function to describe the degree of belonging. They showed that algorithms with possibilistic memberships are more robust to noise and outliers than FCM. The possibilistic clustering approach has also been applied in shell clustering, boundary detection, surface and function approximations (see Refs. [7–9]). It is necessary to pre-assume the number c of clusters for these hard, fuzzy and possibilistic clustering algorithms (PCAS). In general, the cluster number c should be unknown. The problem for finding an optimal c is usually called cluster validity. Once the partition is obtained by a clustering method, the validity function can help us to 0031-3203/$30.00 䉷 2005 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2005.07.005 6 M.-S. Yang, K.-L. Wu / Pattern Recognition 39 (2006) 5 – 21 validate whether it accurately presents the structure of the data set or not. The first proposed cluster validity functions associated with FCM are the partition coefficient (PC) and partition entropy (PE) [2,10,11]. These indexes use only membership functions and have the disadvantage of lack of connection to the geometrical structure of the data. The validity indexes that explicitly take into account the data geometrical properties include the FS index proposed by Fukuyama and Sugeno [12], the XB index proposed by Xie and Beni [13], the SC index proposed by Zahid et al. [14], FH V (fuzzy hyper-volume) and PD (partition density) indexes proposed by Gath and Geva [15], etc. By combining a validity function, fuzzy clustering algorithms, such as FCM and alternative FCM [16], can become unsupervised fuzzy clustering algorithms. In real data analysis, noise and outliers are unavoidable. In this situation, one may process a PCA. However, the existing validity indexes will lose their efficiency in a possibilistic clustering environment (i.e. the membership functions are possibilistic types). In this paper we will discuss problems of how to validate the clustering results obtained through a PCA. Since possibilistic memberships relax the constraint in the fuzzy memberships obtained by FCM, it can be made sure that these existing validity indexes will not work in these possibilistic clustering models. We will use a normalization technique to generalize these existing indexes. This generalization makes each validity index workable in both fuzzy and possibilistic clustering models. We shall also propose a new PCA whose possibilistic memberships are exponential functions, and is of course robust to noise and outliers. Therefore, an unsupervised possibilistic clustering method can be created by combining these generalized validity indexes with a PCA. This paper is organized as follows: in Section 2, we review the FCM clustering algorithm and discuss the effects of the parameters m (fuzzifier) and c (cluster number). We also review four existing validity indexes that are the most indicative in the fuzzy clustering validity analysis. In Section 3.1, we propose a PCA whose objective function is an extension of the FCM objective function in the combination of the PC and PE validity indexes. We also discuss the effects of the parameters m and c in PCA. In Section 3.2, we give the robust properties of PCA based on the influence function. In Section 3.3, we propose a normalization technique to generalize these existing indexes. This generalization makes each validity index workable by validating both fuzzy and possibilistic clusters. In Section 4, we use three real data sets to test the efficiency of the original and generalized validity indexes by validating the fuzzy and possibilistic clusters obtained through the FCM and PCA, respectively. We also analyze the effects of the parameter m by using three real data and by giving a suggestion of choosing m in both FCM and PCA. Finally, the discussion and conclusions are presented in Section 5. 2. Unsupervised fuzzy clustering Since Zadeh [1] introduced the concept of fuzzy sets, a great deal of research on fuzzy clustering has been conducted. Let X={x1 , . . . , xn } be a data set in an s-dimensional Euclidean space R s with its ordinary Euclidean norm · and let c be a positive integer larger than one. A partition of X into c clusters can be presented using mutually disjoint sets X1 , . . . , Xc such that X1 ∪ · · · ∪ Xc = X or equivalently by the indicator functions 1 , . . . , c such that i (x) = 1 if x ∈ Xi and i (x) = 0 if x ∈ / Xi for all i = 1, . . . , c. The set of indicator functions {1 , . . . , c }H = {{1 , . . . , c } | i (x) ∈ {0, 1}} (1) is called a hard c-partition which clusters X into c clusters. Consider an extension to allow i (x) to be membership functions of fuzzy sets i on X assuming values in the interval [0, 1] such that ci=1 i (x) = 1 for all x in X. In this case, {1 , . . . , c }F = {1 , . . . , c } | i (x) ∈ [0, 1], c i (x) = 1 i=1 (2) is called a fuzzy c-partition of X. 2.1. The FCM clustering algorithm In the unsupervised learning literature, the FCM is the best-known fuzzy clustering method. The FCM is an iterative algorithm using the necessary conditions for a minimizer of the FCM objective function JF CM with JF CM (, a) = n c i=1 j =1 2 m ij xj − ai , m > 1, (3) where = {1 , . . . , c } with the membership function i defined as ij = i (xj ) is a fuzzy c-partition and a = {a1 , . . . , ac } is the set of c cluster centers. The necessary conditions for a minimizer (, a) of JF CM are the following update equations: −1 c xj − ai 2/(m−1) , ij = xj − ak 2/(m−1) k=1 i = 1, . . . , c, j = 1, . . . , n and (4) n m j =1 ij xj m j =1 ij ai = n , i = 1, . . . , c. (5) The weighting exponent m is called the fuzzifier which can have an influence on the clustering performance of FCM [17]. The influence of the weighting exponent m on the membership function i of Eq. (4) is shown in Fig. 1. This figure is produced by assuming that there are only two clusters M.-S. Yang, K.-L. Wu / Pattern Recognition 39 (2006) 5 – 21 7 concept and property in any fuzzy or possibilistic clustering methods. The FCM clustering algorithm is then summarized as follows: 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 m = 10 m=3 m=2 m = 1.5 m = 1.1 0 1 2 3 Fuzzy c-means clustering algorithm (0) Initialize ai , i = 1, . . . , c and set > 0; set iteration counter = 0; (+1) Step 1. Compute ij using Eq. (4). (+1) 4 Fig. 1. The membership functions of FCM with different weighting exponents m. using Eq. (5). Step 2. Compute ai (+1) () − ai < . Increment ; until maxi ai 2.2. Cluster validity for fuzzy clustering 1.0 c=2 0.9 c=3 0.8 c=4 0.7 c=5 0.6 0.5 c = 10 0.4 0.3 0.2 0.1 0.0 0 1 2 3 4 Fig. 2. The membership functions of FCM with different cluster numbers c. with centers 0 and 2. The curves with different m values are the membership functions belonging to the cluster with center 0. When m = 1, the FCM will reduce to the traditional hard c-means. When m tends to infinity, ij = 1/c for all i, j and the sample mean will be a unique optimizer of JF CM . In fact, this situation may occur for any specified m values, and Yu et al. [18] proposed a theoretical upper bound for m that can prevent the sample mean from being the unique optimizer of JF CM . Another parameter which also has an influence on ij is the cluster number c. In general, we do not consider it to be a parameter of the membership function. However, in fact, the shapes of the membership functions change when the cluster number c changes. The influence of c on ij is shown in Fig. 2. The curve is the membership function of belonging to the cluster with center 0. When c=2, the curve is made by adding the cluster with center 2. That is, the set of the cluster centers when c=2 is {0, 2}. When c=3, the curve is made by adding the third cluster with center 2 and the set of the cluster centers when c = 3 is {0, 2, 2}. The set of the cluster centers when c = 4 is {0, 2, 2, 2}. The rest can be done in the same way. The shapes of the FCM membership functions will become steep when c increases. This is reasonable because it can help FCM to find more clusters in a large cluster number case. This analysis is important, and we should involve this After a fuzzy c-partition is provided by a fuzzy clustering algorithm such as FCM, we may ask whether it accurately represents the structure of the data set or not. This is a cluster validity problem. Since most of the fuzzy clustering methods need to pre-assume the number c of clusters, a validity criterion for finding an optimal c, which can completely describe the data structure, becomes the most studied topic in cluster validity. For a given cluster number range the validity measure is evaluated for each given cluster number, and then an optimal number is chosen for these validity measures. We briefly review four existing indexes here. (a) The first validity index associated with FCM is the partition coefficient [2] defined by P C(c) = n c 1 2 ij , n (6) i=1 j =1 where 1/c P C(c) 1. In general, we find an optimal cluster number c∗ by solving max2 c n−1 P C(c) to produce a best clustering performance for the data set X. (b) The PE [10,11] is defined by n c 1 PE(c) = − ij log2 ij , n (7) i=1 j =1 where 0 PE(c)log2 c. In general, we find an optimal c∗ by solving min2 c n−1 PE(c) to produce a best clustering performance for the data set X. When ij = 1/c for all i, j (i.e. the sample mean is the unique optimizer), PC(c)=1/c and PE(c)=log2 c. This may be caused by the fault of algorithms, by unsuitable parameter use, or by lack of data structure. Yu et al. [18] showed that if m is larger than a theoretical upper boundary then the above situation occurs. The above indexes use only the membership functions, and may have a monotone tendency with cluster number c. This may be due to lack of connection to the geometrical structure of data. The following indexes simultaneously take into account the membership functions and the structure of data. 8 M.-S. Yang, K.-L. Wu / Pattern Recognition 39 (2006) 5 – 21 (c) A validity function proposed by Fukuyama and Sugeno [12] is defined by FS(c) = c n i=1 j =1 m ij xj − ai − 2 c n i=1 j =1 = JF CM (, a) + K(, a), m ij ai − a 2 XB(c) = = i=1 n m j =1 ij xj (8) ∀i . (10) We may call the memberships 1 , . . . , c in Eq. (10) as the possibilistic c-memberships. To avoid trivial solutions, Krishnapuram and Keller [6] added a constraining term to FCM and proposed the following possibilistic clustering objective function: J1 (, a) = n c i=1 j =1 + 2 m ij xj − ai n c i (1 − ij )m . (11) i=1 j =1 − a i 2 n mini,j ai − aj 2 JF CM (, a)/n . Sep(a) {1 , . . . , c }P = {1 , . . . , c } | max i (x) > 0, i where a = ci=1 ai /c. JF CM (, a) is the FCM objective function which measures the compactness, and K(, a) measures the separation. In general, an optimal c∗ is found by solving min2 c n−1 FS(c) to produce a best clustering performance for the data set X. (d) A validity function proposed by Xie and Beni [13] with m = 2 and then generalized by Pal and Bezdek [17] is defined by c possibilistic type of membership function with (9) JF CM (, a) is a compactness measure, and Sep(a) is a separation measure. In general, an optimal c∗ is found by solving min2 c n−1 XB(c) to produce a best clustering performance for the data set X. These four indexes are the most cited validity indexes for fuzzy clustering. They have a common objective of finding an optimal c with each one of these c clusters, compact and separated from other clusters. By combining the validity functions, FCM becomes a completely unsupervised fuzzy clustering algorithm. Note that since no single validity index is the best, a better way of using validity indexes to solve the cluster validity problem is to consider all information proposed by all selected indexes, and then make an optimal decision. However, in some situations such as in a noisy and outlier environment, one may wish to process a possibilistic clustering approach. But, when the membership functions are of a possibility type, it can be made sure that these existing validity indexes will not work. Before solving this problem, we will discuss the possibilistic clustering approaches below. 3. Unsupervised possibilistic clustering Although FCM is a very useful clustering method, its memberships do not always correspond well to the degree of belonging of the data, and may be inaccurate in a noisy environment [6]. To improve this weakness of FCM, and to produce memberships that have a good explanation for the degree of belonging for the data, Krishnapuram and Keller [6] relaxed the constrained condition ci=1 i (x) = 1 of the fuzzy c-partition {1 , . . . , c }F in FCM to obtain a They then created a possibilistic approach to clustering which used a possibilistic c-membership of Eq. (10) to describe the degree of belonging on the basis of the objective function (11). Afterward, Krishnapuram and Keller [19] gave an alternative objective function for the possibilistic clustering with: J2 (, a) = n c ij xj − ai 2 i=1 j =1 + n c i (ij log ij − ij ). (12) i=1 j =1 Note that both (1 − ij )m and ij log ij − ij of Eqs. (11) and (12) are monotone decreasing functions of ij . This forces ij to be as large as possible to avoid the trivial solutions. However, the parameter i has a major influence on the clustering results. The determination of the normalization parameter i is quite important. Krishnapuram and Keller [6,19] recommended to select i as n m 2 j =1 ij xj − ai n i = K or m j =1 ij 2 xj ∈(i ) xj − ai i = , (13) |xj ∈ (i ) | where K ∈ (0, ∞) was typically chosen to be one, and xj ∈ (i ) if {ij | j = 1, . . . , n}. Memberships obtained by minimizing J1 or J2 are possibilistic type. The clustering performance using both J1 and J2 heavily depends on the chosen parameter i (see Refs. [19,20]). On the other hand, it is also difficult to handle the parameter i in the real applications. In this section we will propose a PCA whose performance can be easily controlled and whose objective function can be properly analyzed. M.-S. Yang, K.-L. Wu / Pattern Recognition 39 (2006) 5 – 21 Now eliminate xj − ai 2 from the objective function JP CA using Eq. (18) and we have: 3.1. A new possibilistic clustering algorithm The objective function of our PCA is JP CA (, a) = n c i=1 j =1 + m ij xj √ m2 c JP CA () = − − ai 2 n c i=1 j =1 m m m ij log ij − ij , (14) where , m and c are all positive. The first term is equivalent to the FCM objective function which requires the distances from the feature vectors to the cluster centers to be as small as possible. The second term is constructed by an analog m of the PE validity index (m ij log ij ), and an analog of the PC validity index (m ij ). This constrained term will force the ij to be as large as possible. The objective function (14) extends the FCM objective function by combining √ it with the PC and PE validity indexes. The term /m2 c will be discussed later. Theorem 1. The necessary conditions for a minimizer (, a) of the objective function (14) are the following update equations: √ m cxj − ai 2 ij = exp − , i = 1, . . . , c, j = 1, . . . , n (15) and n m j =1 ij xj m j =1 ij ai = n , i = 1, . . . , c. (16) Proof. Since no conditions are constrained on ij , minimizing JP CA with respect to ij is equivalent to minimizing 2 m ij xj − ai + m m m log − √ ij ij ij m2 c (17) with respect to ij . By differentiating Eq. (17) with respect to ij and setting it to zero, we will have Eq. (15). Since the second term of JP CA is independent of ai , minimizing JP CA with respect to ai will be equivalent to minimizing the first term of JP CA . This leads to Eq. (16). The insights of the objective function JP CA can be observed by solving the membership function (15) for xj − ai 2 in terms of ij . We have that: xj − ai 2 = − ln m ln ij ij √ = − 2√ . m c m c 9 (18) n c m ij . √ m2 c (19) i=1 j =1 Thus, minimizing or (JP CA ) is equivalent to maxic n JP CA m mizing i=1 j =1 ij . Therefore, the objective of JP CA is to find prototypes such that the sum of the membership function is maximized. Furthermore, for the special case of m = 2, the PCA ends up producing the results that maximize the PC validity index. On the other hand, if we insert the membership functions (15) into the term ci=1 nj=1 m ij , we have √ m n c c n m cxj − ai 2 m ij = exp − . i=1 j =1 i=1 j =1 (20) We find that the right-hand term in Eq. (20) is the potential function [21] and the mountain function [22]. Thus, minimizing JP CA (or JP CA ) is equivalent to maximizing the total potential function or the mountain function. Therefore, the cluster centers obtained by PCA correspond to the peaks of the potential function or the mountain function. Since the potential function or mountain function can be looked upon as a density function, the prototypes of PCA are equivalent to the modes of the estimated density. It is obvious that the exponential function in Eq. (15) obtained by minimizing JP CA is a possibilistic type. The parameter is a normalization term that measures the degree of separation of the data set, and it is reasonable to define as the sample co-variance. That is: n n 2 j =1 xj − x j =1 xj = with x = . (21) n n The role of the parameter m in Eq. (15) is correspondent to the fuzzifier m in FCM. If m in JP CA tends to zero, then limm→0 ij = 1 and limm→0 ai = x for all i, j. In this case, the sample mean will be the unique optimizer of the JP CA , and hence no clusters will be found, or we say that these c clusters coincide to one cluster. If m tends to infinity, most data points will have very small membership values even if they are very close to one of these c cluster centers. In this case, the membership function does not have a good corresponding explanation of the degree of belonging. In PCA, the influence of the weighting exponent m on the membership functions ij is shown in Fig. 3. The curves of different m values are the membership functions of belonging to the cluster with center 0. We also combine the cluster number c into the exponential membership function. In PCA, the influence of the cluster number c on the membership functions ij is shown in Fig. 4. When c increases, the shape of the membership function will become more steep. This will allow us to find more 10 M.-S. Yang, K.-L. Wu / Pattern Recognition 39 (2006) 5 – 21 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 200 m = 1.1 m = 1.5 B 190 m=2 180 m=3 170 m = 10 160 A 150 140 0 1 2 3 Fig. 3. The membership functions of PCA with different weighting exponents m. 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 50 4 100 150 Fig. 5. Two clusters’ data set with a bridge point A and an outlying point B. Table 1 Data set and membership values of FCM and PCA in Fig. 5 c=2 c=3 X Y c=4 c=5 1 2 3 (0) Initialize ai , i = 1, . . . , c and set > 0; set iteration counter = 0; (+1) Step 1. Compute ij using Eq. (15). (+1) using Eq. (16). Step 2. Compute ai (+1) () − ai < . Increment ; until maxi ai mu2 mu1 mu2 0.998 0.998 0.988 0.991 0.979 0.991 0.997 0.002 0.009 0.021 0.002 0.012 0.009 0.003 0.002 0.002 0.012 0.009 0.021 0.009 0.003 0.998 0.991 0.979 0.998 0.988 0.991 0.997 1.000 0.957 0.838 0.955 0.835 0.956 0.956 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.956 0.835 0.957 0.838 0.956 0.956 A 100 150 0.500 0.500 0.058 0.058 B 100 200 0.500 0.500 0.001 0.001 Fig. 4. The membership functions of PCA with different cluster numbers c. Possibilistic clustering algorithm mu1 150 150 150 150 150 145 155 150 150 150 150 150 145 155 4 possible clusters in a large c case. Although the cluster number c is not directly considered as a parameter in FCM, it still has an influence on the membership as shown in Fig. 2. Therefore, we include the cluster number c in our exponential membership function. The above √ discussion is the reason why we multiply the term /m2 c in the second term of JP CA . The parameter can always be fixed as the sample co-variance. The parameter m can be specified by the users according to their √ requirements. In general, we take m = 2. The parameter c is used to control the steep degree of the membership functions. The main objective involving the parameter c in JP CA is to make our algorithm more powerful in various data sets, especially when solving the cluster validity problem. The PCA is then summarized as follows: PCA 60 65 70 55 50 60 60 140 145 150 135 130 140 140 c = 10 0 FCM Similar to the possibilistic clustering approach proposed by Krishnapuram and Keller [6], the proposed PCA also has a more reasonable explanation for the degree of belonging than FCM. In Fig. 5, there are two clusters with a bridge point A and an outlying point B. FCM gives both A and B the memberships 0.5 to these two clusters, although they should have a different degree of belonging. These are shown in Table 1. In PCA with the weighting exponent m = 2, the bridge point A and the outlying point B are assigned different degrees of belonging with 0.058 and 0.001, respectively. Note that the memberships obtained by PCA are of a possibilistic type as shown in Table 1. PCA can give us more information about the data locations than FCM according to these possibilistic membership values. In fact, the parameters , m and c in the proposed PCA algorithm play a similar role that may affect the degree of belonging and the shape of the PCA membership functions of Eq. (15). In Figs. 3 and 4, we demonstrate these effects of M.-S. Yang, K.-L. Wu / Pattern Recognition 39 (2006) 5 – 21 3.2. The robust properties of PCA 0.15 FCM Error rate 11 0.10 0.05 PCA 0.00 1 2 3 4 m Fig. 6. The error rate curves of FCM and PCA for the Normal-4 data set with respect to different weighting exponents m. the parameters m and c on the PCA membership functions. The parameter in PCA is used to reduce the influence of the data scale so that it is set up as the sample variance with Eq. (21). In Fig. 2, we had illustrated that FCM membership functions will become steep when the cluster number c increases. The parameter c in PCA is used to adjust the membership functions suitable for different cluster number cases as shown in Fig. 4 so that the PCA with an actual number c could have the clustering results to fit the structure of data well. In this case, the adjustment with c could make the PCA more suitable for use for various data sets. Thus, we need only to specify the parameter m in PCA so that we could only focus on the discussion for the parameter m. We now use an example to show the influence of m in PCA with a comparison to the influence of m in FCM for the classification problem. Example 1. We consider the Normal-4 data set that was proposed by Pal and Bezdek [17]. Normal-4 is a four-dimensional data set with a sample size n = 800 points consisting of 200 points from each of four clusters. The population mean vectors are 1 = (3, 0, 0, 0), 2 =(0, 3, 0, 0), 3 =(0, 0, 3, 0)and 4 =(0, 0, 0, 4) and the variance–covariance matrix is i = I4 , an identity matrix, i = 1, 2, 3, 4. Both FCM and PCA are implemented to obtain four clusters for the Normal-4 data set. Each data point is classified according to the nearest neighbor to the cluster center. Fig. 6 shows the classification error rate curves of FCM (solid circle point) with m = 1.5.4 and PCA (circle point) with m = 0.01.4. We find that both algorithms work well when the parameter m is between 1.5 and 3. But FCM does not work well when the parameter m is between 3 and 4 where PCA still works well for the Normal-4 data set. In fact, Pal and Bezdek [17] had suggested that m in the interval [1.5, 2.5] was generally recommended to be used for FCM. More discussion about the parameter m including the robust properties and clustering for an unknown cluster number data set will be discussed in the next two subsections. A good clustering method should have the robustness to tolerate noise and outliers. In this subsection, we will give the robust properties of the proposed PCA algorithm. We use the influence function (see Ref. [23]) to show that the proposed PCA cluster center update Eqs. (15) and (16) are robust to noise and outliers. Let {x1 , . . . , xn } be an observed data set of real numbers and be an unknown parameter to be estimated. An M-estimator [23] is generated by minimizing the form n (xj ; ), (22) j =1 where is an arbitrary function that can measure the loss of xj and . Here, we are interested in a location estimate that minimizes n (xj − ) (23) j =1 and the M-estimator is generated by solving the equation n (xj − ) = 0, (24) j =1 where (xj − ) = (j/j)(xj − ). The influence function or influence curve (IC) can help us to assess the relative influence of an individual observation toward the value of an estimate. The M-estimator has shown that its influence function is proportional to its function. In the location problem, we have the influence function of an M-estimator with I C(x; F, ) = (x − ) , − ) dFX (x) (x (25) where FX (x) denotes the distribution function of X. If the influence function of an estimator is unbounded, a noise or outliers might cause trouble. Similarly, if the function of an estimator is unbounded, noise and outliers may also cause trouble. We have shown that minimizing JP CA is equivalent to maximizing the total potential function (20). Now let the loss between the data point xj and the ith cluster center ai be √ m m cxj − ai 2 (xj − ai ) = 1 − exp − (26) and √ m √ m cxj − ai 2 m c (xj − ai ) = exp − (xj − ai ). (27) 12 M.-S. Yang, K.-L. Wu / Pattern Recognition 39 (2006) 5 – 21 Maximizing the total equiva potential function (20) is lent to minimizing nj=1 (xj − ai ). By solving nj=1 (xj − ai ) = 0, we have the result shown in Eq. (16) with the membership function (15). Thus, the PCA cluster center The phi function of PCA 2 m=5 m=2 1 m=1 0 -1 -2 10 20 30 40 50 60 70 80 90 100 xj Fig. 7. The (phi) function of the PCA clustering algorithm. estimate is an M-estimator with the loss function (26) and the function (27). The function can be used to assess the relative influence of an individual observation toward the value of a cluster center estimate where the function of PCA is shown in Fig. 7. We see that the influence of a single noisy point (far away from 50) is a monotone decreasing function of the weighting exponent m and the largest influence occurs in the case when m tends to zero. In PCA, however, limm→0 ij = 1 and limm→0 ai = x for all i, j. This means that the sample mean will be the unique optimizer of the PCA objective function JP CA with small m values. This is reasonable because the sample mean is not robust to noise or outliers so that adding an individual outlier will largely shift the sample mean toward the outlier. Fig. 8 shows a two-clusters data set with an outlier (50, 0). Fig. 8(a) shows these phenomena. When m is small such as m = 1.5, the two cluster center estimates (solid circle points) will be very close to the sample mean. In Figs. 8(b) and (c) with m = 2 and 2.5, the cluster center estimates are not influenced by this outlying point and present the clustering centers for the data set well. PCA m = 1.5 PCA m = 2 2 2 1 1 0 0 -1 -1 -2 -2 (a) 2 -3 -2 -1 0 1 2 3 4 5 6 7 (b) PCA m = 2.5 2 1 1 0 0 -1 -1 -2 -2 (c) -3 -2 -1 0 1 2 3 4 5 6 7 (d) FCM m = 1.05 2 1 1 0 0 -1 -1 (e) PCA m = 20 -3 -2 -1 0 1 2 3 4 5 6 7 FCM m = 2 2 -2 -3 -2 -1 0 1 2 3 4 5 6 7 -2 -3 -2 -1 0 1 2 3 4 5 6 7 (f) -3 -2 -1 0 1 2 3 4 5 6 7 Fig. 8. The locations of cluster center estimates (solid circle points) obtained by PCA and FCM with different weighting exponents m where the outlier (50, 0) is added in the data set. M.-S. Yang, K.-L. Wu / Pattern Recognition 39 (2006) 5 – 21 In Fig. 7, if the added individual point is far away from the cluster center estimate, it will have no influence when m is small. In contrast, if the added individual point is close to the cluster center estimate, it will have a large influence when m is large as shown in Fig. 7. This can be explained by denoting ˆ = max{i1 , . . . , in }, ij = ij /, ˆ j = 1, . . . , n and we have n m j =1 ij xj lim {ai } = lim n m m→∞ m→∞ j =1 ij n m xj ij =1 xj j =1 ij = lim . m = m→∞ n ij =1 1 j =1 ij (28) When m becomes large, the cluster center estimate will be the data point that has the largest membership value. In other words, the cluster center estimate will be the data point that is the closest to the initial value. This phenomenon is shown in Fig. 8(d) with m = 20. The solid circle points present the cluster center estimates that are close to the initial values. We also show the FCM clustering results with m = 1.05 and 2 in Figs. 8(e) and (f), respectively. The right-hand cluster center estimate is always influenced by the noise and cannot present the location of the cluster center well. In the next subsection, we will discuss the validity problem in the possibilistic clustering environment. Thus, we can combine the PCA with the cluster validity indexes to be an unsupervised PCA. 3.3. The generalized cluster validity indexes In general, the existing validity indexes are constructed for solving the validity problems under a fuzzy clustering environment. The most simple examples are the PC and PE indexes. Since we constrain nj=1 i (x) = 1 in FCM, it is impossible for one data point to have simultaneous high memberships with more than one cluster. Thus, a large value of PC and a small value of PE will correspond to a wellvalidated data structure. Hence, maximizing PC and minimizing PE will give us a good cluster number estimate. However, the data points in PCA may have simultaneously high memberships to different clusters. For example, if the data set has only one compact cluster, the PCA will find two coincident clusters when c = 2 and each data point will have a large and equal membership value to both clusters. According to this property, it can be seen that the validity values of PC and PE will increase when c increases in a possibilistic clustering environment. Other existing indexes will also have some undesirable tendency. This will be shown later. In order to solve the validity problem in using the possibilistic clustering method, we use a normalization technique to generalize these existing validity indexes. We know 13 that the possibilistic clustering method is created by relaxing the condition of nj=1 ij = 1. Hence, it is more robust to the noise and outliers. It also has a better explanation for the degree of belonging than FCM. However, this relaxation makes the fuzzy validity indexes lose their efficiency under the possibilistic clustering environment. Therefore, we normalized the possibilistic c-memberships{1 , . . . , c }P to be {1 , . . . , c }F such that the condition nj=1 ij = 1 can be satisfied. We then generalized the fuzzy cluster validity indexes by replacing ij with ij as follows: suppose that we have a set of a possibilistic c-membership {1 , . . . , c }P ; the normalized possibilistic c-memberships is then defined by {1 , . . . , c }F with (x) i (x) = c i , k=1 k (x) i = 1, . . . , c. (29) Then the generalized PC, PE, FS and XB are defined by replacing ij with (xj ) ij = i (xj ) = c i , k=1 k (xj ) j = 1, . . . , n, i = 1, . . . , c, (30) which are denoted by GPC, GPE, GFS and GXB, respectively. When these generalized validity indexes are used in a fuzzy clustering environment (i.e. ci=1 ij = 1), it is easy to show that ij = ij in Eq. (23) and hence each generalized validity index is equivalent to their original form. This is why we call these indexes generalized validity indexes. Note that any existing validity indexes can also make this generalization treat the cluster validity problem under a possibilistic clustering environment. Next, we use two examples to demonstrate the performance of these generalized validity indexes GPC, GPE, GFS and GXB. Example 2. This is a 16-group data set as shown in Fig. 9(a). Data points in each group are uniformly generated from each rectangle. We call this data set the Uniform-16 data set. Uniform-16 is a two-dimensional data set with a sample size n = 800 consisting of 50 points from each of 16 clusters. In order to compare the performance of these generalized validity indexes GPC, GPE, GFS and GXB for the Uniform-16 data set, we will implement FCM and PCA from c = 2 to 25 so that the clustering results could completely consider all situations for this 16-group data set. According to the data structure of the Uniform-16 data set, we may assign the c initial values from c = 1 to 25 with the locations and orders as shown in Fig. 10. Say for examples, if c = 4, the initial cluster centers are located on the four corners and if c = 16, the initial cluster centers are uniformly located on 16 locations with the orders shown in Fig. 10, etc. The results of the validity indexes PC, PE, FS and XB by processing FCM are shown in the left portion of Table 2. The PC indexes give the result that matches the data structure well. According to Fig. 10, the result with c = 4 may be another reasonable choice for the optimal c, M.-S. Yang, K.-L. Wu / Pattern Recognition 39 (2006) 5 – 21 7 6 5 4 3 2 1 0 noise.Y Y 14 0 1 2 3 4 5 6 7 0 X (a) 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 noise.X (b) Fig. 9. (a) The Uniform-16 data set. (b) The Uniform-16 data set with 100 noisy points. 14 5 15 10 6 23 20 17 2 1 1 24 21 13 3 16 11 7 18 4 4 25 22 19 6 initial. Y 12 8 2 7 9 5 3 0 0 1 2 3 4 initial. X 5 6 7 Fig. 10. The locations and orders of the given initial values for the Uniform-16 data set. such as the FS and XB indexes have shown. We now process the PCA with the results in using the original and generalized validity indexes which are shown in the center and the right portion of Table 2, respectively. As we expected, these original validity indexes present some undesirable tendencies. The PC, PE and FS present a monotone tendency with c. The XB index presents an acceptable result. However, all generalized validity indexes present reasonable results. The GPC, GPE, and GFS show that c = 4 is optimal. The GXB index gives a result with c = 16 which matches well with the data structure. Note that both FCM and PCA are processed with the same initial values and the same m = 2. Table 2 Cluster validity for the Uniform-16 data set without noise c 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 FCM PCA PC PE FS XB PC PE FS XB GPC GPE GFS GXB 0.683 0.614 0.595 0.533 0.523 0.518 0.533 0.536 0.556 0.581 0.587 0.634 0.658 0.675 0.700 0.685 0.669 0.658 0.643 0.630 0.620 0.614 0.607 0.597 0.483 0.687 0.796 0.948 1.003 1.055 1.061 1.091 1.080 1.044 1.053 0.958 0.913 0.878 0.824 0.864 0.906 0.946 0.980 1.002 1.028 1.043 1.060 1.086 −1102 −2728 −3792 −3281 651 283 6022 15034 7430 10923 8667 49442 16237 6679 8160 10207 14768 13626 26246 30790 31237 789816 49237 43091 0.068 0.021 0.020 0.117 0.988 0.979 2.973 7.007 4.501 5.657 5.012 18.235 7.224 4.134 4.549 57.6 99.4 74.4 177.9 204.4 210.6 4584.3 318.6 277.6 0.178 0.211 0.258 0.308 0.356 0.405 0.450 0.496 0.542 0.588 0.633 0.678 0.722 0.764 0.807 0.848 0.890 0.932 0.974 1.014 1.055 1.096 1.136 1.174 0.293 0.224 0.249 0.293 0.357 0.409 0.433 0.454 0.490 0.520 0.530 0.540 0.549 0.560 0.568 0.586 0.603 0.618 0.624 0.637 0.650 0.649 0.661 0.661 −188 − 2051 − 3162 − 3512 − 3488 − 3569 − 3979 − 4373 − 4409 − 4496 − 4932 − 5271 − 5563 − 5980 − 6376 − 6396 − 6420 − 6437 − 6776 − 6779 − 6802 − 7462 − 7484 − 8145 2.42E−02 7.46E−03 5.76E−03 4.59E−02 4.36E−02 4.23E−02 4.15E−02 4.17E−02 4.15E−02 4.11E−02 4.07E−02 4.02E−02 4.02E−02 4.07E−02 4.11E−02 2.77E+06 3.71E+06 1.13E+07 1.65E+06 1.80E+08 2.22E+08 2.75E+09 3.66E+10 5.22E+12 0.862 0.919 0.957 0.854 0.790 0.725 0.700 0.681 0.667 0.652 0.659 0.677 0.693 0.706 0.723 0.709 0.697 0.694 0.678 0.676 0.676 0.654 0.639 0.616 0.221 0.146 0.097 0.251 0.367 0.485 0.547 0.589 0.622 0.660 0.659 0.641 0.622 0.611 0.588 0.616 0.639 0.657 0.680 0.692 0.705 0.732 0.754 0.784 3381 − 5506 − 9637 − 9165 − 7760 − 6637 − 6370 − 6135 − 5663 − 5211 − 5203 − 5466 − 5661 − 5831 − 5976 − 6012 − 6043 − 6102 − 5999 − 6081 − 6125 − 5791 − 5809 − 5393 5.79E−01 2.11E−01 1.06E−01 7.50E−01 5.33E−01 3.88E−01 3.45E−01 3.09E−01 2.34E−01 1.79E−01 1.57E−01 1.19E−01 9.84E−02 6.06E−02 3.82E−02 2.41E+06 3.02E+06 8.82E+06 1.21E+06 1.27E+08 1.51E+08 1.76E+09 2.21E+10 2.94E+12 M.-S. Yang, K.-L. Wu / Pattern Recognition 39 (2006) 5 – 21 15 Table 3 Cluster validity for the Uniform-16 data set with noise c 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 FCM PCA PC PE 0.686 0.621 0.592 0.531 0.521 0.524 0.518 0.525 0.537 0.565 0.583 0.592 0.617 0.645 0.666 0.647 0.629 0.616 0.605 0.594 0.587 0.576 0.568 0.561 0.478 0.676 0.798 0.951 1.010 1.041 1.092 1.111 1.116 1.077 1.060 1.055 1.008 0.949 0.906 0.952 0.995 1.034 1.06 1.099 1.109 1.137 1.157 1.173 FS − 1137 − 3104 −4095 −3801 − 26 17689 1456 14401 8376 13843 7005 10218 12857 10564 11596 10030 18611 13819 15389 25766 28204 21264 22978 44734 XB PC PE FS XB GPC GPE GFS GXB 0.079 0.023 0.026 0.089 0.844 4.161 1.661 6.342 4.585 5.565 4.092 5.116 6.045 4.923 5.134 60.7 108.8 77.5 100.4 153.6 152.1 126.8 134.5 225.5 0.198 0.217 0.248 0.304 0.355 0.405 0.451 0.489 0.527 0.567 0.605 0.651 0.700 0.736 0.772 0.818 0.861 0.903 0.947 0.987 1.027 1.062 1.096 1.129 0.274 0.265 0.260 0.307 0.377 0.434 0.459 0.478 0.509 0.536 0.545 0.564 0.581 0.587 0.591 0.605 0.628 0.650 0.662 0.682 0.701 0.698 0.707 0.706 −198 −1848 −3305 −3699 −3663 −3760 −4250 −4641 −4682 −4774 −5221 −5575 −5909 −6354 −6780 −7155 −7181 −7197 −7543 −7526 −7551 −8207 −8225 −8933 2.89E−02 1.69E−02 6.57E−03 6.39E−02 5.88E−02 5.60E−02 5.26E−02 4.96E−02 4.79E−02 4.65E−02 4.53E−02 4.56E−02 4.62E−02 4.59E−02 4.56E−02 3.31E+06 3.09E+07 6.02E+08 8.39E+08 9.38E+08 1.02E+09 1.06E+09 1.09E+09 1.36E+14 0.854 0.869 0.949 0.836 0.773 0.709 0.685 0.670 0.660 0.648 0.654 0.666 0.679 0.692 0.709 0.691 0.673 0.666 0.660 0.652 0.650 0.632 0.617 0.597 0.238 0.218 0.108 0.278 0.395 0.512 0.570 0.605 0.631 0.664 0.664 0.657 0.643 0.630 0.607 0.638 0.673 0.699 0.721 0.736 0.753 0.776 0.797 0.822 5423 − 4426 −10475 − 9832 − 8339 − 7050 − 6766 − 6568 − 6119 − 5698 − 5705 − 5940 − 6106 − 6325 − 6525 − 6392 − 6368 − 6394 − 6395 − 6470 − 6494 − 6217 − 6252 − 5826 7.75E−01 3.62E−01 1.07E−01 9.15E−01 6.28E−01 4.39E−01 3.65E−01 3.14E−01 2.42E−01 1.89E−01 1.68E−01 1.27E−01 1.02E−01 6.72E−02 4.66−02E 3.12E+06 2.69E+07 4.97E+08 6.57E+08 6.92E+08 7.23E+08 7.27E+08 7.18E+08 8.55E+13 Table 4 Initial values for Normal-4 data set and their orders Order X1 X2 X3 X4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 3 0 0 0 3 0 0 1 3 0 1 0 3 1 0 0 0 3 0 0 1 3 0 0 0 3 0 1 0 3 1 0 0 0 3 0 0 1 3 0 1 0 3 0 0 0 3 1 0 0 0 3 0 0 1 3 0 1 0 3 1 0 0 3 We now add 100 uniformly random noisy points to the Uniform-16 data set as shown in Fig. 9(b). The x-coordinate values of these noisy points are uniformly distributed over the interval [−0.5, 3.5] and the y-coordinate values of these noisy points are uniformly distributed over the interval [−0.5, 7.5]. These noisy points have an influence on FCM, and the validity indexes are influenced correspondingly as shown on the left portion of Table 3. However, as the right portion of Table 3 shows, the results of the generalized validity indexes by processing PCA are not influenced by these noisy points, and the GXB index still presents the best result which matches the structure of the data. The results of the original validity indexes by processing PCA are shown in the center of Table 3. In this example, PCA shows its robust property for noise, not only in the clustering results but also in the validity problems. Example 3. We implement the Normal-4 data set that was used in Example 1. The initial values and their orders are shown in Table 4. If c = 3, the initial cluster centers are the observations of orders 1, 2 and 3. If c = 4, the initial cluster centers are the observations of orders 1, 2, 3 and 4, etc. Both FCM and PCA are processed with the same initial values and m = 2. The validity results by processing FCM are shown in the left portion of Table 5. Only the FS index presents the result that matches the data structure. The users should be careful about the fact that the FS index often ended up in unreasonable results in the investigations 16 M.-S. Yang, K.-L. Wu / Pattern Recognition 39 (2006) 5 – 21 Table 5 Cluster validity for the Normal-4 data set c 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 FCM PCA PC PE FS XB PC PE FS XB GPC GPE GFS GXB 0.556 0.488 0.503 0.407 0.351 0.309 0.279 0.251 0.234 0.219 0.205 0.195 0.185 0.178 0.168 0.635 0.878 0.958 1.196 1.369 1.508 1.615 1.735 1.828 1.911 1.985 2.057 2.126 2.184 2.244 − 334 − 1145 − 2157 −1785 −1550 −1376 −1084 −1083 −1009 − 857 − 810 − 378 − 422 − 330 − 350 0.0281 0.0217 0.0224 0.0478 0.1178 0.2397 0.6435 0.4062 0.3469 0.5551 0.5960 1.1157 1.1008 1.4691 1.6172 0.123 0.145 0.167 0.185 0.194 0.204 0.216 0.226 0.231 0.238 0.247 0.254 0.257 0.262 0.270 0.237 0.290 0.346 0.395 0.443 0.483 0.526 0.563 0.595 0.622 0.655 0.691 0.719 0.741 0.768 −188 −467 −669 −709 −781 −878 −960 −995 −1074 −1151 −1200 −1158 −1154 −1190 −1208 1.62E −02 1.57E −02 1.64E −02 4.62E+05 3.83E+09 1.64E+11 4.89E+13 1.14E+18 2.93E+21 2.15E+22 2.16E+23 1.33E+13 1.49E+11 9.05E+09 8.83E+10 0.819 0.839 0.895 0.786 0.681 0.578 0.468 0.429 0.393 0.356 0.317 0.310 0.292 0.276 0.258 0.289 0.286 0.199 0.359 0.507 0.653 0.811 0.909 0.998 1.089 1.189 1.237 1.293 1.352 1.419 3939 350 − 2008 − 2229 − 2022 − 1578 − 1125 − 1150 − 1086 − 956 − 830 − 783 − 688 − 627 − 581 5.12E−01 3.20E−01 2.07E−01 4.96E+06 3.56E+10 1.32E+12 3.21E+14 6.85E+18 1.65E+22 1.13E+23 1.02E+24 5.98E+13 6.44E+11 3.76E+10 3.45E+11 Table 6 Cluster validity for the Normal-4 data set with uniform noise c 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 FCM PCA PC PE FS XB PC PE FS XB GPC GPE GFS GXB 0.524 0.428 0.444 0.461 0.377 0.324 0.289 0.257 0.231 0.214 0.199 0.189 0.190 0.179 0.169 0.669 0.964 1.057 1.084 1.302 1.472 1.597 1.710 1.829 1.922 2.006 2.078 2.085 2.161 2.225 − 133 − 690 − 1862 − 4684 − 3691 - 3068 − 2621 − 2322 − 2066 − 1898 − 1604 − 1530 − 2084 − 1933 − 1818 0.0453 0.0352 0.0166 0.0200 0.0420 0.0809 0.2113 0.2853 0.3719 0.4112 0.5435 0.4864 0.4489 0.4317 0.6594 0.219 0.252 0.295 0.330 0.357 0.380 0.408 0.432 0.449 0.465 0.485 0.504 0.516 0.528 0.545 0.427 0.494 0.555 0.614 0.673 0.723 0.778 0.829 0.881 0.927 0.977 1.024 1.072 1.114 1.160 373 − 289 − 777 − 965 − 1161 − 1396 − 1603 − 1712 − 1830 − 1985 − 2121 − 2201 − 2289 − 2408 − 2510 1.05E−01 6.30E−02 5.68E−02 5.83E+06 9.25E+06 2.83E+07 2.11E+07 3.91E+07 4.32E+08 5.62E+08 7.10E+08 2.75E+09 3.79E+09 6.81E+09 1.36E+10 0.659 0.664 0.710 0.641 0.570 0.494 0.416 0.385 0.355 0.323 0.292 0.275 0.259 0.242 0.225 0.512 0.594 0.560 0.678 0.790 0.901 1.016 1.096 1.174 1.252 1.333 1.394 1.453 1.513 1.575 11058 7536 4856 3935 3242 2550 2300 2101 1887 1648 1556 1471 1365 1244 1193 1.65E+00 9.13E−01 6.92E−01 6.40E+07 8.75E+07 2.18E+08 1.37E+08 2.36E+08 2.38E+09 2.74E+09 3.13E+09 1.15E+10 1.48E+10 2.46E+10 4.60E+10 of many researchers (see Refs. [14,17]). The results of the original and generalized validity indexes by processing PCA are shown in the center, and on the right portion in Table 5 respectively. As we expect, the original validity indexes lose their efficiency in the possibilistic clustering environment. However, the GPC, GPE and GXB indexes give the best results that match the data structure. We now add 100 uniform noisy points. Each coordinate of the noisy points is generated from a uniform distribution over the interval [0, 10]. The values of the validity indexes by processing FCM and PCA are shown in Table 6. In this noisy environment, the XB index with FCM gives us a good result. However, it does not always work well in a noisefree environment as Table 5 shows. This may because the data set is randomly generated. Thus, the results of the XB index, by processing FCM is still influenced by the noise. Both GPC and GXB present good results that match the data structure and are not influenced by the noise. More examples including the real data are presented in the next section. M.-S. Yang, K.-L. Wu / Pattern Recognition 39 (2006) 5 – 21 4. Examples with real data xj k as In this section, the Iris real data set (see Refs. [24,25]) and the other two real data sets from [26] will be implemented. The first one is the Iris data set that has n = 150 points in an s = 4-dimensional space that represents three clusters, each with 50 points (see Refs. [24,25]). Two clusters have substantial overlapping in Iris. Thus, one can argue c = 2 or 3 for Iris. The second real data set is the Glass [26] that has n = 214 points in an s = 9-dimensional space that presents six clusters. The third real data set is the Vowel [26] that has n = 990 points in an s = 10-dimensional space that presents eleven clusters Most clustering problems are solved by minimizing the constructed dispersion measures. For a data set in an s-dimensional space, each dimension presents one characteristic of the data and the degrees of dispersions of each characteristic are always different. Thus, the results of minimizing the total dispersion measure will discard the effects of some characteristics, especially those that have a small degree of dispersion. This situation occurs frequently, especially in a high-dimensional data set. To sufficiently use all the information of the characteristics, we shall normalize the data set. Suppose we have a data set X = {x1 , . . . , xn } in an s-dimensional space with each xj = (xj 1 , . . . , xj s ), we normalize the data by replacing xj k with xj k = 17 xj k − nl=1 xlk /n , 2 n n − x /n /(n − 1) x l=1 lk l=1 lk k = 1, . . . , s, j = 1, . . . , n. (31) After normalization, each characteristic of the data set will have a common sample mean and dispersion measures. We will normalize three real data sets before we analyze them. We analyze the cluster validity problem using the original validity indexes (PC and XB) and the generalized validity indexes (GPC and GXB) by processing FCM and PCA, respectively. We also look for the influences of the parameter m on the validity problem in both FCM and PCA. The chosen m values for FCM are 1.1, 1.5, 2, 2.5, and 3. The chosen Table 7 The theoretical upper bound of m for the FCM clustering algorithm Data set n s c Upper bound of m Iris Glass Vowel 150 214 990 4 9 10 3 6 11 infinity 3.1726 1.7787 Original data from Yu et al. [18]. Table 8 Cluster validity for the normalized Iris data c FCM m = 1.1 m = 1.5 PC XB 2 3 4 5 6 7 8 9 0.999 0.981 0.976 0.976 0.976 0.980 0.976 0.983 5.71E 2.32E 1.25E 1.57E 2.27E 4.21E 6.33E 5.19E c PCA m= 1 GPC GXB m = 1.5 GPC 0.880 0.637 0.481 0.381 0.329 0.274 0.242 0.221 2.61E − 01 1.49E + 07 2.68E + 09 5.31E + 09 7.76E + 09 4.47E + 11 1.86E + 11 2.92E + 11 0.971 0.660 0.495 0.387 0.332 0.277 0.244 0.222 2 3 4 5 6 7 8 9 PC + 17 + 17 + 16 + 16 + 16 + 15 + 15 + 15 0.952 0.887 0.848 0.801 0.780 0.787 0.761 0.767 m= 2 XB m = 2.5 m= 3 PC XB PC XB PC 0.832 0.707 0.635 0.570 0.513 0.498 0.486 0.474 0.251 1.222 2.098 2.845 8.095 9.988 11.375 8.862 0.729 0.573 0.481 0.416 0.366 0.341 0.312 0.290 0.115 0.492 0.708 0.952 2.173 2.573 4.270 6.389 0.660 0.490 0.392 0.330 0.283 0.257 0.230 0.207 0.073 0.270 0.328 0.400 0.762 0.944 1.817 17.271 GXB m= 2 GPC GXB m = 2.5 GPC GXB m= 3 GPC GXB 1.44E − 01 6.15E + 05 3.49E + 09 3.67E + 09 2.14E + 09 8.58E + 10 6.24E + 10 2.79E + 10 0.988 0.665 0.498 0.388 0.333 0.278 0.244 0.222 1.53E − 01 8.74E + 04 8.76E + 08 2.72E + 09 1.12E + 10 1.66E + 12 4.00 E + 13 3.71E + 15 0.995 0.666 0.500 0.389 0.486 0.439 0.468 0.459 1.69E − 01 1.69E + 04 1.82E + 17 1.66E + 29 2.52E + 09 3.61E + 11 4.23E + 04 1.29E + 11 0.998 0.887 0.801 0.645 0.593 0.546 0.519 0.501 1.92E − 01 4.30E − 01 1.67E + 00 2.53E + 04 1.91E + 04 4.13E + 30 6.97E + 25 1.42E + 27 10.3 28.7 50.1 54.8 81.6 148.3 243.6 200.4 XB 18 M.-S. Yang, K.-L. Wu / Pattern Recognition 39 (2006) 5 – 21 Table 9 Cluster validity for the normalized Glass data c FCM m = 1.1 m = 1.5 PC XB 2 3 4 5 6 7 8 9 10 11 12 0.981 0.951 0.945 0.954 0.962 0.961 0.944 0.971 0.960 0.960 0.959 3.91E 3.95E 2.72E 1.21E 1.20 E 1.22E 3.67E 1.96E 3.32E 6.90 E 7.55E c PCA m= 1 GPC 2 3 4 5 6 7 8 9 10 11 12 0.500 0.333 0.250 0.200 0.167 0.143 0.125 0.111 0.100 0.131 0.122 PC + 00 + 03 + 08 + 14 + 14 + 14 + 14 + 14 + 14 + 14 + 14 0.830 0.773 0.735 0.743 0.748 0.685 0.678 0.675 0.620 0.614 0.618 m= 2 XB 0.060 0.149 1.013 1.053 0.988 3.001 4.892 7.819 38.012 46.579 70.241 m = 1.5 GXB 6.12E 2.30 E 9.27E 4.74E 1.03E 2.44E 1.66E 2.28E 8.36E 1.08E 1.00 E + 08 + 08 + 09 + 09 + 10 + 11 + 12 + 14 + 21 + 12 + 12 m = 2.5 XB PC XB PC XB 0.617 0.540 0.487 0.447 0.370 0.373 0.332 0.318 0.288 0.283 0.288 0.133 0.194 0.288 0.247 1.650 2.114 3.292 3.287 2.953 3.480 3.676 0.500 0.409 0.335 0.261 0.218 0.188 0.170 0.172 0.155 0.141 0.115 2.13E + 03 1.88E − 01 1.92E − 01 1.43E + 02 1.17E + 02 8.26E + 06 4.38E + 06 6.35E + 04 2.49E + 04 6.08E + 04 1.31E + 05 0.500 0.340 0.278 0.207 0.173 0.148 0.139 0.123 0.111 0.108 0.092 4.52E+02 6.14E+01 9.96E+03 2.58E+03 3.85E+04 5.18E+06 1.65E+03 2.71E+03 1.43E+05 2.34E+12 2.02E+07 GPC GXB GPC GXB 0.500 0.333 0.356 0.386 0.387 0.317 0.275 0.346 0.445 0.420 0.457 2.90 E 1.16E 8.81E 7.88E 1.77E 5.69E 1.75E Inf 4.02E 1.78E 2.77E 0.500 0.333 0.550 0.587 0.595 0.498 0.535 0.531 0.563 0.569 0.592 8.59E 3.13E 1.40 E 5.71E 6.36E 5.56E 5.96E 4.16E 7.46E 1.17E 1.39E m= 2 GPC GXB 0.500 0.333 0.353 0.275 0.228 0.196 0.174 0.211 0.283 0.270 0.338 2.68E 1.45E 4.24E 1.29E 9.80 E 2.47E 4.65E 1.60 E 1.48E 1.88E 2.26E + 08 + 09 + 08 + 09 + 11 + 13 + 22 + 14 + 15 + 15 + 15 m= 3 PC m = 2.5 + 08 + 08 + 08 + 08 + 08 + 16 + 26 + 19 + 18 + 19 m= 3 + 08 + 10 + 12 + 06 + 06 + 31 + 13 + 11 + 11 + 12 + 12 GPC GXB 0.836 0.543 0.821 0.849 0.866 0.640 0.556 0.552 0.581 0.584 0.577 1.06E+00 5.99E+06 1.09E+00 9.47E−01 9.02E−01 3.55E+11 9.52E+11 4.55E+12 3.27E+16 1.69E+15 2.30E+06 m values for PCA are 1, 1.5, 2, 2.5 and 3. In FCM, the most suggested m values range within [1.5, 2.5]. Moreover, Yu et al. [18] gave a theoretical upper boundary of m for FCM. They showed that an m value which is larger than the theoretical upper boundary will have only one optimizer (i.e. the sample mean) of the FCM objective function, no matter what value c has. The theoretical upper boundaries for three real data are shown in Table 7 according to Yu et al. [18]. We shall also consider this theoretical result in our simulations. sample mean being a unique optimizer when m is larger. This phenomenon of sample mean being a unique optimizer may occur in PCA when m is too small. However, the theoretical upper boundary of m for the normalized Iris data is infinity as shown in Table 7. Therefore, the sample mean will not be a unique optimizer of JF CM and the validity index PC will not be equal to 1/c for all c. Thus, both original and generalized validity indexes give good results in the Iris data set. Example 4 (Iris data set). The Iris data set (see Refs. [24,25]) has n = 150 points in an s = 4-dimensional space. It consists of three clusters. Two clusters have substantial overlapping. Table 8 shows the cluster validity results for the normalized Iris data set implemented by FCM and PCA. The clustering algorithms with the selected parameter m and the validity indexes PC, XB, GPC and GXB are combined for different cluster numbers c. In FCM, both PC and XB work well when m = 1.5, 2, 2.5 and 3. In PCA, both GPC and GXB work well for all specified m values. In general, we should use FCM carefully to avoid the situation of the Example 5 (Glass data set). The Glass data set from Blake and Merz [26] has n = 214 points in an s = 9-dimensional space. It consists of six clusters. Table 9 shows the cluster validity for the normalized Glass data set. For given m values from 1.1 to 3, PC and XB always show that c = 2 or 3 is optimal.Note that, when m = 2.5 and 3, P C(2) = 1/2. We suspect that the sample mean is a unique optimizer when c = 2 and we do not adopt the result provided by the PC and XB when c = 2, or we say that c = 2 is not suitable for this data set. Generally speaking, we will not consider the results in the cases of P C(c) = 1/c. The M.-S. Yang, K.-L. Wu / Pattern Recognition 39 (2006) 5 – 21 19 Table 10 Cluster validity for the normalized Vowel data c FCM m = 1.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 c 2 3 4 5 6 7 8 9 10 11 12 13 14 15 m = 1.5 m= 2 m = 2.5 PC XB PC XB PC XB 0.958 0.933 0.942 0.923 0.917 0.924 0.910 0.920 0.912 0.915 0.921 0.918 0.919 0.924 0.0002 0.0001 0.0002 0.0136 0.1738 0.1207 0.0145 0.0260 0.1746 0.0069 0.0096 0.0126 0.0172 0.0261 0.608 0.508 0.463 0.416 0.377 0.335 0.340 0.319 0.310 0.290 0.306 0.293 0.294 0.296 0.008 0.011 0.010 0.020 0.026 0.078 0.024 0.165 0.094 0.261 0.105 0.550 230 3.524 0.500 0.333 0.250 0.200 0.167 0.143 0.125 0.111 0.100 0.091 0.083 0.077 0.071 0.067 3.89E 1.63E 9.38E 9.49E 1.42E 6.77E 5.56E 2.16E 7.96E 3.57E 1.02E 1.82E 6.41E 5.70 E PCA m= 1 m =2 m = 1.5 GPC GXB 0.500 0.333 0.250 0.200 0.167 0.143 0.125 0.111 0.100 0.091 0.083 0.077 0.071 0.067 2.43E 1.48E 1.37E 1.60 E 7.80 E 1.11E 5.21E 2.55E 4.20 E 2.33E 1.35E 5.90 E 7.46E 4.00 E + + + + + + + + + + + + + + 09 09 09 09 08 09 09 09 09 09 10 09 09 09 + 02 + 03 + 03 + 04 + 06 + 05 + 05 + 07 + 07 + 07 + 08 + 11 + 09 + 08 GPC GXB 0.500 0.333 0.250 0.200 0.167 0.143 0.187 0.170 0.157 0.147 0.205 0.254 0.243 0.212 7.71E 1.67E 2.85E 2.04E 1.08E 7.48E 4.43E 9.62E 6.54E 3.50 E 1.56E 1.22E 2.93E Inf + + + + + + + + + + + + + 08 08 08 10 11 12 11 12 18 24 21 26 27 GXB 0.500 0.688 0.685 0.434 0.402 0.403 0.422 0.410 0.404 0.397 0.381 0.375 0.365 0.360 1.90 E 8.09E 3.58E 8.84E 5.00 E 8.51E 8.63E 2.87E 3.69E 1.07E 8.35E 4.45E 6.20 E 8.07E Example 6 (Vowel data set). The Vowel data set from Blake and Merz [26] has n = 990 points in an s = 10-dimensional space that has 11 clusters. Table 10 shows the cluster validity PC XB PC XB 0.500 0.333 0.250 0.200 0.167 0.143 0.125 0.111 0.100 0.091 0.083 0.077 0.071 0.067 3761 27772 20427 152493 516614 6261341 5671117 2705403 1863116 2507218 3085955 2512509 1664479 1440508 0.500 0.333 0.250 0.200 0.167 0.143 0.125 0.111 0.100 0.091 0.083 0.077 0.071 0.067 12651 46664 32445 99395 435968 836560 685560 425636 580590 331552 314035 357623 318837 263237 m = 2.5 GPC same phenomenon can also be found in the GPC index. If GP C(c) = 1/c, then these c clusters are coincidental (may or may not be the sample mean). As m = 1 in PCA, GP C(c) = 1/c for all c, and we discard this result. When m = 1.5, 2 and 2.5, GP C(c) = 1/c as c = 2 and 3. Therefore, we only consider the case in c > 3. As m increases, the results of the GPC and GXB give the correct cluster number (c = 6) of the data. In the normalized Glass data, the algorithm PCA with the generalized validity indexes works better than the algorithm FCM with the original validity indexes. m= 3 + 07 + 10 + 11 + 21 + 14 + 09 + 08 + 10 + 11 + 09 + 11 + 20 + 31 + 31 m= 3 GPC GXB 0.500 0.493 0.579 0.574 0.431 0.467 0.484 0.456 0.463 0.487 0.452 0.437 0.422 0.412 9.45E 3.87E 1.23E 4.15E 2.48E 1.57E 2.50 E 1.13E 6.04E 8.03E 1.97E 2.74E 5.58E 2.30 E + 05 + 06 + 07 + 07 + 09 + 10 + 10 + 11 + 12 + 14 + 16 + 17 + 30 + 31 GPC GXB 0.500 0.553 0.619 0.644 0.610 0.524 0.567 0.545 0.547 0.488 0.450 0.435 0.424 0.385 3.18E 2.63E 7.83E 2.54E 2.04E 7.88E 5.04E 2.02E 2.45E 3.03E 2.69E 4.83E 3.82E 6.85E + + + + + + + + + + + + + + 06 07 08 00 00 14 11 10 10 10 16 19 28 30 for the normalized Vowel data set. When m = 2, 2.5 and 3, the sample mean is a unique optimizer of JF CM and hence P C(c) = 1/c for all c. Thus, we do not consider the results obtained by FCM with m=2, 2.5 and 3. These results exactly match the theoretical upper boundary 1.7787 for m which are shown in Table 7. When m = 1.1 and 1.5, PC and XB give the results that c = 2 or 3 is optimal. In the normalized Glass and Vowel data, the PC and XB with FCM algorithm have the tendency that a small value of c is preferred. In PCA, GP C(c) = 1/c for all c when m = 1 so that we do not adopt the result of m=1. As m=1.5 in PCA, GP C(c)=1/c for c = 2.7. Only c > 7 is considerable. The optimal cluster numbers obtained by the generalized validity GPC and GXB indexes with the PCA algorithm for m = 2, 2.5 and 3 are shown in Table 10. It is also difficult to match the cluster 20 M.-S. Yang, K.-L. Wu / Pattern Recognition 39 (2006) 5 – 21 number c = 11. We see that the Vowel data set seems to lack a good structure. 5. Conclusions We proposed a new PCA, and solved the problem for validating the clusters obtained by PCA. The proposed generalized validity indexes work well in both fuzzy and possibilistic clustering environments, and the proposed PCA objective function combines the FCM objective function with the validity indexes PC and PE so that the exponential membership functions are used to describe the degree of belonging. We also embedded the cluster number c into the membership functions to make PCA more powerful for various data sets, especially for solving the cluster validity problems. The cluster centers obtained by PCA attempted to maximize the sum of the membership functions, and were corresponding to the peaks of the potential function or the mountain function. By combining the generalized validity indexes, PCA became an unsupervised PCA. Numerical examples showed that the validity results of the generalized validity indexes are more accurate than the original validity indexes in a noisy environment. This is because the possibilistic clustering is more robust than the fuzzy clustering when noisy points are presented. Moreover, we presented the robust properties of the proposed PCA based on the influence function. In the analysis of three real data sets, the generalized validity indexes with PCA seemed to work better than the original validity indexes with FCM. According to the simulation results, we recommend that a better choice of the fuzzifier m in FCM is 1.5 for a large and high-dimensional data, and the parameter m = 2 in PCA is recommended. Note that the exponential membership function of PCA with a too small or too large m value cannot have a good explanation for degree of belonging. Finally, the results of PCA depend on the initialization, just as any clustering algorithm does. A general initialized technique for FCM and PCA shall be our further research topic. Acknowledgements The authors are grateful to the anonymous reviewers for their comments in improving the presentation of the paper. This work was supported in part by the National Science Council of Taiwan under Grant NSC-93-2118-M-033-001. References [1] L.A. Zadeh, Fuzzy sets, Inf. Control 8 (1965) 338–353. [2] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York, 1981. [3] M.S. Yang, A survey of fuzzy clustering, Math. Comput. Modell. 18 (1993) 1–16. [4] F. Höppner, F. Klawonn, R. Kruse, T. Runkler, Fuzzy Cluster Analysis: Methods for Classification Data Analysis and Image Recognition, Wiley, New York, 1999. [5] J.C. Dunn, A fuzzy relative of the ISODATA process and its use in detecting compact, well-separated clusters, J. Cybern. 3 (1974) 32–57. [6] R. Krishnapuram, J.M. Keller, A possibilistic approach to clustering, IEEE Trans. Fuzzy Syst. 1 (1993) 98–110. [7] R. Krishnapuram, H. Frigui, O. Nasraoui, Fuzzy and possibilistic shell clustering algorithm and their application to boundary detection and surface approximation, IEEE Trans. Fuzzy Syst. 3 (1995) 29–60. [8] T.A. Runkler, J.C. Bezdek, Function approximation with polynomial membership functions and alternating cluster estimation, Fuzzy Sets Syst. 101 (1999) 207–218. [9] T.A. Runkler, J.C. Bezdek, Alternating cluster estimation: a new tool for clustering and function approximation, IEEE Trans. Fuzzy Syst. 7 (1999) 377–393. [10] J.C. Bezdek, Numerical taxonomy with fuzzy sets, J. Math. Biol. 1 (1974) 57–71. [11] J.C. Bezdek, Cluster validity with fuzzy sets, J. Cybern. 3 (1974) 58–73. [12] Y. Fukuyama, M. Sugeno, A new method of choosing the number of clusters for the fuzzy c-means method, Proceedings of the Fifth Fuzzy Syst. Symposium, 1989, pp. 247–250. [13] X.L. Xie, G. Beni, A validity measure for fuzzy clustering, IEEE Trans. Pattern Anal. Mach. Intell. 13 (1991) 841–847. [14] N. Zahid, M. Limouri, A. Essaid, A new cluster-validity for fuzzy clustering, Pattern Recognition 32 (1999) 1089–1097. [15] I. Gath, A.B. Geva, Unsupervised optimal fuzzy clustering, IEEE Trans. Pattern Anal. Mach. Intell. 11 (1989) 73–781. [16] K.L. Wu, M.S. Yang, Alternative c-means clustering algorithm, Pattern Recognition 35 (2002) 2267–2278. [17] N.R. Pal, J.C. Bezdek, On cluster validity for fuzzy c-means model, IEEE Trans. Fuzzy Syst. 1 (1995) 370–379. [18] J. Yu, Q. Cheng, H. Huang, Analysis of the weighting exponent in the FCM, IEEE Trans. Syst. Man Cybern.—Part B 34 (2004) 634–638. [19] R. Krishnapuram, J.M. Keller, The possibilistic c-means algorithm: insights and recommendations, IEEE Trans. Fuzzy Syst. 4 (1996) 385–393. [20] M. Barni, V. Cappellini, A. Mecocci, Comments on: a possibilistic approach to clustering, IEEE Trans. Fuzzy Syst. 4 (1996) 393–396. [21] R.O. Duda, P.E. Hart, Pattern Classification and Scene Analysis, Wiley, New York, 1973. [22] R.R. Yager, D.P. Filev, Approximate clustering via the mountain method, IEEE Trans. Syst., Man Cybern. 24 (1994) 1279–1284. [23] P.J. Huber, Robust Statistics, Wiley, New York, 1991. [24] E. Anderson, The Irises of the gaspe peninsula, Bull. Am. IRIS Soc. 59 (1935) 2–5. [25] J.C. Bezdek, J.M. Keller, R. Krishnapuram, L.I. Kuncheva, N.R. Pal, Will the Iris data please stand up?, IEEE Trans. Fuzzy Syst. 7 (1999) 368–369. [26] C.L. Blake, C.J. Merz, UCI repository of machine learning databases, a huge collection of artificial and real-world data sets, 1998. Available from: http://www.ics.uci.edu/∼mlearn/ MLRepository.html. About the Author—MIIN-SHEN YANG received his BS degree in mathematics from the Chung Yuan Christian University, Chungli, Taiwan, in 1977, MS degree in applied mathematics from the National Chiao-Tung University, Hsinchu, Taiwan, in 1980, and his Ph.D. degree in statistics from the University of South Carolina, Columbia, USA, in 1989. M.-S. Yang, K.-L. Wu / Pattern Recognition 39 (2006) 5 – 21 21 In 1989, he joined the faculty of the Department of Applied Mathematics in the Chung Yuan Christian University as an Associate Proffesor, where, since 1994, he has been a Professor. From 1997 to 1998, he was a Visiting Professor with the Department of Industrial Engineering, University of Washington, Seattle. His current research interests include applications of statistics, fuzzy clustering, pattern recognition, and neural fuzzy systems. Dr. Yang is an Associate Editor of the IEEE Transactions on Fuzzy systems. About the Author—KUO-LUNG WU received his BS degree in mathematics in 1997, the MS and Ph.D. degrees in applied mathematics in 2000 and 2003, all from the Chung Yuan Christian University, Chungli, Taiwan. Since 2003, he has been an Assistant Professor in the Department of Information Management at Kun Shan University of Technology, Tainan, Taiwan. He is a member of the Phi Tau Phi Scholastic Honor Society of Taiwan. His research interests include fuzzy theorem, cluster analysis, pattern recognition, and neural networks.
© Copyright 2026 Paperzz