Fundamenta Informaticae 111 (2011) 81–90 81 DOI 10.3233/FI-2011-555 IOS Press A Novel Multimodal Probability Model for Cluster Analysis Jian Yu∗† Dept. of Computer Science, Beijing Jiaotong University Beijing, 100044, China [email protected] Miin-Shen Yang Dept.of Applied Maths, Chung Yuan Christian University Chung-Li 32023, Taiwan [email protected] Pengwei Hao Center of Information Science, Peking University Beijing, 100871, China [email protected] Abstract. Cluster analysis is a tool for data analysis. It is a method for finding clusters of a data set with most similarity in the same group and most dissimilarity between different groups. In general, there are two ways, mixture distributions and classification maximum likelihood method, to use probability models for cluster analysis. However, the corresponding probability distributions to most clustering algorithms such as fuzzy c-means, possibilistic c-means, mode-seeking methods, etc., have not yet been found. In this paper, we construct a multimodal probability distribution model and then present the relationships between many clustering algorithms and the proposed model via the maximum likelihood estimation. Moreover, we also give the theoretical properties of the proposed multimodal probability distribution. Keywords: Cluster analysis,probability density function ∗ Address for correspondence: Dept. of Computer Science, Beijing Jiaotong University, Beijing, 100044, China This work is partially supported by NSFC Grant 6087503, 90820013, 61033013; 973 Program Grant 2007CB311002, Beijing Natural Science Foundation (Grant No. 4112046), the Fundamental Research Funds for the Central Universities (Grant No. 2009JBZ006-1). † 82 J. Yu et al. / A Novel Multimodal Probability Model for Cluster Analysis 1. Introduction Cluster analysis is an important tool for data analysis. It is a branch in statistical multivariate analysis and also an unsupervised learning in pattern analysis and machine intelligence. Nowadays, cluster analysis has been widely applied in many areas, such as image processing, data mining, biology, medicine, economics and marketing, etc. [5]. In general, mixture distribution models are popular likelihood-based approaches on the use of a probability model for cluster analysis[9]. In finite mixtures, the data are considered to conform to a mixture of probability distributions. Therefore, the likelihood function for clustering can be defined based on the mixture distributions. Generally, the Gaussian mixture is widely used for continuous data, and the multivariate multinomial mixture (or latent class model) for the categorical data [3][9]. The classification maximum likelihood (CML) method is another remarkable likelihood-based approach, where the cluster prototypes are the parameters of the distribution [2][10]. In the CML method, the data set is supposed to be composed of c different clusters B1 , B2 , · · · , Bc and the probability distribution of a point x from the ith subpopulation is hi (x|θi ) for parameters θi , i = 1, 2, · · · , c. In fact, the CML method can induce the well-known C-means (or K-means) clustering algorithm [7]. Furthermore, it had been extended to the fuzzy CML [12]. In the literature, mixture distribution and CML models are two popular probability models for cluster analysis. It is known that partitional clustering plays an essential role in cluster analysis [5]. Most partitional clustering algorithms are based on their own objective functions, but not a probability model. In the literature, numerous partitional clustering algorithms are developed based on minimizing the mean square error or its variants such as C-means, FCM and GCM, etc. [1][7][8][14]. On the other hand, when the clustering method is a mode seeking approach, the cluster prototypes are considered as the peaks of the density function shape of the data points. In this case, the objective function is the data density function. These mode seeking methods are, for example, the mean shift [4], mountain method [11], possibilistic C-means (PCM) [6], and similarity-based clustering method (SCM) [13], etc. In the literature, there are some efforts to find an appropriate probability distribution such that we can associate the probability distribution model to most clustering methods. For example, Windham designed a complex parametric family of probability distribution functions by considering the objective function and its derivative with respect to cluster prototypes. However, it is very difficult to find a closed form associated to a particular clustering method as Windham himself noted in [15]. The corresponding probability distributions to most clustering algorithms have not yet been found. In this paper, we would propose a multimodal probability distribution model. The relationships between many clustering methods and the proposed probability distribution model can be found by the maximum likelihood estimation. The remainder of the paper is organized as follows. In Section 2, we review the Windham’s model and then propose a novel multimodal probability distribution model, which takes finite mixtures as its special cases. Under some condition, we prove that the proposed probability model can lead to GCM by maximum likelihood method. In such a way, the probability distributions for many partitional clustering algorithms such as C-means and fuzzy C-means, etc. are found. Moreover, we also study the theoretical properties of the proposed PDF. In Section 3, we study the properties of the proposed probability model and interpreted the properties of C-means, fuzzy C-means by the proposed model. In Section 4, we make conclusions and discuss the future researches on the proposed probability model and its applications. J. Yu et al. / A Novel Multimodal Probability Model for Cluster Analysis 83 2. The Proposed Multimodal Probability Model As we know, there are two ideas to use a probability model for cluster analysis. One assumes that the c clusters are generated from c different probability distributions. A typical example is the classification maximum likelihood (CML) [10][2]. Some generalized results for CML can be found in [12][16]. The other is based on that all the points are independently drawn from a probability distribution. In general, finite mixture distributions are the most used probability models for cluster analysis [9] in which the expectation and maximization (EM) algorithm is usually used for parameter estimation. In [15], Windham first gave a try to find a probability distribution associated to cluster analysis other than CML and finite mixture models. The probability clustering model proposed by Windham [15] can be stated as follows: Suppose that X = {x1 , x2 , · · · , xn } is a data set in Rs where X is divided into c clusters represented by the cluster prototypes v = {v1 , v2P , · · · , vc } ∈ Rc×s . Then v = {v1 , v2 , · · · , vc } can be estimated by minimizing the objective function nk=1 ξ (xk , v), i.e. the estimate, v̂, can be obtained by solving the equation: n P D (xk , v) = 0, where D(x, v) is the derivative of ς(x, v). In this paper, we call it the Windham’s k=1 model. To obtain a family of probability distribution functions (pdf) parameterized by v = {v1 , v2 , · · · , vc } that is close to exp (−ξ (xk , v)) and satisfies E(D(x, v)) = 0 , Windham [15] gave the following theorem. [15]) For fixed v ∈ Θ, suppose ϕ : R c×s → R, defined by ϕ (λ) = RTheorem 2.1. (Windham T exp −ξ (x, v)R− λ D (x, v) dx, if φ exists on an open set in Rc×s and attains a global minimum at λ0 satisfying D (x, v) exp(−ξ (x, v) − λD (x, v) )dx = 0, then there is a unique pdf of the form p∗ (x, v) = exp(−ξ(x, v) − λD(x, v))/φ(λ0 ) that minimizes δ (p) = Ep (ξ + log p), where p is any pdf for which Ep (D (x, v)) = 0, and D(x, v) is the derivative of ξ(x.v). As Windham [15] mentioned, to estimate λ0 is difficult because it needs a numerical integration in high dimension. Except for λ0 = 0, such a probability distribution seems not to have a close relation to the ML method since the pdf f ∗ (x, θ) = exp −ξ (x, v) − λT0 D (x, v) φ (λ0 ) has a totally new parameter family λ0 which does not appear in the function ξ(x, θ). Therefore, it is not easy to use such a probability distribution to analyze properties of a corresponding clustering algorithm. However, such effort had provided us with the motivation for a further study. Provided that all the points in the data set X = {x1 , x2 , · · · , xn } are independently drawn from a pdf h(x, v), where the prototypes v = {v1 , v2 , · · · , vc } are partial parameters of the distribution h(x, v). Therefore, the log-likelihood function is as follows: Xn ln P (X, v) = ln h (xk , v) (1) k=1 Thus, an ML clustering based on the pdf h(x, v) can be created by maximizing the log-likelihood function (1) with respect to prototypes v = {v1 , v2 , · · · , vc }. If a specific structure about h(x, v) is given, then the ML estimates of cluster prototypes can be obtained so that a clustering algorithm is constructed. Conversely, when a clustering algorithm is given, it may offer some hints and information about its assumed probability distribution. Recently, a unified framework for partitional clustering (for continuous variables), called the general C-means (GCM), was proposed in [14]. It has been shown that the GCM 84 J. Yu et al. / A Novel Multimodal Probability Model for Cluster Analysis clustering model not only includes many partitional clustering algorithms (see [14]), but also has many useful characteristics. First, let us recall the GCM objective function as follows: ! c X 1 Xn R= a (xk ) f αi g (ρi (xk , vi )) (2) k=1 n i=1 where ∀t ≥ 0, f (g(t)) = t; c P αi = 1; and ∀i, αi > 0. The function f (x) is called the GCM i=1 generator. The GCM clustering algorithm is derived by minimizing the GCM objective function (2). Although the GCM is created for continuous variables in [14], it can be also applied for categorical variables when ρi (x, vi ) is well defined for categorical data. To compare (1) with (2), we find that minimizing (2) should be equivalent to maximizing (1) if we set the following equation: Xc ln h (x, v) = −a (x) f αi g (ρi (x, vi )) i=1 P Thus, h(x, v) should have the formPexp (−a (x) f ( ci=1 αi g (ρi (x, vi )))). To construct h(x, v) to be a pdf with the form exp (−a (x) f ( ci=1 αi g (ρi (x, vi )))), we need the following theorem. R Theorem 2.2. If ∀t ∈ R,f (g(t)) = t,∀x ∈ Rs ,a (x) ≥ 0,A = Rs exp(−τ ×a (x) min ρi (x, vi ))dx < 1≤i≤c R +∞, and B = Rs exp(−τ × a (x) max ρi (x, vi ))dx < +∞, then 0 < B ≤ N (v) = 1≤i≤c c R P αi g (ρi (x, vi )) dx ≤ A. That is, the term (N (v))−1 does exist so that the Rs exp −a (x) f i=1 P function h (x, v) = (N (v))−1 exp(−a (x) f ( ci=1 αi g(ρi (x, vi )))) constitutes a pdf. Proof: c P αi g (ρi (x, vi )) ≤ max ρi (x, vi ). Since f (g (t)) = We first claim that min ρi (x, vi ) ≤ f 1≤i≤c 1≤i≤c i=1 d d f f −1 (t) = t, we have 1 = dt t = dt f (g (t)) = f ′ (g (t)) g′ (t). Thus, it has only two cases: 1) g′ (t) > 0, f ′ (t) > 0 and 2) g′ (t) < 0, f ′ (t) < 0. Without loss of generality, we suppose g′ (t) > 0, f ′ (t) > 0. Because max1≤i≤c ρi (x, vi ) ≥ ρi (x, vi ) ≥ c P min1≤i≤c ρi (x, vi ), we have g min ρi (x, vi ) ≤ αi g (ρi (x, vi )) ≤ g max ρi (x, vi ) . 1≤i≤c i=1 1≤i≤cc P ′ Since g = f −1 and f (x) > 0, we can obtain that g min ρi (x, vi ) ≤ αi g (ρi (x, vi )) ≤ 1≤i≤c i=1 g max ρi (x, vi ) . 1≤i≤c c R P αi g (ρi (x, vi )) dx ≤ A. ⊔ ⊓ Thus, we have 0 < B ≤ N (v) = Rs exp −a (x) f i=1 Theorem 2.2 tells us that P the proposed probability model h (x, v) = (N (v))−1 exp (−a (x) f ( ci=1 αi g (ρi (x, vi )))) is a pdf. Since the proposed pdf h(x, v) could present multimodal shapes, we call it a multimodal probability model where {v1 , · · · , vc } are the prototypes and {α1 , · · · , αc } are the mixing proportions. In next section, we will demonstrate that h(x, v) is J. Yu et al. / A Novel Multimodal Probability Model for Cluster Analysis 85 actually a multimodal pdf when c > 1 in which it can represent a probability model for clustering a data set with c modes. Substituting the proposed pdf into the log-likelihood function (1), we obtain a new type of log likelihood function as follows: Yn Xc Xn l (X, v) = ln h (xk , v) = − (a (xk ) f αi g (ρi (xk , vi )) k=1 k=1 i=1 + ln N (v)) = −nR − n ln N (v) We see that l(X, v) is a log-likelihood function of the proposed pdf h(x, v) for the data set X. Let R = R + ln N (v), then l (X, v) = −nR. Thus, the ML clustering by maximizing l(X, v) with respect to (w.r.t) {v1 , · · · , vc } is equivalent to minimizing R̄ w.r.t {v1 , · · · , vc }. Since R̄ is the GCM objective function R plus a penalized constraint term ln N (v), we may investigate the relations between R̄ and R. If it is not explicitly stated, we always assume that X = {x1 , x2 , · · · , xn } is an s-dimensional data set and the cluster prototype vi ∈ Rs and ρi (xk , vi ) = (xk − vi )T A−1 (xk − vi ) = (xk , vi )A where A is positive definite. Thus, the necessary condition for minimizing R w.r.t {v1 , · · · , vc } is: n X ∂R = αi Jv (xk , vi )/n = 0 ∂vi (3) k=1 P where Jv (x, vi ) = a (x) f ′ ( ci=1 αi g ((x, vi )A )) g′ ((x, vi )A ) A−1 (x − vi ). Similarly, the necessary condition for minimizing R̄ with respect to {v1 , · · · , vc } is: ∂ R̄ ∂R 1 ∂N (v) = + =0 ∂vi ∂vi N (v) ∂vi (4) If N (v) is a constant that is independent of v, then we can ignore the term lnv when minimizing R̄ w.r.t {v1 , · · · , vc } so that (3) is equivalent to (4) in this case. That is, GCM Pc becomes an ML clustering −1 via the proposed probability model h (x, v) = (N (v)) exp (−a (x) f ( i=1 αi g (ρi (x, vi )))). Since N (v) may not be a constant of v so that we can construct more general clustering algorithms than GCM with the regularization term ln N (v). As a matter of fact, N(v) is usually called partition function. Moreover, we can give a connection of the GCM with the Windham’s model from the following corollary. Corollary 1. Let Ω̄ = {v = {v1 , v2 , · · · , vc } |∀i, ∂N (v)/∂vi = 0}. Set ς (x, v) = a (x) f c P i=1 αi g ((x, vi )A ) and D (x, v) = [α1 Jv (x, v1 ) , α2 Jv (x, v2 ) , · · · , αc Jv (x, vc )]. If v ∈ Ω̄ is predefined, then φ(λ) attains a P global minimum at λ = 0 and the pdf h (x, v) = (N (v))−1 exp (−a (x) f ( ci=1 αi g ((x, vi )A ))) is the global minimum of δ (p) = E (ς + log p), where p is any pdf for which E(D(x, v)) = 0. Proof: As v ∈ Ω̄ means that E(D(x, v)) = 0, the conclusion can be proved in the way as that of Theorem 2.1 proved in [15]. ⊔ ⊓ Moreover, we can easily prove the following properties of PDF related to GCM. 86 J. Yu et al. / A Novel Multimodal Probability Model for Cluster Analysis s Theorem 2.3. If ∀t, f ′′ (t) ≥ 0, ≤ 0,∀x ∈ R , a (x) Pn Pc Pn 2 then k=1 f ≥ k=1 kxk − x̄k2 . where x̄ = i=1 αi g kxk − vi k 1 n Pn k=1 xk Theorem 2.4. ∀t, f ′′ (t) ≤0,∀x ∈ Rs , a (x) ≥ 0, Pc 2 2 exp −a (x) f α g kx − v k ≤ exp −a (x) kx − v̄k where v̄ = i i i=1 1 c Pc i=1 vi In summary, Corollary 1 tells us that Pc the proposed pdf −1 h (x, v) = (N (v)) exp (−a (x) f ( i=1 αi g (ρi (x, vi )))) can be a natural inference of Theorem 2.2 in which the proposed pdf is actually less complex and less intractable than that in Theorem 2.1. If v ∈ Ω̄, the ML clustering via the proposed pdf can induce the GCM. If v ∈ / Ω̄, the ML clustering via the proposed pdf can lead to more novel clustering frameworks other than the GCM. Moreover, Theorem 2.3,2.4 clearly show that the pdf corresponding to GCM has one peak when ∀t, f ′′ (t) ≤ 0 in theory. 3. Existing Clustering Models and the Proposed PDF In this section, we study the relations between some existing clustering models and the proposed probability model so that we can construct those Pclustering algorithms as an ML clustering via the proposed pdf h (x, v) = (N (v))−1 exp(−a (x) f ( ci=1 αi g (ρi (x, vi )))). Before we give these relations, we briefly make the analysis of the proposed pdf as follows. According to the Occam’s razor principle, the simplest hypothesis that fits the data is preferred. Yu [14] gave two preferred generators, f (x) = −β −1 ln x and f (x) = x1−m , of the GCM based on the Occam’s razor principle. To demonstrate P the variety of different multimodal shapes of the proposed pdf h (x, v) = (N (v))−1 exp (−a (x) f ( ci=1 αi g (ρi (x, vi )))), we analyze the pdf with different parameters. For simplicity, we set ρi (x, vi ) = kx − vi k2 , and use the two generators of f (x) = −β −1 ln x and f (x) = x1−m . Then, the following special pdfs are obtained: h (x, v) = (N (v))−1 β −1 exp −βkx − vi k2 f (x) = −β −1 ln x !1−m 1 Pc 1−m 2 exp − f (x) = x1−m i=1 αi kx − vi k P c i=1 αi It is easy to know that the parameter β, m and α have a great impact on the proposed pdf. In order to clearly demonstrate this property, we plot the proposed PDF with respect to β, m and α for predefined cluster centers v1 = −4 , v2 = 0 and v3 = 4. Fig. 1 tells us that when β → +∞, the proposed pdf h (x, v) = −1 Pc 2 β −1 (N (v)) ( i=1 αi exp −βkx − vi k ) becomes c essentially identical, but translated distributions that are independent of mixing proportions αi . + Similarly, Fig. proposed pdf h (x, v) = 2indicated that when m → 1!, the 1−m 1 Pc 2 1−m also becomes three totally coincident compo(N (v))−1 exp − i=1 αi kx − vi k nents that are independent of mixing proportions αi . J. Yu et al. / A Novel Multimodal Probability Model for Cluster Analysis 0.2 0.2 β=300 α =1/3 1 α =1/3 2 α3=1/3 0.15 0.1 0 −10 0.1 −8 −6 −4 −2 0 2 4 6 0.2 8 10 0.1 −6 −4 −2 0 2 4 6 8 −4 −2 0.1 −6 −4 −2 0 2 4 6 0.2 8 −8 −6 −4 −2 0 2 4 6 8 0 −10 0 −10 −6 −4 −2 0 2 4 6 10 8 −8 −6 −4 −2 0 2 4 6 −8 −6 −4 −2 0 2 4 6 8 8 −8 −6 −4 −2 0 2 4 6 β=0.3 0.2 −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 10 0 2 4 6 −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8 1 0.6 0.4 0.2 −8 −6 −4 −2 0 2 4 6 1.5 8 8 0.5 −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8 −6 −4 −2 0 2 4 6 −8 −6 −4 −2 0 2 4 6 8 10 0.6 0.4 0.2 −6 −4 −2 0 2 4 6 −8 −6 −4 −2 0 2 4 6 8 8 −8 −6 −4 −2 0 2 4 6 0.2 0 −10 0.6 −8 −6 −4 −2 0 2 4 6 8 8 −6 −4 −2 0 2 4 6 8 0.2 10 8 0 −10 m=−10 α =1/3 α1=1/3 α32=1/3 4 0 −10 0 −10 0 −10 0 −10 0 −10 0 −10 0 2 4 6 8 −4 −2 0 2 4 6 8 0 −10 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 1.5 1 0.5 0 −10 −8 −6 −4 −2 0 2 4 6 8 10 −8 −6 −4 −2 0 2 4 6 8 10 β=−0.3 α =0.2 α21=0.2 α3=0.6 −8 −6 −4 −2 0 2 4 6 8 10 m=1.02 α1=1/3 α2=1/3 α =1/3 −8 −6 −4 −2 0 2 4 6 8 10 m=1.02 α1=0.1 α2=0.7 α =0.2 −8 −6 −4 −2 0 2 4 6 8 10 m=1.02 α1=0.2 α2=0.2 α =0.6 −8 −6 −4 −2 0 2 4 6 8 10 m=−1 α1=1/3 α2=1/3 α =1/3 3 −8 −6 −4 −2 0 2 4 6 8 10 m=−1 α1=0.1 α2=0.7 α3=0.2 0 −10 −8 −6 −4 −2 0 2 4 6 8 10 m=−1 α =0.2 1 α2=0.2 α3=0.6 0 −10 −8 −6 −4 −2 0 2 4 6 8 10 m=−100 α =1/3 α1=1/3 α32=1/3 −8 −6 −4 −2 0 2 4 6 8 10 8 m=−100 α1=0.1 α2=0.7 α3=0.2 0.6 0.4 0.2 10 0 −10 −8 −6 −4 −2 0 2 4 6 8 10 −8 −6 Figure 2. h (x, v) = (N (v)) −1 −4 −2 0 exp − 2 4 6 8 m=−100 α1=0.2 α2=0.2 α =0.6 1.5 1 3 0.5 0 −10 8 2 1 10 6 0.8 m=−80 α1=0.2 α2=0.2 α =0.6 1.5 m=−10 α1=0.2 α2=0.2 α3=0.6 4 2 2 2 2 1 0.2 0 −10 0 3 m=−80 α1=0.1 α2=0.7 α3=0.2 0.4 10 −2 0.4 1 −8 −4 0.2 10 m=−80 α =1/3 α1=1/3 α32=1/3 0.6 2.5 10 4 0.8 0.5 −6 0.6 3 −6 2 10 −8 0.8 α =0.6 −8 3 m=−10 α =0.1 α1=0.7 α32=0.2 8 β=−0.3 α =0.1 1 α2=0.7 α3=0.2 0.5 10 0 −10 −2 6 1 1 −4 4 1.5 0 −10 −6 2 1 1 −8 0 0.4 0 −10 1 −2 0.6 2 1.5 −4 0.8 4 6 −6 3 10 8 0.1 −8 −8 1 0.3 3 0.4 β=−0.3 α1=1/3 α2=1/3 α3=1/3 0.1 10 m=−0.5 α =0.2 1 α2=0.2 0.4 0.2 0 −10 0.15 0.5 0.8 10 3 m=−0.5 α1=0.1 α =0.7 2 α3=0.2 0.4 10 8 0.2 0.6 m=0.5 α1=0.2 α2=0.2 α =0.6 6 0.1 10 m=−0.5 α1=1/3 α2=1/3 α3=1/3 0.8 1 4 0.05 −8 0.4 m=0.5 α1=0.1 α =0.7 2 α3=0.2 2 0.2 0.2 0.8 0 0.15 m=2 α =0.2 α1=0.2 α32=0.6 0.6 0 −10 −2 0.05 −8 3 0.4 0.2 −4 0.2 0.8 m=0.5 α1=1/3 α2=1/3 α =1/3 0.6 −6 3 10 0.1 1 0.8 −8 0.1 m=2 α =0.1 α1=0.7 α32=0.2 0.2 0 −10 10 0.2 0.3 10 0 −10 0.15 0.4 α =0.2 α21=0.2 α3=0.6 8 β=1 α =0.2 α1=0.2 α32=0.6 0.5 10 0.1 m=10 1 6 0.05 0.2 0 −10 4 1 m=2 α =1/3 α1=1/3 α32=1/3 0.3 10 2 1.5 0.4 m=10 α =0.1 1 α =0.7 2 α3=0.2 0.8 0 0.5 10 0.1 0 −10 −2 −1 Pc 2 ( i=1 αi exp −βkx − vi k )β 0.2 10 −4 2 0.3 3 −6 1 0.4 m=10 α1=1/3 α2=1/3 α =1/3 1 −8 2 α =0.2 α1=0.2 α32=0.6 −1 0.5 0 −10 1 β=0.3 Figure 1. h (x, v) = (N (v)) 1.5 10 β=1 α =0.1 α1=0.7 α32=0.2 1.5 0.2 0 −10 8 2 α =0.1 α1=0.7 α32=0.2 0.4 10 6 1.5 β=0.3 0.6 α =0.2 α1=0.2 α32=0.6 0.4 4 0.1 0.8 0.6 2 0.2 10 8 0.2 0 −10 0 0.5 0.4 10 −2 0.3 β=0.3 α =1/3 1 α =1/3 2 α3=1/3 0.8 0.8 −4 0.2 10 0.6 0.2 −6 0.4 0.1 α =0.1 α1=0.7 α32=0.2 −8 0.3 β=3 α =0.2 α1=0.2 α32=0.6 0.2 0 −10 0 −10 0.4 0.1 −8 0.15 β=0.3 0.4 0 −10 10 β=3 α =0.1 α1=0.7 α32=0.2 0.1 10 0.6 0 −10 8 0.05 0.8 0 −10 6 0.2 β=0.3 α =1/3 1 α =1/3 2 α3=1/3 0.1 0 −10 4 0.15 0.05 0 −10 2 0.05 −8 0.15 0 −10 0 0.25 0.05 0 −10 −6 0.2 10 β=300 α =0.2 1 α2=0.2 α3=0.6 0.15 0 −10 −8 0.1 −8 0.2 0 −10 0 −10 0.3 0.05 0 −10 0.1 0.05 0.4 β=300 α1=0.1 α =0.7 2 α3=0.2 β=1 α =1/3 1 α =1/3 2 α3=1/3 0.15 0.05 0.15 0 −10 0.2 β=3 α =1/3 1 α =1/3 2 α3=1/3 0.15 0.05 87 3 0.5 10 0 −10 −8 −6 1 2 1−m α kx − v k i i=1 i Pc −4 −2 0 !1−m 2 4 6 8 10 88 J. Yu et al. / A Novel Multimodal Probability Model for Cluster Analysis However, when m > 1 (not very close to 1) or β > 0 (not approach to +∞ or 0) is fixed, the mixing proportions αi may control the shapes of c components in the proposed pdf. As for h (x, v) = P β −1 c 2 (N (v))−1 α exp −βkx − v k , the mixing proportion αi actually controls the height of i i i=1 c components. Roughly speaking, thelarger αi has the higher correspondingcomponent. As for the pro!1−m 1 Pc 1−m −1 2 , the mixing proportion αi posed pdf h (x, v) = (N (v)) exp − i=1 αi kx − vi k controls not the height but the width of c components. That is, the larger αi gives the wider corresponding component. Moreover, Fig.1 illustrates for β < 0 or β > 0 (but small enough), the proposed PDF h (x, v) = −1 −1 Pc (N (v)) ( i=1 αi exp −βkx − vi k2 )β becomes unimodal,consistent with Theorem 2.3. As for !1−m 1 P c 2 1−m , it is also unithe proposed PDF h (x, v) = (N (v))−1 exp − i=1 αi kx − vi k modal when 0 < m < 1, which is also consistent with Theorem 2.3. Figure. 2 shows that when m < 0 , the proposed PDF h(x|v) may be unimodal or multimodal which depends on m. If m < 0 and −m is not large enough, then the proposed PDF h(x|v) is unimodal. If m < 0 and −m is large enough, then h(x|v) is multimodal with very sharper components. Therefore, it is easy to see that a clustering algorithm corresponding to the proposed PDF with β < 0 or m < 1 does not output c different cluster centers with a high probability. The above observations also support the following results presented in [14]: the condition that m > 1 and β > 0(not small enough) should hold for the GCM with respect to f = x1−m and f = −β −1 lnx , respectively. By observing PDF h (x, v) = Figure. 2, it is easy to see that!the proposed 1−m 1 Pc 2 1−m has c components with the same shape. In statis(N (v))−1 exp − i=1 αi kx − vi k tical point of view, it implies that the corresponding clustering algorithm (i.e. FCM) outputs the clusters with equal sizes if the data are well separated. This offers a reasonable explanation why FCM has such a tendency to output the clusters of equal sizes, as mentioned in [1]. In fact, it is easy to discuss the relations between many well-known clustering methods such as FCM or EM type clustering algorithm and the proposed pdf with the method of this paper. 4. Conclusions In this paper, we have investigated partitional clustering algorithms from a statistical point of view. By comparison between the ML method and the GCM clustering model, we have proposed a novel multimodal probability density function that includes the finite mixture model and the exponential family as special cases. Under some constraint condition, we have proved that such PDF indeed leads to GCM. Considering that the proposed PDF plays an important role in algorithm design and performance evaluation for cluster analysis, we have studied the properties of the proposed PDF in theory in this paper. Transparently, partition function N(v) of the proposed PDF is worth further study. In [14], it has been proved that many existing partitional clustering algorithms are special cases of the GCM clustering model. Therefore, the distributions associated to many partitional clustering J. Yu et al. / A Novel Multimodal Probability Model for Cluster Analysis 89 algorithms are also discovered in some sense, and the relations between the proposed PDF and various partitional clustering methods can be easily discussed, such as the classification maximum likelihood, the finite mixture model, the latent class analysis, C-means clustering and fuzzy clustering methods, etc. Specially, we find the probability density function associated to fuzzy C-means clustering algorithms that might be useful to better understand the properties of many fuzzy clustering algorithms. For example, we also interpret why C-mean or FCM has a tendency to output clusters of equal size based on the proposed PDF. Furthermore, the proposed PDF might offer a theoretical way to choose the optimal clustering algorithm according to a given data set. In theory, the proposed PDF translates the above problem into the following question: how to judge if a given data set obeys a probability distribution? This is a classical question in the statistical field. More accurately, a novel method might be developed to evaluate the performance of GCM based on the proposed PDF in the future. The idea is theoretically simple: choosing v as the optimal clustering result such that N (v) = ̟−1 for maximizing l(X, v) according to the outputs v of GCM. In practice, if indeed using such method, one challenging problem is to accurately estimate N(v) in high dimensions, which is beyond the scope of this paper. References [1] Bezdek,J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. New York:Plenum Press, 1981. [2] Bryant, P.G. , Williamson, J.A. : Asymptotic behavior of classification maximum likelihood estimates. Biometrica,Vol. 65,273–438, 1978. [3] Celeux, G. , Govaert, G. : Clustering criteria for discrete data and latent class models. Journal of classification, Vol. 8, 157-176, 1991. [4] Fukunaga, K. , Hostetler, L.D.: The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans. Information Theory, 21, 32-40, 1975. [5] Kaufman, L. , Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York, 1990. [6] Krishnapuram, R. , Keller, J.M.: A possibilistic approach to clustering. IEEE Trans. Fuzzy Systems, 1, 98110, 1993. [7] Lloyd, S.: Least squares quantization in pcm. Bell Telephone Laboratories Papers, Marray Hill, 1957. [8] MacQueen, J.: Some methods for classification and analysis of multivariate observations. Proc. of 5th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1,281-297, Berkley: University of California Press, 1967. [9] McLachlan, G.J. , Basford, K.E.: Mixture Models: Inference and Applications to clustering. Marcel Dekker, New York, 1988. [10] Scott, A.J. , Symons, M.J.: Clustering methods based on likelihood ration criteria. Biometrics, Vol. 27, 387397, 1971. [11] Bock, H.H.: Probability models and hypotheses testing in partitioning cluster analysis. In: Arabie, P., Hubert, L.J., Soete, G.D. (eds.):Clustering and Classification, 377-453, World Scientific Publ., River Edge, NJ, 1996. [12] Yang, M.S.: On a class of fuzzy classification maximum likelihood procedures. Fuzzy Sets and Systems, 57, 365-375, 1993. 90 J. Yu et al. / A Novel Multimodal Probability Model for Cluster Analysis [13] Yang, M.S. , Wu, K.L.: A similarity-based robust clustering method. IEEE Trans. Pattern Anal. Machine Intelligence 26, 434-448, 2004. [14] Yu, J.: General C-means clustering model. IEEE Trans. Pattern Anal. Machine Intelligence, 27(8), 11971211, Aug. 2005. [15] Windham, M.P. : Statistical models for cluster analysis. in: E. Diday, Y. Lechevallier (1991): Symbolicnumeric data analysis and learning, Commack, New York: Nova Science, 17-26 [16] Govaert, G.: Clustering model and metric with continuous data. in: E. Diday (1989): Learning symbolic and numeric knowledge, Commack, New York: Nova Science, 95-102
© Copyright 2026 Paperzz