On parameter estimation for normal mixtures based on clustering

FUZZY
sets and systems
ELSEVIER
Fuzzy Sets and Systems 68 (1994) 13 28
On parameter estimation for normal mixtures based on fuzzy
clustering algorithms
Miin-Shen Yang*, Chen-Feng Su
Department o[ Mathematics Chung Yuan Christian University, Chung Li. Taiwan 32023. ROC
Received July 1993; revised March 1994
Abstract
Described here are three approaches to estimate the parameters of a mixture of normal distributions. One approach is
based on a modification of the expectation maximization algorithm to compute maximum likelihood estimates. Another
makes use of the fuzzy c-means clustering algorithms, and the third is based on the penalized fuzzy c-means clustering
algorithms. The accuracy and computational efficiency of algorithms of these three types to estimate parameters of
normal mixtures are compared with samples drawn from univariate normal mixtures of two classes.
Keywords: Clustering; EM algorithm; FCM algorithms; PFCM algorithms; Normal mixtures; Parameter estimation;
Accuracy; Computational efficiency
1. Introduction
Mixtures of normal distributions are used extensively as models in diverse important practical situations
in which data are viewed to arise from two or more subpopulations mixed in varying proportions. There are
numerous examples in which finite mixtures of normal distributions are applied to analysis of data in the
sciences (see [11]). The parameter estimation for the normal mixture is the first step to understand the model
of a mixture. The m a x i m u m likelihood principle is the most commonly applied method in parameter
estimation in which the expectation maximization (EM) algorithm had been commonly used to compute the
m a x i m u m likelihood estimates in finite mixtures of normal distributions (see 1-16, 13]).
Cluster analysis, a major technique in pattern recognition, is a method to cluster a data set into groups of
similar individuals. The conventional (hard) clustering methods restrict that each point of the data set
belongs to exactly one cluster. The fuzzy set in [20] gave an idea of uncertainty of belongingness that is
described by a membership function. The use of a fuzzy set provides imprecise class membership information.
It is natural and useful to apply the idea of a fuzzy set in cluster analysis. Fuzzy clustering is widely applied in
diverse substantive areas. See, for example, [2, 12].
* Corresponding author.
0165-0114/94/$07.00 ~ 1994-Elsevier Science B.V. All rights reserved
SSDI 0165-01 14(94)00169-3
14
M.-S. Yang, C.-F. Su / Fuzzy Sets and Systems 68 (1994) 13-28
In the literature of fuzzy clustering, the fuzzy c-means (FCM) clustering algorithms defined by Dunn [7]
and generated by Bezdek [2] are well-known and powerful methods in cluster analysis. There are also many
generalizations and variations in these FCM, such as by Bezdek et al. [2, 3], Dave r4], Trauwaert et al. [15],
Jajuga [-10], Yang [19], etc. These FCM algorithms can also be used for parameter estimation for finite
mixtures of normal distributions. These FCM are obviously nonparametric approaches, but they enable
results superior to those from the EM algorithm which is a parametric approach. Such numerical comparison was proposed by Bezdek et al. [3], Davenport et al. [5] and Gath and Geva [8].
Yang [18] proposed a class of fuzzy classification maximum likelihood procedures. He added a penalty
term to these FCM and then extended the F C M to the so-called penalized FCM (PFCM). In this work, we
investigated these three approaches to estimate the parameters of normal mixtures based on EM, FCM and
P F C M algorithms. We give numerical comparisons for these three approaches according to criteria of
accuracy and computational efficiency. The results show that the P F C M are impressive. In Section 2, we
describe these algorithms. Section 3 gives the numerical examples and their results.
2. EM, FCM and PFCM algorithms
To describe briefly the model of finite mixtures of distributions, let the population be assumed to consist of
c subpopulations, or classes, with c a positive integer greater than unity; the density of observation x from
class i is gi(x; OiL i = 1, 2 ..... c, for some parameter 0~. Let ~ be the proportion of class i in the population
with cq ~ (0, 1) and Z~- 1 cq = 1. Let ~' = (~1, ~2 ..... ~c), 0' = (0'1,0~ ..... 0'~), f(x; 7) = 5~=, ~igi(x; Oi) and
y' = (~', 0'), in which the convex combination f(x; y) is a mixture of the g parametrized in y.
Let X = {x,, x2 ..... x,} c R p be a random sample of size n from f(x; y), but unlabeled by class origin. It is
assumed throughout that the number c of components is known. The problem of deciding about c itself falls
into the category of exploratory data analysis called cluster validity, and is not addressed here. More
specifically, for i = 1, 2 ..... c, let gi(x; 0~) be the multivariate normal with mean ai and covariance matrix Z~;
i.e. gi(x; Oi) = N(ai, Zi).
We present the notations used in this section as follows:
f2 = {0 = (a, Z): a ~ R p, Z ---- LL', L ~ R p×p, det(L) # 0};
~2 ~ = .Q x ,,Q x .Q x ... x ,Q (c t i m e s ) ;
A ~ = { ~'~R~:i=I~ ~ i = l ' O < < ' ~ i < ~ l V i } ;
F = { y ' = (~',0'): ~ ¢ A c, 0~f2c};
and
gi(x; Oi) = K exp{ -½(x - ai)' Z f l(x - ai)}
with K = (2n)-P/2(det(Zi))-'/2.
Next we describe three approaches to estimate parameters of normal mixtures. These are Wolfe EM, FCM
and P F C M algorithms.
2.1. The Wolfe E M algorithm
One choice to estimate parameters is to use the maximum likelihood (ML) principle. The ML principle to
estimate the parameters of a mixture of normal distributions is an iterative optimization procedure well known
as the EM algorithm explored by Dempster et al. [6] when the data are viewed as a set of incomplete data.
Wolfe [16] first proposed an iterative algorithm of this kind, which we call the Wolfe EM (WEM) algorithm.
M.-S. Yang, C.-F. Su / Fuzzy Sets and Systems 68 (1994) 13 28
15
Let X be a subset of a p-dimensional Euclidean space R p with its ordinary Euclidean n o r m I1"11and let c be
a positive integer greater than unity• A partition of X into c clusters can be represented by mutually disjoint
sets B 1 , B 2 . . . . . Bc, such that B~ w B 2 u " " W Bc = X, or equivalently by the indicator functions zl,ze ..... z~,
such that z~(x) = 1 if x e B~ and zi(x) = 0 if x q~ B~ for all x s X and all i = 1,2, ..., c. In this case, X has a hard
c-partition according to the indicator functions z = (z~,z2 . . . . . z~).
In accordance with the mixture density f(x;7), the log likelihood for the complete data is given by
Lc(}';X):~ ~ zi(xj){ln~i+lngi(Xd;Oi) }.
i=1 j = l
The E M algorithm is applied to the mixture distributions by treating z as a missing datum. The algorithm is easy to p r o g r a m in two steps, E (for expectation) and M (for maximization). According to the
initial value of 7, say 7~°~, the E step requires the calculation of Q(7,7 ~°1) = E(Lc(7;X)IX;'/~°~) that is
the expectation of log likelihood Lc(7;X) of the complete data, conditional on the observed data and
the initial value 7~°~ for 7. The M step is to choose the value of ~, say 7 I1~, that maximizes Q(v,?,~o~) after
the E step.
Let X = {xl,x2 ..... x,} ~ R p be a r a n d o m sample of size n from f ( x ; v ) in which f ( x ; 7 ) is a mixture of
n o r m a l distributions as the notation describes. Consider the first derivative of Lc(?) by }, and let it be zero.
Then Wolfe [16] gave the necessary conditions of a maximizer of Lc(?') as follows:
~i = ~ ui(xi)/n;
i = 1,2 . . . . . c,
(la)
j=l
ai= ~ ui(xj)xj/n~i;
i = 1,2 . . . . . c,
(lb)
j=l
Xi = ~ ui(xj)(xj - al)(xj - ai)'/ncq;
i =
1,2 . . . . . c,
(lc)
j=l
in which
aigi(Xj; Oi)
ui(xj)
Z~,=I ~kOk(Xj;0k)
i = 1,2 . . . . . c; j = 1,2 . . . . . n.
ld)
Eq. (ld) shows that ui(xj) m a y be interpreted via Bayes rule as the posterior probability that given xa, it is
drawn from class i, i.e.
u~(xj) = Pr(class ilxj; 0~).
(le)
A fuzzy set, p r o p o s e d by Z a d e h [20], is an extension to allow an indicator function zi(x) to be a function
ui(x) (called a m e m b e r s h i p function) assuming values in the interval [0, 1]. Ruspini [14] introduced a fuzzy
c-partition u = (ul, u2 . . . . . uc) with the extension to allow z~(x) to be functions ui(x) assuming values in the
interval [0, 1] such that ul(x) + u2 + ... + uc(x) = 1 since he first applied the fuzzy set in cluster analysis.
u = (ul, u2 . . . . . uc) with ui(xj) in (ld) consists of just fuzzy c-partitions.
The E step can be replaced by E(zi(xj)lxj; 7 c0)) and the M step is used to estimate the p a r a m e t e r ~, by means
of (1 a) (1 c). We can alternate repeatedly the E step and M step. They have a sequence of executions of stage
k by stage k - 1. If the condition that Q(7,7 ~°~) is continuous in both 7 and 7(°1 holds, then /~c is a local
m a x i m u m of Lc(7(k~; X) (see [6]).
Based on the necessary conditions ( l a ) - ( l d ) , we give the Wolfe E M (WEM) algorithm as follows.
M.-S. Yang, C.-F. Su / Fuzzy Sets and Systems 68 (1994) 13-28
16
W E M algorithm
Step 1: F i x 2 ~ < c ~ < n a n d f i x a n y e > 0 .
Given a fuzzy c-partition u ~°).
Step 2: Compute ~(k) = (~tk), otk)) with u ~k- 1) by (la)-(lc).
Step 3: Update to u tk) with 7<k) = (c¢tk), 0 tk)) by (ld)
Step 4: C o m p a r e u tk) to u tk- ~ in a convenient matrix norm.
I F [Iu(k) - - u(k- 1)II < ~, S T O P .
E L S E k = k + 1, return to step 2.
Although the convergence properties of the EM algorithm were created by Wu [17], it may be painfully
slow and may yield an undesirable estimator such as the "wrong" local maximizer, or even a singularity. The
EM algorithm has a poorer result for a poorer initial value. In order to let the initial values be stable and to
avoid the "wrong" local maximizer or even a singularity, we may use a two-step method called F C M - W E M .
That is, we first use F C M to estimate the parameters as initial estimates of parameters for WEM, then run
WEM. A two-step algorithm of this kind was proposed by Bezdek et al. [3], Davenport et al. [5] and Gath
and Geva [-8], who showed that F C M - W E M has results superior to those of WEM. Therefore we consider
F C M - W E M in numerical examples of Section 3.
2.2. F C M algorithms
Let X = { X l , X 2 . . . . . Xn} C RPbe a set of finite data and let u = ( U l , U 2 . . . . . Uc) be a fuzzy c-partition. Dunn
[7] gave a fuzzy generalization of the conventional (hard) c-means (more popularly known as k-means) by
combining the idea of Ruspini's fuzzy c-partitions. He gave the objective function JD(U, a) with
i=1 j = l
in which u = (ul, 132 . . . . . Uc) is a fuzzy c-partition and a = (al, a2 . . . . . ac) ~ (RP) c is a cluster center. Bezdek [1]
generalized JD(U, a) to JFcM(U, a) with
JFcM(U,a)=
~', ~'. u':'(xj)][xj--ai]] 2
i=l j=l
for which m 6 [1, oo) represents the degree of fuzziness. Then the necessary conditions for a minimizer (u, a) of
JvcM(U, a) are
al - Z~=, u':'(xj)xj.,
Z~=, u?(x j)
i = 1, 2 . . . . . c,
(2a)
and
ui(xj)
=
(~ ~llxja'll~/(~-')
-~; i = 1,2 . . . . . c; j = 1,2 . . . . . n.
~112/<,._ ~
(2b)
To add the effect of various cluster shapes, we consider a sample covariance matrix Zi (or called the scatter
matrix) as follows:
Zi = ~'7=lUi(Xj)(xJ - ai)(xj - ai)'.
Y~=, u,(xj)
i = 1,2 . . . . . c.
M.-S. Yang, C.-F. Su / Fuzz), Sets and Systems 68 (1994) 13-28
17
We replace ]1xj - ai II by II xj - a~ II~;~ with IIxj - ai tlz;-, = (xj - ai)' $7 1(xj - ai) and replace conditions (2a)
and (2b) by the following conditions (3a)-(3c):
tl
ai _ ~j= 1 Urfl(Xj)Xj', i = 1, 2 . . . . . C,
2~'=, "re(X J)
z~,i ~- ZT=lUi(Xj)(XJ--
ai)(xJ
--
ai)",
(3a)
i = 1,2 ..... c,
(3b)
i = 1,2 . . . . c; j = 1,2 ..... n.
(3c)
2~'=1Ui(Xj)
and
ui(xj) --
Ilxj - ail[~/,(-"?- 1)t-1 "
tk__~l Ilxj
- ak 112~m-l)
'
The iterative algorithms to compute minimizers of JFcM(U,a) with necessary conditions (2a) and (2b) are
called the fuzzy c-mean (FCM) clustering algorithms. These FCM clustering procedures defined by Dunn [ 7]
and generalized by Bezdek [2] are widely applied in diverse substantive areas. The convergence properties of
FCM algorithms were created in Bezdek [1] and Yang [19]. In order to add the effect of various cluster
shapes, we consider a modification of FCM algorithms as iterations with conditions (3a)-(3c). We mention
that Gustafson and Kessel [9] and Trauwaert et al. [15] proposed a modification of FCM to add the effect of
various cluster shapes by replacing IIxj - ai II with II xj - a~ IIs,-,. They claimed that this modification of FCM
shall be better than F C M when clusters have various shapes. Here we still call this modification of FCM an
FCM algorithm and present it as follows.
F C M algorithm
Step 1: Fix m ~ [1, ~ ) for JFCM,fix 2 ~< C ~< n and fix any e > 0.
Given a fuzzy c-partition u t°).
Step 2: Compute 0 tk) with u tk- 1) by (3a) and (3b)
Step 3: Update to u ~k) with 0 ~k) by (3c)
Step 4: Compare u tk~ to u tk- 1) in a convenient matrix norm.
IF lit/(k)- /./(k-1)]1 < e, STOP
ELSE k = k + 1, return to step 2.
2.3. P F C M algorithms
The classification maximum likelihood (CML) procedure is a remarkable mixture maximum likelihood
approach to clustering (see Ell]). We let X = { x l , x 2 . . . . . x.} be a random sample of size n from the
population f ( x ; 7) which is a mixture of density gi(x; 03 with proportion ~i of class i, i = 1, 2 . . . . . c. The CM L
procedure is the optimization problem with a hard c-partition u, a proportion ~ and an estimate 0 to
maximize the log likelihood B l(U, ~', 0') of X in which
BI(u,~',O') = ~
Y" ln~igi(xj;Oi)
i= 1 jell
= ~ ~ ui(xj)lnotigi(xj;Oi),
i - i j=l
with ui(x) ~ {0, 1} and ul(x) + u2(x) + ... + uc(x) = 1 for all x E X. Yang [181 gave the objective function
B,,. w(u, ~', 0') in which
Bm, w(U,~',O')= ~
~ u~'(xj)lng,(x~;Oi)+w ~, ~ ur(xj)ln~i
i=1 j=l
i=1 j=l
M.-S. Yang, C.-F. Su / Fuzzy Sets and Systems 68 (1994) 13 28
18
with ui(x) ~ [0, 1] and ul(x) + U 2 ( X ) At- " ' " + Uc(X ) = 1 for all x 6 X, and the fixed constants m 6 [1, ~) and
w >~ 0. Thus B,,, w(U, ~', 0'), called the fuzzy C M L objective function, is a fuzzy extension of B1 (u, a', 0'). Based
on Bm, w(U,c(,O') we derive various hard and fuzzy clustering algorithms. Yang [18] gave the objective
function JPFCM with
JpvcM(u,o~',a) = ~.
~ u~(xj) llxj--ai]l 2 - w ~
i=1 j-,
~ u~(x,)ln~i
i=1 j=l
~ um(xi)ln~i.
= JFcM(U,a)-- W ~
i=l
j=l
As JPFCMadds the penalty term ( - w ~ = 1 Y~= 1 uT'(xj)ln ~i) to JvcM(U, a), JPFCM(U,g', a) is called the penalized
FCM (PFCM) objective function. Yang gave the necessary conditions for a minimizer (u,g',a) of
JPFCM(U,g', a) as follows:
n
cq =
~-,.i=1 u~(x.i)
i = 1, 2 ..... c,
' ~ = 1 E 3 = l lgm(xj) ;
ai - 27= 1 Nm(xj)xj',
ui(xj)=
(4a)
i = 1, 2, ..., C,
(4b)
( k~=1 (ll xi - ai ll Z - w ln ~i)a/'- ~']- '
~-~--akll e wlnTk)l/~,,_l~ ] ; i = 1,2 ..... c; j = 1,2 ..... n.
(4c)
Yang [18] also gave the convergence properties of the P F C M algorithms.
Thus the P F C M clustering algorithms to compute the minimizer of JPFcM(U,e', a) are iterations through
these necessary conditions of (4a)-(4c). On comparison of the P F C M and FCM, the difference is the penalty
term. In order to consider the effect of various cluster shapes, we estimate the subpopulation covariance
matrix by the scatter matrix Zi:
Zi = y'7=' ui(xi)(xJ - ai)(xj - a,)'.,
i = 1,2 ..... c,
S,~= , u,(xj)
and replace the metric of the Euclidean norm 11x~ - ai 11by the Mahalanobis metric norm 11xj - a~ ]Is, =
(xj - ai)'Z? *(xj - ai). Thus the conditions become
cci =
Z7=1 uT'(xj)
.
c
n
Z,~,
2j=,
u,tn (xj) ~
ai - y~7= ' uT'(xj)xj.,
2~=1tim(x J)
i = 1,2,
"'"
,c,
(5a)
i = 1,2 ..... c,
Si = y~= l ui(xj)(xJ - ai)(xj - ai)'.,
(5b)
i = 1, 2 . ... . c,
(5c)
2'j:, u,(xA
and
ui(x~)= ( £
•
k=l
( ] l x 2 - ail12#~ - wln~)~/~m
(tIxj --
a k l l 2~i-1
1)'~-~
w ln O~k)l/tm-1),/
•
'
i=
1,2 .....
c;j=
1,2,
. n.
"'
(5d)
'
Based on the necessary conditions (5a)-(5d), we construct a P F C M algorithm. We generate a random
sample of size n = 500 from a mixture of two normals 0.5N(0,1)+ 0.5N(1,1) by the method of
pseudo-random variable which will be described in Section 3. We implement the P F C M algorithm to the
M.-S. Yang, C,-F. Su / Fuzzy Sets and Systems 68 (1994) 13 28
19
0.5N(O, I )+0.5N( t , t)
1.8
t.6
1.4
f.2
1
\
0.8
0.6
0.4
0.2
0
J
-0.2
-0.4
I
-0.6
0
I
L
,2
4
W
Fig. 1. E s t i m a t e d m e a n s vs. w in P F C M a l g o r i t h m s as rn = 2 in Eq. (5at.
r a n d o m sample. We observe the variation of the estimated means on a mixture of two normal distributions
0.5N(0, 1) + 0.5N(1, 1) by altering the value o f w from 0 to 5. Figs. 1 and 2 represent the estimated means vs.
w as m = 2 in (5a) and as m = 1 in (5a), respectively. In Fig. 2 we find that the estimated means are
symmetrically closed to 0.5 as w increases, but for the estimated means in Fig. 1 it is not so. T h a t is, the
accuracy of the p a r a m e t e r estimation in the n o r m a l mixture represented in Fig. 2 is superior to that in Fig. 1.
Thus m = 1 in (5a) has a p r o p e r t y superior to that of m = 2 in (5a). Choosing m = 1 in (5a) is invariably
superior to choosing m > 1 in (5a). Therefore, we replace (5a) by (5a'):
~=
5~=~u~(xa)
;
u,(x~)
i = 1 , 2 . . . . . c.
(5a')
EL t ~ j ="t
The ei and -ri in (5a') and (5c) are the same as (la) and (lc) in the W E M algorithm.
N o w we consider the modification of the P F C M algorithm as iterations with conditions (5a') and
(5b)-(5d). According to Figs. 1 and 2 we k n o w that this modification of the P F C M algorithm is better than
the P F C M algorithm in estimating p a r a m e t e r s of n o r m a l mixtures. We do not k n o w the convergence
properties of this modified P F C M algorithm. We conjecture that it has similar convergence properties to the
20
M.-S. Yang, C.-F. Su / Fuzz), Sets and Systems 68 (1994) 13-28
O. 5N(O, l )÷O. 5 N ( l, l )
f.5
f.4
f.3
f.2
f.f
g
0.9
0.8
0.7
0.8
0.5
0.4
0.3
0.2
0.1
0
-0. f
-0.2
-0.3
-0.4
I
-0.5
0
I
I
2
I
I
4
W
Fig. 2. Estimated means vs. w in P F C M algorithms as m = 1 in Eq. (5a).
P F C M shown in [18]. In this paper we use this modified P F C M algorithm in all numerical comparisons and
also call it a P F C M algorithm. We present this algorithm as follows.
PFCM algorithm
Step 1: Fix m c [1, Go) for JPFCM, fiX 2 ~< C ~< n and fix any e > 0.
Given a fuzzy c-partition ut°(
Step 2: Compute 7tk) = (~tk), o~k)) with u ¢k- 1~ by (5a'), (5b) and (5c).
Step 3: Update to u tk) with ~tk~= (c~tk),otk)) by (5d).
Step 4: Compare u <k) to u ~k- 1~ in a convenient matrix norm.
I F IIu ~k~ - u ~k- 1~ II < ~, S T O P .
E L S E k = k + 1, return to step 2.
Accordingto numerical experiments, the optimal value w* of w in P F C M algorithms under the criterion of
accuracy is almost independent of various prior probability and variance, but depends on the distance
between subpopulation means. The P F C M of w = 1 has accuracy better than the P F C M of w = 0 (i.e. FCM).
This behavior can be observed in Tables 2 - 4 in which the optimal values w* in test A and test B are all about
2.2 in Tables 2 and 3 but the optimal values w* in test C vary from 1.2 to 2.8 in Table 4. A functional relation
M.-S. Yang, C.-F. Su / Fuzz)' Sets and Systems 68 (1994) 13-28
21
between the optimal value w* and the distance between subpopulation means is created in Fig. 4. This figure
provides a way to find an optimal value w* of w. Most of the curve is closed to w* = 1. We propose a method
of two steps of P F C M (called two-PFCM) as follows: (1) First we compute the estimates of means by P F C M
with w = 1. (2) Based on these estimates of means, we find an optimal value of w in accordance with Fig. 4
and then we run P F C M with this optimal value of w again. A two-PFCM of this is recommended after we
make numerical comparisons.
3. Numerical examples and results
We make numerical comparison of algorithms of these three types to estimate the parameters of the
normal mixtures using samples drawn from some univariate normal mixtures of two classes under the criteria
of accuracy and computational efficiency. We define these properties of criteria underlying: (1) Accuracy: the
accuracy of an algorithm is measured by the mean squared error (MSE) that is the average squared error
between the true parameter and the estimate of the parameter in trials. (2) Computational efficiency: the
computational efficiency of an algorithm is measured by the numbers of iterations in trials.
We study a mixture of two univariate normals in all numerical examples. As the separation between
subpopulations is determined by varying the parameters of subpopulations, without loss of generality we
give that one subpopulation normal is N(0,1) and the other is N(a2,o2). That is, f ( x ; 7 ) =
cqN(0, 1) + o~2N(a2,tr 2) with 0~2 = 1 - ~1. We consider the random sample of data drawn from f(x;y) =
~N(0, 1) + ~2N(a2,0 "2) generated by the pseudo-random variable according to the following that is well
known in the literature.
Definition (pseudo-random variable). The pseudo-random variable is a method used to generate a random
sample, that is, random variables that are generated through the use of computer algorithms.
Because uniforms are more easily generated than normals, we investigate two methods that generate
normal variables from uniforms here.
Define
X1 = c o s ( 2 n U 1 ) x / - 21n U2,
X 2 = sin(2rtU1)x/- 21n U2,
in which U~ and U2 are independent uniform (0, 1) random variables. Then X~ and X2 are independent
N(0, 1) random variables. Furthermore, we let
Y1 = al + 0.1X~,
Y2 = a2 + 0.2X2.
Then Y1 and Y2 are independent N(al, a 2) and N(a2, a2a), respectively.
We consider the cases c = 2, m = 2, p = 1, n = 500 and e = 0.0001 in all the following numerical examples.
Example 1. The mixtures 0.5N(0, 1)+ 0.5N(a, 1) have the same prior probabilities and subpopulation
variances but different means. We investigate the relation between MSE and the distance of two subpopulation means (i.e. a) about parameter estimation of the normal mixtures 0.5N(0, 1) + 0.5N(a, 1) based on FCM
clustering algorithms. We find that the distance between two subpopulation means is an important factor to
MSE. Their relation is represented in Fig. 3. Obviously MSE is a decreasing function of the distance between
two subpopulation means.
Example 2. A curve of optimal value w* vs. the distance between subpopulation means under the criterion of
MSE based on P F C M algorithms for the normal mixtures 0.5N(0, 1)+ 0.5N(a, 1) is given in Fig. 4.
M.-S. Yang, C.-F. Su / Fuzzy Sets and Systems 68 (1994) 13-28
22
O. 5N(O, !
!)
t.4
t.3
t.2
t.t
1
0.9
0.,9
0.7
0.6
0.5
0.4
0.3
0.2
O.t
0
0
2
4
ml/,
Fig. 3. MSE vs. the distance of two population means in FCM algorithms.
Accordingly w* is less than 3. We consider only the value of w less than 3 in P F C M algorithms, w* decreases
to about unity when the distance between subpopulation means increases. This figure gives us a point of view
to propose a method of two steps of P F C M described at the end of Section 2.
Example 3. We give the behaviors of the estimates of subpopulation means, MSE and the numbers of
iterations based on the P F C M as the value of w alters from 0 to 5. We consider the normal mixture
0.5N(0, 1) + 0.5N(1, 1), and describe these behaviors in Figs. 2, 5 and 6, respectively. In Fig. 2, the estimates of
subpopulation means are outside the interval [0, 1] when w = 0 (i.e. FCM), and vary symmetrically closed to
0.5 when w increases. Thus the value of estimates of subpopulation means is near the value of true means
when we choose a suitable value of w. Fig. 5 shows that a minimizer of MSE is obtained at w = 2.2. In this
way we have the optimal parameter estimation from P F C M clustering algorithms when w is chosen as 2.2.
Fig. 6 describes that the numbers of iterations increase and then decrease for w varying from 0 to 5, but are
less than 400, and less than 150 whenever w ~< 2.5. The number of iterations of P F C M is invariably less than
from the F C M - W E M algorithm.
Example 4. We consider normal mixtures of four types and then we run F C M - W E M , F C M and P F C M
algorithms for these normal mixtures. Although the P F C M varies in the value of w, we consider only three
23
M.-S. Yang. C.-F. Su / Fuzzy Sets and Systems 68 (1994) 13-28
0.5N(O, l
!)
2.8
2.7
2.G
2.,5
2.4
2.3
2.2
2. f
,2
1.9
f.8
f.7
1.6
1.5
f.4
t.3
t.2
f.f
f
J
J
/
0.9
0.8
[
0.7
0
I
2
I
[
[
4
I
I
8
l
1
[
a
I
tO
I
t
t2
t4
I
I
16
I
I
18
I
I
20
-iq~llA,
Fig. 4. Optimal w* vs. the distance between subpopulation means m PFCM algorithms.
Table I
Mixture distributions for the numerical tests in Example 3
aN(0, 1) + (1 - ~)N(I, 1)
Test A
a
A1
0.1
A2
0.2
A3
0.3
A4
0.4
A5
0.5
0.5N(0, 1) + 0.5N(l,a 2)
Test B
cr2
B1
0.2
B2
0.5
B3
1
B4
1.5
B5
3
0.5N(0, 1) + 0.5N(#, 1)
Test C
#
CI
0.2
C2
0.5
C3
1
C4
1.5
C5
3
aN(0, 1) + (1 - a)N(p,o -2)
Test D
a
#
a2
DI
0.1
1
1.5
D2
0.2
02
1
D3
0.3
1.5
3
D4
0.4
0.5
0.2
D5
0.5
3
0.5
m a i n cases: (1) w = 1; (2) o p t i m a l v a l u e w* o f w; a n d (3) t w o - P F C M . W e d i s c u s s t h e c r i t e r i a o f a c c u r a c y a n d
c o m p u t a t i o n a l efficiency a m o n g t h e m ,
T e s t A: I n t e s t A five u n d e r l y i n g n o r m a l m i x t u r e s v a r y i n p r i o r p r o b a b i l i t y . T h e s p e c i f i c a t i o n s o f n o r m a l
m i x t u r e s (A1, A2, A3, A4, A5) a r e d e s c r i b e d i n T a b l e 1.
24
M.-S. Yang, C.-F. Su / Fuzzy Sets and Systems 68 (1994) 13-28
O. 5N(O, ! )+ O. 5N( l, ! )
0.6
0.5
0.4
0.3
0.2
0.1
I
0
I
I
4
2
W
Fig. 5. MSE vs. w in P F C M algorithms.
Table 2
Accuracy and computational efficiency of test A
Test
AI
A2
A3
A4
A5
FCM
FCM-WEM
P F C M (w = 1)
Two-PFCM
P F C M (w*)
MSE
NI
MSE
NI
MSE
NI
MSE
NI
MSE
N1
w*
0.5875
0.4817
0.4245
0.3860
0.3582
17
17
17
16
16
0.3665
0.3053
0.2871
0.2007
0.1121
1065
803
1149
862
462
0.4890
0.3725
0.3113
0.2714
0.2447
21
21
21
22
22
0.3585
0.2349
0.1618
0.1165
0.0934
50
49
49
50
50
0.3175
0.1665
0.0789
0.0259
0.0077
39
44
46
52
49
2.10
2.15
2.20
2.20
2.20
* MSE: mean squared error; NI: n u m b e r of iterations; w*: optimal w in P F C M .
Table 2 shows that these MSE of P F C M of three kinds are invariably smaller than those of FCM. MSE of
F C M - W E M are smaller than that of P F C M (w = 1), but the MSE of two-PFCM and P F C M (optimal w*)
are invariably smaller than those of F C M - W E M .
Table 2 demonstrates that the number of iterations of F C M and P F C M are invariably less than 55 but
those of F C M - W E M are large.
25
M.-S. Yang, C.-F. Su / Fuzzy Sets and Systems 68 (1994) 13-28
0.5N(O, ! ) + 0 . 5 N ( t, ! )
400
350
800
250
,,q
200
fSO
tO0
50
[
I
I
I
2
I
4
W
Fig. 6. Number of iterations vs. w in PFCM algorithms.
Table 3
Accuracy and computational efficiency of test B
Test
B1
B2
B3
B4
B5
FCM
FCM-WEM
PFCM (w = 1)
Two-PFCM
PFCM (w*)
MSE
NI
MSE
NI
MSE
NI
MSE
NI
MSE
NI
w*
0.1250
0.2085
0.3582
0.5119
0.9585
22
19
16
19
22
0.0465
0.1401
0.1121
0.2201
0.1966
143
584
462
587
337
0.0937
0.1355
0.2447
0.3625
0.6958
21
19
22
20
19
0.0148
0.0265
0.0934
0.1748
0.4750
51
44
50
41
41
0.0079
0.0055
0.0077
0.0115
0.0334
37
34
49
43
49
2.05
2.05
2.20
2.25
2.25
* MSE: mean squared error; NI: number of iterations; w : optimal w in PFCM.
Test B: I n test B five u n d e r l y i n g n o r m a l m i x t u r e s v a r y in s u b p o p u l a t i o n v a r i a n c e s . T h e s p e c i f i c a t i o n s o f
n o r m a l m i x t u r e s (B1, B2, B3, B4, B5) are d e s c r i b e d in T a b l e 1.
T a b l e 3 s h o w s t h a t t h e s e M S E of P F C M o f t h r e e k i n d s are i n v a r i a b l y s m a l l e r t h a n t h o s e of F C M . T h e
M S E of F C M - W E M a r e s m a l l e r t h a n t h o s e o f P F C M (w = 1) e x c e p t B2, b u t the M S E o f t w o - P F C M a n d
P F C M (w*) a r e i n v a r i a b l y s m a l l e r t h a n F C M - W E M e x c e p t t w o - P F C M of B5.
M.-S. Yang, C.-F. Su ! Fuzzy Sets and Systems 68 (1994) 13-28
26
Table 4
Accuracy and computational efficiency of test C
Test
C1
C2
C3
C4
C5
FCM
FCM-WEM
P F C M (w = 1)
Two-PFCM
P F C M (w*)
MSE
NI
MSE
NI
MSE
NI
MSE
NI
MSE
NI
w*
1.0538
0.7140
0.3582
0.1780
0.0383
16
17
16
15
15
0.2509
0.1228
0.1121
0.2665
0.0204
1234
1008
462
673
86
0.8673
0.5580
0.2447
0.0988
0.0149
21
22
22
20
22
0.5078
0.2979
0.0934
0.0287
0.0123
51
50
50
44
44
0.0148
0.0111
0.0077
0.0082
0.0110
147
81
49
30
24
2.80
2,80
2.20
1.80
1.20
* MSE: mean squared error; NI: number of iterations; w*: optimal w in P F C M .
Table 5
Accuracy and computational efficiency of test D
Test
DI
D2
D3
D4
D5
FCM
FCM-WEM
P F C M (w = 1)
Two-PFCM
P F C M (w*)
MSE
NI
MSE
NI
MSE
NI
MSE
NI
MSE
NI
w*
0.9008
1.0780
1.1277
0.2140
0.0265
16
16
19
23
20
0.5990
0.4064
0.5315
0.1802
0.0048
1022
1033
436
133
53
0.7383
0.8869
0.9047
0.1384
0.0185
21
22
17
24
23
0.5529
0.5297
0.7789
0.0085
0.0170
46
51
36
69
47
0.3280
0.0164
0.2133
0.0074
0.0165
52
126
47
41
25
2.30
2.75
2.30
2.05
1.20
* MSE: mean squared error; NI: number of iterations; w*: optimal w in P F C M .
Table 3 shows that the numbers of iterations for FCM and PFCM are invariably smaller than 51 but
those of FCM-WEM are large.
Test C: In test C five underlying normal mixtures vary in distance of subpopulation means. The
specifications of normal mixtures (C1, C2, C3, C4, C5) are described in Table 1.
Table 4 shows that these MSE of PFCM of three kinds are invariably smaller than those of FCM. The
MSE of FCM-WEM are smaller than those of PFCM (w = 1) except C4 and C5, but the MSE of two-PFCM
and PFCM (w*) are invariably smaller than FCM-WEM except two-PFCM for C1 and C2. Hence the
FCM-WEM algorithm provides greater accuracy when the distance between subpopulation means is
smaller.
Table 4 also shows that the numbers of iterations for FCM, PFCM are invariably less than 82 except
PFCM (w*) for C1, but those of FCM-WEM are large except C5.
Test D: In test D five underlying normal mixtures vary in all parameters (i.e. proportions, subpopulation
means and variances) of normal mixtures. The specifications of normal mixtures (D1, D2,D3, D4, D5) are
described in Table 1.
Table 5 shows that these MSE of PFCM of three kinds are invariably smaller than those of FCM. The
MSE of FCM-WEM are smaller than those of PFCM (w = 1) except D4, but the MSE of two-PFCM are
smaller than FCM-WEM for D1 and D4.
Table 5 shows that the numbers of iterations for FCM and PFCM are invariably less than 70 except
PFCM (w*) for D2, but the numbers of iterations of FCM-WEM are large except D5.
M.-S. Yang, C.-F. Su / Fuzzy Sets and ,~vstems 68 (1994) 13 28
27
4. Summary and conclusions
The WEM algorithm is a well-known parameter approach to estimate the parameters of normal mixtures.
The fuzzy clustering algorithms FCM can also be used in parameter estimation for normal mixtures.
Although FCM are nonparametric approaches they have properties superior to those of WEM such as fewer
iterations and smaller sensitivities to initial values. Combining accuracy, computational efficiency and
sensitivity to initial values, a two-step algorithm called FCM-WEM was proposed by Bezdek et al. [3],
Davenport et al. [-5] and Gath and Geva [8]. According to Example 4 of Section 3, FCM-WEM has better
accuracy than FCM.
Yang [18] developed a fuzzy clustering algorithm of a new type called PFCM, which is an FCM of
generalized type depending upon the penalty term according to the value of w. In Section 3, we give
numerical examples to investigate the effect of the value of w and propose a two-step algorithm of a new type
called two-PFCM. In Example 4, we made numerical comparisons under the criteria of accuracy and
computational efficiency for test sets A, B, C and D based on FCM-WEM, FCM and P F C M algorithms. The
algorithms of F C M - W E M and two-PFCM have invariably smaller MSE. For this reason the FCM-WEM
was suggested for use for parameter estimation of normal mixtures by authors such as Bezdek et al. [3],
Davenport et al. [5] and Gath and Geva [8]; likewise we propose two-PFCM as a new approach to estimate
the parameter of normal mixtures. For computational efficiency, we find that two-PFCM is superior to
FCM-WEM. Hence two-PFCM is more favorable for use than FCM-WEM.
Finally we conclude that fuzzy clustering algorithms can be successfully used to estimate parameters of
finite normal mixtures. We provide numerical examples under the criteria of accuracy and computational
efficiency based on FCM, F C M - W E M and PFCM. F C M - W E M and two-PFCM are suggested for use to
estimate parameters of finite normal mixtures. According to computational efficiency, two-PFCM is
recommended. On comparison of conditions (la)-(ld) and (5a'), (5b) (5d) between WEM and PFCM, these
conditions are the same in estimating ~i and Si but different in estimating ui. u~ of P F C M in (5d) is simpler
than ui of WEM in (ld). Hence two-PFCM is more efficient than FCM-WEM.
Acknowledgement
We are grateful to the referees for their valuable suggestions and comments. We thank the National
Science Council of the Republic of China (Grant NSC 82-0208-M-033-031) for support of this research.
References
I l l J.C. Bezdek, A convergence theorem for the fuzzy ISODATA clustering algorithms, IEEE Trans. Pattern Anal. Machine lntell.
2 (1980) 1 8.
[2] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms (Plenum, New York, 1981).
[3] J.C. Bezdek, R.J. Hathaway and V.J. Huggins, Parametric estimation for normal mixtures, Pattern Recognition Lett. 3 (1985)
79 84.
[4] R.N. Dave, Generalized fuzzy c-shells clustering and detection of circular and elliptical boundaries, Pattern Recognition 25 (1992)
713 721.
[5] J.W. Davenport, J.C. Bezdek and R.J. Hathaway, Parameter estimation for finite mixture distributions, Comput. Math. Appl. 15
(1988) 819 828.
[6] A.P. Dempster, N.M. Laird and D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm (with discussion),
J. Roy Statist. Soc. Set. B 39 (1977) 1-38.
[7] J.C. Dunn, A fuzzy relative of the ISODATA process and its use in detecting compact, well-separated cluster, J. Cybernet. 3 (1974)
32 57.
[8] 1. Gath and A.B. Geva, Fuzzy clustering for the estimation of the parameters of the components of mixtures of normal
distributions, Pattern Recognition Lett. 9 (1989) 77 86.
28
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
M.-S. Yang, C.-F. Su / Fuzzy Sets and Systems 68 (1994) 13-28
D.E. Gustafson and W. Kessel, Fuzzy clustering with a fuzzy covariance matrix, Proc. IEEE-CDC, San Diego (1978) 761-766.
K. Jajuga, Ll-norm based fuzzy clustering, Fuzzy Sets and Systems 39 (1991) 43-50.
G.J. McLachlan and K.E. Basford, Mixture Models: Inference and Applications to Clustering (Marcel Dekker, NY, 1988).
S.K. Pal and D.D. Majumdar, Fuzzy Mathematical Approach to Pattern Recognition (Wiley (Halsted), New York, 1986).
R.A. Redner and H.F. Walker, Mixture densities, maximum likelihood and the EM algorithm, SIAM Rev. 26 (1984) 195-239.
E.H. Ruspini, A new approach to clustering, Inform. and Control 15 (1969) 22 32.
E. Trauwaert, L. Kaufman and P. Rousseeuw, Fuzzy clustering algorithms based on the maximum likelihood principle, Fuzzy Sets
and Systems 42 (1991) 213-227.
J.H. Wolfe, Pattern clustering by multivariate mixture analysis, Multivar. Behav. Res. 5 (1970) 329-350.
C.F.J. Wu, On the convergence properties of the EM algorithm, Ann. Statist. 11 (1983) 95-103.
M.S. Yang, On a class of fuzzy classification maximum likelihood procedures, Fuzzy Sets and Systems 57 (1993) 365-375.
M.S. Yang, Convergence properties of the generalized fuzzy c-means clustering algorithms, Comput. Math. Appl. 25 (12) (1993)
3-11.
L.A. Zadeh, Fuzzy sets, Inform. and Control 8 (1965) 338-353.