Different Objective Functions in Fuzzy c

International Journal of Fuzzy Systems, Vol. 13, No. 2, June 2011
89
Different Objective Functions in Fuzzy c-Means Algorithms and
Kernel-Based Clustering
Sadaaki Miyamoto
Abstract1
An overview of fuzzy c-means clustering algorithms is given where we focus on different objective
functions: they use regularized dissimilarity, entropy-based function, and function for possibilistic
clustering. Classification functions for the objective
functions and their properties are studied. Fuzzy
c-means algorithms using kernel functions is also
discussed with kernelized cluster validity measures
and numerical experiments. New kernel functions
derived from the classification functions are moreover studied.
Keywords: cluster validity measure, fuzzy c-means
clustering, kernel functions, possibilistic clustering.
not the kernelized measures are adequate for ordinary
ball-shaped clusters.
Finally, a new class of kernel functions is proposed;
they are derived from fuzzy c-means solutions. Illustrative examples are given.
2. Fuzzy c-Means Clustering
We first give three objective functions. Possibilistic
clustering [18] is included as a variation of fuzzy
c-means clustering.
A. Preliminary consideration
Let objects for clustering be points in the
p-dimensional Euclidean space. They are denoted by
xk = ( x1k ,…, xkp ) ∈ R p ( k = 1,…, N ). A generic point
x = ( x1 ,…, x p )
1. Introduction
Fuzzy clustering is well-known not only in fuzzy
community but also in the related fields of data analysis,
neural networks, and other areas in computational intelligence. Among various techniques of clustering using
fuzzy concepts [16, 23, 30, 37], the word of fuzzy clustering mostly refers to fuzzy c-means clustering by Dunn
and Bezdek [1, 2, 6, 7, 8, 13]. This paper gives an overview of this method. Nevertheless, we adopt a
non-standard formulation. That is, we begin from three
different objective functions, and none of them is exactly
the same as the one by Dunn and Bezdek.
Comparing different objective functions and their solutions, we find theoretical properties of fuzzy c-means
clustering: different fuzzy classifiers are derived from
different solutions. Moreover generalization including a
“cluster size” variable and a “covariance'” variable is
developed. This generalization is shown to be closely
related to mixture distributions.
Kernel-based fuzzy c-means clustering is moreover
studied with associated cluster validity measures. Many
numerical simulations are used to evaluate whether or
Corresponding Author: Sadaaki Miyamoto is with the Department of
Risk Engineering, the University of Tsukuba, Ibaraki 305-8573, Japan.
E-mail: [email protected]
Manuscript received June 2010; revised Nov. 2010; accepted Dec.
2010.
c
clusters;
implies a variable in R p . We assume
cluster
centers
are
denoted
by
vi
( i = 1, …, c ). We write V = (v1 , …, vc ) as the collection of all cluster centers.
The dissimilarity between an object and a cluster center is the squared Euclidean distance:
2
D( xk , vi ) =∥xk − v∥
(1)
i .
We sometimes write Dki = D( xk , vi ) for simplicity.
Moreover D( x, vi ) means that variable x is substituted into object xk .
U = (uki ) is the membership matrix: uki means the
degree of belongingness of xk to cluster i .
Crisp and fuzzy c-means clustering are based on the
minimization of objection functions. Crisp c-means
clustering [21] uses the following:
c
N
J H (U ,V ) = ∑∑ uki D( xk , vi )
(2)
i =1 k =1
Alternate minimization with respect to one of (U , V ) ,
while another variable is fixed, is repeated until convergence [1]. Minimization with respect to U uses the
following constraint:
© 2011 TFSA
c
M = { U = (uki ) : ∑ uki = 1; ukj ≥ 0, ∀k , j }.
i =1
(3)
International Journal of Fuzzy Systems, Vol. 13, No. 2, June 2011
90
We consider three objective functions:
c
N
∑ (u
N
J B (U ,V ) = ∑∑(uki )m{ε + D( xk , vi )},
vi =
(4)
i =1 k =1
k =1
N
∑ (u
(m > 1, ε ≥ 0),
c
k =1
N
J E (U ,V ) = ∑∑{uki D(xk , vi ) + λ −1uki (log uki −1)},
i =1 k =1
Solution for J E :
(5)
uki =
(λ > 0),
c
N
J P (U ,V ) = ∑∑{(uki ) D( xk , vi ) + ζ (1 − uki ) },
−1
m
exp(−λ D( xk , vi ))
∑ exp(−λ D( x , v ))
k
(6)
All above are different from the original function proposed by Dunn [7, 8] and Bezdek [1, 2]. J B (U , V ) has
a nonnegative parameter ε proposed by Ichihashi
[28]. When ε = 0 , J B (U , V ) is the original objective
vi =
∑u
k =1
N
the optimal solution be V .
FCM4: If (U , V ) is convergent, stop. Otherwise go to
FCM2.
End FCM.
We show solutions of FCM2 and FCM3 for each objective function, where the derivations are omitted.
Solution for J B :
1
k =1
1
(ε + D( xk , v j ))
.
(10)
ki
Solution for J P :
1
uki =
1 + ζ D( xk , vi )
1
1
m −1
,
c
∑
j =1
1 + ζ D( xk , v j )
(11)
1
m −1
N
vi =
∑ (u
k =1
N
ki
∑ (u
k =1
) m xk
.
ki
)
(12)
m
B. Basic Functions
We introduce what we call basic functions in this paper:
1
1
m −1
,
(13)
(ε + D( x, y ))
g E ( x, y ) = exp(−λ D( x, y )),
(14)
1
(15)
.
g P ( x, y ) =
1
m −1
1 + ζ D ( x, y )
We also assume that g ( x, y ) is either g B ( x, y ) ,
g E ( x, y ) , or g P ( x, y ) .
A unified representation is now obtained for optimal
uki :
(ε + D( xk , vi )) m −1
uki = c
1
j =1
j
x
g B ( x, y ) =
with respect to V . Let
(9)
ki k
∑u
J E (U ,V ) , or J P (U ,V ) . Minimization with respect to where m = 2 .
U is with constraint (3).
FCM Algorithm of Alternate Optimization.
FCM1: Put initial value V randomly.
FCM2: Minimize J (U , V ) with respect to U . Let
,
N
function. J E (U , V ) has an additional term of entropy.
The use of entropy in fuzzy c -means clustering has
been proposed by a number of researchers, e.g.,[19, 20,
24]. J P (U , V ) has been proposed by Krishnapuram
and Keller [18] for possibilistic clustering. This function
can also be used for fuzzy c -means with constraint (3)
when m = 2 .
We use alternate minimization procedure FCM in the
following, where J (U , V ) is either J B (U , V ) ,
∑
)
ki
(8)
c
(ζ > 0).
the optimal solution be U .
FCM3: Minimize J (U , V )
.
m
j =1
m
i =1 k =1
) m xk
ki
1
m −1
,
(7)
uki =
g ( xk , vi )
c
∑ g(x , v )
j =1
k
(16)
j
for all three objective functions, since g ( x, y ) represents either g B ( x, y ) , g E ( x, y ) , or g P ( x, y ) .
S. Miyamoto: Fuzzy c-Means Algorithms and Kernel-Based Clustering
Possibilistic Clustering
Possibilistic clustering [18] uses J P (U , V ) but with
a different constraint:
M = { U = (uki ) : ukj > 0, ∀k , j } .
91
if x ∈ Si (V ) then x → cluster i.
C.
Note that J P (U , V ) and M in this paper are simpler
than the original formulation [18], but the essential discussion is the same.
We cannot use J B (U , V ) which leads to a trivial
solution in possibilistic clustering, but J E (U , V ) can
be used [4].
We have the solution of possibilistic clustering for
J E (U ,V ) :
When we consider fuzzy rules, a function U i ( x;V )
that interpolates uki is used. We define the following
function using the basic function:
U i ( x;V ) =
g ( x, vi )
c
∑ g ( x, v )
j =1
, x∈ Rp
(19)
j
where g ( x, y ) is either g B ( x, y ) , g E ( x, y ) , or
g P ( x, y ) .
Fuzzy rules are simpler in possibilistic clustering:
U i ( x; vi ) = g ( x, vi ), x ∈ R p
(20)
(17) where g ( x, y ) is either g ( x, y ) , or g ( x, y ) . The
E
P
uki = g E ( xk , vi )
using basic function g E with vi given by (10), while
the solution for J P (U , V ) is the following:
uki = g P ( xk , vi )
(18)
using basic function g P with vi given by (12). Note
that m = 2 is not assumed for possibilistic clustering.
D.
Fuzzy Classifiers
There have been many discussions on fuzzy classifiers
derived from fuzzy clustering, but we show a standard
classifier that is naturally derived from the optimal solutions.
Note that uki is given only on objects xk , while
what we need is fuzzy classification rules whereby the
solutions are provided.
To understand classification rules clearly, let us consider the crisp c -means, where we use the nearest prototype allocation rule: when the set of cluster prototypes
are determined, we allocate an object to its nearest prototype, i.e.,
⎧1 (i = arg min1≤ j ≤c D( xk , v j )),
uki = ⎨
⎩0 (otherwise).
Note that the objective function is J H .
rule is thus the same as basic functions in possibilistic
clustering.
We show a number of theoretical properties of the
fuzzy rules defined by the above functions. The proofs
are given in [25, 28] and omitted here.
Proposition 1: Let U i ( x;V ) is with function g B . In
other words, J B is used. Suppose ε → 0 . Then the
maximum value of U i ( x;V ) is at x = vi :
arg max x∈R p U i ( x;V ) → vi , as ε → 0.
Moreover, for all ε ≥ 0 , we have
1
lim U i ( x;V ) = .
∥x∥→∞
c
Proposition 2: Let U i ( x;V ) is with function g P . In
other words,
J P is used with m = 2 . Suppose
ζ → +∞ . Then the maximum value of U i ( x;V ) is at
x = vi :
arg max x∈R p U i ( x;V ) → vi , as ζ → +∞.
Moreover, for all ζ ≥ 0 , we have
lim U i ( x;V ) =
∥x∥→∞
1
.
c
This allocation rule is applied to all points in the space,
Hence the functions of the fuzzy rules for J B and J P
and the result is the Voronoi regions [17] with the cenbehave similarly when point x goes far, while the
ters of the cluster prototypes. Specifically, we define
maximum point approaches to the cluster center as the
Si (V ) = { x ∈ R p ∥
: x − v∥
/ i}
i <∥x − v ∥
j , ∀j =
respective parameters tend to their limitations. In conas a Voronoi region for a given set of cluster prototypes
trast, fuzzy rule U i ( x;V ) for J E has a quite different
V . We then have
property. To describe this, we should discuss Voronoi
c
p
regions again.
Si (V ) = R , Si (V ) ∩ S j (V ) = ∅ (i =/ j ),
In many cases, fuzzy clusters are made crisp by the
i =1
maximum
membership rule:
where S (V ) is the closure of S (V ) . The nearest al-
∪
i
location rule then is as follows:
i
if i = arg max1≤ j ≤c U j ( x;V ) then x → cluster i.
International Journal of Fuzzy Systems, Vol. 13, No. 2, June 2011
92
Accordingly we can define the set of points that belongs
to cluster i :
Ti (V ) = {x ∈ R p : i = arg max1≤ j ≤c U j ( x;V )}.
We then have the next proposition.
Proposition 3: For all choices of g = g B , g = g E , and
g = gP ,
Ti (V ) = Si (V ).
[11]. However, there is another problem to separate a
dense cluster and a sparse cluster for which “density” or
“cluster size” has to be considered.
To solve the both problems, a generalized objective
function with a Kullback-Leibler information term has
been proposed by Ichihashi and his colleagues [15, 28].
That is, the following function is used for this purpose:
c
i =1 k =1
Thus Ti (V ) is the closure of the Voronoi region with
c
Let us now consider U i ( x;V ) for J E .
Proposition 4: Let U i ( x;V ) is with function g E . In
other words, J E is used. Assume vi 's are in general
positions in the sense that none of the three are on a line.
If a Voronoi region Ti (V ) is bounded, then
lim U i ( x;V ) = 0.
∥x∥→∞
If a Voronoi region Ti (V ) is unbounded and x
moves inside Ti (V ) , then
lim U i ( x;V ) = 1.
∥x∥→∞
In the both cases, 0 < U i ( x;V ) < 1 for all x ∈ R p .
The proof is given in [25] and omitted here.
Possibilistic clustering
As fuzzy rules in possibilistic clustering are bell-shaped
functions, we have the same property:
arg max x∈R p U i ( x; vi ) = vi ,
lim U i ( x; vi ) = 0
∥x∥→∞
for both g E and g P .
If possibilistic clusters should be made crisp, we define
Ti′(V ) = {x ∈ R p : i = arg max1≤ j ≤c U j ( x; vi )}.
We have the next proposition:
Proposition 5: For both g = g E and g = g P ,
Ti′(V ) = Si (V ).
N
+ ∑∑ uki {ν log
center V , and Ti (V ) is the same for all the three objective functions J B , J E , and J P .
N
J KL (U , V , A, S ) = ∑∑ uki D( xk , vi ; Si )
i =1 k =1
We frequently need to recognize a prolonged cluster,
but the original fuzzy c-means cannot do this, as the
Voronoi region cannot separate such a prolonged region.
To solve such a problem, cluster covariances in fuzzy
c-means have been considered by Gustafson and Kessel
αi
1
2
(23)
+ log | Si | }
where variable A = (α1 , …, α c ) controls cluster sizes
with the constraint
c
A = { A : ∑ α i = 1, α j ≥ 0, j = 1,…, c}.
(24)
i =1
Another variable is S = ( S1 , …, Sc ) ; Si ( i = 1, …, c )
is p × p positive-definite matrix with determinant
| Si | . In addition,
D( x, vi ; Si ) = ( x − vi )T Si−1 ( x − vi )
(25)
is the squared Mahalanobis distance for cluster i .
Since this objective function has four variables, the
alternate optimization means minimization with respect
to a variable while other three are fixed: After giving
initial values for V , A, S , we repeat
U = arg minU J KL (U ,V , A, S ),
V = arg minV J KL (U ,V , A, S ),
A = arg min A J KL (U ,V , A, S ),
S = arg min S J KL (U , V , A, S ),
until convergence. The solutions are as follows [28].
Solutions for J KL :
αi
⎛ D( xk , vi ; Si ) ⎞
exp ⎜ −
⎟
ν
⎝
⎠
| Si |
,
uki =
c
⎛ D( xk , v j ; Si ) ⎞
αj
exp ⎜ −
⎟⎟
∑
1
⎜
ν
j =1
2
⎝
⎠
| Sj |
1
2
(26)
n
The Voronoi regions are thus derived again.
3. Size and Covariance of a Cluster
uki
vi =
∑u
k =1
n
∑u
k =1
n
αi =
x
ki k
,
(27)
,
(28)
ki
∑u
k =1
n
ik
S. Miyamoto: Fuzzy c-Means Algorithms and Kernel-Based Clustering
Si =
n
1
∑u
n
∑u
k =1
k =1
ki ( xk − vi )( xk − vi ) .
(29)
ki
Note that the above solutions are similar to those obtained by the EM algorithm [5] for Gaussian mixture
distributions [22, 29]. This model for fuzzy c -means
clustering thus has a close relationship with the statistical model of mixture distributions.
4. Kernel Functions in Fuzzy Clustering
Many studies on kernel functions have been done [31].
The algorithms of kernel-based fuzzy c-means (e.g., [10,
26, 27]) have also been developed. We review the clustering algorithms and also discuss kernelized cluster validity measures. We moreover study a class of new kernel functions.
A. Kernel-based algorithms
Linear cluster boundaries between Voronoi regions
are obtained by fuzzy c-means clustering. When we use
the KL-information method, we have a curved boundary
described by quadratic functions. In contrast, more general nonlinear boundaries can be obtained using kernel
functions, as discussed in support vector machines [34].
A high-dimensional feature space H is assumed,
while the original space R p is called the data space.
H is an inner product space. Assume that the inner
product is 〈⋅, ⋅〉 . The norm of H for g ∈ H is given
93
The basic function is changed as follows:
1
gB ( x, y) =
1
m−1
, (33)
(ε + K ( x, x) + K ( y, y) − 2K ( x, y))
gE ( x, y) = exp(−λ ( K ( x, x) + K ( y, y) − 2K ( x, y))), (34)
1
, (35)
gP ( x, y) =
1
m−1
1 + ζ (K ( x, x) + K ( y, y) − 2K ( x, y))
whereby solution uki is given by (16), with function
g = g B , g = g E , or g = g P changed as above.
The cluster prototype is given by
N
vi =
∑ (u
k =1
ki
) m Φ ( xk )
,
N
∑ (u
k =1
ki
)
m
but function Φ ( xk ) is generally unknown. Hence we
cannot use FCM algorithm. Instead, we update dissimilarity measure Dki :
Dki = K kk −
N
∑ (u
k =1
+
N
2
ki
)m
∑ (u
N
1
N
j =1
) m K jk
(36)
N
∑∑ (u
(∑ (uki ) )
ji
m 2 j =1
=1
m
ji
u i) Kj ,
k =1
by ∥g∥ = 〈 g , g 〉 .
2
H
A transformation Φ : R p → H is used whereby xk
is mapped into Φ ( xk ) . Explicit representation of
where K jk = K ( x j , xk ) . Note that m = 1 in (36) when
J E is considered.
We thus repeat (16) and (36) until convergence.
Φ ( x) is unknown in general but the inner product
B. Kernelized Cluster Validity Measures
〈Φ ( x), Φ ( y )〉 is assumed to be represented by a kernel
Various cluster validity measures have been proposed
function:
K ( x, y ) = 〈Φ( x), Φ ( y )〉 .
(30)
A well-known kernel function is the Gaussian kernel:
K ( x, y ) = exp{−C∥x − y∥2}, (c > 0) . (31)
Note that K ( x, y ) = g E ( x, y ) holds when C = λ .
Objective functions J B , J E , and J P are used but
the dissimilarity is changed as follows:
2
Dki =∥Φ ( xk ) − v∥
i H,
where vi ∈ H .
Note:
There
is
(32)
another
formulation
using
Dki =∥Φ ( xk ) − Φ (vi )∥ instead of (32), which is
omitted here (see, e.g., [35]).
When we derive a kernel-based fuzzy c-means algorithm, we should consider two problems: one is the basic
function and another is the updating scheme.
[1,6] in order to determine the appropriate number of
clusters. They are divided into two classes: one class
uses the membership values alone. A typical example
is the entropy
c
N
E (U , c) = ∑∑ uki log uki
i =1 k =1
whereby the number c that maximizes E (U , c) is
selected.
Another class takes geometrical characteristics into
account. A typical method uses the fuzzy covariance
matrix for cluster i :
N
2
H
Fi =
∑(u
k =1
ki
) m ( xk − vi )( xk − vi )
.
N
∑(u
k =1
ki
)
m
(37)
International Journal of Fuzzy Systems, Vol. 13, No. 2, June 2011
94
Gath and Geva [9] use the sum of the square root of the
determinants of Fi :
c
Wdet = ∑ det Fi .
(38)
i =1
We also consider the sum of the traces of Fi :
c
Wtr = ∑ trFi .
(39)
i =1
Hashimoto et al. [12] showed the trace works as well
as the determinant by having randomly generated many
simulation examples and tested different validity measures.
When we use kernel-based clustering, we should also
have kernel-based validity measures. Let us consider the
kernelized versions of (38) and (39) for this purpose.
The kernel-based fuzzy covariance matrix is the following:
N
KFi =
∑(u
k =1
) (Φ ( xk ) − vi )(Φ ( xk ) − vi )
m
ki
.
N
∑(u
k =1
ki
(40)
)m
where vi is not explicitly given.
Note that the determinant of the kernelized fuzzy covariance is inappropriate, since the next relation holds:
det KFi → 0, as N → ∞.
The proof of this relation is simple because the monotone decreasing sequence λ1 , λ2 , … of the eigenvalues
of KFi will converge to zero as N → ∞ (see, e.g.,
[31]). Hence we have
N
log det KFi = ∑ logλi → −∞, as N → ∞.
i =1
In contrast, the trace of KFi is useful. After some calculation, we have
trKFi =
∑ (u
N
∑(u
k =1
=
N
1
ki
)m
∑ (u
N
k =1
2
) m∥Φ ( xk ) − v∥
i
(41)
N
1
∑(u
k =1
ki
ki
)
m k =1
m
ki
) Dki ,
where Dki is given by (36). We hence define
c
KWtr = ∑ trKFi .
(42)
i =1
Numerical experiments
We now have a question: although a kernel-based
clustering works well for some typical clusters with nonlinear boundaries (such as those in Fig.1), does it also
work well for ordinary ball-shaped clusters?
To answer this question, we compared the above
measures using randomly generated data with artificial
clusters and evaluated the numbers of clusters. Conditions of the random data are shown below.
The basic condition is shown by No.1 in Table 1.
Then the diameter of each cluster was changed to No.2.
Next, the total number of data points of each cluster was
randomly changed to No.3. Finally, the diameter and
total number of members was changed to No.4. The details of these conditions are shown in Table 1. Note that
the randomly generated clusters are ball-shaped, and we
tested if the kernel-based measure has the ability to
judge correct number of clusters as well as the
non-kernelized measures.
Table 1. Conditions for random generation of clusters.
Conditions
Total number of
clusters
Total number of
data points
Dimensions of
data set
Range of cluster
centers
Number of data
in each cluster
Diameter of
each cluster
No.1
No.2
No.3
No.4
4
400
2, 3
0.0 ~ 1.0
100
0.1
0.05~ 0.193
50~150
0.1
0.05 ~0.193
The process of evaluating the numbers of clusters is as
follows:
(1) Generate data sets with conditions No.1--4.
(2) Perform clustering 100 times with random initial
values, and then use the resulting clusters having the
minimum value of the objective function for the
evaluation.
(3) Evaluate the above clusters by each validity measure.
(4) Give label “correct” that has the number “4” of
clusters, otherwise give label “wrong.”
(5) Repeat the process No.1--4 for 1000 times.
(6) Calculate the percentage of label “correct” for each
validity measure.
We observe that kernelized measure KWtr is as effective as the non-kernerized measures. Moreover, it has
been shown that KWtr can judge the correct number of
clusters for the set of points like the one in Fig.1 (see e.g.,
[28]).
C. Positive definite kernels derived from fuzzy c-means
As the last topic in this paper, we consider kernel
S. Miyamoto: Fuzzy c-Means Algorithms and Kernel-Based Clustering
functions again. We note that the Gaussian kernel is the
same as basic function g E . Note that other basic functions g B and g P are also bell-shaped with “longer
tails.”
Table 2. The ratio of accurate numbers of clusters for each
condition using J B ( m = 2 ) with dimension P=2, 3.
No.1
Wtr
Wdet
KWtr
p=2
p=3
0.958
0.943
0.952
0.994
0.995
0.994
No.2
Wtr
Wdet
KWtr
p=2
p=3
0.770
0.931
0.779
0.982
0.992
0.980
No.3
Wtr
Wdet
KWtr
p=2
p=3
0.953
0.931
0.949
0.993
0.993
0.990
No.4
Wtr
Wdet
KWtr
0.710
0.897
0.728
0.981
0.982
0.972
p=2
p=3
Here is a question: Are g B and g P also positive-definite kernel functions? We also have a second
question: Are they as useful in kernel-based clustering as
the Gaussian kernel?
The first answer is shown in the next proposition.
Proposition 6: Functions
g B ( x, y ) =
1
(ε + D( x, y ))
1
m −1
95
Illustrative examples
Let us consider two sets of points shown in Figs.1 and
2. The former figure is typical in discussing the effect of
kernel functions. The crisp and fuzzy c-means cannot
divide the set of points into the outer circle and inner ball,
since they produce linear cluster boundaries, In contrast,
the Gaussian kernel is known to successfully divide the
both circles. As expected, g B and g P also perfectly
succeed in dividing the outer circle and inner ball [14].
Figure 1. First data set: a ball inside a circle.
In the second figure, the “two crescents” are shown,
which is similar to those examples in semi-supervised
learning [3]. It is more difficult to divide these two sets
of points.
,
with ε > 0 and
g P ( x, y ) =
1
1 + ζ D ( x, y )
1
m −1
are positive-definite kernels.
The proof is based on a theorem by Schönberg [32]
that states a class of positive-definite kernels can be derived from complete monotone functions. We proved
that g B and g P are derived from complete monotone
functions [14]. The details are given in [14] and omitted
here. Note that g B is not positive-definite if ε = 0 ,
i.e., the original objective function does not give a kernel
function. The regularization parameter ε > 0 is thus
necessary.
Accordingly, we can use these two functions to the
kernel-based fuzzy c-means algorithms instead of the
Gaussian kernel.
Figure 2. Second data set: two crescents.
International Journal of Fuzzy Systems, Vol. 13, No. 2, June 2011
96
We summarize the results of classifications in Table 3,
where misclassifications are fewer for g B and g P
than the Gaussian kernel.
We thus have the second answer: g B and g P are
useful in these examples with nonlinear cluster boundaries.
Table 3. Summary of misclassifications by fuzzy c-means with
the three kernel functions applied to the two crescents data.
Calculations were repeated 50 times for each kernel with different initial random values. Numbers in the parentheses (*)
are from the entropy fuzzy c-means. The parameters are
m = 2 , ε = 1.0 , η = 1.0 , and λ = 1.0 .
Percentage of
misclassifications
0 ~ 15
16 ~ 30
gE
gB
[4]
[5]
[6]
[7]
gP
14 (7)
(Gaussian)
1 (0)
11 (10)
10 (12)
3 (5)
13 (13)
31 ~ 45
8 (13)
8 (8)
12 (15)
46~
18 (18)
38 (37)
14 (12)
[8]
[9]
5. Conclusions
[10]
An overview of fuzzy c-means clustering with three
different objective functions has been given with the focuses on fuzzy classifiers, a generalization including variables of cluster size and covariance, and kernel functions. The two discussions on kernel functions are kernelized validity measures and new kernels derived from
basic functions of fuzzy c-means. The kernel functions
g B and g P are useful for examples given here. We
expect that they are useful in support vector machines as
well, but many more experiments using real numerical
data are necessary.
In spite of their importance, the topics herein are relatively unknown to the fuzzy community interested in
clustering. They provide, however, many future research
opportunities both in theory and applications. For example, application to semi-supervised clustering [3, 35] and
a variety of new fuzzy clustering algorithms [33, 36] will
be promising.
[11]
[12]
[13]
[14]
[15]
References
[1]
[2]
[3]
J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New
York, 1981.
J. C. Bezdek, J. Keller, R. Krishnapuram, and N. R.
Pal, Fuzzy Models and Algorithms for Pattern
Recognition and Image Processing, Kluwer, Boston, 1999.
O. Chapelle, B. Schölkopf, and A. Zien, eds.,
[16]
[17]
[18]
Semi-Supervised Learning, MIT Press, Cambridge,
Massachusetts, 2006.
R. N. Davé and R. Krishnapuram, “Robust Clustering Methods: a Unified View,” IEEE Trans. on
Fuzzy Systems, vol.5, pp.270-293, 1997.
A. P. Dempster, N. M. Laird, and D. B. Rubin,
“Maximum Likelihood from Incomplete Data via
the EM Algorithm,” J. R. Stat. Soc., vol. B39, pp.
1-38, 1977.
D. Dumitrescu, B. Lazzerini, and L. C. Jain, Fuzzy
Sets and Their Application to Clustering and
Training, CRC Press, Boca Raton, Florida, 2000.
J. C. Dunn, “A Fuzzy Relative of the ISODATA
Process and Its Use in Detecting Compact
Well-Separated Clusters,” J. of Cybernetics, vol. 3,
pp. 32-57, 1974.
J. C. Dunn, “Well-Separated Clusters and Optimal
Fuzzy Partitions,” J. of Cybernetics, vol. 4, pp.
95-104, 1974.
I. Gath and A. B. Geva, “Unsupervised Optimal
Fuzzy Clustering,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 11, no. 7, pp.
773-781, 1989.
M. Girolami, “Mercer Kernel Based Clustering in
Feature Space,” IEEE Trans. on Neural Networks,
vol. 13, no. 3, pp. 780-784, 2002.
E. E. Gustafson and W. C. Kessel, “Fuzzy Clustering with a Fuzzy Covariance Matrix,” IEEE CDC,
San Diego, California, pp. 761-766, 1979.
W. Hashimoto, T. Nakamura, and S. Miyamoto,
“Comparison and Evaluation of Different Cluster
Validity Measures Including Their Kernelization,”
Journal of Advanced Computational Intelligence
and Intelligent Informatics, vol. 13, no. 3, pp.
204-209, 2009.
F. Höppner, F. Klawonn, R. Kruse, and T. Runkler,
Fuzzy Cluster Analysis, Wiley, Chichester, 1999.
J. S. Hwang and S. Miyamoto, “Kernel Functions
Derived from Fuzzy Clustering and Their Application to Kernel Fuzzy c -Means,” Journal of Advanced Computational Intelligence and Intelligent
Informatics, vol. 15, pp. 90-94, 2011.
H. Ichihashi, K. Honda, and N. Tani, “Gaussian
Mixture PDF Approximation and Fuzzy c-Means
Clustering with Entropy Regularization,” Proc. of
Fourth Asian Fuzzy Systems Symposium, vol. 1, pp.
217-221, 2000.
L. Kaufman and P. J. Rousseeuw, Finding Groups
in Data: An Introduction to Cluster Analysis, Wiley, New York, 1990.
T. Kohonen, Self-Organizing Maps, 2nd Ed.,
Springer, Berlin, 1997.
R. Krishnapuram and J. M. Keller, “A Possibilistic
Approach to Clustering,” IEEE Trans. on Fuzzy
S. Miyamoto: Fuzzy c-Means Algorithms and Kernel-Based Clustering
Systems, vol. 1, pp. 98-110, 1993.
[19] R. P. Li and M. Mukaidono, “A Maximum Entropy
Approach to Fuzzy Clustering,” Proc. of the 4th
IEEE Intern. Conf. on Fuzzy Systems
(FUZZ-IEEE/IFES'95), Yokohama, Japan, pp.
2227-2232, March 20-24, 1995.
[20] R. P. Li and M. Mukaidono, “Gaussian Clustering
Method Based on Maximum-Fuzzy-Entropy Interpretation,” Fuzzy Sets and Systems, vol. 102, pp.
253-258, 1999.
[21] J. B. MacQueen, “Some Methods of Classification
and Analysis of Multivariate Observations,” Proc.
of 5th Berkeley Symposium on Math. Stat. and
Prob., pp. 281-297, 1967.
[22] G. McLachlan and D. Peel, Finite Mixture Models,
Wiley, New York, 2000.
[23] S. Miyamoto, Fuzzy Sets in Information Retrieval
and Cluster Analysis, Kluwer, Dordrecht, 1990.
[24] S. Miyamoto and M. Mukaidono, “Fuzzy
c -Means as a Regularization and Maximum Entropy Approach,” Proc. of the 7th International
Fuzzy Systems Association World Congress
(IFSA'97), Prague, Czech, vol. II, pp. 86-92, June
25-30, 1997.
[25] S. Miyamoto, Introduction to Cluster Analysis,
Morikita-Shuppan, Tokyo, 1999 (in Japanese).
[26] S. Miyamoto and Y. Nakayama, “Algorithms of
Hard c -Means Clustering Using Kernel Functions
in Support Vector Machines,” Journal of Advanced
Computational Intelligence and Intelligent Informatics, vol. 7, no. 1, pp. 19-24, 2003.
[27] S. Miyamoto and D. Suizu, “Fuzzy c -Means
Clustering Using Kernel Functions in Support
Vector Machines,” Journal of Advanced Computational Intelligence and Intelligent Informatics, vol.
7, no. 1, pp. 25-30, 2003.
[28] S. Miyamoto, H. Ichihashi, and K. Honda, Algorithms for Fuzzy Clustering, Springer, Berlin, 2008.
[29] R. A. Redner and H. F. Walker, “Mixture Densities,
Maximum Likelihood and the EM Algorithm,”
SIAM Review, vol. 26, no. 2, pp. 195-239, 1984.
[30] E. H. Ruspini, “A New Approach to Clustering,”
Information and Control, vol. 15, pp. 22-32, 1969.
[31] B. Schölkopf and A. Smola. Learning with Kernels,
MIT Press, Cambridge, Massachusetts, 2002.
[32] I. J. Schönberg, “Metric Spaces and Completely
Monotone Functions,” Annals of Mathematics, vol.
39, no. 4, pp. 811-841, 1938.
[33] C.-C. Tsai, C.-C. Chen, C.-K. Chan, and Y.-Y. Li,
“Behavior-Based Navigation Using Heuristic
Fuzzy Kohonen Clustering Network for Mobile
Service Robots,” International Journal of Fuzzy
Systems, vol. 12, no. 1, pp. 25-32, 2010.
[34] V. N. Vapnik, Statistical Learning Theory, Wiley,
97
New York, 1998.
[35] N. Wang, X. Li, and X. Luo, “Semi-supervised kernel-based fuzzy c -means with pairwise constraints,” WCCI 2008 Proceedings, Hong Kong,
China, pp.1099-1103, June 1-6, 2008.
[36] F. Yu, J. Tang, and R. Cai, “Partially Horizontal
Collaborative Fuzzy C-Means,” International
Journal of Fuzzy Systems, vol. 9, no. 4, pp.
198-204, 2007.
[37] L. A. Zadeh, “Similarity Relations and Fuzzy Orderings,” Information Sciences, vol. 3, pp. 177-200,
1971.
Dr. Miyamoto was born in Osaka, Japan,
in 1950. He received the B.S.,M.S. and
the Dr. Eng. degrees in Applied Mathematics and Physics Engineering from
Kyoto University, Japan, in 1973, 1975,
and 1978, respectively. He is now a Professor at the Department of Risk Engineering, the University of Tsukuba, Japan. His current research interests include methodology for uncertainty modeling, data clustering
algorithms, multisets, and methods for text mining. He is a
member of the Society of Instrumentation and Control Engineers of Japan, Information Processing Society of Japan, the
Japan Society of Fuzzy Theory and Systems, and IEEE. He is
a fellow of International Fuzzy Systems Association.