Probabilistic Aspects in Classification

Probabilistic Aspects in Classification
Hans H. Bock
Institute of Statistics, Technical University of Aachen,
Wüllnerstr. 3, D-52056 Aachen, Germany
Summary: This paper surveys various ways in which probabilistic approaches can be
useful in partitional (’non-hierarchical’) cluster analysis. Four basic distribution models
for ’clustering structures’ are described in order to derive suitable clustering strategies.
They are exemplified for various special distribution cases, including dissimilarity data
and random similarity relations. A special section describes statistical tests for checking
the relevance of a calculated classification (e.g., the max-F test, convex cluster tests) and
comparing it to standard clustering situations (comparative assessment of classifications,
CAC).
1. Introduction
Consider a finite set O = {1, ..., n} of objects whose properties are characterized by
some observed or recorded data (a table, a data matrix, verbal descriptions) such
that ’similarities’ or ’dissimilarities’ which may exist among these objects can be
determined by these data. Cluster analysis provides formal algorithms for subdividing
the set O into a suitable number of homogeneous subsets, called clusters (classes,
groups etc.) such that all objects of the same cluster show approximately the same
class-specific properties while objects belonging to different classes behave differently
in terms of the underlying data (separation of clusters).
A range of clustering algorithms is based on a probabilistic point of view: Data
are considered as realizations of random variables, thus influenced by random errors
and natural fluctuations (variations), and even the finite set of objects O may be
considered as a random sample from an infinite universe (super-population Π). Then
any class or classification of objects must necessarily be defined in terms of probability
distributions for the data. This paper presents a survey on this probability-based part
of cluster analysis. While we can point only briefly to various topics, a more detailed
presentation with numerous references may be found in Bock (1974, 1977, 1985, 1987,
1989a, 1994, 1996a,b,c,d), Jain and Dubes (1988) and Milligan (1996); also see various
other papers of this volume, e.g., by Gordon, Hardy, Lapointe and Rasson.
Let us first specify the notation: The set of objects O = {1, ..., n} shall be partitioned
into a suitable number m of disjoint classes C1 , ..., Cm ⊂ O resulting in an m-partition
C = (C1 , ..., Cm ) of O. In fact, we focus on partitional clusterings here, thus neclecting
hierarchical or overlapping classifications. We will consider three types of data:
1. A data matrix X = (xkj )n×p = (x1 , ..., xn )0 where for each object k ∈ O p
(quantitative or qualitative) variables have been sampled and compiled into the
observation vector xk = (xk1 , ..., xkp )0 (0 denotes the transposition of a matrix).
2. A dissimilarity matrix D = (dkl )n×n where dkl quantifies the dissimilarity existing between two objects k and l (e.g., dkl = ||xk − xl ||2 ); typically, we have
0 = dkk ≤ dkl = dlk < ∞ for all k, l ∈ O.
3. A similarity relation S = (skl )n×n where skl = 1 or 0 if the objects k and l are
considered to be ’similar’ or ’dissimilar’, respectively.
In: C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H.-H. Bock, Y. Baba (eds.):
Data Science, Classification, and Related Methods. Springer, Tokyo, 1998, 3-21
In our probabilistic context, the observed xk , dkl and skl will be realizations of suitable
random variables Xk , Dkl and Skl , respectively, whose probability distributions describe the type and extent of the underlying clustering (or non-clustering) structure.
In contrast to deterministic (e.g., algorithmic or exploratory) approaches to cluster
analysis, this probabilistic framework can be helpful for the following purposes:
(1) Modeling clustering structures allowing for various shapes of clusters.
(2) Providing clustering criteria under more or less specified clustering assumptions
(to be distinguished from clustering algorithms that optimize these criteria)
(3) Describing and quantifying the performance or optimality of clustering methods.
(e.g., in a decision-theoretic framework; see Remark 2.2).
(4) Investigating the asymptotic behaviour of clustering methods if, e.g., n approaches
∞ under a mixture model or under a ’randomness’ hypothesis.
(5) Testing for the ’homogeneity’ or ’randomness’ of the data, either versus a general hypothesis of ’non-randomness’ or versus special clustering alternatives,
thus checking for the existence of a hidden clustering of the objects.
(6) Testing for the relevance of a calculated classification C ∗ that has been obtained
by a special clustering algorithm.
(7) Determining the ’true’ or an appropriate number m of classes (see Remark 2.1).
(8) Assessing the relevance of a special cluster Ci∗ of objects that has been obtained
from a clustering algorithm (Gordon 1994).
In the following we will survey some of these topics in detail and consider several
special clustering models.
2. Some probabilistic clustering models
A major and basic problem of classification theory arises by the difficulty of defining the concept of a ’class’ that may be approached by philosophical, mathematical,
probabilistic or heuristical tools. Probabilistic approaches provide a flexible way of
describing the relationship between objects and classes, not just by building each
class from the set of objects that share the same decriptor values or feature combination (as it is common, e.g., in concept theory and monothetic classification), but
by allowing some variation between the objects of the same class whose properties
may deviate, to some limited and random degree, from the typical class profile. In
technical terms this is realized by characterizing each class (classification, clustering
structure) by suitable probability distributions for the sampled data.
There are four basic probabilistic models that are often used in this context. They
will be formally described for the important case when the data is provided by n
random independent p-dimensional feature vectors X1 , ..., Xn (for dissimilarity and
similarity data see the sections 2.1.4 and 2.1.5).
2.1 The fixed-partition clustering model Hm
A fixed-partition model Hm assumes that there exist a fixed, but unknown partition
C = (C1 , ..., Cm ) of the set O into m non-empty classes C1 , ..., Cm ⊂ O and a system θ = (ϑ1 , ..., ϑm ) of (unknown) parameter values ϑ1 , ...., ϑm ∈ Rs describing the
properties of these classes such that
Xk
∼
f (·; ϑi )
for all k ∈ Ci and i = 1, ..., m.
(2.1)
where f (x; ϑ) originates from a given parametric family of distribution densities. In
this context, ’clustering’ consists in estimating the unknown parameters, i.e., the
number of classes m ≥ 1, the m-partition C = (C1 , ..., Cm ) and the parameter system
In: C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H.-H. Bock, Y. Baba (eds.):
Data Science, Classification, and Related Methods. Springer, Tokyo, 1998, 3-21
θ = (ϑ1 , ..., ϑm ). Note that for n data vectors we have (at least) ms + n unknown
parameters, viz. ϑ1 , ..., ϑm and the class indicators I1 , ..., In where Ik = i iff k ∈ Ci .
Classical statistics provides various estimation strategies from which we consider the
maximum likelihood method here (see also Remark 2.2). Assuming m to be known,
maximizing the joint likelihood of the observed data vectors x1 , ..., xn is equivalent
to:
g(C, θ)
m X
X
:=
(− log f (xk ; ϑi )) → min
C,θ
i=1 k∈Ci
(2.2)
(maximum-likelihood classification approach). This is a combined combinatorial and
analytic optimization problem whose solution(s) can be approximated by the wellknown iterative k-means algorithm which partially minimizes g with respect to θ and
C in turn. In fact, partial minimization with respect to the parameter θ provides the
maximum likelihood (m.l.) estimates θ(C) := (ϑˆ1 , ..., ϑ̂m ) for a given C, thus reducing
(2.2) to the combinatorial problem
gm (C)
:=
m X
X
i=1 k∈Ci
(− log f (xk ; ϑ̂i )) → min .
C
(2.3)
On the other hand, minimizing g with respect to C for a given parameter vector θ
yields the maximum-probability-assignment partition C(θ) := Cˆ := (Ĉ1 , ..., Ĉm ) with
classes
Ĉi
:=
{k ∈ O| f (xk ; ϑi ) = max {f (xk ; ϑν )} }
ν=1,...,m
i = 1, ..., m
(2.4)
(with suitable rules for avoiding ties and empty classes). C(θ) can often be interpreted
as a minimum-distance partition and will be termed in this way here. Using (2.4),
the optimization problem (2.2) reduces to
γ(θ)
:=
n
X
k=1
min {− log f (xk ; ϑν )} → min .
ν=1,...,m
θ
(2.5)
Thus, fixed-partition models provide, simultaneously, three equivalent clustering criteria (2.2), (2.3) and (2.5) and a comfortable (and fast converging) optimization
strategy (k-means algorithm). We must remind, however, that in this formulation
any resulting (optimal) class Ci∗ is not necessarily a most ’homogeneous’ one, described by good internal properties or even well separated from other classes, but
only an element of an m-partition C ∗ that optimizes the overall-criterion (2.3), i.e.,
some average homogeneity of classes. This fact explains, why small classes are typically neclected in this approach (except from those far away from the rest of the data,
e.g., ’outlier classes’). Empirical practice as well as theoretical investigations (Bock
1968, 1974) show that criteria of this type have the tendency to produce equally-sized
classes.
We illustrate the flexibility of the fixed-partition approach by five special distribution
models (see also Bock 1974, 1987, 1996a,b,c,d).
Remark 2.1: The determination of the unknown (or: a suitable) number m of classes
is a conceptually and technically difficult problem that is intensively discussed, e.g.,
in Bock (1968, 1974, 1996a), Gordon (1997b), Milligan (1981, 1996) and Milligan and
Cooper (1985), but will not be investigated here.
2.1.1 The normal distribution case and the variance criterion
In: C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H.-H. Bock, Y. Baba (eds.):
Data Science, Classification, and Related Methods. Springer, Tokyo, 1998, 3-21
Considering the case of quantitative data vectors X1 , ..., Xn ∈ Rp , let each class Ci be
described by a p-dimensional spherical normal distribution N (µi , σ 2 Ep ) with classspecific expectations µ1 , ..., µm ∈ Rp , σ 2 > 0 (known or unknown), and Ep the p × p
unit matrix. Then the criteria (2.2) and (2.3) reduce to the equivalent well-known
SSQ or variance criteria:
g(C, θ)
:=
gm (C)
:=
m X
1X
∗
||xk − ϑi ||2 → min =: gmn
C,µ
n i=1 k∈Ci
m X
1X
∗
||xk − xCi ||2 → min = gmn
.
C
n i=1 k∈Ci
(2.6)
(2.7)
Similar normal distribution models assume an Np (µi , Σ) or Np (µi , Σi ) distribution in
Ci allowing for (possibly class-specific) dependencies among the p components (Bock
1974). More general approaches characterize each class Ci by a hyperplane Hi in Rp
(instead by a single point µi only): principal component clustering (Bock 1969, 1974;
Diday 1973: analyse factorielle typologique) and regression clustering (Bock 1969),
or constrain the class centers µ1 , ..., µm to belong to the same (unknown) hyperplane
H ⊂ Rp of a small dimension s, say, such that the clustering structure is essentially
low-dimensional (projection pursuit clustering; Bock 1987).
Remark 2.2: Clustering has been investigated in a decision theoretic framework as
well, thereby looking for a clustering criterion that minimizes the expected loss incured by missing a hidden clustering structure (Binder 1968; Bock 1968, 1972, 1974;
Hayashi 1974, 1993; Bernardo 1994). For example, Bock derived various Bayesian
methods for normal distribution clustering models under suitable prior assumptions
on the underlying parameters and showed, e.g., that (2.7) is asymptotically optimum
in some cases. Similarly, Hayashi investigated the minimaxity of clustering criteria.
2.1.2 A semi-parametric convex cluster model
There are instances (e.g., in pattern recognition and image exploration) where clusters
can be characterized by uniform distributions U (D) concentrated on some convex set
D ⊂ Rp . A corresponding fixed-partition model for X1 , ..., Xn involves an m-partition
C and m unknown non-overlapping convex domains (’parameters’) D1 , ..., Dm ⊂ Rp
such that for all k ∈ Ci , Xk ∼ U (Di ). The resulting m.l. clustering criterion (2.3) is
given by
gm (C)
:=
m
X
i=1
|Ci | · log volp (H(Ci )) → min
C
(2.8)
where the m.l. estimate of Di is just the convex hull D̂i = H(Ci ) := conv{xk |k ∈ Ci }
of the data points belonging to the class Ci , and minimization is over all m-partitions
with non-overlapping H(C1 ), ..., H(Cm ) with positive volumes (Bock 1997; see also
section 3.2.3). Note that such a model is applicable only if the presumptive clusters
are clearly separated by some empty space.
2.1.3 Qualitative data, contingency tables and entropy criteria
In the case of qualitative data where the j-th component Xkj of Xk takes its values
in a finite set Xj of alternatives (e.g., Xj = {0, 1} in theQbinary case) the observed
vectors x1 , ..., xn belong to the Cartesian product X := pj=1 Xj which corresponds
to the cells of the p-dimensional contingency table N = (nξ )ξ∈X = (nξ1 ,...,ξp ) that
contains as its entries the number of objects k ∈ O with the same data vector
xk = ξ = (ξ1 , ..., ξp )0 ∈ X . Thus any clustering C of O corresponds to a decomposition of N = N1 + · · · + Nm into m class-specific sub-tables N1 , ..., Nm .
In: C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H.-H. Bock, Y. Baba (eds.):
Data Science, Classification, and Related Methods. Springer, Tokyo, 1998, 3-21
Loglinear models provide a convenient tool for describing the distribution of multivariate qualitative data. A loglinear model for a vector X involves various parameters
(stacked into a vector ϑ) which are distinguished into main effects (roughly describing the size of marginal frequencies) and interaction parameters (of various orders)
that describe the association or dependencies that might exist among the p components of X. Then the distribution density (probability function) of X takes the form
f (ξ; ϑ) = P (X = ξ) = c(ϑ) · exp{z(ξ)0 ϑ} where z(ξ) is a binary dummy vector that
picks from ϑ the interaction parameters which correspond to the cell ξ of the contingency table (c(ϑ) is a norming factor). – Assuming a fixed-partition model with
class-specific interaction vectors ϑ1 , ..., ϑm for the m classes, the m.l. method yields
the entropy clustering criterion
gm (C)
m
X
:=
i=1
|Ci | · H(X, ϑ̂i (C)) → min
(2.9)
C
where H(X, ϑi ) := − ξ∈X f (ξ; ϑi ) · log f (ξ; ϑi ) ≥ 0 is Shannon’s entropy for the
distribution density f (·; ϑi ) and ϑ̂i = ϑ̂i (C) the m.l. estimate for ϑi in the class Ci . –
This and similar entropy criteria (such as logistic regression clustering) are considered
in Bock (1986, 1993, 1994, 1996a,d) and reformulated in Celeux and Govaert (1991).
P
2.1.4 Clustering models for dissimilarity data
The fixed-partition approach can also be used in the case of dissimilarity-based clustering where the data is a matrix D = (Dkl )n×n of random dissimilarities Dkl . Recall
that a basic
model for describing a ’no-clustering’ or ’randomness’ situation assumes
that all n2 variables Dkl with k < l, whilst being independent, have the same distribution density f (d), like a suitable generic variable D ∗ ≥ 0, say (e.g., an exponential
distribution Exp(1)). In this context, a clustering structure involving a partition
C = (C1 , ..., Cm ) will intuitively result if we shrink the dissimilarities between objects
in the same class Ci by a factor ϑii > 0, and stretch the dissimilarities between objects belonging to different classes Ci , Cj , i 6= j, by another factor ϑij > 0 (typically
ϑii < ϑij for all i, j). The resulting dissimilarity clustering model reads as follows:
∼
Dkl
ϑij · D∗
for all k ∈ Ci , l ∈ Cj .
(2.10)
The corresponding m.l. clustering criterion is given by
X
g(C, θ) :=
1≤i≤j≤m


X
k∈Ci ,l∈Cj

[− log f (dkl /ϑij )] + nij · log ϑij  → min (2.11)
C,θ
where for i = j the inner sum is over k < l only, and nij = |Ci | · |Cj | and nii = |C2i |
is the number of terms in the inner sum for i 6= j and i = j, respectively. For
exponentially distributed dissimilarities with f (d) = e−d for d > 0, this reduces to:
g(C, θ)
:=
X
1≤i≤j≤m
nij [DCi ,Cj /ϑij + log ϑij ] → min
C,θ
(2.12)
where DCi ,Cj = n−1
k∈Ci ,l∈Cj dkl is the average dissimilarity between two classes Ci
ij
P
and Cj , and DCi ,Ci = n−1
k,l∈Ci k<l dkl measures the heterogeneity of Ci . – Note
ii
that the unconstrained m.l. estimate for ϑij is given by ϑ̂ij = DCi ,Cj such that (2.12)
reduces to the log-distance clustering criterion
P
gm (C)
:=
!
X
n
g(C, θ̂) −
=
nij log DCi ,Cj → min .
C
2
1≤i≤j≤m
In: C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H.-H. Bock, Y. Baba (eds.):
Data Science, Classification, and Related Methods. Springer, Tokyo, 1998, 3-21
(2.13)
2.1.5 Clustering models for random similarity relations
In this last example, we consider binary similarity data where two objects k and
l are considered either as being ’similar’ (skl = 1, e.g. for k = l) or ’dissimilar’
(skl = 0). (Note that these data
can be interpreted as a similarity or association graph
P
with n vertices and M := k,l,k6=l skl links or edges). Intuitively, a corresponding
clustering model should be such that links are more likely inside the clusters than
between different clusters. This can be modeled by introducing linking probabilities
pij between the classes Ci and
Cj of C (typically pii > pij for i 6= j). The resulting
n
fixed-partition model for 2 random and independent similarity indicators Skl , k < l
(with Skk = 1 and Skl = Slk ) reads as follows:
P (Skl = 1)
=
for all k ∈ Ci , l ∈ Cj
pij
(2.14)
and leads to the m.l. clustering criterion:
g(C, p)
:=
X
1≤i≤j≤m
(Nij log pij + (nij − Nij ) log(1 − pij )] → min . (2.15)
C,p
where, for i < j, Nij (Nii ) is the number of observed links skl = 1 between Ci and
Cj (inside Ci ). If side constraints are neglected, p̂ij := Nij /nij is the m.l. estimate
for pij and can be substituted into (2.15). – The model (2.14) has been described by
Bock (1989b, 1996a,c), related models are known under the heading ’block models’
(see, e.g., Snijders and Nowicki 1996).
mix
2.2 Random-partition clustering models and the mixture model Hm
Clustered populations are often described by mixture models: The random vectors
X1 , ..., Xn are assumed to be independent, all with the same mixture density
f (x)
=
f (x; θ, π) :=
m
X
πi f (x; ϑi )
(2.16)
i=1
which involves m
class-specific parameters ϑ1 , ..., ϑm and m unknown probabilities
Pm
π1 , ..., πm (with i=1 πi = 1). Whilst this model incorporates no explicit clustering
of objects, it is well known that it is obtained by a two-stage process where, in a
first step, the objects of O are sampled independently from a superpopulation Π that
is decomposed into m subpopulations Π1 , ..., Πm with relative sizes πi and described
by the parameters ϑi . This first step results in a non-observable random partition
C = (C1 , ..., Cm ) of O that is characterized by the (random) class indicators I1 , ..., In
according to Ci = {k ∈ O|Ik = i}, i = 1, ..., m. Conditionally on Ik = i (i.e., k ∈ Ci ),
the vector Xk is distributed with the density f (·; ϑi ).
Classical mixture analysis concentrates on the estimation of the unknown parameters
π1 , ..., πm , ϑ1 , ..., ϑm for a suitable number m of components, typically by maximizing
the log likelihood
L(θ, π) :=
n
X
k=1
log(
m
X
i=1
πi · f (xk ; ϑi ) → max .
θ,pi
(2.17)
(McLachlan and Basford 1988, Titterington et al. 1985). Whilst this criterion involves no classification of objects, such a classification Cˆ = (Ĉ1 , ..., Ĉm ) can be constructed from the estimated parameters ϑ̂i , π̂i by using, in an additional stage, a
plug-in Bayesian rule that yields the classes
Ĉi := {k ∈ O| π̂i f (xk ; ϑ̂i ) = max {π̂ν f (xk ; ϑ̂ν )} }
ν=1,...,m
i = 1, ..., m.
In: C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H.-H. Bock, Y. Baba (eds.):
Data Science, Classification, and Related Methods. Springer, Tokyo, 1998, 3-21
(2.18)
Cˆ = (Ĉ1 , ..., Ĉm ) is an ’estimate’ for the random partition C.
A more appropriate approach that incorporates a partition of objects from the outset, is based on the random-partition model which comprizes the n independent
pairs (I1 , X1 ), ..., (In , Xn ) with the joint ’density’ πi f (x; ϑi ) (for Q
x ∈ Rp , i = 1, ...m)
and
maximizes the joint likelihood l(θ, π, I1 , ..., In ; x1 , ..., xn ) = nk=1 πIk f (xk ; ϑIk ) =
Qm Q
i=1 k∈Ci πi f (xk ; ϑi ) of these data with respect to θ, π = (π1 , ..., πm ) and the ’missing’ values I1 , ..., In (or C, equivalently). Using the fact that partial maximization
with respect to π yields the estimates π̂i = |Ci |/n we obtain the clustering criterion
g(C, θ) :=
m X
X
i=1 k∈Ci
(− log f (xk ; ϑi )) − n ·
m
X
i=1
(|Ci |/n) · log(|Ci |/n) → min (2.19)
C,θ
(Fahrmeir, Kaufmann and Pape 1980, Symons 1981, Anderson 1985). Obviously, this
adds an entropy term to the previous fixed-partition criterion (2.2). It can be shown
that the classes of an optimum m-partition C ∗ are generated by the Bayesian rule
(2.18) (after replacing ϑ̂i by the optimum values ϑ∗i ). Computationally, the likelihood
l(θ, π, I1 , ..., In ; x1 , ..., xn ) can be successively increased by using a modified k-means
algorithm where, in the t-th iteration step, a new partition C (t) is obtained by ap(t−1)
plying the Bayesian rule (2.18) with the previous parameter estimates ϑi
and
(t−1)
(t−1)
(t−1)
πi
= |Ci
|/n (obtained from the previous partition C
), in analogy to the
maximum-probability-assignment partition (2.4) (Fahrmeir et al. 1980, Celeux and
Diebolt 1985, Bock 1996a, Fahrmeir 1996).
2.3 Modal clusters and density-contour clusters
Another group of clustering models, designed primarily for data points x1 , ..., xn in
Rp , looks for those regions of Rp where these data points are locally concentrated,
or, alternatively, for the regions in which the density of points exceeds some given
threshold: These regions (or the corresponding clouds of points) can be used and
interpreted as ’classes’ or ’clusters’, especially in the context of pattern recognition
and image processing.
More specifically, let f (x) be the common (smooth) distribution density of X1 , ..., Xn
and define, for a threshold c > 0, by B(c) := {x ∈ Rp |f (x) ≥ c} the level-c region of f . Then the connected components B1 (c), B2 (c), ... of B(c) are termed highdensity clusters (Bock 1974, 1996a) or density-contour clusters of f at the level c.
For increasing values of c, these clusters split, but also disappear and show insofar a pseudo-hierarchical structure. The unknown density f can be approximated,
e.g., by a kernel estimate fˆn (x) obtained from the data x1 , ..., xn and corresponding
estimates B̂1 (c), B̂2 (c), ... ⊂ Rp are found from fˆn . From these estimated regions,
a (non-exhaustive) clustering of objects or data points is obtained by defining the
clusters Ci (c) := {k ∈ O|fˆn (xk ) ≥ c} = B̂i (c) ∩ {x1 , ..., xn }, i = 1, 2, ... . Note that
a cluster Ci (c) can show a very general (even ramificated) shape in Rp , and will be
particularly useful if, for a fixed sufficiently large c, it is separated by broad ’density
valleys’ from the rest of the data, and, for a varying c, if it is constant over a wide
range of values of c.
Except for the two-dimensional case, the geometrical description of high-density clusters is difficult. Therefore many ’discretized’ or modified versions of this clustering
strategy have been proposed (often using a weaker or discretized version of connectivity in Rp ; see Bock 1996a). From a theoretical point of view, Hartigan (1981) showed
that single linkage clustering fails in detecting high-density clusters for all dimensions
p ≥ 2.
A related clustering approach focusses on local density peaks in R p , i.e., on the points
In: C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H.-H. Bock, Y. Baba (eds.):
Data Science, Classification, and Related Methods. Springer, Tokyo, 1998, 3-21
ξ1 , ξ2 , ... ∈ Rp where the underlying (smooth) density f (or its estimate fˆn ) has its
local maxima (modes): Clusters are formed by successively relocating each data point
xk into a region with a larger value of fˆn (by hill-climbing algorithms, steepest ascent
etc.) and then collecting all data points into the same cluster Ci , termed mode cluster, which finally reach the same mode of fˆn . Even if this approach can be criticized
for its instability of the cluster concept (small local variations of f or fˆn can generate
an abritarily large number of modes or clusters) it is often used in image analysis and
pattern recognition and there exist many algorithmic variations of this approach (cf.
Bock 1996a).
2.4 Spatial clustering and clumping models
Motivated by biological and physical applications, spatial statistics provides various
other models for describing a clustering tendency of points in the space R p . A first
non-parametric approach considers the data x1 , ..., xn as a realization of a Poisson process (restricted to a finite window G ⊂ Rp ), either a homogeneous one with a constant
intensity λ (= the average number of data points per unit square) for describing a
’homogeneous’ or ’non-clustered’ sample, or with a location-dependent intensity λ(x)
in the case of a clustering structure: Here the modes and contour regions of λ(x) can
characterize clusters similarly as in section 2.3 (when using a distribution density f ),
and will be determined by suitable non-parametric estimates of λ(x) (Ripley 1981,
Cressie 1991).
Another model is motivated by the spread of plants in a plane or the growing of
cristals around kernels: The Neyman-Scott process builds clusters in three separate
steps: (1) by placing random ’seed points’ ξ1 , ξ2 , ... into Rp according to a homogeneous Poisson process, (2) by choosing, for each ξi , a random integer Ni with a Poisson distribution P(λ), and (3) by surrounding each ’parent’ point ξi by Ni ’daughter’
points Xi1 , ..., XiNi that are independently distributed according to h((x − ξ)/σ)/σ
(conditionally on the result of (1) and (2)) where h(x) is a spherically symmetric
density (typically, h ∼ N (0, 1) or h ∼ U (K(0, 1)), the uniform distribution in the
unit ball K(0, 1)). The data are then identified with the set of all daugther points
Xik inside a suitable window G. – There exist statistical methods for estimating the
unknown parameters λ, σ etc. from these data, but the problem of reconstructing
the ’clusters’ (families) from the data is largely unsolved. Insofar this model is representative for a range of models (including Cox processes, Poisson cluster processes
etc.) that focus more on the clustering tendency of the data than on the underlying
clustering of objects.
3. Hypothesis testing in cluster analysis
A major problem in cluster analysis consists in the interpretion of the constructed
clusters of objects and the assessment of their relevance for the underlying practical
problem. A range of strategies can be proposed in order to solve this problem, including:
(a) Descriptive and exploratory methods for determining the properties of the
clusters, either in terms of the observed data or by using secondary (background)
information that has not yet been used in the classification process (e.g.,
Bock 1981);
(b) A substance-related analysis of the classes that looks for intuitive, ’natural’
explanations or interpretations of the differences existing among the obtained
classes;
(c) A quantitative evaluation of the benefits that can be gained by using the construc-
In: C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H.-H. Bock, Y. Baba (eds.):
Data Science, Classification, and Related Methods. Springer, Tokyo, 1998, 3-21
ted classification in practice (e.g., in marketing, administration, official statistics,
libraries etc.);
(d) A qualitative or quantitative validation of the clusters by comparing them to
classifications obtained from other clustering methods, from alternative data
(for the same objects) or from traditional systematics (see Lapointe 1997);
(e) Inferential statistics which is based on probabilistic models and proceeds essentially by classical hypothesis testing.
It is this latter issue (e) that will be discussed in this section. In fact, there is a long
list of clustering-related questions that can be investigated by hypothesis testing. A
full account is given in Bock (1996a). Here we will address only two of the major
problems: Testing for homogeneity and checking the adequacy of a calculated classification (model).
3.1 Testing for homogeneity
Each clustering algorithm provides a classification of objects even if the underlying
data show no cluster structure and are homogeneous in some sense. In this case, the
resulting classification will typically be an artifact of the algorithm and might lead
to wrong conclusions, e.g. when searching for ’natural’ classes of objects. In order
to avoid this error, it will be useful to check, before applying a clustering algorithm,
some hypothesis of ’homogeneity’ or ’randomness’ and perform clustering only if this
hypothesis is rejected. Depending on the type of data, the following models for ’homogeneity’ or ’randomness’ have been considered in this context (also see Bock (1985,
1989a, 1996a) and Gordon (1996, 1997)):
HG : X1 , ..., Xn are uniformly distributed in a finite domain G ∈ Rp (to be estimated
from the data).
H uni : X1 , ..., Xn have the same (often unknown) unimodal distribution density f0 (x)
in Rp , with the special case:
H1N : X1 , ..., Xn all have the same p-dimensional normal distribution Np (µ, σ 2 Ep ).
H D : All n2 dissimilarities Dkl , k < l, are i.i.d., each with an arbitrary (or a specified)
continuous distribution density f (d); this implies the two following models:
H perm : All
n
2
! rankings of the dissimilarities Dkl , k < l, are equally probable.
H n,M : For each fixed number M of ’similar’ pairs of objects {k, l} (i.e. with Dkl smaller
that a given threshold
d > 0), these M links are purely randomly assigned to
n
the set of all 2 pairs of objects.
For testing one of these hypotheses versus a general alternative of non-homogeneity
we may consider, e.g., the empirical distribution of the Euclidean distances Dkl :=
||Xk − Xl ||, the nearest neighbour distances Dk = minl6=k {Dkl }, maximin distances or
gap statistics such as T := maxk {Dk } or T ∗ , the radius of the maximum ball that can
be placed in the window G without containing a data point Xk , and various other
test statistics. A survey of the resulting homogeneity tests is given, e.g., by Dubes
and Jain (1979), Dubes and Zeng (1987) and Bock (1985, 1989a, 1996a).
A better power performance is to be expected from tests which are tailored to some
clustering alternative of the type that has been defined in section 2. For example,
there exists a range of tests for bimodality and multimodality which are suited to
the concept of mode clusters or mixtures (Silverman 1981, Hartigan 1985, Sawitzki
In: C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H.-H. Bock, Y. Baba (eds.):
Data Science, Classification, and Related Methods. Springer, Tokyo, 1998, 3-21
1996) or related to single linkage clusters (i.e., the connected components of suitable
similarity graphs). In particular, the graph-theoretical and combinatorial methods
proposed by Ling (1973), Godehardt (1990), Godehardt and Horsch (1996), and Van
Cutsem and Ycart (1996a,b, 1997) are designed to test the hypotheses H perm and
H n,N , just by comparing the single linkage hierarchy calculated from the data to the
one to be expected for random data under these hypotheses.
3.2 Testing versus parametric clustering models
The fixed-partition clustering model Hm , (2.1), and the corresponding mixture model
mix
Hm
, (2.16), are specified by a parametric density family (typically with a unimodal
density f (x; ϑ)), and they reduce to the same homogeneity model H1 for m = 1
mix
or ϑ1 = · · · = ϑm . Thus testing H1 against the alternatives Hm or Hm
can, in
principle, be performed with the likelihood ratio test (LRT) statistics
Tm
L (C ∗ )
L
:= 2 log LHm = 2 log HLm n
H1
H1
and
L mix
Tmmix := 2 log LHm
H1
(3.1)
where LHm , LHmmix and LH1 denote the likelihood of the data maximized under the
mix
models Hm , Hm
and H1 , respectively, for a fixed number m > 1 of clusters, and Cn∗
is the optimum m-partition resulting from (2.1) or (2.3). Unfortunately, the classical asymptotic LRT theory (yielding χ2 distributions for Tm ) fails for these clustermix
ing models, due either to the fact that H1 is on the ’boundary’ of Hm
under the
parametrization (2.16) or to the discrete character of the parameter C in the fixedpartition model (2.1) (see also Hartigan 1985, Bock 1996a). However, there exist
some special investigations relating to these two test criteria.
3.2.1 Testing versus the mixture model
P
2
The case of a one-dimensional normal mixture f (x) ∼ m
i=1 N (µi , σ ) has been investigated by Everitt (1981), Thode et al. (1988) and Böhning (1994) who present
simulated percentiles of Tmmix under N (0, 1) for various sample sizes n and m = 2.
The power of this LRT is investigated by Mendell et al. (1991, 1993) where it results,
e.g., that n ≥ 50 is needed to have 50% power to detect a difference |µ1 − µ2 | ≥ 3σ
with 0.1 ≤ π1 ≤ 0.9 (also see Milligan (1981, 1996), Bock (1996a)). The paper of
Böhning (1994) extends these results to the case of one-dimensional exponential families and shows that inside these families, the asymptotic distribution of T mmix remains
(approximately) stable. For more general cases we recommend to determine suitable
percentiles of Tmmix by simulations instead of recurring, e.g., to heuristic formulas.
More theoretical investigations are presented by Titterington et al. (1985), Titterington (1990), Goffinet et al. (1992), and Böhning et al. (1994). Those authors
show, for two-component mixtures (with partially fixed parameters), that under H 1
the asymptotic distribution of T2mix (for n → ∞) is a mixture of the unit mass at 0
and a χ21 distribution. Ghosh and Sen (1985) show that the asymptotic distribution
of T2mix is closely related to a suitable Gaussian process, and Bardai and Garel (1994)
present the corresponding tabulations. – An alternative method for testing H 1 versus
mix
Hm
has been proposed by Bock (1977, 1985, 1996a, chap. 6.6) and uses, as test
statistics, the average similarity among the sample points x1 , ..., xn which should be
larger under H1 than under the mixture alternative.
3.2.2 Testing for the fixed-classification model; the max-F test
In contrast to the mixture model, the fixed-classification model (2.1) is defined in
terms of an unknown m-partition C = (C1 , ..., Cm ) of the n objects, for a fixed number m of classes and a given family of densities f (x; ϑ). Therefore the LRT using T m ,
(3.1), can be interpreted either as
• a test for homogenity H1 versus the clustering structure Hm ;
In: C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H.-H. Bock, Y. Baba (eds.):
Data Science, Classification, and Related Methods. Springer, Tokyo, 1998, 3-21
• a test for the significance or suitability of the calculated (optimum) classification Cn∗ of the n objects;
• a test for the existence of m > 1 ’natural’ classes in the data set (versus the
hypothesis of one class only).
Thus, depending on the interpretation, the analysis of the LR test has many facets.
Under the assumption that X1 , ..., Xn are i.i.d, all with the same density f (x) (describing either homogeneity or a mixture of distributions) the almost sure convergence
and the asymptotic normality of the parameter estimates ϑ̂i has been intensively studied, e.g., in Bryant and Williamson (1978), Pollard (1982), Pärna (1986), and Bryant
(1991). It appears that the asymptotic behaviour of these estimates is closely related
to the solution of the ’continuous’ clustering problem
G(B, θ)
:=
m Z
X
i=1
Bi
[− log f (x; ϑi )] · f (x) dx → min
B,θ
(3.2)
where minimization is over all m-partitions B = (B1 , ..., Bm ) of Rp and all parameter
vectors θ = (ϑ1 , ..., ϑm ). Instead of going into details here (see Bock 1996a) we will
focus on the special case of the normal distribution model described in section 2.1.1
where Xk ∼ N (µi , σ 2 Ep ) for k ∈ Ci under Hm , and Xk ∼ N (µ, σ 2 Ep ) for all k in case
of H1 . Here the LR test reduces to the intuitive max-F test defined by
∗
kmn
:=
=:
Pm
|Ci | · ||xCi − x||2
max Pmi=1P
2
C
k∈Ci ||xk − xCi ||
i=1
max
C
SSB(C)
SSB(Cn∗ )
=
SSW (C)
SSW (Cn∗ )
(3.3)
> c decide for Hm
≤ c decide for H1
(3.4)
where Cn∗ minimizes the variance criterion (2.7). In this case the continuous optimization problem (3.2) reduces to
G(B, µ)
:=
m Z
X
i=1
Bi
||x − µi ||2 · f (x) dx → min =: G∗m
B,µ
(3.5)
as an analogue to (2.6), and its solution B ∗ is necessarily a stationary partition, i.e.
a minimum-distance partition of Rp generated by its own class centroids, the conditional expectations µ∗i := Ef [X|X ∈ Bi∗ ], i = 1, ..., m.
For the one-dimensional case, the optimum partition B ∗ of R1 is given
√ √by Cox (1957)
and Bock (1974, p. 179) for the cases f ∼ N (0, 1) and f ∼ U ([− 3, 3]) with variance 1. For two- and three-dimensional normals f ∼ Np (µ, Ep ) a range of stationary
partitions B has been calculated by Baubkus (1985) (m = 2, ..., 6) and Flury (1993),
the ellipsoidal case has been considered by Baubkus (1985), Flury (1993, m = 4),
Kipper and Pärna (1992), Tarpey et al. (1995) and Jank (1996, 2 ≤ m ≤ 4). For
the two-dimensional normal N2 (0, E2 ) some stationary partitions as well as their numerical characteristics are reproduced in Tab. 1; for example, the three quite distinct
5-partitions B 5,1 , B 5,2 , B 5,3 differ in their G5 -values by no more than 0.013 (for other
cases see Bock 1996a). It is conjectured that for m = 2 to 5 this list includes the
optimum partitions of R2 (see the asterisks ∗∗ in Tab. 1), but a formal proof of optimality exists only for m = 2 and 3 classes (Baubkus 1985).
In order to apply the max-F test, the critical threshold (percentile) c must be cal∗
culated from the null distribution of kmn
under some f ∼ H1 . While this distribu∗
∗
tion is intractable for a finite n, the asymptotic normality of gmn
and kmn
has been
In: C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H.-H. Bock, Y. Baba (eds.):
Data Science, Classification, and Related Methods. Springer, Tokyo, 1998, 3-21
m
B
µi = Ef [X|X ∈ Bi ]
i = 1, ..., m
2
B2
µi =
3
B3
µi = 1.03648 ·
cos(2πi/3)
sin(2πi/3)
4
B 4,1
µi = 1.12838 ·
cos(πi/2)
sin(πi/2)
B 4,2
µi = 1.27910 ·
B 5,1
µi = 1.36334 ·
5
B 5,2
±0.79789
0
cos(πi/2)
sin(πi/2)
i = 1, ..., 4,
µ5 = (0, 0)
µ1,2
µ4
cos(2πi/3)
sin(2πi/3)
i = 1, 2, 3
µ4 = (0, 0)0
µ5,3
B 5,3
pi = P (X ∈ Bi )
i = 1, ..., m
±0.70505
=
0.87119 ±1.34917
=
−0.52106
0
=
−0.89064
µi = 1.17246 ·
cos(2πi/5)
sin(2πi/5)
Gm := G(B, µ)
κm := (2 − Gm )/Gm
pi ≡ 1/2
G2 = 1.36338∗∗
κ2 = 0.46694
pi ≡ 1/3
G3 = 0.92570∗∗
κ3 = 1.16052
pi ≡ 1/4
G4 = 0.72676∗∗
κ4 = 1.75194
pi = 0.24034
G4 = 0.82034
κ4 = 1.43801
i = 1, 2, 3
p4 = 0.27898
pi = 0.18636
G5 = 0.61448∗∗
κ5 = 2.25477
i = 1, ..., 4
p5 = 0.25457
p1,2 = 0.22217
G5 = 0.62246
κ5 = 2.21305
p5,3 = 0.14580
p4 = 0.26405
pi ≡ 1/5
G5 = 0.62533
κ5 = 2.19830
Tab. 1: Stationary partitions B = (B1 , ..., Bm ) of R2 with m = 2, ..., 5 classes for the continuous variance criterion (3.5) if f ∼ N2 (0, E2 ), with the class centers µi , class percentages
pi and criterion values Gm := G(B, µ). The SSB κm = (2 − Gm )/Gm is the continuous
∗ , (3.3). ∗∗ marks the optimum or best known m-partitions.
analogue to kmn
In: C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H.-H. Bock, Y. Baba (eds.):
Data Science, Classification, and Related Methods. Springer, Tokyo, 1998, 3-21
proved under some regularity conditions on f , with asymptotic expectations G∗m and
κ∗m := (Ef [||X − Ef [X]||2 ] − G∗m ]/G∗m , the continuous analogues of the minimum
variance and maxF criteria (2.7) and (3.3), respectively (see Bryant and Williamson
1978, Hartigan 1978 (for p = 1), Bock 1985 (for p ≥ 1). Since the regularity conditions include the uniqueness of the optimum partition B ∗ of (3.5), these results
cannot be applied for the rotation-invariant density f ∼ Np (0, Ep ) if p > 1, but suitable simulations have been conducted (see Hartigan 1978, Bock 1996a, Jank 1996).
For example, Jank (1996) and Jank and Bock (1996) found that for n ≥ 100 (in
∗
particular: n = 1000) the null distribution of the standardized values (gmn
− a)/b
∗
and (kmn − a)/b is satisfactorily approximated by a N (0, 1) distribution (in the range
[−2, 2], say) if a and b chosen to be the empirical mean and standard deviation of the
∗
∗
optimum values gmn
and kmn
, respectively (results of the k-means algorithm) from
N = 1600 simulations of {x1 , ..., xn } under N2 (0, E2 ) and m = 2, ..., 5.
3.2.3 The LRT for the convex cluster case; convex cluster tests
A less investigated case is provided by clustering models where each class is characterized by a uniform distribution on a convex domain of Rp . To be specific, we
consider the following three convex clustering models which all involve a system of
m unknown non-overlapping convex sets D1 , ..., Dm ⊂ Rp (to be estimated from the
data):
Hm Fixed-classification model:
Xk ∼ U (Di ) for all k ∈ Ci , with an unknown m-partition C = (C1 , ..., Cm ) of
O.
mix
Mixture model:
Hm
P
Xk ∼ m
i=1 πi · U (Di ) for k = 1, ..., n.
uni
Pseudo-mixture model:
Hm
Xk ∼ U (D1 +· · ·+Dm ) for k = 1, ..., n (constrained to the union D1 +· · ·+Dm ).
These models have been introduced Bock (1997) as a generalization of some work by
uni
Rasson et al. (1988, 1994) related to Hm
(also see Rasson 1997).
Here we consider the problem of testing the hypothesis of homogeneity HG (i.e., Xk ∼
U (G) for some unknown convex G ⊂ Rp ) versus one of these clustering alternatives.
We find the following LR tests where Gn := con{x1 , ..., xn } denotes the convex hull
of all n data points, Di := H(Ci ) the convex hull of all data points belonging to a
class Ci ⊂ O, and c is a critical threshold to be obtained from the HG distribution
of the test statistics:
• HG versus Hm :
Tm := −
m
X
|Ci∗ |
i=1
n
· log
volp (H(Ci∗ ))
volp (Gn )
> c decide for Hm
(3.6)
≤ c accept uniformity HG .
∗
where the partition C ∗ = (C1∗ , ..., Cm
) minimizes the clustering criterion (2.8).
mix
• HG versus Hm
:
Tmmix
:=
|Ci∗ | volp (H(Ci∗ ))
log
/
n
n
volp (Gn )
m
X
|Ci∗ |
i=1
"
#
mix
> c decide for Hm
(3.7)
≤ c accept uniformity HG .
∗
where C ∗ = (C1∗ , ..., Cm
) is the partition that minimizes the clustering criterion
mix
gm
(C) :=
m
X
i=1
|Ci | · log
volp (H(Ci ))
→ min .
C
|Ci |
In: C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H.-H. Bock, Y. Baba (eds.):
Data Science, Classification, and Related Methods. Springer, Tokyo, 1998, 3-21
(3.7)
over all partitions C = (C1 , ..., Cm ) with disjoint convex hulls H(C1 ), ..., H(Cm ),
all with a positive volume.
uni
• HG versus Hm
:
∗
∗
Denote by C = (C1∗ , ..., Cm
) the m-partition which minimizes the volume clustering criterion
λ(C)
:=
volp (H(C1 )) + · · · + volp (H(Cm )) → min,
C
(3.8)
i.e. the sum of the volumes of the m class-specific convex hulls D̂i := H(Ci ) ⊂
∗
Gn (supposed to be non-overlapping). Then Vn := Gn − (H(C1∗ ) + · · · + H(Cm
))
is the maximum ’empty space’ that is left in Gn outside of the cluster domains
D̂i , and the LRT reduces to the following empty space test:
Tmuni
:=
volp (Vn )
volp (Gn )
uni
> c decide for clustering Hm
≤ c accept uniformity HG .
(3.9)
This test is a multivariate generalization of various one-dimensional gap tests
(see Bock (1989a, 1996a, 1997), Rasson and Kubushishii (1994), Hardy (1997)).
Since all these test criteria are invariant against any regular affine transformation of
the data, their distribution under HG can be determined (e.g., by simulations) for
some standardized form of G such as the unit square or the unit ball in R p , without
restriction of generality.
3.3 Power considerations and the comparative assessment of classifications
Practical experience and theoretical investigations have shown that the power of clustering tests is far from being satisfactory. For example, in normal distribution mixture
cases a considerable separation of the classes is needed in order to detect a hidden
clustering structure with a satisfactorily large probability (see Everitt 1981, Mendell
et al. 1993, Bock 1996a). As a matter of fact, this difficulty seems to be an intrinsic
feature of the classification problem, and not only a technical deficiency of our statistical tools. Insofar the result of any clustering test must be interpreted with care,
more in an indicative than in a convincing sense.
Quite generally, it may be doubted if the classical hypothesis testing paradigm with
only two alternative decisions: acceptance and rejection of a hypothesis H0 , will
be appropriate for the clustering framework where it would be much more realistic
and useful to distinguish various grades of classifiability which interpolate between
the extreme cases of ’homogeneity or randomness’ and an ’obvious clustering structure’. Instead of defining corresponding quantitative measures of classifiabilty here,
we propose a more qualitative approach, the comparative assessment of classifications
(CAC): It proceeds by defining, prior to any clustering algorithm, some ’benchmark
, i.e., data configurations or distribution models which show
clustering situations’ Hm
a more or less marked classification structure, indexed by (not necessarily a real
number, but possibly a measure of class separation; see below). These configurations
can be selected in cooperation of practitionners and statisticians. Then, after having
calculated a special (e.g., optimum) classification Cn∗ for the given data {x1 , ..., xn },
we compare this classification with those to be expected under Hm
, for various degrees . Thus we place the observed data into a network of various different clustering
situations Hm
and in order to get an idea of their underlying structure.
In the case of the fixed-classification normal model Hm , (2.1), with class-specific
distributions Np (µi , σ 2 Ep ), this idea can be realized as follows:
In: C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H.-H. Bock, Y. Baba (eds.):
Data Science, Classification, and Related Methods. Springer, Tokyo, 1998, 3-21
1. Determine suitably parametrized benchmark clusterings Hm
, e.g.:
• A partition C of O with m classes Ci of equal sizes ni = |Ci | = n/m,
whose class centroids µi are sufficiently different, e.g., ||µi − µj ||/σ ≥ for
P
2
2
all i 6= j, or m1 m
i=1 ||µi − µ|| /σ ≥ .
• A normal mixture
Pm
i=1
πi · Np (µi , σ 2 Ep ) with :=
Pm
i=1
πi · ||µi − µ||2 .
∗
2. Consider the LRT statistics Tm or, equivalently, the max − F statistics kmn
for
the hypothesis H0 : µ1 = · · · = µm .
3. Determine or estimate some characteristics Q of the (untractable) probability
∗
distribution of Tm or kmn
under the benchmark situations Hm
selected in 1.,
e.g., by simulating a large number of data sets {X̃1 , · · · , X̃n } under Hm
and
calculating the empirical mean, median or some other empirical percentile of
∗
the resulting values of Tm or kmn
.
∗
4. Compare the values Tm or kmn
calculated from the original data {x1 , ..., xn }
(eventually after a suitable standardization) to the characteristics Q of the
benchmark situations.
5. The clustering tendency or classifiability of the data {x1 , ..., xn } and the relevance of the calculated classification Cn∗ is then described, illustrated and quan
tisized (by ) by confronting them to those benchmark situations Hm
which
∗
show a weaker clustering behaviour (in the sense that, e.g., Q ≤ Tm or ≤ kmn
)
and, on the other hand, to those which describe a stronger clustering structure
(i.e., where the converse inequality holds).
It is obvious that this strategy CAC is related to a formal test of a hypothesis ≤ 0
versus > 0 ), but is more flexible due to the arbitrary selection of suitable benchmark situations. Its generalization to other clustering models is obvious.
4. Final remarks
In this paper we have described various probabilistic and inferential tools for cluster
analysis. These methods provide a firm basis for deriving suitable clustering strategies and allow for a quantitative evaluation of classification results and clustering
methods, including error probabilities and risk functions. In particular, various test
statistics can safeguard against a too rash acceptance of a clustering structure and
help to validate calculated classifications.
On the other hand, the application of these methods is not at all easy and self-evident:
Problems arise, e.g., when selecting a suitable clustering model and an appropriate
family of densities f (x, ϑ), or when different types of cluster shapes will simultaneously occur in the same data set. We have seen that the probability distribution
of many test statistics is hard to obtain in many cases. Moreover, our analysis is
always based on only one sample from the n objects such that we cannot evaluate the
stability or variation of the resulting classification (as it would be the case if repeated
samples were available for the same objects).
When comparing the risks and benefits of probability-based versus deterministic clustering approaches (which proceed, e.g., by intuitive clustering criteria or heuristic algorithms) we see that these same deficiencies exist, in some other and disguised form,
for the latter methods as well. It is recommended here to combine both approaches in
an exploratory way and thereby profit from both points of view. The CAC strategy
presented above is an example for such an analysis.
In: C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H.-H. Bock, Y. Baba (eds.):
Data Science, Classification, and Related Methods. Springer, Tokyo, 1998, 3-21
References
Anderson, J.J. (1985): Normal mixtures and the number of clusters problem. Computational Statistics Quarterly 2, 3-14.
P. Arabie, L. Hubert and G. De Soete (eds.) (1996): Clustering and Classification. World
Science Publishers, River Edge/NJ.
Baubkus, W. (1985): Minimizing the variance criterion in cluster analysis: Optimal configurations in the multidimensional normal case. Diploma thesis, Institute of Statistics,
Technical University of Aachen, Germany.
Berdai, A., and B. Garel (1994): Performances d’un test d’homogénéité contre une hypothèse de mélange gaussien. Revue de Statistique Appliquée 42 (1), 63-79.
Bernardo, J.M. (1994): Optimizing prediction with hierarchical models: Bayesian clustering. In: P.R. Freeman, A.F.M. Smith (Eds.): Aspects of uncertainty. Wiley, New York,
1994, 67-76.
Binder, D.A. (1978): Bayesian cluster analysis. Biometrika 65, 31-38.
Bock, H.H. (1968): Statistische Modelle für die einfache und doppelte Klassifikation von
normalverteilten Beobachtungen. Dissertation, Univ. Freiburg i. Brsg., Germany.
Bock, H.H. (1969): The equivalence of two extremal problems and its application to the
iterative classification of multivariate data. Report of the Conference ’Medizinische Statistik’, Forschungsinstitut Oberwolfach, February 1969, 10pp.
Bock, H.H. (1972): Statistische Modelle und Bayes’sche Verfahren zur Bestimmung einer
unbekannten Klassifikation normalverteilter zufälliger Vektoren. Metrika 18 (1972) 120-132.
Bock, H.H. (1974): Automatische Klassifikation (Clusteranalyse). Vandenhoeck & Ruprecht,
Göttingen, 480 pp.
Bock, H.H. (1977): On tests concerning the existence of a classification. In: Proc. First
Symposium on Data Analysis and Informatics, Versailles, 1977, Vol. II. Institut de Recherche
d’Informatique et d’Automatique (IRIA), Le Chesnay, 1977, 449-464.
Bock, H.H. (1984): Statistical testing and evaluation methods in cluster analysis. In: J.K.
Ghosh & J. Roy (Eds.): Golden Jubilee Conference in Statistics: Applications and new
directions. Calcutta, December 1981. Indian Statistical Institute, Calcutta, 1984, 116-146.
Bock, H.H. (1985): On some significance tests in cluster analysis. J. of Classification 2,
77-108.
Bock, H.H. (1986): Loglinear models and entropy clustering methods for qualitative data.
In: W. Gaul, M. Schader (Eds.), Classification as a tool of research. North Holland, Amsterdam, 1986, 19-26.
Bock, H.H. (1987): On the interface between cluster analysis, principal component analysis, and multidimensional scaling. In: H. Bozdogan and A.K. Gupta (eds.): Multivariate statistical
modeling and data analysis. Reidel, Dordrecht, 1987, 17-34.
Bock, H.H. (Ed.) (1988): Classification and related methods of data analysis. Proc. First IFCS
Conference, Aachen, 1987. North Holland, Amsterdam.
Bock, H.H. (1989a): Probabilistic aspects in cluster analysis. In: O. Opitz (Ed.): Conceptual and
numerical analysis of data. Springer-Verlag, Heidelberg, 1989, 12-44.
Bock, H.H. (1989b): A probabilistic clustering model for graphs and similarity relations. Paper presented at the Fall Meeting 1989 of the Working Group ’Numerical Classification and Data Analysis’
of the Gesellschaft für Klassifikation, Essen, November 1989.
Bock, H.H. (1994): Information and entropy in cluster analysis. In: H. Bozdogan et al. (Eds.): Multivariate statistical modeling, Vol. II. Proc. 1st US/Japan Conference on the Frontiers of Statistical
Modeling: An Informational Approach. Univ. of Tennessee, Knoxville, 1992. Kluwer, Dordrecht,
In: C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H.-H. Bock, Y. Baba (eds.):
Data Science, Classification, and Related Methods. Springer, Tokyo, 1998, 3-21
1994, 115-147.
Bock, H.H. (1996a): Probability models and hypotheses testing in partitioning cluster analysis. In:
P. Arabie et al. (Eds.), 1996, 377-453.
Bock, H.H. (1996b): Probabilistic models in cluster analysis. Computational Statistics and Data
Analysis 23, 5-28.
Bock, H.H. (1996c): Probabilistic models in partitional cluster analysis. In: A. Ferligoj and A.
Kramberger (Eds.): Developments in data analysis. Metodološki zvezki, 12, Faculty of Social Sciences Press (Fakulteta za druzbene vede, FDV), Ljubljana, 1996, 3-25.
Bock, H.H. (1996d): Probabilistic models and statistical methods in partitional classification problems. Written version of a Tutorial Session organized by the Japanese Classification Society and the
Japan Market Association, Tokyo, April 2-3, 1996, 50-68.
Bock, H.H. (1997): Probability models for convex clusters. In: R. Klar and O. Opitz (Eds.): Classification and knowledge organization. Springer-Verlag, Heidelberg, 1997, 3-14.
Bock, H.H., and W. Polasek (Eds.) (1996): Data analysis and information systems: Statistical and
conceptual approaches. Springer-Verlag, Heidelberg, 1996.
Böhning, D., Dietz, E., Schaub, R., Schlattmann, P., & Lindsay, B.G. (1994): The distribution of
the likelihood ratio for mixtures of densities from the one-parameter exponential family. Annals of
the Institute of Mathematical Statistics 46, 373-388.
Bryant, P. (1988): On characterizing optimization-based clustering methods. J. of Classification 5,
81-84.
Bryant, P.G. (1991): Large-sample results for optimization-based clustering methods. J. of Classification 8, 31-44.
Bryant, P.G., and J.A. Williamson (1978): Asymptotic behaviour of classification maximum likelihood estimates. Biometrika 65, 273-281.
Céleux, G., & Diebolt, J. (1985): The SEM algorithm: A probabilistic teacher algorithm derived
from the EM algorithm for the mixture problem. Computational Statistics Quarterly 2, 73-82. Cox,
D.R. (1957): A note on grouping. J. Amer. Statist. Assoc. 52, 543-547.
Cressie, N. (1991): Statistics for spatial data. Wiley, New York.
Diday, E. (1973): Introduction à l’analyse factorielle typologique. Rapport de Recherche no. 27,
IRIA, Le Chesnay, France, 13 pp.
Diday, E., Y. Lechevallier, M. Schader, P. Bertrand, and B. Burtschy (Eds.) (1994): New approaches
in classification and data analysis. Studies in Classification, Data Analysis, and Knowledge Organization, vol. 6. Springer-Verlag, Heidelberg, 186-193.
Dubes, R., and Jain, A.K. (1979): Validity studies in clustering methodologies. Pattern Recognition
11, 235-254.
Dubes, R.C., and Zeng, G. (1987): A test for spatial homogeneity in cluster analysis. J. of Classification 4, 33-56.
Everitt, B. S. (1981): A Monte Carlo investigation of the likelihood ratio test for the number of
components in a mixture of normal distributions. Multivariate Behavioural Research 16, 171-180.
Fahrmeir, L., Hamerle, A. and G. Tutz (Eds.) (1996): Multivariate statistische Verfahren. Walter
de Gruyter, Berlin - New York.
Fahrmeir, L., Kaufmann, H.L., and H. Pape (1980): Eine konstruktive Eigenschaft optimaler Partitionen bei stochastischen Klassifikationsproblemen. Methods of Operations Research 37, 337-347.
Flury, B.D. (1993): Estimation of principal points. Applied Statistics 42, 139-151.
W. Gaul & D. Pfeifer (Eds.) (1996): From data to knowledge. Theoretical and practical aspects of
classification, data analysis and knowledge organization. Springer-Verlag, Heidelberg.
Ghosh, J. K., & Sen, P. K. (1985): On the asymptotic performance of the log likelihood ratio statistic for the mixture model and related results. In: L.M. LeCam, R.A. Ohlsen (Eds.): Proc. Berkeley
Conference in honor of Jerzy Neyman and Jack Kiefer. Vol.II, Wadsworth, Monterey, 1985, 789-806.
In: C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H.-H. Bock, Y. Baba (eds.):
Data Science, Classification, and Related Methods. Springer, Tokyo, 1998, 3-21
Godehardt, E. (1990): Graphs as structural models. The application of graphs and multigraphs in
cluster analysis. Friedrich Vieweg & Sohn, Braunschweig, 240pp.
Godehardt, E., and Horsch, A. (1996): Graph-theoretic models for testing the homogeneity of data.
In: W. Gaul & D. Pfeifer (Eds.), 1996, 167-176.
Goffinet, B., Loisel, P., and B. Laurent (1992): Testing in normal mixture models when the proportions are known. Biometrika 79, 842-846.
Gordon, A.D. (1994): Identifying genuine clusters in a classification. Computational Statistics and
Data Analysis 18, 561-581.
Gordon, A.D. (1996): Null models in cluster validation. In: W. Gaul and D. Pfeifer (Eds.), 1996,
32-44.
Gordon, A.D. (1997a): Cluster validation. This volume, 22-39.
Gordon, A.D. (1997b): How many clusters? An investigation of five procedures for detecting nested
cluster structure. This volume, 109-116.
Hardy, A. (1997): A split and merge algorithm for cluster analysis. Lecture at IFCS-96, Kobe.
Hartigan, J.A. (1978): Asymptotic distributions for clustering criteria. Ann. Statist. 6, 117-131.
Hartigan, J.A. (1985): Statistical theory in clustering. J. of Classification 2, 63-76.
Hayashi, Ch. (1974): Method of quantification (Suryoka no hoho.) Chap. 15: Quantification by
probability - its problems. Keizai Shinposha, Tokyo.
Hayashi, Ch. (1993): Treatise on behaviormetrics (Kodo-keiryo gaku josetsu). Chap. 13.4. AsakuraShoten, Tokyo.
Jain, A.K., and Dubes, R.C. (1988): Algorithms for clustering data. Prentice Hall, Englewood
Cliffs, NJ.
Jank, W. (1996): A study on the varaince criterion in cluster analysis: Optimum and stationary
partitions of Rp and the distribution of related clustering criteria. (In German). Diploma thesis,
Institute of Statistics, Technical University of Aachen, Aachen, 204pp.
Jank, W., and Bock, H.H. (1996): Optimal partitions of R 2 and the distribution of the variance and
max-F criterion. Paper presented at the 20th Annual Conference of the Gesellschaft für Klassifikation, Freiburg, Germany, March 1996.
Kipper, S., and Pärna, K. (1992): Optimal k−centres for a two-dimensional normal distribution.
Acta et Commentationes Universitatis Tartuensis, Tartu Ülikooli TOIMEISED 942, 21-27.
Lapointe, F.-J. (1997): To validate and how to validate? That is the real question. This volume,
71-88.
Ling, R.F. (1973): A probability theory of cluster analysis. J. Amer. Statist. Assoc. 68, 159-164.
McLachlan, G.J., and K.E. Basford (1988): Mixture models. Inference and applications to clustering. Marcel Dekker, New York - Basel.
Mendell, N.P., Thode, H.C., & Finch, S.J. (1991): The likelihood ratio test for the two-component
normal mixture problem: power and sample-size analysis. Biometrics 47, 1143-1148. Correction:
48 (1992) 661.
Mendell, N.P., Finch, S.J., and Thode, H.C. (1993): Where is the likelihood ratio test powerful for
detecting two-component normal mixtures? Biometrics 49, 907-915.
Milligan, G. W. (1981): A review of Monte Carlo tests of cluster analysis. Multivariate Behavioural
Research 16, 379-401.
Milligan, G.W. (1996): Clustering validation: Results and implications for applied analyses. In: P.
Arabie et al. (Eds.), 1996, 341-375.
Milligan, G. W., and M.C. Cooper (1985): An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 159-179.
Pärna, K. (1986): Strong consistency of k-means clustering criterion in separable metric spaces.
Tartu Riikliku Ülikooli, TOIMEISED 733, 86-96.
Pollard, D. (1982): A central limit theorem for k-means clustering. Ann. Probab. 10, 919-926.
In: C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H.-H. Bock, Y. Baba (eds.):
Data Science, Classification, and Related Methods. Springer, Tokyo, 1998, 3-21
Rasson, J.-P. (1997): Convexity methods in classification. This volume, 99-106.
Rasson, J.-P., Hardy, A., and Weverbergh, D. (1988): Point process, classification and data analysis.
In: H.H. Bock (Ed.), 1988, 245-256.
Rasson, J.-P., and Kubushishi, T. (1994): The gap test: an optimal method for determining the
number of natural classes in cluster analysis. In: E. Diday et al. (eds.), 1994, 186-193.
Ripley, B.D. (1981): Spatial statistics. Wiley, New York.
Sawitzki, G. (1996): The excess-mass approach and the analysis of multi-modality. In: W. Gaul
and D. Pfeifer (Eds.), 1996, 203-211.
Silverman, B.W. (1981): Using kernel density estimates to investigate multimodality. J. Royal
Statist. Soc. B 43, 97-99.
Snijders, T.A.B. and K. Nowicki (1996): Estimation and prediction for stochastic blockmodels for
graphs with latent block structure. J. of Classification 13, 75-100.
Symons, M.J. (1981): Clustering criteria and multivariate normal mixtures. Biometrics 37, 35-43.
Tharpey, Th., Li, L., Flury, B.D. (1995): Principal points and self-consistent points of elliptical
distributions. Annals of Statistics 23, 103-112.
Thode, H.C., Finch, S.J., & Mendell, N.R. (1988): Simulated percentage points for the null distribution of the likelihood ratio test for a mixture of two normals. Biometrics 44, 1195-1201.
Titterington, D.M. (1990): Some recent research in the analysis of mixture distributions. Statistics
21, 619-641.
Titterington, D.M., A.F.M. Smith and U.E. Makov (1985): Statistical analysis of finite mixture
distributions. Wiley, New York.
Van Cutsem, B., and Ycart, B. (1996a): Probability distributions on indexed dendrograms and
related problems of classifiability. In H.H. Bock and W. Polasek (Eds.), 1996, 73-87.
Van Cutsem, B., and Ycart, B. (1996b): Combinatorial structures and structures for classification.
Computational Statistics and Data Analysis 23, 169-188.
Van Cutsem, B., and Ycart, B. (1997): Random dendrograms for classifiability testing. This volume,
133-144.
In: C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H.-H. Bock, Y. Baba (eds.):
Data Science, Classification, and Related Methods. Springer, Tokyo, 1998, 3-21