Model–based clustering with the assignment problem

Model–based clustering with the assignment problem
Marı́a Teresa Gallegos and Gunter Ritter∗
Faculty of Mathematics and Computer Science
University of Passau
D-94030 Passau, Germany
Nov 5, 2006
Abstract: We establish a statistical model for clustering in the presence of gross outliers
and derive its ML and MAP clustering estimators. Their computation leads to constrained
trimming algorithms of k–means type. One of them includes a λ–assignment problem known
from combinatorial optimization. We also estimate the numbers of clusters and outliers with
the algorithms obtained and illustrate them with some examples.
Key Words: model–based clustering, constrained classification, combinatorial optimization,
assignment problem, outliers, model selection
1
Introduction
Clustering means grouping data sets into homogeneous, separated subsets. If cluster shapes
are spherical or elliptical and if the data come from a mixture of normal distributions then
much progress towards an automatic treatment has been made in the past 40 years. In the
1960’s, the following cluster criteria, optimal for uncontaminated, normal data and applicable
in the case of a known number of clusters, had been established:
• The pooled trace criterion, Ward [49]. It is the ML criterion for equal cluster sizes and
spherical normal populations of equal variances, Scott and Symons [39].
• The pooled determinant criterion, Friedman and Rubin [13]. It is the ML criterion for
equal cluster sizes and general normal populations of equal shapes, Scott and Symons loc.
cit..
• The heteroscedastic determinant criterion. This criterion is optimal in the case of equal
cluster sizes and general normal populations, Scott and Symons loc. cit..
By that time it was well understood how criteria could be established based on other normal
submodels, e.g. with independent features, and on different cross–cluster constraints. In 1981,
Symons [42] noticed that subtracting the entropy of the cluster proportions from the log–
likelihood could in some sense optimally handle unknown and unequal cluster sizes.
∗
E-mail: [email protected]
1
Steinhaus [41] and independently Jancey [23] proposed an efficient algorithm for minimizing
the trace criterion. It is now called the “k–means” algorithm, a name coined by MacQueen [28]
who proposed and analyzed a similar algorithm in a somewhat different situation. The method
consists essentially of an alternating application of ML parameter estimation and ML discriminant analysis. Except for this method it took some time until dedicated algorithms suitable for
efficiently reducing the other criteria mentioned above became known. Indeed, Schroeder [38]
realized that all criteria could be optimized by similar, alternating algorithms. In the case
of general normal populations, e.g., it is sufficient to replace the Euclidean distance used in
k–means with the Mahalanobis distance.
The criteria and algorithms mentioned so far are not robust against outliers. Small outliers may be handled by applying suitable distributional families such as the multivariate
Student (or Pearson’s type VII) distribution. Gross outliers need a different approach. It is
well known that the usual breakdown points of the methods above even vanish. An effective
way of dealing with the related problem of robust parameter estimation is Rousseeuw’s [36]
minimum–covariance–determinant criterion, MCD. Rousseeuw and van Driessen [37] proposed an efficient algorithm for its approximate computation. This algorithm, too, is alternating. Pesch [33] established a statistical model for which MCD is an ML–estimator. His
model assumes an independent, albeit non-stationary sequence of observations.
Rocke and Woodruff [35], see also [51], proposed on heuristic grounds a trimmed extension
of Scott and Symons’s heteroscedastic determinant criterion which they called MINO. A
trimmed extension of the k–means algorithm was designed by Garcı́a–Escudero, Gordaliza,
and Matrán [17]. In [16], we proposed a homoscedastic statistical model for clustering with
outliers deriving from it a pooled cluster criterion and a related trimming algorithm. In that
paper we also computed some breakdown points thus showing robustness of criterion and
algorithm. We finally mention that a synoptic presentation of much of the state of the art at
the turn of the millennium appears in Fraley and Raftery [12].
In the present paper, we propose a heteroscedastic clustering model with arbitrary populations
and with outliers extending Scott and Symons’s heteroscedastic normal model and the statistical model with spurious outliers established in [16] to this case. From this model we derive
trimmed clustering criteria. In the normal ML case we retrieve Rocke and Woodruff’s MINO.
Optimizing the heteroscedastic criteria is not easy and we propose alternating algorithms for
this task, the single–point and multi–point algorithms. Interestingly, the multi–point algorithm contains a famous classical problem from combinatorial optimization, λ–assignment.
We also show how to use the algorithms for estimating the numbers of clusters and outliers.
We finally report on our experience with some data sets.
2
Statistical model and constrained clustering criteria
We consider a very general statistical clustering model with r regular observations and n − r
“spurious” outliers in a sample space E. Spuriousness [16] is meant as a model for observations
that are unpredictable in the sense that they obey no statistical law. We feel that the best
way of handling this idea in a statistical (!) framework is assuming that each outlier comes
from its own population with a nuisance parameter. The following is the main assumption on
the spurious outliers.
2
(SVo ) An outlier Xi : Ω → E, i ∈ 1..n, obeys a parametric model with parameter ψ i ∈ Ψi
such that the likelihood integrated w.r.t. some prior measure τ i on Ψi satisfies
Z
fXi [x | ψi ] τi ( dψi ) = 1,
(1)
Ψi
i.e., does not depend on x. We will later consider the ψ i ’s as nuisance parameters to be
integrated out. There are two important and sufficiently general situations where (SV o ) holds.
(A) The sample space E = Rd is Euclidean, Ψi = E, the outliers obey a location model
Xi = U i + ψ i
with some (unknown) random noise Ui : (Ω, P ) → E, and τi is Lebesgue measure on Ψi .
Indeed, in this case, the conditional Lebesgue density is f Xi [x | ψi ] = fUi (x − ψi ) and, hence,
Z
fXi [x | ψi ] dψi = 1.
Ψi
(B) The parameter set Ψi is singleton and the distribution of X i is taken as the reference
measure for its density. This case includes the idea of outliers “uniformly distributed” on
some domain.
Each of the r regular observations comes from one of g populations represented by a density
fγj , j ∈ 1..g, in some dominated parametric statistical model with parameter set Γ j . This
allows the consideration of discrete and continuous sample spaces of all kinds.
A labeling of the n objects is a mapping l : 1..n → 0..g: l i = j ∈ 1..g stands for the assignment
of object i to class j and li = 0 specifies it as an outlier. A labeling is admissible if each
cluster j ∈ 1..g contains at least bj data points. These numbers may serve several purposes.
They may first reflect prior information on the cluster sizes. Secondly, the number b j has to
be chosen in such a way as to secure ML–estimation of the population parameters. Assume
the full normal model, Nmj ,Vj , and general position of the data, i.e. any d + 1 observations
are affinely independent. Then, the m.l.e.’s of m j and Vj exist if and only if #Cj ≥ d + 1 for
all j ∈ 1..g and we need bj ≥ d + 1. Similar statements can be made for normal submodels. If
the Vj ’s are assumed to be diagonal and if x i1 ,k 6= xi1 ,k for any two observations i1 6= i2 and
all k ∈ 1..d then it is sufficient to have at least two elements in each cluster. The same is true
in the spherical normal model where this minimum number even suffices if observations are
only pairwise different. The same can be said about Lebesgue continuous elliptically contoured
models. The conditions on the positions of the data are guaranteed with probability one if the
data arise from Lebesgue continuous models. Let Λ r denote the set of all admissible labelings
l : 1..n → 0..g of the n objects such that n − r labels are 0.
Combining regular observations and outliers, we may use the Cartesian product
Λr ×
g
Y
j=1
Γj ×
n
Y
Ψi
i=1
as the parameter set of our complete model with g classes and n − r outliers. The number r
of regular objects is considered fixed; as to its choice see Sect. 3.4.
3
The density of the ith observation for the parameter values R ∈
ψ = (ψ1 , · · · , ψn ) is
(
fγj (x),
li = j ∈ 1..g,
fXi [x | l, γ, ψi ] =
fXi [xi | ψi ], li = 0, cf. Eqn. (1).
1..n
r ,
γ = (γ1 , · · · , γg ), and
We assume that the sequence of observations (X i )ni=1 is statistically independent but not
necessarily i.i.d. unless there are no outliers, n = r. By the product formula, the likelihood
for the data set x = (x1 , . . . , xn ) w.r.t. the product reference measure on E n is
fX (l, γ, ψ) =
g Y
Y
fγj (xi )
j=1 li =j
Y
li =0
fXi [xi | ψi ].
Considering the parameters ψi of the outliers nuisances, the likelihood integrated w.r.t. to the
prior measures τi is by Eqn. (1)
fX (l, γ) =
g Y
Y
fγj (xi ).
(2)
j=1 li =j
Maximizing first w.r.t. all γj ’s and taking the logarithm, we infer that the ML–estimate of l
is given by the constrained trimmed ML–criterion
MLl (x) = argmin
l∈Λr
g X
X
j=1 li =j
− ln fγj (l) (xi ),
(3)
where γj (l) is the m.l.e. of the parameter γj ∈ Γj w.r.t. cluster j. This criterion is optimal
if it is known a priori that the clusters are of (about) the same size. If f γ is normal, the
summands in (3) become essentially the log–determinants of the scatter matrices and we find
Rocke and Woodruff’s [35] MINO. In order to account for unequal sizes, Symons [42] noted
that the entropy of the relative cluster sizes had to be added to the ML–criterion. Therefore,
the constrained trimmed MAP–criterion is
g X
o
n
X
ln fγj (l) (xi ) .
MAPl (x) = argmin rH (nj (l)/r)j −
l∈Λr
(4)
j=1 li =j
P
n (l)
n (l)
Here, nj (l) is the size of cluster j w.r.t. l and H (nj (l)/r)j = − gj=1 jr ln jr is the
entropy of the cluster proportions. Adding this term to the criterion discourages equal cluster
sizes, in which case the entropy is maximum.
The criterion includes a number of special cases. First, any state space E is allowed, be
it discrete or continuous. Secondly, various statistical models on E may be used. It is not
necessary that each population belong to the same model. Thirdly, one may assume that
there are spurious outliers in which case it is necessary to specify a lower bound on the
number of regular objects, r. One may also collect possible outliers in one cluster by choosing
a particular model for them, e.g. a distribution uniform on the convex hull of the data, cf.
[11]. In this case we choose r = n. The special case with g = 1 and the normal distribution is
Rousseeuw’s [36] MCD.
4
In some rare cases, computing the ML–estimates of the parameters γ j is simple. Popular
examples are the normal and coin tossing cases where ML–estimation reduces to simple summation. Even then, minimizing the criteria (3) and (4) over the combinatorial structure of all
admissible assignments Λr is not a simple task. Only for the MCD estimator was it recently
shown that optimization can be carried out in polynomial time, cf. [5] where an algorithm
2
of complexity O(nd ) based on elliptical separability is designed. In general, one has recourse
to general optimization schemes such as local descent methods combined with multistart or
MCMC algorithms. A shortcoming of these methods is the need to update the parameters
with each move, also the unsuccessful ones. More efficient algorithms detect whether a move
will be successful before the parameters are updated. We deal here with a dedicated alternating algorithm of the k–means type. The general version in the non–contaminated case is due
to Schroeder [38]. The following proposition extends it to the case with spurious outliers. A
version for the m.l.e. in the pooled normal case appears in [16].
2.1
Proposition
Let l and lnew be two admissible labelings s.th.
g
X
X
j=1 i:lnew,i =j
g X
X
ln nj (l) + ln fγj (l) (xi ) >
j=1 i:li =j
ln nj (l) + ln fγj (l) (xi ) .
(5)
(a) Then lnew strictly improves the MAP–criterion (4) w.r.t. l.
(b) The same is true for the ML–criterion (3) if the hypothesis is replaced with
g
X
X
ln fγj (l) (xi ) >
j=1 i:lnew,i =j
g X
X
ln fγj (l) (xi ).
j=1 i:li =j
Proof. We give the proof for Case (a), Case (b) is similar. Applying the hypothesis, the
entropy inequality, and ML–estimation in this order, we have the following chain of estimates.
X X
nj (l) − rH
+
ln fγj (l) (xi )
r
j
j i:li =j
X X nj (l)
+ ln fγj (l) (xi )
=
ln
r
j i:li =j
X X nj (l)
<
ln
+ ln fγj (l) (xi )
r
j
=
i:lnew,i =j
X
nj (lnew ) ln
X
nj (lnew ) ln
X
nj (lnew ) ln
j
j
≤
nj (l) X
+
r
j
X
ln fγj (l) (xi )
i:lnew,i =j
nj (lnew ) X
+
r
j
X
ln fγj (l) (xi )
i:lnew,i =j
nj (lnew ) X X
+
ln fγj (lnew ) (xi )
r
j
j i:lnew,i =j
X X
nj (lnew ) +
ln fγj (lnew ) (xi ).
= −rH
r
j
≤
j
i:lnew,i =j
5
2
The fact that both sides in the hypotheses of Proposition 2.1 contain the same population
parameters γj (l) substantially reduces the complexity of the optimization. There are several
ways to exploit the foregoing proposition. A first method moves one object from one cluster
to another or swaps an outlier and a regular object. Define the following admissible moves
starting from the admissible labeling l
(c)
(o)
i : li → j,
k : 0 → j, i : li → 0,
j, li ∈ 1..g, j 6= li , if size of cluster li is > bli ;
j, li ∈ 1..g, if size of cluster li is > bli or if j = li .
Denote the assignment resulting from move (c) or (o) by l new . The difference between the left
and right sides of (5) amounts to
nj (l)
+ ln fγj (l) (xi ) − ln fγl (l) (xi ),
i
nli (l)
nj (l)
+ ln fγj (l) (xk ) − ln fγl (l) (xi ),
u(o) (i, k, j) = ln
i
nli (l)
u(c) (i, j) = ln
respectively. If the difference is strictly positive for some admissible move then the proposition
asserts that the move improves the MAP–criterion (4). For the ML–criterion (3) the term
involving the sizes is dropped. This leads to the following reduction step.
The single–point reduction step
// Input: An admissible labeling l;
// Output: an admissible labeling l new with smaller criterion or the response “stop.”
(1) Compute the parameters γj (l) w.r.t. l, using update formulae if possible;
(2) determine the (admissible) move (c) or (o) with maximal u (c) or u(o) ;
(3) if this maximum is > 0 then update l accordingly and return the updated labeling;
else respond “stop.”
The proposition suggests also the following, often more efficient, multi–point reduction step
for improving the MAP–criterion: Given a labeling l, compute the numbers n j (l) and the
parameters γj (l) and assign each observation xi to the cluster lnew,i = j such that the quantity
ln nj (l) + ln fγj (l) (xi ) (or fγj (l) (xi )) is maximum. If the sum of the largest r of these numbers
exceeds the sum for the original labeling then the proposition assures us that the new labeling
lnew has smaller MAP–criterion (4) (ML–criterion (3)) if it is admissible. This is the main
scheme of an efficient estimation procedure. Its advantage over a single–point reduction step
is the fact that it allows to explore the whole data set without updating the parameters
n (l)
after each successful reassignment of a single point. Since jr fγj (l) (xi ) is just the posterior
density given the parameters of the current labeling used in discriminant analysis, the step is
intuitively appealing since it assigns each observation to the class determined by the MAP–
(or ML–)discriminant rule.
Unfortunately, in the present heteroscedastic case, the naive procedure described above does
not guarantee admissibility of the new assignment, in particular if there is a small cluster or if
the data set is small. This is contrary to the homoscedastic case where clusters of one element
do not prevent a clustering from being admissible. Although the naive procedure sometimes
6
works this fact may render it unstable. In order to ensure admissibility, we have actually to
solve the following constrained optimization problem.
The multi–point optimization problem
For given numbers nj and parameters γj , minimize
labelings l, i.e., subject to the constraints
#{i ∈ 1..n | li = j} ≥ bj ,
P
i
−ln nli −ln fγli (xi ) over all admissible
j ∈ 1..g,
#{i ∈ 1..n | li = 0} = n − r.
The natural numbers bj are defined in such a way as to enable ML–estimation of the parameters of population j, see the discussion of admissibility in Sect. 2. The number r of regular
objects must not be smaller than their sum.
There is an exact and at the same time efficient solution to the multi–point optimization
problem. It is suitable to first reformulate it in a form common in combinatorial optimization.
A labeling l may be represented by a zero–one matrix y of size n × (g + 1) by putting y i,j = 1
if and only if li = j, i.e., if l assigns object i to cluster j. A zero–one matrix
P y is admissible,
i.e., corresponds to an admissible labeling,
if it satisfies the constraints j yi,j = P
1 for each i
P
(each object has exactly one label), i yi,0 = n − r (there are n − r outliers) and i yi,j ≥ bj
for all j ≥ 1 (each cluster j contains at least b j elements). Using this matrix and defining the
“cost” coefficients1
ui,j = − ln nj (l) − ln fγj (l) (xi ),
i ∈ 1..n, j ∈ 1..g,
we may reformulate the multi–point optimizationPproblem as a binary linear optimization
problem, cf. [32], in the following way. Note that j bj ≤ r ≤ n.
(BLO)
g
n X
X
i=1 j=1
ui,j yi,j minimal over all y ∈ Rn×(g+1) subject to the constraints
P


j yi,j = 1,


P y = n − r,
i,0
Pi

y

 i i,j ≥ bj ,

y ∈ {0, 1}.
i,j
i ∈ 1..n,
j ∈ 1..g,
Although binary linear optimization is NP–hard, in general, the present problem has an efficient solution. First, it is easy to construct a solution that satisfies all but the third constraint.
Just assign object i to argminj ui,j . If this solution happens to satisfy also the third constraint
then it is plainly optimal. In the opposite case, the deficient classes j in this solution contain
exactly bj objects in an optimal solution. Indeed, if an originally deficient class j was of size
> bj then at least one of the forced elements would be free to go to its natural class, thus
reducing the target function.
We show in Section 3 that the present multi–point optimization problem may be transferred
to another one with an efficient solution.
1
the negative posterior log–likelihood with the parameters replaced by their m.l.e.’s w.r.t. l
7
3
Multi–point algorithm and λ–assignment
3.1
Transfer of multi–point optimization to a λ–assignment problem
A linear optimization problem of the form
X
(TP)
uij zij minimal over z ∈ Rn×m subject to the constraints
i,j
P

Pj zij = ai ,
i zij = bj ,


zi,j ≥ 0,
i ∈ 1..n,
j ∈ 1..m,
is called a transportation or Hitchcock problem, see [32]. Here,
) is a real n by m matrix of
P (u i,jP
weights or costs and ai and bj are real numbers ≥ 0 s.th.
ai = bj . Plainly this condition
is necessary and sufficient for a solution to exist.
Our (BLO) problem is not yet a transportation problem for two reasons: first, the constraints
contain also an inequality and, secondly, the transportation problem is not restricted to binary
solutions. We first use a trick to overcome the first problem; fortunately, the second turns out
to be a free lunch. Let us introduce an artificial (g + 1)th class. It is meant to accommodate
the excess members w.r.t. the constraints ≥ b j of the g natural classes. Thus, defining the
additional cost coefficients
ui,g+1 = min ui,j ,
j∈1..g
we find the special transportation problem
(λA)
g+1
n X
X
i=1 j=1
ui,j zi,j minimal over z ∈ Rn×(g+2) subject to the constraints
P

i ∈ 1..n,

j zi,j = 1,



P


 i zi,0 = n − r,

P
zi,j = bj ,
j ∈ 1..g,

Pi
P


bj ,

i zi,g+1 = r −




z ≥ 0,
i ∈ 1..n, j ∈ 0..(g + 1).
i,j
Note that the “supplies” ai of all its n objects i are unitary. Such a transportation problem
g+2 denoting the array of the “demands” n − r,
is often called
P a λ–assignment problem, λ ∈ N
bj , and r − bj . Weights ui,0 of the outliers are not needed; they may be put to any constant
value if necessary, e.g. 0. A matrix z that satisfies the constraints of (λA) will be called
feasible. Contrary to an admissible solution, a feasible solution is defined by equalities. The
assignment matrix z is illustrated in Fig. 1. Note that the hypothesis on the sums of the right
sides of the constraints is satisfied so that (λA) possesses an optimal solution z.
Now, the constraints of (λA) are integral. By the Integral Circulation Theorem [26], Theorem
12.1, or [8], Theorem 4.5, we may assume that z is integral. The first constraint implies that
it is even binary representing an assignment.
8
j
z1,0 z1,1
z1,j
z1,g z1,g+1
z2,0 z2,1
z2,j
z2,g z2,g+1
i zi,0 zi,1
zi,j
zi,g zi,g+1
zn,0 zn,1
P
=
n−r
zn,j
P
= bj
zn,g zn,g+1
P
=
P
r − bj
P
=1
Figure 1: Table of assignments of objects to classes in the transportation problem associated with the
linear optimization problem.
With the optimal solution to (λA) we associate an optimal solution to (BLO). Define the
matrix y ∈ Rn×(g+1) by moving, in each line i, the excess members collected in cluster g + 1
to the class which minimizes ui,j , j ∈ 1..g. We claim that y optimizes (BLO). Indeed, let z
be admissible for (BLO). Define a feasible matrix z 0 by moving excess members i from the
classes to the artificial class g + 1. By definition of u i,g+1 , the value of z w.r.t. (BLO) is larger
than that of z0 w.r.t. (λA) and, by optimality, the latter value is larger than that of z w.r.t.
(λA) which, again by definition of ui,g+1 , equals that of y w.r.t. (BLO). Hence, z is inferior
to y and y is an optimal solution to the original problem (BLO).
Optimal solutions to (λA) and (BLO) are actually equivalent. Given an optimal solution to
(BLO), move any excess elements assigned to classes 1..g to class g + 1. Note that any class
that contains excess elements contains no forced elements since they could choose a cheaper
class. Therefore, the new assignment creates the same total cost.
Thus, the following multi–point reduction step reduces the criterion. It uses (λA) and the
right sides bj appearing there have to be chosen inP
such a way as to guarantee existence of
the m.l.e.’s. (Plainly, the parameter r must satisfy
bj ≤ r.)
The multi–point reduction step (λ–assignment version)
// Input: An admissible labeling, its parameters γ j (l), and its criterion;
// Output: an admissible labeling l new with its parameters and (smaller) criterion
or the response “stop.”
(1) solve the λ–assignment problem (λA) with weights
(
ui,j = − ln nj (l) − ln fγj (l) (xi ), j ∈ 1..g,
ui,g+1 = minj∈1..g ui,j ,
9
i ∈ 1..n;
(2) move each object in cluster g + 1 to the cluster where it causes minimal cost;
(3) compute parameters and criterion of the solution to the clustering problem thus obtained;
(4) if if the criterion has improved then return it together with this solution and its parameters;
else “stop.”
A reduction step is the combination of parameter estimation and discriminant analysis. In this
sense, the Hitchcock version of the multipoint reduction step may be viewed as a combination
of parameter estimation and constrained discriminant analysis. The transportation problem
in connection with constrained discriminant analysis appears already in Tso and Graham [47].
These authors use the constraints in automatic chromosome classification to ensure the correct
number of chromosomes of a class in a biological cell. We learnt about their work in Tso et
al. [48], where Tso and Graham’s method is extended to karyotyping. Another application
of λ–assignment to clustering appears in Aurenhammer et al. [4], where a constrained least
squares assignment problem akin to k–means with fixed cluster centers and sizes is considered.
3.2
Algorithms for the λ–assignment problem
The multi–point reduction step requires the solution to a λ–assignment problem. Since reduction steps are executed many times during a program run, a few words are in order about
known algorithms for its solution and about their complexities. Both have been extensively
studied in operations research and combinatorial optimization. As a linear optimization problem, λ–assignment may be solved by the simplex method. In fact, it is a special transportation
problem which, in turn, is a special case of, and surprisingly [32] equivalent to, a minimum–
cost flow problem. An instance of the latter problem is specified by a directed graph, net flows
in its nodes, and arc capacities and costs. The aim is to determine a flow with minimum overall cost satisfying the required net flows in all nodes without violating the given capacities.
Classical adaptations of the simplex method tailored to the particularities of min–cost flow
are the network simplex method, [40] and primal–dual algorithms, cf. Cook et al. [8], 4.2 and
4.3, such as the out–of–kilter method [14, 26]. The network simplex method seems to be most
popular in applications although these algorithms are exponential in the worst case.
Polynomial algorithms for minimum–cost flow have also existed for some time. It is common to
represent the cost matrix as the adjacency matrix of a weighted graph (V, E) with node set V
of size n and edge set E of size m. A polynomial example is Orlin’s [31] O n(m+n log n) log n
algorithm for the uncapacitated min–cost flow problem. Other algorithms use the concept of
scaling introduced by Edmonds and Karp [9] who used it to produce the first polynomial
algorithm for the general min–cost flow problem. Scaling solves a series of subproblems with
approximated instance parameters, capacities or costs or both. Edmonds and Karp showed
with their cost–scaling algorithm that min–cost flow was a low complexity problem. However,
opinions about the usefulness of scaling in programs are mixed, see the comments in [19],
Sec. 9 and [15], Sec. 4. In the context of λ–assignment the capacities may be defined as 1
and capacity scaling becomes trivial although the algorithms
may of course be applied. An
example is Gabow and Tarjan’s [15] O n(m + n log n) log U algorithm, log U being the bit
10
length used to represent network capacities. In the case of λ–assignment, log U = 1 so that
the complexities of Orlin’s and Gabow and Tarjan’s algorithms are equal in this case.
Now, the λ–assignment problem has some special features that reduce its complexity compared with the general min–cost flow problem. First, its graph is bipartite, its node set being
partitioned in two subsets, the objects i and the clusters j, such that each edge connects an
object with a cluster. For bipartite network flow algorithms it is often possible to sharpen
the complexity bounds by using the sizes of the smaller (k) and the larger (n) node subsets
as parameters, Gusfield, Martel, and Fernadez-Baca [21]. Second, the bipartite network is
unbalanced in the sense that the number of clusters is (at least in our case) much smaller
than the number of objects. Bounds based on the two subsets are particularly effective for
unbalanced networks. Third, all row sums of the assignment matrix are equal, and equal to
1 which implies that the capacities may be chosen unitary. As a consequence the algorithms
for the lambda assignment problem mentioned at the beginning become polynomial. E.g., the
time complexity of the out–of–kilter method becomes O(nm) = O(kn 2 ) since m = kn here,
see [26], p. 157.
Algorithms dedicated to the λ–assignment problem are due to Kleinschmidt et al. [24]
and
Achatz et al. [1]. These authors propose algorithms with asymptotic run time O kn2 . Both
draw from Balinski’s work on the Hirsch conjecture about convex polytopes and relate their
work to an algorithm of Hung and Rom [22].
The algorithms mentioned so far remain at least quadratic in the size of the larger node set,
n. By contrast, two cost–scaling min–cost flow algorithms are linear in n: Goldberg and Tarjan’s [19], Theorem 6.5, algorithm
solves the min–cost flow problem on a bipartite network
asymptotically in O k 2 n log(kC) time, see Ahuja et al. [2], and the “two–edge push” algorithm of the latter authors needs O (km + k 3 ) log(kC) time. In our application, m = kn.
Both estimates contain the bit length, log C, for representing costs.
A different low–complexity approach is due to Tokuyama and Nakano [44, 46, 45]. These authors state and prove a geometric characterization of optimal solutions to the λ–assignment
and transportation problems by a so–called splitter, a k–vector that partitions the Euclidean
k–space into k closed cones. The corresponding subdivision of the lines of the cost matrix
describes an optimal assignment. Tokuyama and Nakano design a deterministic and a ran-
√
domized strategy for splitter finding that run in O(k 2 n ln n) time and O kn + k 5/2 n ln3/2 n
expected time, respectively.
Their algorithms are almost linear in n and close to the absolute
lower bound O kn if k is small, the case of interest for (λA).
The disadvantage of the geometric method is that the whole procedure is activated even if the
unconstrained assignment automatically produces a feasible (and hence optimal) solution.
3.3
Heuristic feasible solutions
Besides exact solutions, there are reasons to say a word about heuristic feasible solutions
to the λ–assignment problem. First, they may be used for “multipoint reduction steps” on
their own. Moreover, some of the graphical methods presented in Sect. 3.2 need an initial
feasible solution for starting or profit from a good one. The network simplex method, e.g.,
needs a primal feasible solution and the method presented in [1] needs a dual feasible solution.
While arbitrary feasible solutions are easily produced, good initial feasible solutions can be
11
constructed by means of greedy heuristics. We propose here two. If the bounds b j are small
enough, the heuristics often produce even optimal solutions.
Each reduction step receives a set of parameters γ j from which all weights ui,j , i ∈ 1..n,
j ∈ 1..g, are computed. The first two heuristics construct primal feasible solutions. Both
start from the best unconstrained assignment of the clustering problem which can be easily
attained by sorting the numbers ui = minj ui,j . More precisely:
Basic primal heuristic
(1) sort the numbers ui = minj ui,j upward, i ∈ 1..n;
(2) attach label 0 to the last n − r objects in the ordered list;
(3) assign each of the remaining r objects to the class 1..g where the minimum is attained;
(4) if this assignment is admissible then stop (it is optimal);
else
(α) starting from element n−r in the ordered list and going downwards, assign surplus
elements to arbitrary deficient classes until they contain exactly b j elements;
(β) assign the largest remaining surplus elements in the classes to class g + 1.
If an admissible instead of a feasible solution is required, only, then we drop step (4)(β). We
next refine the “else” part of step (4). Steps (1), (2), and (3) are as before.
Refined primal heuristic
(1) sort the numbers ui = minj ui,j upward, i ∈ 1..n;
(2) denote the set of the last n − r objects in the ordered list by L;
(3) assign each of the remaining r objects to the class 1..g where the minimum is attained;
(4) if this assignment is admissible then stop (it is optimal);
else
(α) let D be the set of deficient classes in 1..g and let δ be their total deficiency;
(β) starting from element n − r in the ordered list and going downwards, move the first
δ elements in surplus classes to L;
(γ) sort the object–class pairs (i, j) ∈ L × D upward according to their weights u i,j to
obtain an array (i1 , j1 ), (i2 , j2 ), · · · , (i#L·#D , j#L·#D );
(δ) scan all pairs (ik , jk ) in this list starting from k = 1 assigning object i k to class
jk unless ik appears already in some class or jk is saturated, until all classes are
saturated;
() assign the yet unassigned elements in L to the outlier class;
(ζ) assign the largest remaining surplus elements in the classes to class g + 1.
12
In Sect. 3.1, we exploited the fact that any cluster j with more than b j members in an optimal
solution contained no forced elements. Plainly, both heuristics share this property since such
members would be free to change classes.
Both heuristics are much faster than any of the solution algorithms while yielding often a
low value of the criterion, the refined heuristic lower than the basic. However, contrary to
λ–assignment, the reductions in the sense of Propostion 2.1 they provide are not optimal, in
general. In most cases, the criterion decreases although one may construct examples where this
fails: Consider the data set {−40, −8, −6, 0, 1, 2, 3, 40}, let g = 4 and b 1 = b2 = b3 = b4 = 2,
and assume that there are no outliers. Suppose that the parameters are generated from
the initial partition C1 = {−40, 3}, C2 = {−8, 1}, C3 = {−6, 0}, C4 = {2, 40}. They are
m1 = −18.5, m2 = −3.5, m3 = −3.0, m4 = 21.0, and v1 = 462.25, v2 = 20.25, v3 = 9.0,
v4 = 361.0. The matrix of weights

9.91
9.15 9.25
 71.57 6.78 6.09
(ui,j ) = 
157.08 7.75 5.97
18.97 10.99 10.68
9.65
6.39
5.97
9.88
9.73
6.78
6.75
9.77
9.82
7.27
7.75
9.66
T
9.91 16.31
7.87 99.23 

8.97 210.41 
9.56 9.66
generates the free partition {−40}, {−8, 2, 3}, {−6, 0, 1}, {40} and the second heuristic
modifies it to {−40, 1}, {−8, 2}, {−6, 0}, {3, 40}. The score of the latter is 39.73 whereas the
initial partition has the lower score 39.67.
The problem dual to (λA) reads
(λA*)
n
X
i=1
pi +
g+1
X
j=0
bj qj maximal over (p, q) ∈ Rn+g+2 subject to the constraints
pi + qj ≤ ui,j , i ∈ 1..n, j ∈ 0..(g + 1).
Let us find the best
P dual feasible solution with p = α = constant. Since the target function
may be written j bj (α + qj ), here, and since the constraints read α + q j ≤ ui,j , we may as
well assume α = 0 obtaining a simple dual feasible solution.
Dual heuristic pi = 0 and qj = mini ui,j .
3.4
Overall algorithms
Single– and multi–point reduction steps both receive a labeling and improve it according to
Proposition 2.1. Their iteration thus gradually decreases the criterion. Since there are only
finitely many labelings the iteration must stall after a finite number of steps with the “stop”
signal. We have reached one approximation to the requested partition. It is self–consistent
in the sense that it reproduces its parental parameters. While a solution that optimizes the
criterion shares this property, the solution obtained from the iteration is not optimal w.r.t. the
criterion, in general. In fact, clustering is known to be NP–hard, see Garey and Johnson [18].
Therefore, we have to employ the multistart method in order to achieve a value of the criterion
as low as possible in the time available. Of course, the number of repetitions that lead to a
13
solution near the optimum depends heavily on the size of the data set, on the parameters g
and r, and on the statistical model chosen.
Although the heuristic reduction step does not always improve the criterion it may be used
in a similar way. However, it does not necessarily terminate with a self–consistent solution.
Iteration of single– or (heuristic or exact) multi–point reduction steps thus gives rise to three
optimization methods for fixed numbers of clusters and outliers. The multi–point is superior
to the single–point algorithm in speed. In difficult situations, it is useful to refine the result
of a multi–point search by single–point steps. The advantage of the constrained multi–point
reduction step presented here is its ability to produce a reasonable solution even if a cluster
tends towards becoming too small by the end of an iteration.
Each series of reduction steps needs a first solution to begin with. We generate an unbiased
partition of the whole data set in two parts, one of size r, the “regular” elements, and its
complement. There exists a simple but sophisticated algorithm that accomplishes just this in
one sweep through the data, see Knuth [25], p.136 ff. Moreover, we assign the objects of the
part of size r to g clusters in an unbiased way. The clusters must be large enough to allow
estimation of their parameters. In general, this requirement does not pose a problem unless
g is large. In this way, the clusters will be of about equal size with very high probability. It
follows that the entropy of the mixing rates is large at the beginning making it easy to fulfill
the condition of Proposition 2.1(a).
4
Numbers of clusters and outliers
An important part of a clustering process is the estimation of the number of clusters and
outliers. It is generally accepted that there is no rigorous definition of the notion of an “outlier”
as there is none of a “cluster,” either. What is more, the notions of outliers and cluster
are overlapping. In the present heteroscedastic model, outliers either originate from a rare
class so that a number of elements sufficient for m.l. parameter estimation has not been
observed or they constitute a set whose structure, e.g. shape, is not in harmony with the
posited populations. In the first case, the outliers may find some extreme cluster members that
complement them to a sufficiently large cluster. Sometimes, a sufficient number of “outliers”
may give rise to a cluster and it may sometimes even be reasonable to assume the absence of
outliers. In this way, outliers may be traded for clusters. The model often accommodates also
clusters whose structures do not conform with its populations. Therefore, in both cases, the
numbers of clusters and outliers are not uniquely determined and subject to interpretation.
If the two numbers are roughly known then postulating a number of clusters and outliers
that exceed the actual numbers to some extend may be tolerated. As the effect of a larger
number of assumed outliers each cluster will just loose a number of extreme members and
the parameters will not be too much affected. It is therefore justified to use a lower bound
of the number of regular objects as the number r; see also Rousseeuw and Van Driessen [37],
Section 5. If one assumes one or two more clusters than there actually are then clusters of
minimal size bj are split off by the MAP method but the estimated parameters of the other
clusters remain essentially unchanged. In this case, it is sufficient to apply the estimator with
one suitable pair (g, r). It must however be kept in mind that it is harmful to choose g too
small or r too large. The parameter estimates may break down through the influence of a
single remaining gross outlier or one missing class.
14
In practical applications both the numbers of clusters and of outliers are often not even roughly
known and have to be estimated. A large amount of literature has been and is still devoted
to these subjects which are still considered challenges. So far we have designed a tool that
allows to establish optimal clusterings for all pairs (g, r) of numbers of clusters and outliers.
This is a substantial reduction of the complexity of the problem but not yet its solution. We
need a strategy to select the most promising pair (g, r) or maybe a few of them which should
then be more closely analyzed.
Concerning the number of clusters, there are essentially three approaches, cf. [30, 20], cluster
validation, the so-called elbow criterion, and model selection criteria. Cluster validation may
be divided in two branches: tests and validity measures. The classical test, due to Wolfe [50],
is a likelihood ratio test for the hypothesis of k clusters against (k − 1) clusters. Bock [7]
discusses some significance tests for distinguishing between the hypothesis of a homogeneous
population vs. the alternative of heterogeneity. Validity measures are functionals of partitions
and usually measure the quality of cluster separation and of cluster compactness; see, e.g.,
Bezdek et al. [6]. Often, the total within–cluster sum of squared distances about the centroids
is used as a measure of compactness and the total between–cluster sum of squared distances
for separation; cf. Milligan and Cooper [30] and the abridged presentation of their paper by
Gordon [20]. The elbow criterion identifies the number of clusters as the location where the
decrease of some cluster criterion flattens markedly. For a recent refinement of this method
we refer the reader to Tibshirani et al. [43].
Maximum likelihood estimation prefers a large number of clusters. A model selection criterion hampers this tendency by adding to the minimum of the negative log–likelihood or of the
negative posterior log–density a penalty term. A popular model selection criterion is Schwarz’
Bayesian Information Criterion, BIC. Although originally proposed for exponential families,
there is some theoretical and practical evidence that supports it as a means for estimating the
number of clusters of a clustering model, too; see the discussion in McLachlan and Peel [29],
Ch. 6. According to Leroux [27], BIC does not underestimate the true number of components,
asymptotically. However, most model selection criteria tend towards overestimating this number accepting solutions with small spurious clusters. In the present case, BIC is obtained from
the minimum in the ML (3) or the MAP criterion (4) by adding the penalty term 2p · ln r,
p being the total dimension of the parametric model. In the MAP case, p is g − 1 (for the
mixture rates) plus the sum of the dimensions of the g cluster populations.
We therefore take a different approach with the MAP–method. It determines the number of
classes and at the same time the number of outliers if populations are normal and is based
on the following two observations:
(i) If, for fixed r, the assumed number of clusters is increased beginning with the “true”
value then MAP splits off a small cluster (b j or approximately bj points lying close to
some hyperplane); if it is decreased then no small clusters are created.
(ii) If, for the true g, the assumed number r of regular objects is decreased then some or all
clusters loose some of their members.
The first half of statement (i) cannot be made for the ML–method. The presence of the small
cluster is not a particular property of the d + 1 points. If they are removed then others assume
their role.
15
Positing that the true clustering contains no small cluster we infer the following algorithm
for selecting valid partitions. The integer m(g, r) is the size of the smallest cluster obtained
for the optimal MAP partition with parameters g ≥ 1 and r ≤ n.
Selection algorithm
(1) Establish a table of the integers m(g, r) for all reasonable pairs (g, r);
(2) for each r, determine the largest g such that m(j, r) is not small for any j ≤ g and mark
the pair (g, r);
(3) in each line, scan the marked pairs beginning with low values of r until a partition is
found that satisfies some cluster validity criterion; return the corresponding pair.
It is, of course, sufficient to perform the procedure with a lacunary set of numbers r. Plainly,
each column contains some marked pair. The returned pairs are candidates for reasonable
choices of g and r. There is at most one returned pair per line in the table so that the
complexity of the problem is again largely reduced. If it turns out that no marked pair is
valid then the distributional assumptions are questionable.
In order to further reduce the number of possible solutions we use validation. We accept all
pairs (g, r) for which none of the g clusters is rejected by a χ 2 goodness–of–fit test with the
categories defined by some quantiles of the Mahalanobis distance squares w.r.t. the estimated
parameters, cf. Ritter and Gallegos [34].
Means
(0, 0, 0)
Covariance
diagonal
matrices
(9.0, 4.0, 1.0)
Eigenvalues 9.0 4.0 1.0
Cardinalities
200
(−6, 3, 6)
(6, 6, 4)
 

4.0
4.0
−3.2 4.0
 3.2 4.0

−0.2 0.0 1.0
2.8 2.4 2.0
7.20 1.07 0.73
9.11 0.88 0.016
50
50

Table 1: Structures of the three 3D normal populations from which Data Set A is sampled. The
eigenvalues of the covariance matrices are also shown. The data set contains 30 additional “outliers”
uniformly distributed on the cube [−15, 15]3.
4.1
Simulation studies
We illustrate the method presented in Section 3 with two contaminated data sets, A and B,
and Anderson’s [3] well–known four–dimensional Iris Data Set. Data Set A is sampled from the
three normal populations specified in Table 1. The corresponding mixture appears in [29], p.
218. To this basic data set we add 30 outliers uniformly distributed in the cube [−15.0, 15.0] 3 .
Data Set B has six features and consists of 100 regular observations, 50 sampled from N 7e1 ,I
and 50 from N7e2 ,I , and of two outliers from N0,100I . Iris consists of 50 observations of each of
the three subspecies setosa, versicolor, and virginica with only few outliers. Its entries being
given in the form *.* with only two digits of precision, its observations are not in general
position. Observations 102 and 143 are even equal. We, therefore, add a number uniformly
16
distributed in the interval [−0.05, 0.05] to each entry. Moreover, since the entries are positive,
we take logarithms.
The MAP clusterings for the Data Sets A and B with the known numbers of classes and
outliers are essentially the original ones. The estimated parameters for Data Set A and these
input parameters are shown on top of Table 4. The original biological partition of Iris is essentially reproduced by the ML–method with three clusters, just two versicolors are considered
virginicas and one virginica is falsely assigned to versicolor. The MAP clustering with three
clusters unites these two subspecies identifying only two major clusters and a small spurious
cluster of minimal size five.
g
1
2
3
4
5
6
7
n−r
g
1
2
3
4
5
0
5
330
134
47
32
4
4
4
325
47
47
27
4
4
4
n−r
g
1
2
3
4
5
10 15 20 25 30 35 40 45 50 55 60
320
56
49
22
4
4
4
315
47
47
14
4
4
4
310
47
47
12
4
4
4
305
47
49
4
4
4
4
300
47
47
4
4
4
4
295
46
47
4
4
4
4
0
1
2
3
4
6
102
7
7
7
7
101
50
7
7
7
100
50
7
7
7
99
49
7
7
7
98
49
7
7
7
96
47
7
7
7
n−r
0
5
150
50
5
5
5
145
45
5
5
5
290
46
48
4
4
4
4
285
46
46
4
4
4
4
280
46
42
4
4
4
4
275
46
42
4
4
4
4
270
46
41
4
4
4
4
8 10 12 14 16
94
46
7
7
7
92
45
7
7
7
90
42
7
7
7
88
40
7
7
7
86
39
7
7
7
10 15 20 25 30
140
42
5
5
5
135
39
5
5
5
130
38
5
5
5
125
34
5
5
5
120
30
5
5
5
Table 2: Table of the minimal cluster sizes m(g, r), see Sect. 4, for the three data sets A (top), B
(center), and Iris w.r.t. MAP. In each column, the entry marked in boldface terminates the interval of
pairs without small clusters. In each line the leftmost marked entry which passes the validation test
indicates a possible solution to the clustering problem.
Table 2 shows the minimal sizes m(g, r) defined in Sect. 4 for various values of g and n − r
and the three data sets. The pairs marked according to Sect. 4(2) are displayed in boldface.
The table nicely illustrates observation (i) in Sect. 4: all entries below the boldface line are
d + 1. In order to determine the number of outliers, we apply a χ 2 –goodness–of–fit test with
17
the ten circular categories of equal probability 0.1 after sphering the data with the estimated
parameters. We reject all solutions whose poorest cluster is rejected at a p–value of 0.1, say.
Since there are nine degrees of freedom this means a value of the χ 2 –statistic of ≥ 14.7. Two
solutions for Data Set A and g = 4 are not rejected, the one devoid of outliers and the one
with 15 outliers. The first accepts the outliers as a cluster and is the proposed partition for
g = 4. The second solution removes extreme elements in the corners of the cube as outliers
and accepts the rest as a cluster. The test does also not reject the solutions with 3 clusters
and between 30 and 40 outliers and so we accept also the one with 30 outliers. Its sizes are
202, 51, and 47. This solution is close to the original partition.
g
4
4
4
4
4
n−r
0
5
10
15
20
12.5
23.2
21.9
14.2
15.8
A
11.0
12.5
12.5
13.2
12.5
10.7
11.0
11.0
12.1
11.0
9.3
9.3
9.3
3.5
8.5
g
3
3
3
3
3
3
3
3
n−r
25
30
35
40
45
50
55
60
20.1
12.5
11.1
11.3
15.9
19.1
22.8
37.7
A
14.3
11.6
7.5
7.5
15.0
15.1
19.1
20.8
10.9
11.0
3.5
3.1
2.5
2.5
2.5
3.7
g
2
2
2
2
2
2
2
n−r
0
5
10
15
20
25
30
Iris
13.3
12.2
8.6
10.6
14.8
19.1
10.8
4.9
4.9
4.1
3.4
11.5
10.9
8.7
Table 3: Yates corrected χ2 –goodness–of–fit test with nine degrees of freedom for estimating the
number of outliers. The rows show the values of the test statistic for the Data Sets A and Iris and
clustering results with g clusters and n − r outliers. See also the text.
The marked pairs for Data Set B are displayed in the center of Table 2. Interestingly, the
solution with two clusters and no outliers contains a small cluster so that the solution with
one cluster and no outliers is marked. The χ 2 –test, of course, rejects it as it rejects the solution
with two clusters and one outlier. It does not reject the solution with two clusters and two
outliers with exactly the original clustering. This is not surprising as the two clusters are very
well separated and the outliers are sufficiently far away. Data Set B was just chosen in order
to demonstrate that the boldface line in Table 2 may also contain downward jumps, contrary
to what one might at first conjecture.
The Iris Data Set has served for demonstrating the performance of many clustering algorithms,
beginning with Fisher [10]. Our MAP–method is not able to identify the correct number of
clusters, clearly identifying two. Some authors contend that this is the number of clusters
that a geometric method should estimate. We disagree with this opinion. Setosa is completely
separated from the two other classes and versicolor and virginica touch but they, too, are fairly
well separated. Otherwise, the ML estimator with three classes would assign a good amount
of the overlapping to a wrong class which is not the case as we already mentioned. Simulated
normal data show the same behavior: if two clusters approach each other there comes a point
when ML still estimates the right sizes and clusterings whereas MAP unites the touching
clusters. The reason for this misbehavior is, therefore, not a bad choice of the distributional
assumptions. The influence of the entropy term in the MAP–criterion (4) rather turns out
to be too strong so that the estimates of the sizes break down. The χ 2 –goodness–of–fit test
presented on the right of Table 3 indicates that Iris does not contain many outliers, if any.
18
Means
(0.09,
0.13, −0.02)


9.1

0.4 3.7
Covariance
0.2 0.0 1.0
matrices
Eigenvalues 9.1 3.7 1.0
Cardinalities
202
(−5.94,
2.94, 5.95) 

4.0

−2.3 2.7
−0.5 −0.0 1.2
4.0 1.4 1.0
51
Means
(0.02,
0.13, −0.04)
2.84, 5.97)
 (−5.86,


8.7
3.3
 −2.3 2.5

0.4 3.8
Covariance
0.1 0.0 0.96
−0.2 0.2 1.0
matrices
Eigenvalues 8.7 3.7 1.0
3.3 0.9 0.9
Cardinalities
200
48
Means
(0.09,
0.13, −0.02)


9.1

0.4 3.7
Covariance
0.2 0.0 1.0
matrices
Eigenvalues 9.1 3.7 1.0
Cardinalities
202
(−5.83,
2.94, 5.90)

3.5

−2.3 2.8
−0.2 0.0 1.0
3.5 1.2 1.0
50
(6.38,
6.40, 4.28)

3.4

2.6 3.0
2.5 2.0 1.8
3.4 1.1 0.02
47
(6.19,
6.21, 4.14)

2.6

2.2 3.2
1.9 1.7 1.4
2.6 1.3 0.02
47
(6.26,
6.37, 4.21)

2.3

1.9 2.7
1.6 1.5 1.2
2.3 1.1 0.02
44
(8.57,
7.90, 6.35) 

12.7

 9.8 9.5
10.3 9.7 9.9
12.7 2.0 3.1 · 10 −11
4
Table 4: Estimations for Data Set A. Top: estimated parameters with known numbers of clusters and
outliers (30). Center: the estimation with the proposed method has 3 clusters and 35 outliers. Bottom:
the parameters for the optimal solution with four clusters and 30 outliers. The scatter matrix of the
small cluster has a very small eigenvalue thus lying close to some hyperplane. See also the text.
References
[1] Hans Achatz, Peter Kleinschmidt, and Konstaninos Paparrizos. A dual forest algorithm for the
assignment problem. In Peter Gritzmann and Bernd Sturmfels, editors, Applied Geometry and
Discrete Mathematics, volume four of DIMACS Series in Discrete Mathematics and Theoretical
Computer Science, pages 1–10. American Mathematical Society, Providence, RI, 1991.
[2] Ravindra K. Ahuja, James B. Orlin, Clifford Stein, and Robert Endre Tarjan. Improved algorithms
for bipartite network flows. SIAM J. Computing, 23:906–933, 1994.
[3] E. Anderson. The irises of the gaspe peninsula. Bull. Amer. Iris Soc., 59:2–5, 1935.
[4] F. Aurenhammer, F. Hoffmann, and B. Aronov. Minkowski–type theorems and least–squares
clustering. Algorithmica, 20:61–76, 1998.
[5] Thorsten Bernholt and Paul Fischer. The complexity of computing the MCD–estimator. Theor.
Comp. Science, 326:383–398, 2004.
[6] James C. Bezdek, James Keller, Raghu Krisnapuram, and Nikhil R. Pal. Fuzzy Models and
Algorithms for Pattern Recognition and Image Processing. The Handbooks of Fuzzy Sets Series.
Kluwer, Boston, London, Dordrecht, 1999.
[7] Hans-Hermann Bock. On some significance tests in cluster analysis. J. Classification, 2:77–108,
1985.
[8] William J. Cook, William H. Cunningham, William R. Pulleyblank, and Alexander Schrijver.
Combinatorial Optimization. Wiley, New York etc., 1998.
[9] J. Edmonds and R.M. Karp. Theoretical improvements in algorithmic efficiency for network flow
problems. J. ACM, 19:248–264, 1972.
19
[10] R.A. Fisher. The use of multiple measurements in taxonomic problems. Ann. Eugenics, 7:179–188,
1936.
[11] Chris Fraley and Adrian E. Raftery. How many clusters? which clustering method? answers via
model–based cluster analysis. The Computer Journal, 41:578–588, 1998.
[12] Chris Fraley and Adrian E. Raftery. Model–based clustering, discriminant analysis, and density
estimation. JASA, 97:611–631, 2002.
[13] H.P. Friedman and J. Rubin. On some invariant criteria for grouping data. Journal of the
American Statistical Association, 62:1159–1178, 1967.
[14] D.R. Fulkerson. An out-of-kilter method for minimal cost flow problems. J. SIAM, 9:18–27, 1961.
[15] Harold N. Gabow and Robert Endre Tarjan. Faster scaling algorithms for network problems.
SIAM J. Computing, 18:1013–1036, 1989.
[16] Marı́a Teresa Gallegos and Gunter Ritter. A robust method for cluster analysis. Annals of
Statistics, 33:347–380, 2005.
[17] Luis Angel Garcı́a-Escudero, Alfonso Gordaliza, and Carlos Matrán. Trimming tools in exploratory data analysis. Journal of Computational and Graphical Statistics, 12:434–449, 2003.
[18] Michael R. Garey and David S. Johnson. Computers and Intractibility. Freeman, San Francisco,
1979.
[19] Andrew V. Goldberg and Robert Endre Tarjan. Finding minimum–cost circulations by successive
approximation. Math. of OR, 15:430–466, 1990.
[20] A. D. Gordon. Classification, volume 82 of Monographs on Statistics and Applied Probability.
CRC Press, second edition, 1999.
[21] Dan Gusfield, Charles Martel, and David Fernandez-Baca. Fast algorithms for bipartite network
flow. SIAM J. Comput., 16:237–251, 1987.
[22] Ming S. Hung and Walter O. Rom. Solving the assignment problem by relaxation. Operations
Research, 28:969–982, 1980.
[23] R.C. Jancey. Multidimensional group analysis. Australian J. Botany, 14:127–130, 1966.
[24] Peter Kleinschmidt, Carl W. Lee, and Heinz Schannath. Transportation problems which can be
solved by use of Hirsch-paths for the dual problem. Math. Prog., 37:153–168, 1987.
[25] Donald E. Knuth. The Art of Computer Programming, volume 2. Addison–Wesley, Reading,
Menlo Park, London, Amsterdam, Don Mills, Sydney, 2nd edition, 1981.
[26] E.L. Lawler. Combinatorial Optimization: Networks and Matroids. Holt, Rinehart and Winston,
New York, 1976.
[27] B.G. Leroux. Consistent estimation of a mixing distribution. Ann. Stat., 20:1350–1360, 1992.
[28] J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc.
5th Berkeley Symp. Math. Statist. Probab., pages 281–297, 1967.
[29] Geoffrey McLachlan and David Peel. Finite Mixture Models. Wiley, New York etc., 2000.
[30] G.W. Milligan and M.C. Cooper. An examination of procedures for determining the number of
clusters in a data set. Psychometrika, 50:159–179, 1985.
[31] James B. Orlin. A faster strongly polynomial minimum cost flow algorithm. In Proc. 20th ACM
Symp. Theory of Computing, pages 377–387, 1988.
[32] Christos H. Papadimitriou and Kenneth Steiglitz. Combinatorial Optimization. Prentice–Hall,
Englewood Cliffs, New Jersey, 1982.
20
[33] Christoph Pesch. Eigenschaften des gegenüber Ausreissern robusten MCD-Schätzers und Algorithmen zu seiner Berechnung. PhD thesis, Universität Passau, Fakultät für Mathematik und
Informatik, 2000.
[34] Gunter Ritter and Marı́a Teresa Gallegos. Outliers in statistical pattern recognition and an
application to automatic chromosome classification. Patt. Rec. Lett., 18:525–539, 1997.
[35] David M. Rocke and David L. Woodruff.
A synthesis of outlier detection and
cluster identification.
Technical report, University of California, Davis, 1999.
http://handel.cipic.ucdavis.edu/∼dmrocke/Synth5.pdf.
[36] Peter J. Rousseeuw. Multivariate estimation with high breakdown point. In Wilfried Grossmann, Georg Ch. Pflug, István Vincze, and Wolfgang Wertz, editors, Mathematical Statistics and
Applications, volume 8B, pages 283–297. Reidel, Dordrecht–Boston–Lancaster–Tokyo, 1985.
[37] Peter J. Rousseeuw and Katrien Van Driessen. A fast algorithm for the minimum covariance
determinant estimator. Technometrics, 41:212–223, 1999.
[38] Anne Schroeder. Analyse d’un mélange de distributions de probabilités de même type. Revue de
Statistique Appliquée, 24:39–62, 1976.
[39] A.J. Scott and M. J. Symons. Clustering methods based on likelihood ratio criteria. Biometrics,
27:387–397, 1971.
[40] Robert Sedgewick. Algorithms in C, volume 5 - Graph Algorithms. Addison-Wesley, Boston etc.,
third edition, 2002.
[41] H. Steinhaus. Sur la division des corps matériels en parties. Bull. Acad. Polon. Sci., 4:801–804,
1956.
[42] M. J. Symons. Clustering criteria and multivariate normal mixtures. Biometrics, 37:35–43, 1981.
[43] Robert Tibshirani, Guenther Walther, and Trevor Hastie. Estimating the number of clusters in
a data set via the gap statistic. J. Royal Statistical Society, Series B, 63:411–423, 2001.
[44] Takeshi Tokuyama and Jun Nakano. Geometric algorithms for a minimum cost assignment problem. In Proc. 7th ACM Symp. on Computational Geometry, pages 262–271, 1991.
[45] Takeshi Tokuyama and Jun Nakano. Efficient algorithms for the Hitchcock transportation problem. SIAM J. Comput., 24:563–578, 1995.
[46] Takeshi Tokuyama and Jun Nakano. Geometric algorithms for the minimum cost assignment
problem. Random Structures and Algorithms, 6:393–406, 1995.
[47] M.K.S. Tso and J. Graham. The transportation algorithm as an aid to chromosome classification.
Patt. Rec. Lett., 1:489–496, 1983.
[48] M.K.S. Tso, P. Kleinschmidt, I. Mitterreiter, and J. Graham. An efficient transportation algorithm
for automatic chromosome karyotyping. Patt. Rec. Lett., 12:117–126, 1991.
[49] Joe H. Ward, Jr. Hierarchical grouping to optimize an objective function. J. Am. Statist. Ass.,
58:236–244, 1963.
[50] J.H. Wolfe. Pattern clustering by multivariate mixture analysis. Multivariate Behavioral Res.,
5:329–350, 1970.
[51] David L. Woodruff and Torsten Reiners. Experiments with, and on, algorithms for maximum
likelihood clustering. CSDA, 47:237–252, 2004.
21