Model–based clustering with the assignment problem Marı́a Teresa Gallegos and Gunter Ritter∗ Faculty of Mathematics and Computer Science University of Passau D-94030 Passau, Germany Nov 5, 2006 Abstract: We establish a statistical model for clustering in the presence of gross outliers and derive its ML and MAP clustering estimators. Their computation leads to constrained trimming algorithms of k–means type. One of them includes a λ–assignment problem known from combinatorial optimization. We also estimate the numbers of clusters and outliers with the algorithms obtained and illustrate them with some examples. Key Words: model–based clustering, constrained classification, combinatorial optimization, assignment problem, outliers, model selection 1 Introduction Clustering means grouping data sets into homogeneous, separated subsets. If cluster shapes are spherical or elliptical and if the data come from a mixture of normal distributions then much progress towards an automatic treatment has been made in the past 40 years. In the 1960’s, the following cluster criteria, optimal for uncontaminated, normal data and applicable in the case of a known number of clusters, had been established: • The pooled trace criterion, Ward [49]. It is the ML criterion for equal cluster sizes and spherical normal populations of equal variances, Scott and Symons [39]. • The pooled determinant criterion, Friedman and Rubin [13]. It is the ML criterion for equal cluster sizes and general normal populations of equal shapes, Scott and Symons loc. cit.. • The heteroscedastic determinant criterion. This criterion is optimal in the case of equal cluster sizes and general normal populations, Scott and Symons loc. cit.. By that time it was well understood how criteria could be established based on other normal submodels, e.g. with independent features, and on different cross–cluster constraints. In 1981, Symons [42] noticed that subtracting the entropy of the cluster proportions from the log– likelihood could in some sense optimally handle unknown and unequal cluster sizes. ∗ E-mail: [email protected] 1 Steinhaus [41] and independently Jancey [23] proposed an efficient algorithm for minimizing the trace criterion. It is now called the “k–means” algorithm, a name coined by MacQueen [28] who proposed and analyzed a similar algorithm in a somewhat different situation. The method consists essentially of an alternating application of ML parameter estimation and ML discriminant analysis. Except for this method it took some time until dedicated algorithms suitable for efficiently reducing the other criteria mentioned above became known. Indeed, Schroeder [38] realized that all criteria could be optimized by similar, alternating algorithms. In the case of general normal populations, e.g., it is sufficient to replace the Euclidean distance used in k–means with the Mahalanobis distance. The criteria and algorithms mentioned so far are not robust against outliers. Small outliers may be handled by applying suitable distributional families such as the multivariate Student (or Pearson’s type VII) distribution. Gross outliers need a different approach. It is well known that the usual breakdown points of the methods above even vanish. An effective way of dealing with the related problem of robust parameter estimation is Rousseeuw’s [36] minimum–covariance–determinant criterion, MCD. Rousseeuw and van Driessen [37] proposed an efficient algorithm for its approximate computation. This algorithm, too, is alternating. Pesch [33] established a statistical model for which MCD is an ML–estimator. His model assumes an independent, albeit non-stationary sequence of observations. Rocke and Woodruff [35], see also [51], proposed on heuristic grounds a trimmed extension of Scott and Symons’s heteroscedastic determinant criterion which they called MINO. A trimmed extension of the k–means algorithm was designed by Garcı́a–Escudero, Gordaliza, and Matrán [17]. In [16], we proposed a homoscedastic statistical model for clustering with outliers deriving from it a pooled cluster criterion and a related trimming algorithm. In that paper we also computed some breakdown points thus showing robustness of criterion and algorithm. We finally mention that a synoptic presentation of much of the state of the art at the turn of the millennium appears in Fraley and Raftery [12]. In the present paper, we propose a heteroscedastic clustering model with arbitrary populations and with outliers extending Scott and Symons’s heteroscedastic normal model and the statistical model with spurious outliers established in [16] to this case. From this model we derive trimmed clustering criteria. In the normal ML case we retrieve Rocke and Woodruff’s MINO. Optimizing the heteroscedastic criteria is not easy and we propose alternating algorithms for this task, the single–point and multi–point algorithms. Interestingly, the multi–point algorithm contains a famous classical problem from combinatorial optimization, λ–assignment. We also show how to use the algorithms for estimating the numbers of clusters and outliers. We finally report on our experience with some data sets. 2 Statistical model and constrained clustering criteria We consider a very general statistical clustering model with r regular observations and n − r “spurious” outliers in a sample space E. Spuriousness [16] is meant as a model for observations that are unpredictable in the sense that they obey no statistical law. We feel that the best way of handling this idea in a statistical (!) framework is assuming that each outlier comes from its own population with a nuisance parameter. The following is the main assumption on the spurious outliers. 2 (SVo ) An outlier Xi : Ω → E, i ∈ 1..n, obeys a parametric model with parameter ψ i ∈ Ψi such that the likelihood integrated w.r.t. some prior measure τ i on Ψi satisfies Z fXi [x | ψi ] τi ( dψi ) = 1, (1) Ψi i.e., does not depend on x. We will later consider the ψ i ’s as nuisance parameters to be integrated out. There are two important and sufficiently general situations where (SV o ) holds. (A) The sample space E = Rd is Euclidean, Ψi = E, the outliers obey a location model Xi = U i + ψ i with some (unknown) random noise Ui : (Ω, P ) → E, and τi is Lebesgue measure on Ψi . Indeed, in this case, the conditional Lebesgue density is f Xi [x | ψi ] = fUi (x − ψi ) and, hence, Z fXi [x | ψi ] dψi = 1. Ψi (B) The parameter set Ψi is singleton and the distribution of X i is taken as the reference measure for its density. This case includes the idea of outliers “uniformly distributed” on some domain. Each of the r regular observations comes from one of g populations represented by a density fγj , j ∈ 1..g, in some dominated parametric statistical model with parameter set Γ j . This allows the consideration of discrete and continuous sample spaces of all kinds. A labeling of the n objects is a mapping l : 1..n → 0..g: l i = j ∈ 1..g stands for the assignment of object i to class j and li = 0 specifies it as an outlier. A labeling is admissible if each cluster j ∈ 1..g contains at least bj data points. These numbers may serve several purposes. They may first reflect prior information on the cluster sizes. Secondly, the number b j has to be chosen in such a way as to secure ML–estimation of the population parameters. Assume the full normal model, Nmj ,Vj , and general position of the data, i.e. any d + 1 observations are affinely independent. Then, the m.l.e.’s of m j and Vj exist if and only if #Cj ≥ d + 1 for all j ∈ 1..g and we need bj ≥ d + 1. Similar statements can be made for normal submodels. If the Vj ’s are assumed to be diagonal and if x i1 ,k 6= xi1 ,k for any two observations i1 6= i2 and all k ∈ 1..d then it is sufficient to have at least two elements in each cluster. The same is true in the spherical normal model where this minimum number even suffices if observations are only pairwise different. The same can be said about Lebesgue continuous elliptically contoured models. The conditions on the positions of the data are guaranteed with probability one if the data arise from Lebesgue continuous models. Let Λ r denote the set of all admissible labelings l : 1..n → 0..g of the n objects such that n − r labels are 0. Combining regular observations and outliers, we may use the Cartesian product Λr × g Y j=1 Γj × n Y Ψi i=1 as the parameter set of our complete model with g classes and n − r outliers. The number r of regular objects is considered fixed; as to its choice see Sect. 3.4. 3 The density of the ith observation for the parameter values R ∈ ψ = (ψ1 , · · · , ψn ) is ( fγj (x), li = j ∈ 1..g, fXi [x | l, γ, ψi ] = fXi [xi | ψi ], li = 0, cf. Eqn. (1). 1..n r , γ = (γ1 , · · · , γg ), and We assume that the sequence of observations (X i )ni=1 is statistically independent but not necessarily i.i.d. unless there are no outliers, n = r. By the product formula, the likelihood for the data set x = (x1 , . . . , xn ) w.r.t. the product reference measure on E n is fX (l, γ, ψ) = g Y Y fγj (xi ) j=1 li =j Y li =0 fXi [xi | ψi ]. Considering the parameters ψi of the outliers nuisances, the likelihood integrated w.r.t. to the prior measures τi is by Eqn. (1) fX (l, γ) = g Y Y fγj (xi ). (2) j=1 li =j Maximizing first w.r.t. all γj ’s and taking the logarithm, we infer that the ML–estimate of l is given by the constrained trimmed ML–criterion MLl (x) = argmin l∈Λr g X X j=1 li =j − ln fγj (l) (xi ), (3) where γj (l) is the m.l.e. of the parameter γj ∈ Γj w.r.t. cluster j. This criterion is optimal if it is known a priori that the clusters are of (about) the same size. If f γ is normal, the summands in (3) become essentially the log–determinants of the scatter matrices and we find Rocke and Woodruff’s [35] MINO. In order to account for unequal sizes, Symons [42] noted that the entropy of the relative cluster sizes had to be added to the ML–criterion. Therefore, the constrained trimmed MAP–criterion is g X o n X ln fγj (l) (xi ) . MAPl (x) = argmin rH (nj (l)/r)j − l∈Λr (4) j=1 li =j P n (l) n (l) Here, nj (l) is the size of cluster j w.r.t. l and H (nj (l)/r)j = − gj=1 jr ln jr is the entropy of the cluster proportions. Adding this term to the criterion discourages equal cluster sizes, in which case the entropy is maximum. The criterion includes a number of special cases. First, any state space E is allowed, be it discrete or continuous. Secondly, various statistical models on E may be used. It is not necessary that each population belong to the same model. Thirdly, one may assume that there are spurious outliers in which case it is necessary to specify a lower bound on the number of regular objects, r. One may also collect possible outliers in one cluster by choosing a particular model for them, e.g. a distribution uniform on the convex hull of the data, cf. [11]. In this case we choose r = n. The special case with g = 1 and the normal distribution is Rousseeuw’s [36] MCD. 4 In some rare cases, computing the ML–estimates of the parameters γ j is simple. Popular examples are the normal and coin tossing cases where ML–estimation reduces to simple summation. Even then, minimizing the criteria (3) and (4) over the combinatorial structure of all admissible assignments Λr is not a simple task. Only for the MCD estimator was it recently shown that optimization can be carried out in polynomial time, cf. [5] where an algorithm 2 of complexity O(nd ) based on elliptical separability is designed. In general, one has recourse to general optimization schemes such as local descent methods combined with multistart or MCMC algorithms. A shortcoming of these methods is the need to update the parameters with each move, also the unsuccessful ones. More efficient algorithms detect whether a move will be successful before the parameters are updated. We deal here with a dedicated alternating algorithm of the k–means type. The general version in the non–contaminated case is due to Schroeder [38]. The following proposition extends it to the case with spurious outliers. A version for the m.l.e. in the pooled normal case appears in [16]. 2.1 Proposition Let l and lnew be two admissible labelings s.th. g X X j=1 i:lnew,i =j g X X ln nj (l) + ln fγj (l) (xi ) > j=1 i:li =j ln nj (l) + ln fγj (l) (xi ) . (5) (a) Then lnew strictly improves the MAP–criterion (4) w.r.t. l. (b) The same is true for the ML–criterion (3) if the hypothesis is replaced with g X X ln fγj (l) (xi ) > j=1 i:lnew,i =j g X X ln fγj (l) (xi ). j=1 i:li =j Proof. We give the proof for Case (a), Case (b) is similar. Applying the hypothesis, the entropy inequality, and ML–estimation in this order, we have the following chain of estimates. X X nj (l) − rH + ln fγj (l) (xi ) r j j i:li =j X X nj (l) + ln fγj (l) (xi ) = ln r j i:li =j X X nj (l) < ln + ln fγj (l) (xi ) r j = i:lnew,i =j X nj (lnew ) ln X nj (lnew ) ln X nj (lnew ) ln j j ≤ nj (l) X + r j X ln fγj (l) (xi ) i:lnew,i =j nj (lnew ) X + r j X ln fγj (l) (xi ) i:lnew,i =j nj (lnew ) X X + ln fγj (lnew ) (xi ) r j j i:lnew,i =j X X nj (lnew ) + ln fγj (lnew ) (xi ). = −rH r j ≤ j i:lnew,i =j 5 2 The fact that both sides in the hypotheses of Proposition 2.1 contain the same population parameters γj (l) substantially reduces the complexity of the optimization. There are several ways to exploit the foregoing proposition. A first method moves one object from one cluster to another or swaps an outlier and a regular object. Define the following admissible moves starting from the admissible labeling l (c) (o) i : li → j, k : 0 → j, i : li → 0, j, li ∈ 1..g, j 6= li , if size of cluster li is > bli ; j, li ∈ 1..g, if size of cluster li is > bli or if j = li . Denote the assignment resulting from move (c) or (o) by l new . The difference between the left and right sides of (5) amounts to nj (l) + ln fγj (l) (xi ) − ln fγl (l) (xi ), i nli (l) nj (l) + ln fγj (l) (xk ) − ln fγl (l) (xi ), u(o) (i, k, j) = ln i nli (l) u(c) (i, j) = ln respectively. If the difference is strictly positive for some admissible move then the proposition asserts that the move improves the MAP–criterion (4). For the ML–criterion (3) the term involving the sizes is dropped. This leads to the following reduction step. The single–point reduction step // Input: An admissible labeling l; // Output: an admissible labeling l new with smaller criterion or the response “stop.” (1) Compute the parameters γj (l) w.r.t. l, using update formulae if possible; (2) determine the (admissible) move (c) or (o) with maximal u (c) or u(o) ; (3) if this maximum is > 0 then update l accordingly and return the updated labeling; else respond “stop.” The proposition suggests also the following, often more efficient, multi–point reduction step for improving the MAP–criterion: Given a labeling l, compute the numbers n j (l) and the parameters γj (l) and assign each observation xi to the cluster lnew,i = j such that the quantity ln nj (l) + ln fγj (l) (xi ) (or fγj (l) (xi )) is maximum. If the sum of the largest r of these numbers exceeds the sum for the original labeling then the proposition assures us that the new labeling lnew has smaller MAP–criterion (4) (ML–criterion (3)) if it is admissible. This is the main scheme of an efficient estimation procedure. Its advantage over a single–point reduction step is the fact that it allows to explore the whole data set without updating the parameters n (l) after each successful reassignment of a single point. Since jr fγj (l) (xi ) is just the posterior density given the parameters of the current labeling used in discriminant analysis, the step is intuitively appealing since it assigns each observation to the class determined by the MAP– (or ML–)discriminant rule. Unfortunately, in the present heteroscedastic case, the naive procedure described above does not guarantee admissibility of the new assignment, in particular if there is a small cluster or if the data set is small. This is contrary to the homoscedastic case where clusters of one element do not prevent a clustering from being admissible. Although the naive procedure sometimes 6 works this fact may render it unstable. In order to ensure admissibility, we have actually to solve the following constrained optimization problem. The multi–point optimization problem For given numbers nj and parameters γj , minimize labelings l, i.e., subject to the constraints #{i ∈ 1..n | li = j} ≥ bj , P i −ln nli −ln fγli (xi ) over all admissible j ∈ 1..g, #{i ∈ 1..n | li = 0} = n − r. The natural numbers bj are defined in such a way as to enable ML–estimation of the parameters of population j, see the discussion of admissibility in Sect. 2. The number r of regular objects must not be smaller than their sum. There is an exact and at the same time efficient solution to the multi–point optimization problem. It is suitable to first reformulate it in a form common in combinatorial optimization. A labeling l may be represented by a zero–one matrix y of size n × (g + 1) by putting y i,j = 1 if and only if li = j, i.e., if l assigns object i to cluster j. A zero–one matrix P y is admissible, i.e., corresponds to an admissible labeling, if it satisfies the constraints j yi,j = P 1 for each i P (each object has exactly one label), i yi,0 = n − r (there are n − r outliers) and i yi,j ≥ bj for all j ≥ 1 (each cluster j contains at least b j elements). Using this matrix and defining the “cost” coefficients1 ui,j = − ln nj (l) − ln fγj (l) (xi ), i ∈ 1..n, j ∈ 1..g, we may reformulate the multi–point optimizationPproblem as a binary linear optimization problem, cf. [32], in the following way. Note that j bj ≤ r ≤ n. (BLO) g n X X i=1 j=1 ui,j yi,j minimal over all y ∈ Rn×(g+1) subject to the constraints P j yi,j = 1, P y = n − r, i,0 Pi y i i,j ≥ bj , y ∈ {0, 1}. i,j i ∈ 1..n, j ∈ 1..g, Although binary linear optimization is NP–hard, in general, the present problem has an efficient solution. First, it is easy to construct a solution that satisfies all but the third constraint. Just assign object i to argminj ui,j . If this solution happens to satisfy also the third constraint then it is plainly optimal. In the opposite case, the deficient classes j in this solution contain exactly bj objects in an optimal solution. Indeed, if an originally deficient class j was of size > bj then at least one of the forced elements would be free to go to its natural class, thus reducing the target function. We show in Section 3 that the present multi–point optimization problem may be transferred to another one with an efficient solution. 1 the negative posterior log–likelihood with the parameters replaced by their m.l.e.’s w.r.t. l 7 3 Multi–point algorithm and λ–assignment 3.1 Transfer of multi–point optimization to a λ–assignment problem A linear optimization problem of the form X (TP) uij zij minimal over z ∈ Rn×m subject to the constraints i,j P Pj zij = ai , i zij = bj , zi,j ≥ 0, i ∈ 1..n, j ∈ 1..m, is called a transportation or Hitchcock problem, see [32]. Here, ) is a real n by m matrix of P (u i,jP weights or costs and ai and bj are real numbers ≥ 0 s.th. ai = bj . Plainly this condition is necessary and sufficient for a solution to exist. Our (BLO) problem is not yet a transportation problem for two reasons: first, the constraints contain also an inequality and, secondly, the transportation problem is not restricted to binary solutions. We first use a trick to overcome the first problem; fortunately, the second turns out to be a free lunch. Let us introduce an artificial (g + 1)th class. It is meant to accommodate the excess members w.r.t. the constraints ≥ b j of the g natural classes. Thus, defining the additional cost coefficients ui,g+1 = min ui,j , j∈1..g we find the special transportation problem (λA) g+1 n X X i=1 j=1 ui,j zi,j minimal over z ∈ Rn×(g+2) subject to the constraints P i ∈ 1..n, j zi,j = 1, P i zi,0 = n − r, P zi,j = bj , j ∈ 1..g, Pi P bj , i zi,g+1 = r − z ≥ 0, i ∈ 1..n, j ∈ 0..(g + 1). i,j Note that the “supplies” ai of all its n objects i are unitary. Such a transportation problem g+2 denoting the array of the “demands” n − r, is often called P a λ–assignment problem, λ ∈ N bj , and r − bj . Weights ui,0 of the outliers are not needed; they may be put to any constant value if necessary, e.g. 0. A matrix z that satisfies the constraints of (λA) will be called feasible. Contrary to an admissible solution, a feasible solution is defined by equalities. The assignment matrix z is illustrated in Fig. 1. Note that the hypothesis on the sums of the right sides of the constraints is satisfied so that (λA) possesses an optimal solution z. Now, the constraints of (λA) are integral. By the Integral Circulation Theorem [26], Theorem 12.1, or [8], Theorem 4.5, we may assume that z is integral. The first constraint implies that it is even binary representing an assignment. 8 j z1,0 z1,1 z1,j z1,g z1,g+1 z2,0 z2,1 z2,j z2,g z2,g+1 i zi,0 zi,1 zi,j zi,g zi,g+1 zn,0 zn,1 P = n−r zn,j P = bj zn,g zn,g+1 P = P r − bj P =1 Figure 1: Table of assignments of objects to classes in the transportation problem associated with the linear optimization problem. With the optimal solution to (λA) we associate an optimal solution to (BLO). Define the matrix y ∈ Rn×(g+1) by moving, in each line i, the excess members collected in cluster g + 1 to the class which minimizes ui,j , j ∈ 1..g. We claim that y optimizes (BLO). Indeed, let z be admissible for (BLO). Define a feasible matrix z 0 by moving excess members i from the classes to the artificial class g + 1. By definition of u i,g+1 , the value of z w.r.t. (BLO) is larger than that of z0 w.r.t. (λA) and, by optimality, the latter value is larger than that of z w.r.t. (λA) which, again by definition of ui,g+1 , equals that of y w.r.t. (BLO). Hence, z is inferior to y and y is an optimal solution to the original problem (BLO). Optimal solutions to (λA) and (BLO) are actually equivalent. Given an optimal solution to (BLO), move any excess elements assigned to classes 1..g to class g + 1. Note that any class that contains excess elements contains no forced elements since they could choose a cheaper class. Therefore, the new assignment creates the same total cost. Thus, the following multi–point reduction step reduces the criterion. It uses (λA) and the right sides bj appearing there have to be chosen inP such a way as to guarantee existence of the m.l.e.’s. (Plainly, the parameter r must satisfy bj ≤ r.) The multi–point reduction step (λ–assignment version) // Input: An admissible labeling, its parameters γ j (l), and its criterion; // Output: an admissible labeling l new with its parameters and (smaller) criterion or the response “stop.” (1) solve the λ–assignment problem (λA) with weights ( ui,j = − ln nj (l) − ln fγj (l) (xi ), j ∈ 1..g, ui,g+1 = minj∈1..g ui,j , 9 i ∈ 1..n; (2) move each object in cluster g + 1 to the cluster where it causes minimal cost; (3) compute parameters and criterion of the solution to the clustering problem thus obtained; (4) if if the criterion has improved then return it together with this solution and its parameters; else “stop.” A reduction step is the combination of parameter estimation and discriminant analysis. In this sense, the Hitchcock version of the multipoint reduction step may be viewed as a combination of parameter estimation and constrained discriminant analysis. The transportation problem in connection with constrained discriminant analysis appears already in Tso and Graham [47]. These authors use the constraints in automatic chromosome classification to ensure the correct number of chromosomes of a class in a biological cell. We learnt about their work in Tso et al. [48], where Tso and Graham’s method is extended to karyotyping. Another application of λ–assignment to clustering appears in Aurenhammer et al. [4], where a constrained least squares assignment problem akin to k–means with fixed cluster centers and sizes is considered. 3.2 Algorithms for the λ–assignment problem The multi–point reduction step requires the solution to a λ–assignment problem. Since reduction steps are executed many times during a program run, a few words are in order about known algorithms for its solution and about their complexities. Both have been extensively studied in operations research and combinatorial optimization. As a linear optimization problem, λ–assignment may be solved by the simplex method. In fact, it is a special transportation problem which, in turn, is a special case of, and surprisingly [32] equivalent to, a minimum– cost flow problem. An instance of the latter problem is specified by a directed graph, net flows in its nodes, and arc capacities and costs. The aim is to determine a flow with minimum overall cost satisfying the required net flows in all nodes without violating the given capacities. Classical adaptations of the simplex method tailored to the particularities of min–cost flow are the network simplex method, [40] and primal–dual algorithms, cf. Cook et al. [8], 4.2 and 4.3, such as the out–of–kilter method [14, 26]. The network simplex method seems to be most popular in applications although these algorithms are exponential in the worst case. Polynomial algorithms for minimum–cost flow have also existed for some time. It is common to represent the cost matrix as the adjacency matrix of a weighted graph (V, E) with node set V of size n and edge set E of size m. A polynomial example is Orlin’s [31] O n(m+n log n) log n algorithm for the uncapacitated min–cost flow problem. Other algorithms use the concept of scaling introduced by Edmonds and Karp [9] who used it to produce the first polynomial algorithm for the general min–cost flow problem. Scaling solves a series of subproblems with approximated instance parameters, capacities or costs or both. Edmonds and Karp showed with their cost–scaling algorithm that min–cost flow was a low complexity problem. However, opinions about the usefulness of scaling in programs are mixed, see the comments in [19], Sec. 9 and [15], Sec. 4. In the context of λ–assignment the capacities may be defined as 1 and capacity scaling becomes trivial although the algorithms may of course be applied. An example is Gabow and Tarjan’s [15] O n(m + n log n) log U algorithm, log U being the bit 10 length used to represent network capacities. In the case of λ–assignment, log U = 1 so that the complexities of Orlin’s and Gabow and Tarjan’s algorithms are equal in this case. Now, the λ–assignment problem has some special features that reduce its complexity compared with the general min–cost flow problem. First, its graph is bipartite, its node set being partitioned in two subsets, the objects i and the clusters j, such that each edge connects an object with a cluster. For bipartite network flow algorithms it is often possible to sharpen the complexity bounds by using the sizes of the smaller (k) and the larger (n) node subsets as parameters, Gusfield, Martel, and Fernadez-Baca [21]. Second, the bipartite network is unbalanced in the sense that the number of clusters is (at least in our case) much smaller than the number of objects. Bounds based on the two subsets are particularly effective for unbalanced networks. Third, all row sums of the assignment matrix are equal, and equal to 1 which implies that the capacities may be chosen unitary. As a consequence the algorithms for the lambda assignment problem mentioned at the beginning become polynomial. E.g., the time complexity of the out–of–kilter method becomes O(nm) = O(kn 2 ) since m = kn here, see [26], p. 157. Algorithms dedicated to the λ–assignment problem are due to Kleinschmidt et al. [24] and Achatz et al. [1]. These authors propose algorithms with asymptotic run time O kn2 . Both draw from Balinski’s work on the Hirsch conjecture about convex polytopes and relate their work to an algorithm of Hung and Rom [22]. The algorithms mentioned so far remain at least quadratic in the size of the larger node set, n. By contrast, two cost–scaling min–cost flow algorithms are linear in n: Goldberg and Tarjan’s [19], Theorem 6.5, algorithm solves the min–cost flow problem on a bipartite network asymptotically in O k 2 n log(kC) time, see Ahuja et al. [2], and the “two–edge push” algorithm of the latter authors needs O (km + k 3 ) log(kC) time. In our application, m = kn. Both estimates contain the bit length, log C, for representing costs. A different low–complexity approach is due to Tokuyama and Nakano [44, 46, 45]. These authors state and prove a geometric characterization of optimal solutions to the λ–assignment and transportation problems by a so–called splitter, a k–vector that partitions the Euclidean k–space into k closed cones. The corresponding subdivision of the lines of the cost matrix describes an optimal assignment. Tokuyama and Nakano design a deterministic and a ran- √ domized strategy for splitter finding that run in O(k 2 n ln n) time and O kn + k 5/2 n ln3/2 n expected time, respectively. Their algorithms are almost linear in n and close to the absolute lower bound O kn if k is small, the case of interest for (λA). The disadvantage of the geometric method is that the whole procedure is activated even if the unconstrained assignment automatically produces a feasible (and hence optimal) solution. 3.3 Heuristic feasible solutions Besides exact solutions, there are reasons to say a word about heuristic feasible solutions to the λ–assignment problem. First, they may be used for “multipoint reduction steps” on their own. Moreover, some of the graphical methods presented in Sect. 3.2 need an initial feasible solution for starting or profit from a good one. The network simplex method, e.g., needs a primal feasible solution and the method presented in [1] needs a dual feasible solution. While arbitrary feasible solutions are easily produced, good initial feasible solutions can be 11 constructed by means of greedy heuristics. We propose here two. If the bounds b j are small enough, the heuristics often produce even optimal solutions. Each reduction step receives a set of parameters γ j from which all weights ui,j , i ∈ 1..n, j ∈ 1..g, are computed. The first two heuristics construct primal feasible solutions. Both start from the best unconstrained assignment of the clustering problem which can be easily attained by sorting the numbers ui = minj ui,j . More precisely: Basic primal heuristic (1) sort the numbers ui = minj ui,j upward, i ∈ 1..n; (2) attach label 0 to the last n − r objects in the ordered list; (3) assign each of the remaining r objects to the class 1..g where the minimum is attained; (4) if this assignment is admissible then stop (it is optimal); else (α) starting from element n−r in the ordered list and going downwards, assign surplus elements to arbitrary deficient classes until they contain exactly b j elements; (β) assign the largest remaining surplus elements in the classes to class g + 1. If an admissible instead of a feasible solution is required, only, then we drop step (4)(β). We next refine the “else” part of step (4). Steps (1), (2), and (3) are as before. Refined primal heuristic (1) sort the numbers ui = minj ui,j upward, i ∈ 1..n; (2) denote the set of the last n − r objects in the ordered list by L; (3) assign each of the remaining r objects to the class 1..g where the minimum is attained; (4) if this assignment is admissible then stop (it is optimal); else (α) let D be the set of deficient classes in 1..g and let δ be their total deficiency; (β) starting from element n − r in the ordered list and going downwards, move the first δ elements in surplus classes to L; (γ) sort the object–class pairs (i, j) ∈ L × D upward according to their weights u i,j to obtain an array (i1 , j1 ), (i2 , j2 ), · · · , (i#L·#D , j#L·#D ); (δ) scan all pairs (ik , jk ) in this list starting from k = 1 assigning object i k to class jk unless ik appears already in some class or jk is saturated, until all classes are saturated; () assign the yet unassigned elements in L to the outlier class; (ζ) assign the largest remaining surplus elements in the classes to class g + 1. 12 In Sect. 3.1, we exploited the fact that any cluster j with more than b j members in an optimal solution contained no forced elements. Plainly, both heuristics share this property since such members would be free to change classes. Both heuristics are much faster than any of the solution algorithms while yielding often a low value of the criterion, the refined heuristic lower than the basic. However, contrary to λ–assignment, the reductions in the sense of Propostion 2.1 they provide are not optimal, in general. In most cases, the criterion decreases although one may construct examples where this fails: Consider the data set {−40, −8, −6, 0, 1, 2, 3, 40}, let g = 4 and b 1 = b2 = b3 = b4 = 2, and assume that there are no outliers. Suppose that the parameters are generated from the initial partition C1 = {−40, 3}, C2 = {−8, 1}, C3 = {−6, 0}, C4 = {2, 40}. They are m1 = −18.5, m2 = −3.5, m3 = −3.0, m4 = 21.0, and v1 = 462.25, v2 = 20.25, v3 = 9.0, v4 = 361.0. The matrix of weights 9.91 9.15 9.25 71.57 6.78 6.09 (ui,j ) = 157.08 7.75 5.97 18.97 10.99 10.68 9.65 6.39 5.97 9.88 9.73 6.78 6.75 9.77 9.82 7.27 7.75 9.66 T 9.91 16.31 7.87 99.23 8.97 210.41 9.56 9.66 generates the free partition {−40}, {−8, 2, 3}, {−6, 0, 1}, {40} and the second heuristic modifies it to {−40, 1}, {−8, 2}, {−6, 0}, {3, 40}. The score of the latter is 39.73 whereas the initial partition has the lower score 39.67. The problem dual to (λA) reads (λA*) n X i=1 pi + g+1 X j=0 bj qj maximal over (p, q) ∈ Rn+g+2 subject to the constraints pi + qj ≤ ui,j , i ∈ 1..n, j ∈ 0..(g + 1). Let us find the best P dual feasible solution with p = α = constant. Since the target function may be written j bj (α + qj ), here, and since the constraints read α + q j ≤ ui,j , we may as well assume α = 0 obtaining a simple dual feasible solution. Dual heuristic pi = 0 and qj = mini ui,j . 3.4 Overall algorithms Single– and multi–point reduction steps both receive a labeling and improve it according to Proposition 2.1. Their iteration thus gradually decreases the criterion. Since there are only finitely many labelings the iteration must stall after a finite number of steps with the “stop” signal. We have reached one approximation to the requested partition. It is self–consistent in the sense that it reproduces its parental parameters. While a solution that optimizes the criterion shares this property, the solution obtained from the iteration is not optimal w.r.t. the criterion, in general. In fact, clustering is known to be NP–hard, see Garey and Johnson [18]. Therefore, we have to employ the multistart method in order to achieve a value of the criterion as low as possible in the time available. Of course, the number of repetitions that lead to a 13 solution near the optimum depends heavily on the size of the data set, on the parameters g and r, and on the statistical model chosen. Although the heuristic reduction step does not always improve the criterion it may be used in a similar way. However, it does not necessarily terminate with a self–consistent solution. Iteration of single– or (heuristic or exact) multi–point reduction steps thus gives rise to three optimization methods for fixed numbers of clusters and outliers. The multi–point is superior to the single–point algorithm in speed. In difficult situations, it is useful to refine the result of a multi–point search by single–point steps. The advantage of the constrained multi–point reduction step presented here is its ability to produce a reasonable solution even if a cluster tends towards becoming too small by the end of an iteration. Each series of reduction steps needs a first solution to begin with. We generate an unbiased partition of the whole data set in two parts, one of size r, the “regular” elements, and its complement. There exists a simple but sophisticated algorithm that accomplishes just this in one sweep through the data, see Knuth [25], p.136 ff. Moreover, we assign the objects of the part of size r to g clusters in an unbiased way. The clusters must be large enough to allow estimation of their parameters. In general, this requirement does not pose a problem unless g is large. In this way, the clusters will be of about equal size with very high probability. It follows that the entropy of the mixing rates is large at the beginning making it easy to fulfill the condition of Proposition 2.1(a). 4 Numbers of clusters and outliers An important part of a clustering process is the estimation of the number of clusters and outliers. It is generally accepted that there is no rigorous definition of the notion of an “outlier” as there is none of a “cluster,” either. What is more, the notions of outliers and cluster are overlapping. In the present heteroscedastic model, outliers either originate from a rare class so that a number of elements sufficient for m.l. parameter estimation has not been observed or they constitute a set whose structure, e.g. shape, is not in harmony with the posited populations. In the first case, the outliers may find some extreme cluster members that complement them to a sufficiently large cluster. Sometimes, a sufficient number of “outliers” may give rise to a cluster and it may sometimes even be reasonable to assume the absence of outliers. In this way, outliers may be traded for clusters. The model often accommodates also clusters whose structures do not conform with its populations. Therefore, in both cases, the numbers of clusters and outliers are not uniquely determined and subject to interpretation. If the two numbers are roughly known then postulating a number of clusters and outliers that exceed the actual numbers to some extend may be tolerated. As the effect of a larger number of assumed outliers each cluster will just loose a number of extreme members and the parameters will not be too much affected. It is therefore justified to use a lower bound of the number of regular objects as the number r; see also Rousseeuw and Van Driessen [37], Section 5. If one assumes one or two more clusters than there actually are then clusters of minimal size bj are split off by the MAP method but the estimated parameters of the other clusters remain essentially unchanged. In this case, it is sufficient to apply the estimator with one suitable pair (g, r). It must however be kept in mind that it is harmful to choose g too small or r too large. The parameter estimates may break down through the influence of a single remaining gross outlier or one missing class. 14 In practical applications both the numbers of clusters and of outliers are often not even roughly known and have to be estimated. A large amount of literature has been and is still devoted to these subjects which are still considered challenges. So far we have designed a tool that allows to establish optimal clusterings for all pairs (g, r) of numbers of clusters and outliers. This is a substantial reduction of the complexity of the problem but not yet its solution. We need a strategy to select the most promising pair (g, r) or maybe a few of them which should then be more closely analyzed. Concerning the number of clusters, there are essentially three approaches, cf. [30, 20], cluster validation, the so-called elbow criterion, and model selection criteria. Cluster validation may be divided in two branches: tests and validity measures. The classical test, due to Wolfe [50], is a likelihood ratio test for the hypothesis of k clusters against (k − 1) clusters. Bock [7] discusses some significance tests for distinguishing between the hypothesis of a homogeneous population vs. the alternative of heterogeneity. Validity measures are functionals of partitions and usually measure the quality of cluster separation and of cluster compactness; see, e.g., Bezdek et al. [6]. Often, the total within–cluster sum of squared distances about the centroids is used as a measure of compactness and the total between–cluster sum of squared distances for separation; cf. Milligan and Cooper [30] and the abridged presentation of their paper by Gordon [20]. The elbow criterion identifies the number of clusters as the location where the decrease of some cluster criterion flattens markedly. For a recent refinement of this method we refer the reader to Tibshirani et al. [43]. Maximum likelihood estimation prefers a large number of clusters. A model selection criterion hampers this tendency by adding to the minimum of the negative log–likelihood or of the negative posterior log–density a penalty term. A popular model selection criterion is Schwarz’ Bayesian Information Criterion, BIC. Although originally proposed for exponential families, there is some theoretical and practical evidence that supports it as a means for estimating the number of clusters of a clustering model, too; see the discussion in McLachlan and Peel [29], Ch. 6. According to Leroux [27], BIC does not underestimate the true number of components, asymptotically. However, most model selection criteria tend towards overestimating this number accepting solutions with small spurious clusters. In the present case, BIC is obtained from the minimum in the ML (3) or the MAP criterion (4) by adding the penalty term 2p · ln r, p being the total dimension of the parametric model. In the MAP case, p is g − 1 (for the mixture rates) plus the sum of the dimensions of the g cluster populations. We therefore take a different approach with the MAP–method. It determines the number of classes and at the same time the number of outliers if populations are normal and is based on the following two observations: (i) If, for fixed r, the assumed number of clusters is increased beginning with the “true” value then MAP splits off a small cluster (b j or approximately bj points lying close to some hyperplane); if it is decreased then no small clusters are created. (ii) If, for the true g, the assumed number r of regular objects is decreased then some or all clusters loose some of their members. The first half of statement (i) cannot be made for the ML–method. The presence of the small cluster is not a particular property of the d + 1 points. If they are removed then others assume their role. 15 Positing that the true clustering contains no small cluster we infer the following algorithm for selecting valid partitions. The integer m(g, r) is the size of the smallest cluster obtained for the optimal MAP partition with parameters g ≥ 1 and r ≤ n. Selection algorithm (1) Establish a table of the integers m(g, r) for all reasonable pairs (g, r); (2) for each r, determine the largest g such that m(j, r) is not small for any j ≤ g and mark the pair (g, r); (3) in each line, scan the marked pairs beginning with low values of r until a partition is found that satisfies some cluster validity criterion; return the corresponding pair. It is, of course, sufficient to perform the procedure with a lacunary set of numbers r. Plainly, each column contains some marked pair. The returned pairs are candidates for reasonable choices of g and r. There is at most one returned pair per line in the table so that the complexity of the problem is again largely reduced. If it turns out that no marked pair is valid then the distributional assumptions are questionable. In order to further reduce the number of possible solutions we use validation. We accept all pairs (g, r) for which none of the g clusters is rejected by a χ 2 goodness–of–fit test with the categories defined by some quantiles of the Mahalanobis distance squares w.r.t. the estimated parameters, cf. Ritter and Gallegos [34]. Means (0, 0, 0) Covariance diagonal matrices (9.0, 4.0, 1.0) Eigenvalues 9.0 4.0 1.0 Cardinalities 200 (−6, 3, 6) (6, 6, 4) 4.0 4.0 −3.2 4.0 3.2 4.0 −0.2 0.0 1.0 2.8 2.4 2.0 7.20 1.07 0.73 9.11 0.88 0.016 50 50 Table 1: Structures of the three 3D normal populations from which Data Set A is sampled. The eigenvalues of the covariance matrices are also shown. The data set contains 30 additional “outliers” uniformly distributed on the cube [−15, 15]3. 4.1 Simulation studies We illustrate the method presented in Section 3 with two contaminated data sets, A and B, and Anderson’s [3] well–known four–dimensional Iris Data Set. Data Set A is sampled from the three normal populations specified in Table 1. The corresponding mixture appears in [29], p. 218. To this basic data set we add 30 outliers uniformly distributed in the cube [−15.0, 15.0] 3 . Data Set B has six features and consists of 100 regular observations, 50 sampled from N 7e1 ,I and 50 from N7e2 ,I , and of two outliers from N0,100I . Iris consists of 50 observations of each of the three subspecies setosa, versicolor, and virginica with only few outliers. Its entries being given in the form *.* with only two digits of precision, its observations are not in general position. Observations 102 and 143 are even equal. We, therefore, add a number uniformly 16 distributed in the interval [−0.05, 0.05] to each entry. Moreover, since the entries are positive, we take logarithms. The MAP clusterings for the Data Sets A and B with the known numbers of classes and outliers are essentially the original ones. The estimated parameters for Data Set A and these input parameters are shown on top of Table 4. The original biological partition of Iris is essentially reproduced by the ML–method with three clusters, just two versicolors are considered virginicas and one virginica is falsely assigned to versicolor. The MAP clustering with three clusters unites these two subspecies identifying only two major clusters and a small spurious cluster of minimal size five. g 1 2 3 4 5 6 7 n−r g 1 2 3 4 5 0 5 330 134 47 32 4 4 4 325 47 47 27 4 4 4 n−r g 1 2 3 4 5 10 15 20 25 30 35 40 45 50 55 60 320 56 49 22 4 4 4 315 47 47 14 4 4 4 310 47 47 12 4 4 4 305 47 49 4 4 4 4 300 47 47 4 4 4 4 295 46 47 4 4 4 4 0 1 2 3 4 6 102 7 7 7 7 101 50 7 7 7 100 50 7 7 7 99 49 7 7 7 98 49 7 7 7 96 47 7 7 7 n−r 0 5 150 50 5 5 5 145 45 5 5 5 290 46 48 4 4 4 4 285 46 46 4 4 4 4 280 46 42 4 4 4 4 275 46 42 4 4 4 4 270 46 41 4 4 4 4 8 10 12 14 16 94 46 7 7 7 92 45 7 7 7 90 42 7 7 7 88 40 7 7 7 86 39 7 7 7 10 15 20 25 30 140 42 5 5 5 135 39 5 5 5 130 38 5 5 5 125 34 5 5 5 120 30 5 5 5 Table 2: Table of the minimal cluster sizes m(g, r), see Sect. 4, for the three data sets A (top), B (center), and Iris w.r.t. MAP. In each column, the entry marked in boldface terminates the interval of pairs without small clusters. In each line the leftmost marked entry which passes the validation test indicates a possible solution to the clustering problem. Table 2 shows the minimal sizes m(g, r) defined in Sect. 4 for various values of g and n − r and the three data sets. The pairs marked according to Sect. 4(2) are displayed in boldface. The table nicely illustrates observation (i) in Sect. 4: all entries below the boldface line are d + 1. In order to determine the number of outliers, we apply a χ 2 –goodness–of–fit test with 17 the ten circular categories of equal probability 0.1 after sphering the data with the estimated parameters. We reject all solutions whose poorest cluster is rejected at a p–value of 0.1, say. Since there are nine degrees of freedom this means a value of the χ 2 –statistic of ≥ 14.7. Two solutions for Data Set A and g = 4 are not rejected, the one devoid of outliers and the one with 15 outliers. The first accepts the outliers as a cluster and is the proposed partition for g = 4. The second solution removes extreme elements in the corners of the cube as outliers and accepts the rest as a cluster. The test does also not reject the solutions with 3 clusters and between 30 and 40 outliers and so we accept also the one with 30 outliers. Its sizes are 202, 51, and 47. This solution is close to the original partition. g 4 4 4 4 4 n−r 0 5 10 15 20 12.5 23.2 21.9 14.2 15.8 A 11.0 12.5 12.5 13.2 12.5 10.7 11.0 11.0 12.1 11.0 9.3 9.3 9.3 3.5 8.5 g 3 3 3 3 3 3 3 3 n−r 25 30 35 40 45 50 55 60 20.1 12.5 11.1 11.3 15.9 19.1 22.8 37.7 A 14.3 11.6 7.5 7.5 15.0 15.1 19.1 20.8 10.9 11.0 3.5 3.1 2.5 2.5 2.5 3.7 g 2 2 2 2 2 2 2 n−r 0 5 10 15 20 25 30 Iris 13.3 12.2 8.6 10.6 14.8 19.1 10.8 4.9 4.9 4.1 3.4 11.5 10.9 8.7 Table 3: Yates corrected χ2 –goodness–of–fit test with nine degrees of freedom for estimating the number of outliers. The rows show the values of the test statistic for the Data Sets A and Iris and clustering results with g clusters and n − r outliers. See also the text. The marked pairs for Data Set B are displayed in the center of Table 2. Interestingly, the solution with two clusters and no outliers contains a small cluster so that the solution with one cluster and no outliers is marked. The χ 2 –test, of course, rejects it as it rejects the solution with two clusters and one outlier. It does not reject the solution with two clusters and two outliers with exactly the original clustering. This is not surprising as the two clusters are very well separated and the outliers are sufficiently far away. Data Set B was just chosen in order to demonstrate that the boldface line in Table 2 may also contain downward jumps, contrary to what one might at first conjecture. The Iris Data Set has served for demonstrating the performance of many clustering algorithms, beginning with Fisher [10]. Our MAP–method is not able to identify the correct number of clusters, clearly identifying two. Some authors contend that this is the number of clusters that a geometric method should estimate. We disagree with this opinion. Setosa is completely separated from the two other classes and versicolor and virginica touch but they, too, are fairly well separated. Otherwise, the ML estimator with three classes would assign a good amount of the overlapping to a wrong class which is not the case as we already mentioned. Simulated normal data show the same behavior: if two clusters approach each other there comes a point when ML still estimates the right sizes and clusterings whereas MAP unites the touching clusters. The reason for this misbehavior is, therefore, not a bad choice of the distributional assumptions. The influence of the entropy term in the MAP–criterion (4) rather turns out to be too strong so that the estimates of the sizes break down. The χ 2 –goodness–of–fit test presented on the right of Table 3 indicates that Iris does not contain many outliers, if any. 18 Means (0.09, 0.13, −0.02) 9.1 0.4 3.7 Covariance 0.2 0.0 1.0 matrices Eigenvalues 9.1 3.7 1.0 Cardinalities 202 (−5.94, 2.94, 5.95) 4.0 −2.3 2.7 −0.5 −0.0 1.2 4.0 1.4 1.0 51 Means (0.02, 0.13, −0.04) 2.84, 5.97) (−5.86, 8.7 3.3 −2.3 2.5 0.4 3.8 Covariance 0.1 0.0 0.96 −0.2 0.2 1.0 matrices Eigenvalues 8.7 3.7 1.0 3.3 0.9 0.9 Cardinalities 200 48 Means (0.09, 0.13, −0.02) 9.1 0.4 3.7 Covariance 0.2 0.0 1.0 matrices Eigenvalues 9.1 3.7 1.0 Cardinalities 202 (−5.83, 2.94, 5.90) 3.5 −2.3 2.8 −0.2 0.0 1.0 3.5 1.2 1.0 50 (6.38, 6.40, 4.28) 3.4 2.6 3.0 2.5 2.0 1.8 3.4 1.1 0.02 47 (6.19, 6.21, 4.14) 2.6 2.2 3.2 1.9 1.7 1.4 2.6 1.3 0.02 47 (6.26, 6.37, 4.21) 2.3 1.9 2.7 1.6 1.5 1.2 2.3 1.1 0.02 44 (8.57, 7.90, 6.35) 12.7 9.8 9.5 10.3 9.7 9.9 12.7 2.0 3.1 · 10 −11 4 Table 4: Estimations for Data Set A. Top: estimated parameters with known numbers of clusters and outliers (30). Center: the estimation with the proposed method has 3 clusters and 35 outliers. Bottom: the parameters for the optimal solution with four clusters and 30 outliers. The scatter matrix of the small cluster has a very small eigenvalue thus lying close to some hyperplane. See also the text. References [1] Hans Achatz, Peter Kleinschmidt, and Konstaninos Paparrizos. A dual forest algorithm for the assignment problem. In Peter Gritzmann and Bernd Sturmfels, editors, Applied Geometry and Discrete Mathematics, volume four of DIMACS Series in Discrete Mathematics and Theoretical Computer Science, pages 1–10. American Mathematical Society, Providence, RI, 1991. [2] Ravindra K. Ahuja, James B. Orlin, Clifford Stein, and Robert Endre Tarjan. Improved algorithms for bipartite network flows. SIAM J. Computing, 23:906–933, 1994. [3] E. Anderson. The irises of the gaspe peninsula. Bull. Amer. Iris Soc., 59:2–5, 1935. [4] F. Aurenhammer, F. Hoffmann, and B. Aronov. Minkowski–type theorems and least–squares clustering. Algorithmica, 20:61–76, 1998. [5] Thorsten Bernholt and Paul Fischer. The complexity of computing the MCD–estimator. Theor. Comp. Science, 326:383–398, 2004. [6] James C. Bezdek, James Keller, Raghu Krisnapuram, and Nikhil R. Pal. Fuzzy Models and Algorithms for Pattern Recognition and Image Processing. The Handbooks of Fuzzy Sets Series. Kluwer, Boston, London, Dordrecht, 1999. [7] Hans-Hermann Bock. On some significance tests in cluster analysis. J. Classification, 2:77–108, 1985. [8] William J. Cook, William H. Cunningham, William R. Pulleyblank, and Alexander Schrijver. Combinatorial Optimization. Wiley, New York etc., 1998. [9] J. Edmonds and R.M. Karp. Theoretical improvements in algorithmic efficiency for network flow problems. J. ACM, 19:248–264, 1972. 19 [10] R.A. Fisher. The use of multiple measurements in taxonomic problems. Ann. Eugenics, 7:179–188, 1936. [11] Chris Fraley and Adrian E. Raftery. How many clusters? which clustering method? answers via model–based cluster analysis. The Computer Journal, 41:578–588, 1998. [12] Chris Fraley and Adrian E. Raftery. Model–based clustering, discriminant analysis, and density estimation. JASA, 97:611–631, 2002. [13] H.P. Friedman and J. Rubin. On some invariant criteria for grouping data. Journal of the American Statistical Association, 62:1159–1178, 1967. [14] D.R. Fulkerson. An out-of-kilter method for minimal cost flow problems. J. SIAM, 9:18–27, 1961. [15] Harold N. Gabow and Robert Endre Tarjan. Faster scaling algorithms for network problems. SIAM J. Computing, 18:1013–1036, 1989. [16] Marı́a Teresa Gallegos and Gunter Ritter. A robust method for cluster analysis. Annals of Statistics, 33:347–380, 2005. [17] Luis Angel Garcı́a-Escudero, Alfonso Gordaliza, and Carlos Matrán. Trimming tools in exploratory data analysis. Journal of Computational and Graphical Statistics, 12:434–449, 2003. [18] Michael R. Garey and David S. Johnson. Computers and Intractibility. Freeman, San Francisco, 1979. [19] Andrew V. Goldberg and Robert Endre Tarjan. Finding minimum–cost circulations by successive approximation. Math. of OR, 15:430–466, 1990. [20] A. D. Gordon. Classification, volume 82 of Monographs on Statistics and Applied Probability. CRC Press, second edition, 1999. [21] Dan Gusfield, Charles Martel, and David Fernandez-Baca. Fast algorithms for bipartite network flow. SIAM J. Comput., 16:237–251, 1987. [22] Ming S. Hung and Walter O. Rom. Solving the assignment problem by relaxation. Operations Research, 28:969–982, 1980. [23] R.C. Jancey. Multidimensional group analysis. Australian J. Botany, 14:127–130, 1966. [24] Peter Kleinschmidt, Carl W. Lee, and Heinz Schannath. Transportation problems which can be solved by use of Hirsch-paths for the dual problem. Math. Prog., 37:153–168, 1987. [25] Donald E. Knuth. The Art of Computer Programming, volume 2. Addison–Wesley, Reading, Menlo Park, London, Amsterdam, Don Mills, Sydney, 2nd edition, 1981. [26] E.L. Lawler. Combinatorial Optimization: Networks and Matroids. Holt, Rinehart and Winston, New York, 1976. [27] B.G. Leroux. Consistent estimation of a mixing distribution. Ann. Stat., 20:1350–1360, 1992. [28] J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symp. Math. Statist. Probab., pages 281–297, 1967. [29] Geoffrey McLachlan and David Peel. Finite Mixture Models. Wiley, New York etc., 2000. [30] G.W. Milligan and M.C. Cooper. An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50:159–179, 1985. [31] James B. Orlin. A faster strongly polynomial minimum cost flow algorithm. In Proc. 20th ACM Symp. Theory of Computing, pages 377–387, 1988. [32] Christos H. Papadimitriou and Kenneth Steiglitz. Combinatorial Optimization. Prentice–Hall, Englewood Cliffs, New Jersey, 1982. 20 [33] Christoph Pesch. Eigenschaften des gegenüber Ausreissern robusten MCD-Schätzers und Algorithmen zu seiner Berechnung. PhD thesis, Universität Passau, Fakultät für Mathematik und Informatik, 2000. [34] Gunter Ritter and Marı́a Teresa Gallegos. Outliers in statistical pattern recognition and an application to automatic chromosome classification. Patt. Rec. Lett., 18:525–539, 1997. [35] David M. Rocke and David L. Woodruff. A synthesis of outlier detection and cluster identification. Technical report, University of California, Davis, 1999. http://handel.cipic.ucdavis.edu/∼dmrocke/Synth5.pdf. [36] Peter J. Rousseeuw. Multivariate estimation with high breakdown point. In Wilfried Grossmann, Georg Ch. Pflug, István Vincze, and Wolfgang Wertz, editors, Mathematical Statistics and Applications, volume 8B, pages 283–297. Reidel, Dordrecht–Boston–Lancaster–Tokyo, 1985. [37] Peter J. Rousseeuw and Katrien Van Driessen. A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41:212–223, 1999. [38] Anne Schroeder. Analyse d’un mélange de distributions de probabilités de même type. Revue de Statistique Appliquée, 24:39–62, 1976. [39] A.J. Scott and M. J. Symons. Clustering methods based on likelihood ratio criteria. Biometrics, 27:387–397, 1971. [40] Robert Sedgewick. Algorithms in C, volume 5 - Graph Algorithms. Addison-Wesley, Boston etc., third edition, 2002. [41] H. Steinhaus. Sur la division des corps matériels en parties. Bull. Acad. Polon. Sci., 4:801–804, 1956. [42] M. J. Symons. Clustering criteria and multivariate normal mixtures. Biometrics, 37:35–43, 1981. [43] Robert Tibshirani, Guenther Walther, and Trevor Hastie. Estimating the number of clusters in a data set via the gap statistic. J. Royal Statistical Society, Series B, 63:411–423, 2001. [44] Takeshi Tokuyama and Jun Nakano. Geometric algorithms for a minimum cost assignment problem. In Proc. 7th ACM Symp. on Computational Geometry, pages 262–271, 1991. [45] Takeshi Tokuyama and Jun Nakano. Efficient algorithms for the Hitchcock transportation problem. SIAM J. Comput., 24:563–578, 1995. [46] Takeshi Tokuyama and Jun Nakano. Geometric algorithms for the minimum cost assignment problem. Random Structures and Algorithms, 6:393–406, 1995. [47] M.K.S. Tso and J. Graham. The transportation algorithm as an aid to chromosome classification. Patt. Rec. Lett., 1:489–496, 1983. [48] M.K.S. Tso, P. Kleinschmidt, I. Mitterreiter, and J. Graham. An efficient transportation algorithm for automatic chromosome karyotyping. Patt. Rec. Lett., 12:117–126, 1991. [49] Joe H. Ward, Jr. Hierarchical grouping to optimize an objective function. J. Am. Statist. Ass., 58:236–244, 1963. [50] J.H. Wolfe. Pattern clustering by multivariate mixture analysis. Multivariate Behavioral Res., 5:329–350, 1970. [51] David L. Woodruff and Torsten Reiners. Experiments with, and on, algorithms for maximum likelihood clustering. CSDA, 47:237–252, 2004. 21
© Copyright 2026 Paperzz