A constrained robust proposal for mixture modeling avoiding

A constrained robust proposal for mixture
modeling avoiding spurious solutions ∗
L.A. Garcı́a-Escudero, A. Gordaliza,
C. Matrán and A. Mayo-Iscar
Departamento de Estadı́stica e Investigación Operativa
Universidad de Valladolid and IMUVA. Valladolid, Spain†
Abstract
The high prevalence of spurious solutions and the disturbing effect of outlying observations in mixture modeling are well known problems that pose serious difficulties for
non-expert practitioners of this kind of models in different applied areas. An approach
which combines the use of Trimmed Maximum Likelihood ideas and the imposition of
restrictions on the maximization problem will be presented and studied in this paper.
The proposed methodology is shown to have nice mathematical properties as well as
good performance in avoiding the appearance of spurious solutions in a quite automatic
manner.
Keywords: Mixture models, constraints, robustness, trimming, eigenvalues restrictions, maximum likelihood estimation.
1
Introduction
The problem of estimating the parameters in finite mixture models has received a lot of
attention. This interest is mainly due to the flexibility of this kind of models to adapt
to different settings, as well as to the existence of feasible EM-type algorithms to provide
approximate solutions. However, it is well known the fact that a few outlying observations
can produce clearly undesirable effects in the determination of the mixture fitted. Moreover,
in general, the classical maximum-likelihood estimation approach for these models often leads
to ill-posed problems because of the unboundedness of the objective function to maximize,
∗
This research is partially supported by the Spanish Ministerio de Ciencia e Innovación, grant MTM201128657-C02-01.
†
Departamento de Estadı́stica e Investigación Operativa. Facultad de Ciencias. Universidad de Valladolid.
47002, Valladolid. Spain.
1
which favors the appearance of non-interesting local maximizers and degenerate or spurious
solutions.
The troubles of lack of robustness in mixture fitting appear when the sample contains a
certain proportion of observations that have been generated by some strange mechanism and
do not follow the underlying population model. Moreover, practitioners and users of these
models may not be aware of the presence of contaminated data. Thus, some robustified versions of the procedures at hand are needed. Just to cite some of the most important ones: The
MCLUST with noise component in Fraley and Raftery (1998), Mixtures of t-distributions
in McLachlan and Peel (2000), the Trimmed Likelihood mixture fitting method in Neykov
et al. (2007), a “Forward Search” approach in Calò (2008), Trimmed ML estimation of contaminated mixtures in Gallegos and Ritter (2009), and the robust improper ML estimator
(RIMLE) introduced in Coretto and Hennig (2010).
Some of these previously cited robust proposals for mixture modeling are based on trimming, which has been shown to be a simple, flexible, powerful and computationally feasible
way to robustify statistical methods in many different problems and different frameworks.
In this work, we will focus on this trimming approach and we concentrate on the problem of
fitting a mixture of G normal components to a given data set {x1 , ..., xn } in Rp by maximizing a “trimmed mixture likelihood” (see Neykov et al 2007 and Gallegos and Ritter 2009)
defined as
X
n
G
X
πg ϕ(xi ; µg , Σg ) ,
(1)
z(xi ) log
i=1
g=1
where ϕ(·; µ, Σ) stands for the probability density function of the p-variate normal distribution with mean µ and covariance matrix Σ and where z is a trimming indicator function
that tells us whether observation xi is trimmed off (z(xi ) = 0) or not (z(xi ) = 1). A
fraction α of the observations are allowed to be unassigned or trimmed off by posing that
Pn
i=1 z(xi ) = [n(1 − α)].
The high prevalence of spurious and undesirable solutions in maximum likelihood mixture
modeling is a well known problem. For instance, a lot of examples were presented and studied
in McLachlan and Peel (2000), giving strong evidence of the frequency of the appearance
of such spurious solutions. A distinctive feature of this problem is that spurious solutions
can even appear when this maximum likelihood methodology is applied to artificial data
sets drawn from the models for which the estimator was specially designed, i.e. data sets
generated from a finite mixture model of known distributions (even without adding any kind
of contamination).
Practitioners and applied researchers in general, from any field of knowledge and who do
not necessarily have advanced skills in Multivariate Statistics, demand more and more automatic answers to mixture-models estimation problems, answers that should be free from the
risk of providing spurious solutions. Moreover, there is a clear need for the implementation
2
of on-line detection systems able to work in an automatized way, without close supervision
from an expert. In these cases, as the estimation process cannot be carried out as an “ad
hoc” analysis supervised by an expert statistician, it seems much more sensible to try to
design automatized methods that are self-applicable to each data set at hand.
In our opinion, this need can be addressed in a satisfactory way through procedures that
incorporate constraints in the maximization process defining the mixture fitting estimation
method. These procedures allow the estimation problem to be mathematically well-posed
and so automatically avoids or minimizes the risk of appearance of spurious solutions. Moreover, a correct statement of the problem from the mathematical point of view would also
allow the statistical properties of the methodology to be studied. Several proposals following
this kind of approach, based on constraining the maximization problem, have been proposed
and studied in the recent literature (see, e.g., Ingrassia and Rocci 2007, Garcı́a-Escudero el
al. 2008 and 2012 and Gallegos and Ritter 2009).
In this work, we introduce and study an estimator obtained through the combination of
the restrictions used in Garcı́a-Escudero et al. (2008) associated to the robust model-based
clustering methodology TCLUST and the trimmed likelihood methodology introduced in
Neykov et al. (2007) in the setting of mixture model estimation. These restrictions play a
key role in the proof of the existence and consistency results. They also serve to prevent
the appearance of spurious solutions and to provide different alternative fitted mixtures
depending on the strength of these constraints. To be more specific, one of the main aims of
this work is to analyze the prevalence of the occurrence of spurious solutions in (trimmed)
maximum likelihood estimation when considering different settings that include different
sample sizes, different dimensions of the problem and different amounts of contamination.
The paper is organized as follows. After this brief introduction and motivation section,
Section 2 introduces the proposed methodology and states its main properties, such as existence and consistency under mild conditions. An algorithm for the practical implementation
of this approach is presented in Section 3. Section 4 shows a simulation study carried out in
the normal multivariate mixture setting to illustrate the good performance of the proposed
methodology. Finally, some concluding remarks are presented an discussed in Section 5.
2
Problem statement and theoretical results
Let us assume that the sample {x1 , ..., xn } ⊂ Rp arises from an i.i.d random sample from
an underlying distribution P . We could ideally assume that P is a mixture of G p-variate
normal components but, in the presented results, only mild assumptions on the underlying
distribution P are required. Given this sample, the proposed approach is based on the
maximization of the trimmed mixture log-likelihood (1), as given in Neykov et al. (2007)
and Gallegos and Ritter (2009), but with an additional constraint:
3
(ER) Eigenvalues-Ratio constraint M/m ≤ c, for
M = max max λl (Σg ) and m = min
g=1,...,G l=1,...,p
min λl (Σg )
g=1,...,G l=1,...,p
with λl (Σg ) being the eigenvalues when g = 1, ..., G and l = 1, ..., p of the scatter
matrices Σg and c ≥ 1 being a fixed constant.
This type of constraint simultaneously controls differences between groups and departures
from sphericity by constraining the relative length of the axes of the equidensity ellipsoids
based on ϕ(·; µg , Σg ). The smaller c, the more similarly scattered and spherical the mixture
components should be. In fact, these ellipsoids reduce to balls with the same radius in the
most constrained case c = 1.
The previously stated empirical problem in (1) admits a theoretical or population counterpart:
Constrained trimmed mixture-fitting problem: Given a probability measure P , maximize:
"
X
#
G
EP z(·) log
πg ϕ(·; µg , Σg ) ,
(2)
g=1
in terms of the trimming indicator function
z : Rp 7→ {0, 1} such that EP z(·) = 1 − α
and the parameters θ = (π1 , ..., πG , µ1 , ..., µG , Σ1 , ..., ΣG ) corresponding to weights πg ∈
P
p
[0, 1], with G
g=1 πg = 1, location vectors µg ∈ R and symmetric positive definite
(p × p)-matrices Σg satisfying the (ER) constraint for a fixed constant c ≥ 1.
The set of θ parameters which obey these conditions is denoted by Θc throughout this
paper.
The trimming indicator function z indicates whether x ∈ Rp is trimmed (when z(x) = 0)
or not (when z(x) = 1). In fact, it is easy to see that z can be obtained from θ. Given
θ = (π1 , ..., πG , µ1 , ..., µG , Σ1 , ..., ΣG ), let us define the functions
Dg (x; θ) = πg ϕ(x; µg , Σg ) and D(x; θ) =
G
X
Dg (x; θ).
g=1
Then,
z(x) := z(x; θ) = I{x : D(x; θ) ≥ R(θ, P )},
with
R(θ; P ) = inf {H(u; θ, P ) ≥ α} for H(u; θ, P ) = P (D(·; θ) ≤ u).
u
4
(3)
P
If Pn stands for the empirical measure Pn = (1/n) ni=1 δ{xi } , we recover the original
empirical problem by replacing P by Pn . Note that EPn z(·; θ) = 1 − α cannot be exactly
achieved for some α values, but this familiar fact is not very important in our reasonings.
In this section, we give results guaranteeing the existence of solutions for both the empirical and population problems, together with a consistency result of the empirical solutions to
the population one. These two results only require mild assumptions on the underlying distribution P . Namely, we require P not to be concentrated on G points after trimming which
would make the trimmed mixture fitting approach completely inappropriate for modeling P .
(PR) The distribution P is not concentrated on G points after removing a probability mass
equal to α.
This condition trivially holds for absolutely continuous distributions as well as for empirical measures corresponding to large enough samples drawn from an absolutely continuous
distribution.
We can state the following general existence result:
Proposition 2.1 If (PR) holds for the distribution P , then there exists some θ ∈ Θc such
that the maximum of (2) under (ER) is achieved.
The following consistency result also holds:
Proposition 2.2 Let us assume that (PR) holds for the distribution P which has a strictly
positive density function and that θ0 is the unique maximizer of (2) under (ER). If θn ∈ Θc
denotes a sample version estimator based on the empirical measure Pn , then θn → θ0 almost
surely.
The proof of these results, which is given in the Appendix, follows similar arguments
to those given for the existence and consistency results of the TCLUST method in Garcı́aEscudero et al (2008) and those for the constrained mixture fitting approach presented in
Garcı́a-Escudero et al (2013).
The presented approach is obviously not affine equivariant due to the type of constraints
considered. Although the approach comes closer to affine equivariance when considering
large values of constant c, in any case, it is recommended to (robustly) standardize the
variables when very different measurements scales are involved.
3
Algorithm
Algorithms for maximizing the trimmed likelihood in (1), without considering the constraint
on the scatter matrices, were proposed and justified in Neykov et al (2007) and Gallegos and
5
Ritter (2009). In this section, we propose a feasible algorithm based on those introduced in
Garcı́a-Escudero et al (2013) and Fritz et al (2013), where the scatter matrices constraint is
implemented. The precise rationale behind the proposed algorithm can be followed in these
last two cited works and will not be detailed here.
The proposed algorithm may be described as follows:
1. Initialization: The algorithm is initialized nstart times by selecting different θ(0) =
(0)
(0)
(0)
(0)
(0)
(0)
(π1 , ..., πG , µ1 , ..., µG , Σ1 , ..., ΣG ). With this in mind, we simply propose to ran(0)
domly select G × (p + 1) observations and to accordingly compute G mean centers µg
(0)
and G scatter matrices Σg from them. The cluster scatter matrix constraints (to be
(0)
(0)
(0)
described in Step 2.2) are applied to these Σg if needed. Weights π1 , ..., πG in the
interval (0, 1) and summing up to 1 are randomly chosen.
2. Trimmed EM steps: The following steps are alternatively executed until convergence
(i.e. θ(l+1) = θ(l) ) or a maximum number of iterations iter.max is reached. The way
that trimming is done is similar to how “concentration” steps were applied in this
framework in Neykov et al (2007), but here we are also imposing the eigenvalues-ratio
constraint on the scatter matrices.
2.1. E- and C-steps: Using the notation in the previous section, the observations in
I = {i : D(xi ; θ(l) ) ≤ R(θ(l) ; Pn )} are those which are tentatively discarded in this
iteration of the algorithm. In other words, if
D(x(1) ; θ(l) ) ≤ D(x(2) ; θ(l) ) ≤ .... ≤ D(x(n) ; θ(l) ),
then R(θ(l) ; Pn ) = D(x([nα]) ; θ(l) ), so we are discarding the proportion α of observations with the smallest D(xi ; θ(l) ) values.
As in other mixture fitting EM algorithms, we compute posterior probabilities as
τg (xi ; θ(l) ) = Dg (xi ; θ(l) )/D(xi ; θ(l) ), for i = 1, ..., n.
However, unlike standard EM algorithms, the τg (xi ; θ(l) ) values for the discarded
observations are modified as
τg (xi ; θ(l) ) = 0 for all g = 1, ..., G, when i ∈ I.
2.2. M-step: The parameters are updated as
πg(l+1) =
n
X
τg (xi ; θ(l) )/[n(1 − α)]
i=1
and
µ(l+1)
g
=
n
X
(l)
τg (xi ; θ )xi
i=1
X
n
i=1
6
τg (xi ; θ(l) ).
The sample covariance matrices are initially updated as
X
n
n
X
l
(l+1)
(l+1) 0
Tj =
τg (xi ; θ )(xi − µg )(xi − µg )
τg (xi ; θl ).
i=1
i=1
Unfortunately, these matrices may not satisfy the required eigenvalue-ratio constraint. In this case, the singular-value decomposition of Tg = Ug0 Dg Ug is considered, with Uj being an orthogonal matrix and Dg = diag(dg1 , dg2 , ..., dgp ) a
diagonal matrix. The truncated eigenvalues are defined as


 dgl if dgl ∈ [t, ct]
[dgl ]t =
,
t
if dgl < t


ct
if dgl > ct
with t as some threshold value. The scatter matrices are finally updated as
Σ(l+1)
= Ug0 Dg∗ Ug ,
g
with Dg∗ = diag [dg1 ]topt , [dg2 ]topt , ..., [dgp ]topt and topt minimizing
t 7→
G
X
g=1
πg(l+1)
p X
dgl
log ([dgl ]t ) +
.
[d
gl ]t
l=1
(4)
3. Evaluate target function: After applying the trimmed EM steps, the associated value
of the target function is computed (we set z(xi ) = 0 if i ∈ I and z(xi ) = 1 if i ∈
/ I). The
set of parameters yielding the highest value of this target function and the associated
trimmed indicator function z are returned as the algorithm’s final output.
Proposition 3.2 in Fritz et al (2013) shows that topt can be obtained just by evaluating
2pG + 1 times the real valued function in (4).
4
Simulation study
A simulation study has been carried out to test the effectiveness of the proposed methodology
in avoiding spurious maximizers in mixture models ML estimation and giving solutions
according to the underlying population model.
The simulation scheme is as follows:
• Population mixture model. The
made of two normal components
!
0
1
0.5 · N2
,
0
0
underlying original population mixture model P is
in dimension p = 2 with density
!!
!
!!
0
3
1.49 0
+ 0.5 · N2
,
.
(5)
1
4
0 1
7
Higher dimension situations are generated by adding independent identically distributed
standard normal variables to the additional coordinates. In other words, in all cases,
there are just two informative coordinates identifying the components of the mixture
and the remaining coordinates are merely noise.
• Dimension. The simulation scheme includes dimensions p = 2, 6 and 10.
• Contamination. Three different scenarios are considered. S1: The first one corresponds
to the described mixture model without contamination. S2: The second includes 10%
contamination placed far away from the two mixture components and very concentrated
around the point (17, 12) in the p = 2 case and around the point (17, 12, 0, ..., 0) in
higher dimensional cases. S3: In the third scenario, we add 10% background noise
uniformly distributed on a hyperrectangle containing all genuine observations.
• Sample size. We consider sample sizes equal to n = 100 and n = 200. We have
tried higher sample sizes that will not be reported here because no relevant additional
findings were observed.
In all situations obtained by varying the above factors, we perform the estimation process
through the proposed methodology, assuming that the true number of components of the
mixture G = 2 is known, and varying the trimming level α and the strength of the restrictions
through the constant c in the following way:
• Trimming level. We include the results corresponding to the untrimmed case (α = 0)
and a trimmed solution using a trimming level α = 0.2, which should be sufficient to
deal with a 10% contamination added in all the scenarios.
• Restriction factor. We consider the procedure under three different values for the
restriction factor: c = 4 is a low factor restriction (the length of the longest axis
cannot exceed twice that of the shortest), c = 100 (the length of the longest axis
cannot exceed ten times that of the shortest) and c = ∞, i.e. the unrestricted ML
estimator.
In all scenarios, the algorithm is initialized nstart = 1000 times and the performance of
the proposed methodology is evaluated through two measurements:
• High Concordance Index (HCI). This measurement indicates the amount of initializations that lead to solutions having at least 90% of concordance in the assignment with
the “true” solution. We say that the “true” solution is achieved if the non-trimmed
observations are classified very close to how the classification would be done if the
mixture model (5) was known in advance. The classification is made by assigning each
8
observations to the component that gives it the largest posterior probability. The concordance is calculated through the Adjusted Rand Index in Hubert and Arabie (1985).
This index is computed using only the subset of non contaminated and untrimmed
observations.
• Prevalence of Spurious Solutions (PSS). In this case we are recording the number of
initializations that lead to spurious solutions with higher likelihood than the highest
likelihood obtained with the “true” solutions. Although they have larger likelihoods,
they do not correspond to the mixtures we are actually looking for.
Therefore, the interest is in having HCI strictly greater than 0 (i.e., giving the algorithm
the chance of detecting the true solution throughout the random initializations) and PSS
equal to 0 (i.e. avoiding spurious solutions, with larger likelihood values than the “true”
solution, which the algorithm would eventually prefer).
Table 1 shows the results of the simulation study. The appearance of spurious solutions
increases, in general, with the dimension in all scenarios. The problem is specially dramatic
in the highest dimension case p = 10, where many spurious solutions are easily found. Low
sample sizes favor the appearance of spurious solutions, while their prevalence diminishes
when the sample size increases. To deal with this problem in the p = 10 case, it would
clearly be needed a larger sample size than those used in this simulation study.
Trimming favors solutions close to the true assignments in contaminated cases and diminishes the prevalence of spurious solutions. Moreover, as expected, the stronger the restriction
the lower the prevalence of spurious solutions. Moreover, the strength of the restrictions has
to be increased to deal with high dimensional cases.
In all scenarios, the recommendation would be using the proposed methodology, i.e. Restricted Trimmed Likelihood versions of the procedure, with α greater than the expected
contamination level and moderate values for the restriction factor c, according to the expected heteroscedasticity present in the model. As can be checked in the table, in all cases
the combination of values c = 4 for the restriction factor and α = 0.2 for the trimming level
has provided a value P SS = 0 and reasonably high HCI values.
To visualize some of the results obtained in this simulation study, Figures 1 and 2 show
the classifications obtained by this trimming approach when p = 2 and n = 2. The trimmed
observations are shown as black points. The same number of random initializations niter
and concentration steps iter.max have been taken in all these graphs.
Figure 1 shows how trimmed ML with G = 2 provides classifications closer to the “true”
solutions when contamination is present (S2 and S3 cases). This figure also shows that
considering a trimming size higher than needed is not too detrimental. The same value of
the restriction factor c = 2 has always been considered.
Figure 2 shows the effect of the restriction factor c in the output of the algorithm for
the data set in scenario S2. We take α = 0.2, which supposedly allows to remove the 10%
9
n
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
p
2
2
2
2
2
2
6
6
6
6
6
6
10
10
10
10
10
10
2
2
2
2
2
2
6
6
6
6
6
6
10
10
10
10
10
10
c
4
4
100
100
∞
∞
4
4
100
100
∞
∞
4
4
100
100
∞
∞
4
4
100
100
∞
∞
4
4
100
100
∞
∞
4
4
100
100
∞
∞
α
0
.2
0
.2
0
.2
0
.2
0
.2
0
.2
0
.2
0
.2
0
.2
0
.2
0
.2
0
.2
0
.2
0
.2
0
.2
0
.2
0
.2
0
.2
Table 1: Simulation results.
S1
S2
HCI
PSS
HCI
PSS
.996
.000
.000
1
.988
.000
.983
.000
.773
.000
.000
1
.552
.000
.659
.000
.599
.000
.000
1
.402
.001
.396
.000
.984
.000
.000
1
.987
.000
.990
.000
.072
.000
.019
.964
.045
.000
.109
.031
.022
.005
.003
.859
.013
.016
.003
.663
.486
.000
.000
1
.768
.000
.974
.000
.000
.000
.000
1
.000
1
.000
0.044
.000
1
.000
1
.000
1
.000
1
.994
.000
.000
1
.995
.000
.253
.000
.931
.000
.000
1
.846
.000
.308
.006
.799
.000
.000
1
.728
.000
.164
.163
.993
.000
.000
1
.995
.000
.989
.000
.246
.000
.000
1
.093
.000
.152
.067
.075
.000
.000
1
.032
.001
.002
.721
.992
.000
.000
1
.993
.000
.993
.000
.005
.000
.000
1
.006
.000
.057
.026
.000
1
.000
1
.000
1
.000
1
10
S3
HCI
.000
.993
.000
.883
.000
.673
.053
.992
.005
.031
.001
.004
.176
.962
.001
.001
.000
.000
.119
.994
.000
.865
.000
.687
.126
.994
.000
.138
.000
.030
.107
.989
.000
.009
.000
.000
PSS
.728
.000
1
.000
1
.000
.909
.000
.978
.000
.787
.169
.402
.000
.987
.680
1
1
.871
.000
1
.000
1
.000
.578
.000
1
.000
1
.059
.883
.000
1
.000
1
1
(a)
(b)
1e
−0
4
1e−04●
●
4
4
●
0
0.06 .05
0.09
●
●
●
2
0.03
x2
●
3
0.0
5
0.1
0.06
5
0.0
0.09
●
1e−04
●
●
●
●
●
●
0.0
0.04
3
●
0.04
●
0.001●
0.07
0
0.001
0.07
●
0.04
0.01
●
0
0.01
−2
0.02
−2
0.0
0.02
0.01
0
2
●
.06
●
0.0
0.05
0.03
●
7
●
0.01
●
●
●
0.02
●
0.04
04
●
●
0.02
1e−
●
●
●
●
●
●
●
●
●
●
●
1e
−0
4
−2
0
2
4
6
−2
0
2
(c)
x1
4
6
(d)
x1
●
●●
●
●
●
●●
●●
●●●
●●●
0.01
10
10
0.005
0.001
1e−04
1e−04
5
x2
●
0.01
●
5
0.02
●
0.
03
05
0.03
0
●
2
0.0
●
●
0.01
●
0.02
●
●
●
●
●
●
0.06
0.035
0.
0
0.
●
02
0.
04
●
0.03
5
1e−04
0.005
0.015
●
●
●
●
0.001
●
0.001
0
5
10
15
0
5
10
15
10
(f)
x1
10
(e)
x1
●
●
●
●
0.001
●
●
●
●
5
5
●
0.01
●
●
●
0.02
●
●●
●
1e−04
●
0.02
●
●
●
●
0.015
0
●
x2
0
25
0.0
●
●
●
●● ●
●
●
●
0.03
0.001
●
●
●
−5
−5
0.01
0.05
●
0.005
0.02
●
●
●
1e−04
−10
−10
●
●
−10
−5
0
5
10
−10
−5
0
5
10
Figure 1: Fitted mixtures with G = 2 and c = 4 for the data set in scenario S1 with α = 0
(a) and α = 0.2 (b). Scenario S2 with α = 0 (c) and α = 0.2 (d) and scenario S3 with
α = 0 (e) and α = 0.2 (f). Different plotting symbols are used to represent the observation
classifications made by using the posterior probabilities. The black solid points are the
trimmed observations. The contour plots represent the fitted mixtures.
amount of contamination. We can see that c = 1 yields spherical mixture components,
since the restriction forces the mixture component scatter matrices to be equal to the same
diagonal matrix. We can also see in Figure 2,(c) that the method is not able to resist
the effect of the contamination in the (almost) unconstrained case c = 1010 . Recall that a
trimming level α = 0.2 (larger than the contamination level) was taken. Due to the lack of
constraints, the solution in Figure 2,(c), with very different mixture component scatters, has
a larger likelihood than that corresponding to the mixture that only fits the two main mixture
components in (5). If we are allowing a 20% proportion of observations to be discarded, the
mixtures in Figure 2,(a), (b) provide more sensible fits than that in (c), where a component
with just 10% of the observations is detected while two (main) components, each of them
with 45% observations, are merged together. It may be argued that the problem is that the
11
number of components G has been wrongly chosen, since G = 3 could have been a better
choice. However, this argument is not so clear whenever the contamination is more difficult
to model with an additional normally distributed mixture component. Moreover, Table 1
shows that PSS is equal to 1 for unrestricted c = ∞ applications of trimmed ML procedures
in higher dimensional cases (p = 10), even in the uncontaminated S1 scenario, where α = 0
and G = 2 are clearly the correct choices of these two parameters. On the other hand, the
use of smaller c’s avoids the detection of spurious solutions even in this high dimensional
case.
(a) c = 1
(c) c = 1010
(b) c = 4
●
●
1
0.00
10
10
●●
●
●●
●
●●
●●
●●
●
●
●●
10
●●
●
●●
●
●●
●●
●●
●
●
●●
1e−04
1
0.00
0.02
5
●
●
●
●
0.03
●
●●
●
● ●
●
10
15
0
●
●
●
●
●
●
●
●
●
●
1e−04
5
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
0
0
●
●
●
●
0.01
●
●
0.0
4
0.0
●
●
0
●
●
0.
●
●
5
0.0
●
0.03
●
0.05
0.02
04
0.
0.0
5
●
●
●
●●
●
●
●
●
8
0.1
0.0
4
4
0.02
0.05
0
●
3
0.0
●
●
●
●
0.0
●
●
●
0.05
3
0.0
●
0.01
005
0.02
●
●
●
●
●
x2
0.01
●
0.001
5
x2
5
1e−04
0.001
5
10
15
0
5
10
15
Figure 2: Fitted mixtures for the data in scenario S2 with n = 200 when G = 2 and α = 0.2.
Restriction value c = 1 is used in (a), c = 4 in (b) and c = 1010 (almost unrestricted) in (c).
5
Discussion
We have presented a constrained robust approach for mixture modeling. We have seen
that the proposed trimming approach, together with the consideration of restrictions on the
relative sizes of the mixture component scatter matrices, have nice theoretical properties
and good robustness behavior. This approach is useful to prevent the detection of spurious
solutions.
Since the proposed methodology aims to be very general, several parameters (α, G and
c) are involved. In fact, these parameters are clearly interrelated. For instance, Figure 3 is
done by considering n = 200 observations from the mixture
!
!!
!
!!
!
!!
0
1 0
3
1.49 0
1.5
5 0
0.4N2
,
+0.4N2
,
+0.2N2
,
.
0
0 1
4
0 1
2
0 5
A large value of c, α = 0 and G = 3 allow the methodology to find the third most scattered
mixture component accounting for 20% of the data, as shown in Figure 3,(a). However, a
smaller value of c (i.e., we are interested in mixture components with more similar scatters)
does not allow to obtain this more scattered component in Figure 3,(b). Since this component
12
cannot be fitted with an additional mixture component, then it is better to consider G = 2
and trim off the most scattered observations, as shown in Figure 3,(c).
(a)
(b)
(c)
●
10
10
●
0.002
●
●
0.004
01
09
0.0
●
2
x2
01
16
0
0.0
x2
0
0.
●
08
−5
0.001
●
●
●
●
●
●
0.001
−5
0.01
●
0.003
−5
●
●
●
0
0.
0.006
0.001
●
0.01
0.02
●
●
0.04
0
●
●
0.03
0.013
0.01
0.02
●
0.02
0.05
0.014
0.02
0.01
●
0.001
●
● ●
0.015
0.03
●
●
●
0.011
3
0
0.025
0.015
●
0.007
5
.00
5
0.0
5
5
0.005
0.015
0.0
10
●
1e−04
●
●
●
●
●
●
●
1e−04
1e−04
−10
−10
−10
●
●
−5
0
5
10
15
−5
0
5
10
15
−5
0
5
10
15
Figure 3: Fitted mixtures for the trivariate mixture in Section 5 when G = 3, α = 0 and
c = 10 in (a), G = 3, α = 0 and c = 1 in (b) and G = 2, α = 0.2 and c = 4 in (c).
Therefore, how to choose all the involved parameters cannot be a fully automatic task
and requires an active role to be played by the user of the mixture fitting methodology by
specifying the “type” of mixture components that we are actually searching for. Thus, for
instance, the specification of parameter c allows the user to specify the maximum allowable
difference between mixture components scatters and the use of α > 0 implies that not all the
observations must be fitted. Anyway, tools such as those presented in Garcı́a-Escudero et al
(2011) can be easily adapted to this robust mixture fitting framework. These adapted tools
would be useful in providing guidance about how to make a sensible choice for the parameters
α and G, given c. They are based on monitoring the trimmed mixture likelihoods, i.e. the
maximum values attained by the target function (1), when moving α and G in a controlled
way. In fact, this idea was already explored in Neykov et al. (2007), but considering the
monitoring of BIC penalized trimmed mixture likelihoods. With respect to parameter c, we
have seen that not many essentially different solutions are really obtained when moving c
and it would be interesting to develop tools for exploring all of them in an easy way.
A
Appendix: Proofs of existence and consistency results
A.1
Existence
As was previously commented (and using the same notation as in Section 2), the trimmed
mixture likelihood problem can be defined in terms of the maximization on θ ∈ Θc of
L(θ, P ) := EP [z(·; θ) log(D(·; θ))] .
13
(6)
n
n
n
n
n
n
∞
Let us thus consider a sequence {θn }∞
n=1 = {(π1 , ..., πG , µ1 , ..., µG , Σ1 , ..., ΣG )}n=1 ⊂ Θc such
that
lim L(θn , P ) = sup L(θ, P ) = M > −∞
(7)
n→∞
θ∈Θc
(the boundedness from below in (7) can be easily obtained just by considering π1 = 1, µ1 = 0
and setting the other weights as 0 with arbitrary choices of the other location and scatter
parameters).
Since [0, 1]k is a compact set, we can extract a subsequence from {θn }∞
n=1 (that will be
denoted in the same way as the original one) such that
πjn → πg ∈ [0, 1] for 1 ≤ g ≤ G,
(8)
and satisfying, for some k ∈ {0, 1, ..., G} (a relabeling could be needed), that
µng → µg ∈ Rp for 0 ≤ g ≤ k and min kµng k → ∞.
(9)
g>k
With respect to the scatter matrices, under (ER), we can also consider a further subsequence
verifying one (and only one) of these possibilities:
Σng → Σg for 1 ≤ g ≤ G,
(10)
Mn = max
max λl (Σng ) → ∞,
(11)
mn = min
min λl (Σng ) → 0.
(12)
g=1,...,G l=1,...,p
or
g=1,...,G l=1,...,p
In the proof of the following lemma, we use a bound relating L(θ, P ) with the target
function CL(θ, P ) in the TCLUST clustering problem (trimmed classification likelihood)
introduced in Garcı́a-Escudero et al (2008). This clustering approach was defined in terms
of the maximization on θ ∈ Θc of
" G
#
X
CL(θ, P ) := EP
zg (·; θ) log Dg (·; θ)
(13)
g=1
with zg (x; θ) = I {x : DC (x; θ) = Dg (x; θ)} ∩ {x : DC (x; θ) ≥ R(θ; P )} and where DC (x; θ) =
max{D1 (x; θ), ..., DG (x; θ)}.
Lemma A.1 Given a sequence satisfying (7), if P satisfies (PR) then only the convergence
(10) is possible.
Proof: The proof follows from Lemma A.1 in Garcı́a-Escudero et al (2008) just by taking
into account the bound
"
X
#
G
L(θ, P ) = EP z(x; θ) log
Dg (·; θ) ≤ EP z(x; θ) log G max Dg (·; θ)
(14)
g=1,...,G
g=1
"
≤ (1 − α) log G + EP
G
X
#
zg (·; θ) log Dg (·; θ) = (1 − α) log G + CL(θ, P ).
g=1
14
Lemma A.2 Given a sequence satisfying (7) and assuming condition (PR), if every πg in
(8) verifies πg > 0 for g = 1, ..., G, then k = G in (9).
Proof: If k = 0, we can see that L(θn ; P ) → −∞ by using the bound (14) and Lemma 2
in Garcı́a-Escudero et al (2008). So, let us directly assume that k > 0. We will prove that
"
"
X
#
X
#
G
k
EP z(·; θn ) log
Dg (·; θn ) − EP z̃(·; θn ) log
Dg (·; θn ) → 0,
(15)
g=1
g=1
Pk
with z̃(x; θ) = I{x : D̃(x; θ) ≥ R̃α (θ, P )} with D̃(x; θ) =
g=1 Dg (·; θn ), H̃(u; θ, P ) =
e θ, P ) ≥ β}. Notice that this z̃(x; θ) corresponds to
P (D̃(·; θ) ≤ u) and R̃β (θ; P ) = inf u {H(u;
the trimming indicator function that would have been obtained when considering only the
first k components of the mixture determined by θ.
In order to prove (15), let us consider θ̃ as the limit of a convergent subsequence of
{(π1n , ..., πkn , µn1 , ..., µnk , Σn1 , ..., Σnk )}∞
n=1 . Recall that some θ̃ do always exist due to (8), (9) and
Lemma A.1.
For each x ∈ Bβ and ε with 0 < ε < R̃β (θ̃; P ), if we take the compact set Bβ = {x :
D̃(x; θ̃) ≥ R̃β (θ̃, P )}, then there always exists n0 such that
D(x; θn ) ≥ D̃(x; θn ) ≥ D̃(x; θ̃) − ε ≥ R̃β (θ̃; P ) − ε > 0
for every n ≥ n0 . Since Bβ is a compact set, D(x; θn ) is uniformly bounded from above when
x ∈ Bβ . Moreover, by applying the optimality of z(·; θn ) and z(·; θ̃), we have that there also
exists an n00 such that
n
o
{x : z(x; θn ) = 1} ⊂ x : log(D(x; θn )) ≥ log(R̃β (θ̃; P ) − ε)
(16)
and
n
o
{x : z̃(x; θn ) = 1} ⊂ x : log(D̃(x; θn )) ≥ log(R̃β (θ̃; P ) − ε) ,
(17)
for every n ≥ n00 when β > 1 − α.
From (16) and (17) and the previous uniform (lower and upper) bounds on D(x; θn ), we
have that
X
X
G
k
z(·; θn ) log
Dg (·; θn ) − z̃(·; θn ) log
Dg (·; θn )
g=1
g=1
is a uniformly bounded sequence. This fact, together with its pointwise convergence, proves
(15) by applying the dominated convergence theorem.
Therefore, taking into account (15), we have limn→∞ sup L(θn ; P ) ≤ L(θ̃; P ). Given that
Pk
j=1 πj < 1, the proof ends by showing that we can change the weights π1 , ..., πG by weights
∗
∗
∗
∗
π1 , ..., πG
with πk+1
= ... = πG
= 0. This new solution would lead to a contradiction with
the optimality in (2) and we conclude that k = G.
Proof of Proposition 2.1: Taking into account previous lemmas, the proof is exactly the same
as that of Proposition 2 in Garcı́a-Escudero et al. (2008).
15
A.2
Consistency
Given {xn }∞
n=1 an i.i.d. random sample from an underlying (unknown) probability distrin
n
n
n
n
n ∞
bution P , let {θn }∞
n=1 = {(π1 , ..., πk , µ1 , ..., µk , Σ1 , ..., Σk )}n=1 ⊂ Θc denote a sequence of
empirical estimators obtained by solving the problem (2) for P being the sequence of empirical measures {Pn }∞
n=1 with the eigenvalue-ratio constraint (ER) (notice that the index n
now stands for the sample size).
First we prove that there exists a compact set K ⊂ Θc such that θn ∈ K for n large
enough, with probability 1.
Lemma A.3 If P satisfies (PR), then the minimum (resp. maximum) eigenvalue, mn (resp.
Mn ) of the Σng matrices cannot verify mn → 0 (resp. Mn → ∞).
Proof: The proof is trivial after applying bound (14) and Lemma A.4 in Garcı́a-Escudero
et al. (2008).
Lemma A.4 If (PR) holds for distribution P , then we can choose empirical centers µng ’s
such that their norms are uniformly bounded with probability 1.
Proof: We use the same reasoning as in Lemma A.5 of Garcı́a-Escudero et al. (2008),
combined with the bound (14) applied to P = Pn , to prove that k > 0.
Therefore, we can assume that k > 0. We need to prove (15), but replacing P by
Pn (note that z(·; θn ) and z̃(·; θn ) are also defined depending on Pn ). This can be done
by following a reasoning similar to that applied in the proof of Lemma A.2 after applying
standard Empirical Process theory tools.
In the following lemma, we use the same notation and terminology as in van der Vaart
and Wellner (1996).
Lemma A.5 Given a compact set K ⊂ Θc , a compact set B ⊂ Rp and [a, b] ⊂ R, the class
of functions
H := IB (·)I[d,∞) D(·; θ) log(D(·; θ)) : θ ∈ K, d ∈ [a, b]
(18)
is a Glivenko-Cantelli class.
Proof: It has already been proven in Garcı́a-Escudero et al. (2008) that, for B a fixed
compact set,
G := IB (·) log(D(·; θ)) : θ ∈ K
is a Glivenko-Cantelli class of functions. From this property, the same property for the class
H can be obtained through the consideration of the subgraphs, {log(D(·; θ)) ≥ d : d ∈ [a, b]},
of the functions in −G and the application of Theorem 2.10.6 in van der Vaart and Wellner
(1996).
16
The following lemma is the analogous to Lemma A.7 in Garcı́a-Escudero et al. (2008)
and its proof is exactly the same:
Lemma A.6 Let P be an absolutely continuous distribution with a strictly positive density
function. Then, for every compact subset K, we have that
sup |R(θ; Pn ) − R(θ; P )| → 0, P -a.e. .
(19)
θ∈K
Proof of Proposition 2.2: Taking into account Lemma A.5, the result follows from Corollary
3.2.3 in van der Vaart and Wellner (1996). Notice that Lemmas A.3 and A.4 are needed to
guarantee the existence of a compact set K such that the sequence of empirical estimators
satisfies {θn }∞
n=1 ⊂ K. Lemma A.6 is needed in order to guarantee the existence of an
interval [a, b] including R(θn ; Pn ).
References
[1] Calò, D.G. (2008), “Mixture Models in Forward Search Methods for Outlier Detection,”
in Studies in Classification, Data Analysis, and Knowledge Organization, (eds: Preisach,
C., Burkhardt, H., Schmidt-Thieme, L. and Decker, R.) Springer Berlin Heidelberg, 103110.
[2] Coretto, P. and Hennig, C. (2010), “A simulations study to compare robust clustering
methods based on mixtures,” Adv. Data. Anal. Classif., 4, 111-135.
[3] Fraley, C. and Raftery, A.E. (1998), “How many clusters? Which clustering method?
Answers via model-based cluster analysis,” The Computer J., 41, 578-588.
[4] Fritz, H., Garcı́a-Escudero, L.A. and Mayo-Iscar, A. (2013), “A fast algorithm for robust
constrained clustering”, Comput. Stat. Data Anal., 61, 124-136.
[5] Gallegos, M.T. and Ritter, G. (2009), “Trimmed ML estimation of contaminated mixtures,” Sankhya (Ser. A), 71, 164-220.
[6] Garcı́a-Escudero, L.A., Gordaliza, A., Matrán, C. and Mayo-Iscar, A. (2008), “A general
trimming approach to robust cluster analysis”, Ann. Statist., 36, 1324-1345.
[7] Garca-Escudero, L.A., Gordaliza, A., Matrn, C. and Mayo-Iscar, A. (2011), “Exploring
the number of groups in robust model-based clustering”, Stat Comput, 21, 585-599.
[8] Garcı́a-Escudero, L.A., Gordaliza, A., Matrán, C. and Mayo-Iscar, A. (2013),
“Avoiding spurious local maximizers in mixture modelling”, preprint available at
http://www.eio.uva.es/infor/personas/slm2 web.pdf.
17
[9] Hubert, L. and Arabie, P. (1985), “Comparing partitions”, J. Classif., 2, 193218.
[10] Ingrassia S. and Rocci R. (2007), “Constrained monotone EM algorithms for finite
mixture of multivariate Gaussians,” Comput. Stat. Data Anal., 51, 5339-5351.
[11] McLachlan, G. and Peel, D. (2000), Finite Mixture Models, John Wiley Sons, Ltd., New
York.
[12] Neykov, N., Filzmoser, P., Dimova, R. and Neytchev, P. (2007), “Robust fitting of
mixtures using the trimmed likelihood estimator,” Comput Stat Data Anal, 17, 299-308.
[13] van der Vaart, A.W. and Wellner, J.A. (1996), Weak Convergence and Empirical Processes, John Wiley Sons, Ltd., New York.
18

Download Report

A constrained robust proposal for mixture modeling avoiding

Paperzz.com

Your Paperzz