The Consistency of Estimators in Finite Mixture Models

# Board of the Foundation of the Scandinavian Journal of Statistics 2001. Published by Blackwell Publishers Ltd, 108 Cowley Road,
Oxford OX4 1JF, UK and 350 Main Street, Malden, MA 02148, USA Vol 28: 603±616, 2001
The Consistency of Estimators in Finite
Mixture Models
R. C. H. CHENG and W. B. LIU
University of Kent
ABSTRACT. The parameters of a ®nite mixture model cannot be consistently estimated
when the data come from an embedded distribution with fewer components than that being
®tted, because the distribution is represented by a subset in the parameter space, and not by
a single point. Feng & McCulloch (1996) give conditions, not easily veri®ed, under which the
maximum likelihood (ML) estimator will converge to an arbitrary point in this subset. We
show that the conditions can be considerably weakened. Even though embedded distributions
may not be uniquely represented in the parameter space, estimators of quantities of interest,
like the mean or variance of the distribution, may nevertheless actually be consistent in the
conventional
sense. We give an example of some practical interest where the ML estimators
p
are n-consistent.
Similarly consistent statistics can usually be found to test for a simpler model vs a full
model. We suggest a test statistic suitable for a general class of model and propose a
parameter-based bootstrap test, based on this statistic, for when the simpler model is
correct.
Key words: embedded model, indeterminacy, maximum likelihood, parametric bootstrap
1. Introduction
Finite mixture models have been much studied both theoretically and practically, and a very
thorough review has been given by Titterington et al. (1985). However, there is one problem
which gives rise to serious theoretical and practical dif®culties that has not been so well
investigated. It occurs when data come from a mixture model with fewer components than that
actually being ®tted. The test of the null-hypothesis that there are fewer components than that
under the alternative is then non-regular because certain parameters in the alternative are
indeterminate under the null. Moreover, under the null, the indeterminate parameters cannot be
consistently estimated. This problem of parameter indeterminacy has been reviewed by Smith
(1989) and by Cheng & Traylor (1995) who focus on the hypothesis testing problem. In this
paper, mainly, we consider the parameter estimation problem.
As in the case of hypothesis testing, the estimation problem occurs, because the model being
considered has more components than actually needed. The true model is then an embedded
special case and is not uniquely de®ned in the parameter space embracing the full model. It is
not therefore represented by a single point, but by any point of a certain subset of the parameter
space. This complication was ®rst discussed by Redner (1981). However Redner's work did not
address the case where the estimator of a mixture model approaches a boundary point of the
parameter space, which is actually very important in applications. Redner's ths 3-4 results are
based directly on Wald's technique that has two essential parts: one dealing with any compact
interior set and the other handling boundary points by assuming that the density function goes to
zero whenever parameters approach to a boundary point. As the mixture density function may
not tend to zero when the parameters approach a boundary point of the (e.g. unbounded)
parameter space, Wald's result or approach (therefore Redner's) cannot be adopted directly. That
seems why Redner, ths 5-6, did not address this important issue, by assuming that the estimator
remains in a compact subset. In a recent paper, Feng & McCulloch (1996) establish a
consistency result for ®nite mixture models in this case. In that paper, they give conditions under
604
R. C. H. Cheng and W. B. Liu
Scand J Statist 28
which the vector of maximum likelihood (ML) estimators of the parameters will converge to an
arbitrary point in the subset representing the true model, allowing the estimators to approach a
boundary point of the parameter space. However, their result assumes some complicated
conditions, including those of uniform high order differentiability in neighbourhoods of the
indeterminate sets, which are rather dif®cult to check, so that it is not easily established even in
relatively simple cases if the result holds. We show that the classic consistency argument given
by Wald (1949) can be extended to cover this situation. In our work we apply a compacti®cation
technique: we ®rstly compactify the parameter space by including all the boundary points into
an extended parameter space, and then show that the mixture density function can be extended
to this new space in a continuous fashion. Then we extend Wald's technique to the extended
parameter space. The conditions under which consistency is shown to hold by applying this
approach are considerably weaker and in particular do not require differentiability. These
conditions are much easier to check, and we give explicit examples, covering mixtures of
location-scale models and mixtures of exponential models, including such cases as mixtures of
normals. Besides, the proof is short as well.
The problem can alternatively be dealt with by reparameterization of the model so that each
subset of points representing a given embedded model is mapped into a single point. Though
such reparameterizations do exist, they are model dependent and are in general dif®cult to
de®ne simply. Moreover such a reparameterization may not guarantee the complete removal of
the non-regularity, so that the question of when estimators are ef®cient is thus in general still
unresolved. We do not therefore discuss such reparameterizations in general here, however we
do give an example of some practical interest, where estimators of important parameters are
p
n-consistent irrespective of whether the true model is the embedded or full model. Ghosh &
Sen (1985) mention the possibility of reparameterization but do not pursue the idea as their
results require differentiability conditions. Their suggested reparameterization (1985, p. 791)
does not have the desired properties.
The likelihood ratio, which is not consistent because of the non-regularity, cannot be used to
test for an embedded model. However, consistent statistics can usually be found, and we give an
example that can be used for a wide class of mixture models. We suggest a parametric
bootstrapping method, using this statistic, for carrying out the test and report numerical results
where the statistic has performed satisfactorily.
2. Consistency of maximum likelihood estimators
Let L1 and B‡ denote the following spaces of integrable functions on the interval (ÿ1, 1):
L1 ˆ
f : f measurable, k f k ˆ
…1
ÿ1
jfj,1
B‡ ˆ f f : f 2 L1 , k f k ˆ 1, f > 0g:
Let f 1 , f 2 2 L1 . It follows that f 1 ˆ f 2 in L1 if and only if f 1 (x) ˆ f 2 (x) almost everywhere in
R1 . Let Ù1 , Ù2 be two closed sets in Rm . We de®ne a metric between the two sets as follows:
dis(Ù1 , Ù2 ) ˆ dis(Ù2 , Ù1 ) ˆ inf inf jx ÿ yj
y2Ù2 x2Ù1
When Ù1 , Ù2 are singleton sets (i.e. single points), this metric agrees with the classic Euclidean
distance. The following is used later.
# Board of the Foundation of the Scandinavian Journal of Statistics 2001.
Consistency in mixture models
Scand J Statist 28
605
Property 1
(i) dis(Ù1 , Ù2 ) ˆ 0 if and only if there are sequences of points, fxn g in Ù1 and f yn g in Ù2 ,
such that jxn ÿ yn j ! 0 as n ! 1.
(ii) dis(xn , Ù) ! 0 if and only if there is a sequence f yn g of points in Ù, such that
jxn ÿ yn j ! 0 as n ! 1.
Consider the following mixture model:
h(x, á, è) ˆ
k
X
á i f i (x, èi ),
iˆ1
where
á 2 A f(á1 , á2 , . . ., á k ) : á i > 0, (i ˆ 1, . . ., k), Óá i ˆ 1g,
è 2 È f(è1 , è2 , . . ., è k ) : èi 2 È i (i ˆ 1, . . ., k)g,
and the È i (i ˆ 1, . . ., k) are closed convex sets belonging to Rp .
Let à ˆ A 3 È. For any given (á0 , è0 ) 2 à such that h(, á0 , è0 ) 2 B‡ , we de®ne the set
Ã(á0 , è0 ) ˆ f(á, è) : (á, è) 2 Ã
and
h(, á, è) ˆ h(, á0 , è0 )g:
It can be shown that Ã(á0 , è0 ) is a closed set if certain regularity conditions, for example
assumption (c) below, hold. Ã(á0 , è0 ), however is not a singleton set. There are two distinct
ways in which such non-uniqueness arises.
The ®rst occurs because the same set of parameter values can be ascribed to different
permutations of the mixture components. The problem therefore occurs because the full
parameter space is made up of a number of identical copies of what are in essence the same
subspace. This problem is well known and is frequently referred to as the interchangeability
problem. It can be handled simply by placing an order restriction on the values of just one of the
parameter components, see for example Richardson & Green (1997). For simplicity of exposition and notation we do not include this order restriction explicitly, but simply agree to regard
all such permutation copies as being one and the same.
The second problem of non-uniqueness is much more serious. Ignoring permutations, there
are other cases when Ã(á0 , è0 ) is not a singleton set, but is actually a continuum of parameter
values, when, for any other point (á1 , è1 ) 2 Ã(á0 , è0 ) in this continuum, we have
h(, á1 , è1 ) ˆ h(, á0 , è0 ). Thus h(, á0 , è0 ) does not have a unique representation in Ã. This
happens whenever (á0 , è0 ) has a value for which h(x, á0 , è0 ) takes the reduced form
h(x, á0 , è0 ) ˆ Ó l á f (x, è ),
1
mi
mi
mi
with l < k ÿ 1, è m i 2 È m i . Clearly the ML (or for that matter any other) estimator will not be
consistent in this situation in the sense of converging to a unique point (á0 , è0 ). If (á0 , è0 ) is
the true parameter value in the sense that h(, á0 , è0 ) is the correct model, then we can, at best,
only expect that the ML estimator converges to Ã(á0 , è0 ) in, for example, the Hausdorff
distance. This is essentially theorem 1 below. We now give suf®cient conditions for the theorem
to hold. These conditions are convenient to„ check but not necessarily a minimum set. For any
1
1 < i < k, let è0i 2 È i and write Eè0 ( g) ˆ ÿ1 g(x) f i (x, è0i ) dx.
i
Assumption (a)
Let 1 < i < k. For any èi 2 È i, f i (, èi ) 2 B‡ . The support of fi is independent of èi .
Furthermore f i (, è1i ) ˆ f i (, è2i ) in B‡ only if è1i ˆ è2i .
# Board of the Foundation of the Scandinavian Journal of Statistics 2001.
606
R. C. H. Cheng and W. B. Liu
Scand J Statist 28
Assumption (b)
Let 1 < i < k, and gi (x, èi ) ˆ maxf f i (x, èi ), 1g. For any èi 2 È i,
Eè0 [logf f i (èi )g] . ÿ1,
i
on the support of fi , and
Eè0 [logf gi (èi )g] , 1:
i
Also
h n
Eè0 log
i
sup
èi 2È i ,jèi ÿè0i j<r
g i (èi )
oi
, 1,
for r . 0 suf®ciently small, and
oi
h n
Eè0 log
g i (èi ) , 1,
sup
i
èi 2È i ,jèi j> r . 0
for r suf®ciently large.
Assumption (c)
Let 1 < i < k. For almost every ®xed x 2 R
lim f i (x, èi ) ˆ 0:
jèi j!1
lim f i (x, èi ) ˆ f (x, è0i ),
èi !è0i
if èi , è0i 2 È i .
We use the following two lemmas in the proof of theorem 1. The ®rst lemma shows
assumption (b) „ holds for h as well if it does so for f i. For any g 2 L1, let
1
E(á0 ,è0 ) f g(x)g ˆ ÿ1 g(x)h(x, á0 , è0 ) dx.
Lemma 1
For any (á, è) 2 Ã, assumption (b) holds with k ˆ 1, è1 replaced by (á, è), è01 by (á0 , è0 ) and
f 1 (, è1 ) by h(, á, è).
Proof. The desired results follow from the following inequalities: á log( f 1 ) ‡
(1 ÿ á)log( f 2 ) < logfá f 1 ‡ (1 ÿ á) f 2 g, and for f 1, f 2 > 1, logfá f 1 ‡ (1 ÿ á) f 2 g <
log 2 ‡ log( f 1 ) ‡ log( f 2 ).
The second lemma extends a well known inequality for density functions to non-density
functions.
Lemma 2
Let C ˆ f f 2 L1 : k f k , 1, f . 0g. Then for any f 2 C and g 2 B‡
…1
log( f = g) g dx , 0:
ÿ1
Proof. The inequality follows from Jensen's inequality.
^ but when discussing the
We denote ML estimators not only by the conventional (^
á, è),
n
n
convergence of sequences of ML estimators, also by (á , è ); in the latter case the circum¯ex is
# Board of the Foundation of the Scandinavian Journal of Statistics 2001.
Consistency in mixture models
Scand J Statist 28
607
suppressed for simplicity. Note that for simplicity of terminology we refer to (á n , è n ) as the ML
estimator. However for any given n, (á n , è n ) is not necessarily unique, so that it can actually be
anyone of a set of possible choices. The point however, as the following theorem makes clear, is
that anyone of these values is allowed.
Theorem 1
Let X ˆ (X 1 , X 2 , . . ., X n ) be i.i.d. observations with density h0 (x) ˆ h(x, á0 , è0 ) satisfying the
assumptions (a)±(c). Assume that the parameter space à is a closed convex set. Let the ML
estimator of the model be (á n , è n ). Then
disf(á n , è n ), Ã(á0 , è0 )g ! 0 w:p: 1:
Proof. Without loss of generality, we can assume that à is compact. This is because we can
always extend È i to include all its in®nite clusters so that it is a compact set. Each set in à then
consists of ®nite points and these in®nite clusters. At any such in®nite cluster point, h can be
uniquely de®ned
„ 1as a result of assumption (c). At such cluster points h is not necessarily zero
but always has ÿ1 h < 1 from assumption (c).
We adopt the approach used in Wald (1949). First we show that
(
)
h(X 1 , á, è)h(X 2 , á, è) h(X n , á, è)
P lim sup
ˆ 0 ˆ 1,
(1)
n!1 (á,è)2S
(h0 (X 1 )h0 (X 2 ) h0 (X n )
where S is any closed subset of à such that disfS, Ã(á0 , è0 )g . 0. To show (1), we only need to
con®rm that for each point (á , è ) 2 S, there is always a neighbourhood, N (á , è ), of the
point such that
n
o
E(á0 ,è0 ) log
sup
h(á, è) , E(á0 ,è0 ) logfh(á0 , è0 )g:
(2)
(á,è)2 N (á ,è )
To this end, ®rst assume that (á , è ) is a ®nite point. Let fNi (á , è )g (i ˆ 1, 2, . . ., ) be a
sequence of decreasing neighbourhoods of the point (á , è ) such that \ i>1 Ni (á , è )
ˆ (á , è ). Without loss of generality, we can assume that E(á0 ,è0 ) logfsup(á,è)2 N i (á ,è ) 3
h(á, è)g exists for i ˆ 1, 2, . . . due to assumption (b). It follows from our assumptions that
n
o
lim log
sup
h(á, è) ˆ logfh(á , è )g:
i!1
(á,è)2 N i (á ,è )
It is also clear that we have
n
lim E(á0 ,è0 ) log
sup
n!1
(á,è)2 N i (á ,è )
o
h(á, è) > E(á0 ,è0 ) logfh(á , è )g:
(3)
The sequence fE(á0 ,è0 ) logfsup(á,è)2 N i (á ,è ) h(á, è)gg is clearly decreasing so that
n
o
n
o
log
sup
h(á, è) ÿ log
sup
h(á, è) > 0
(á,è)2 N1 (á ,è )
(á,è)2 N i (á ,è )
Then it follows from the Fatou's lemma and (3) that
n
o
lim E(á0 ,è0 ) log
sup
h(á, è) ˆ E(á0 ,è0 ) logfh(á , è )g , E(á0 ,è0 ) logfh(á0 , è0 )g:
i!1
(á,è)2 N i (á ,è )
Therefore (2) follows when (á , è ) is a ®nite point.
Let (á , è ) be an in®nite point. We ®rst show that (2) is true when N (á , è ) degenerates
into the single point (á , è ). From our assumption we know h(á , è ) has the following form:
# Board of the Foundation of the Scandinavian Journal of Statistics 2001.
608
R. C. H. Cheng and W. B. Liu
h(x, á , è ) ˆ
l
X
iˆ1
Scand J Statist 28
ám i f m i (x, èm i ),
where 0 < l < k ÿ 1 and ám i f m i (, èm i ) . 0. If Ó1l ám i , 1, then from lemma 2,
E(á0 ,è0 ) logfh(á , è )g , E(á0 ,è0 ) log h(á0 , è0 ):
Otherwise, we only need to show that h(á , è ) 6ˆ h(á0 , è0 ). Assume this is not true (here we
cannot yet say that (á , è ) 2 Ã(á0 , è0 ), as it is not a ®nite point of à ). Then
(á , è ) 2 Ã(á0 , è0 ), as it is the limiting point of the sequence f(á1s , . . ., á sk ) 3
(è1s , . . ., è sk )g 2 Ã(á0 , è0 ), where
á s ˆ á if j ˆ m , otherwise á s ˆ 0;
i
j
ˆ èj if j ˆ mi ,
è sj
j
j
è sj
otherwise
! 1:
This is impossible as we have disfS, Ã(á0 , è0 )g . 0. Now let Ni (á , è ) be a sequence of
T
decreasing neighbourhoods of the point (á , è ) such that
i N i ˆ (á , è ). Again from
lemma 1 and Fatou's lemma
n
o
lim E(á0 ,è0 ) log
sup
h(á, è) < E(á0 ,è0 ) logfh(á , è )g
i!1
(á,è)2 N i (á ,è )
, E(á0 ,è0 ) logfh(á0 , è0 )g:
Therefore we have proved (2). Then by using the Heine±Borel ®nite open cover theorem and
the same technique as used in the proof of theorem 1 in Wald (1949), (1) follows.
From (1) we can show that disf(á n , è n ), Ã(á0 , è0 )g ! 0 w.p. 1, by using the same arguments
in the proof of th. 2 of Wald (1949) but changing jè ÿ è0 j into disf(á, è), Ã(á0 , è0 )g.
Theorem 1 reduces to the usual form of consistency for the case where Ã(á0 , è0 ) is a
singleton set and therefore is a natural extension of well established consistency results for the
ML estimators. Also, it implies the convergence results in Feng & McCulloch (1996).
Of the three assumptions, (b) is usually the most tedious to verify. The following gives a
simpler condition that is suf®cient for assumption (b) to hold.
Corollary 1
Let fi satisfy assumptions (a) and (c). Suppose j f i (x, è)j < M(x) for all 1 < i < k, where
Eè0 M , 1 and Eè0 [logf f i (èi )g] . ÿ1. Then theorem 1 holds.
i
i
Theorem 1 shows that when the true model is an indeterminate case then the ML estimator
converges in Hausdorff distance towards the indeterminate set of points Ã(á0 , è0 ) de®ning the
model. This does not in itself guarantee that the ®tted model tends to that of the true model,
which is what is really needed. This important point is not discussed by Feng & McCulloch
(1996), but is dealt with in our case by the following theorem which shows that assumption (c)
guarantees the continuity of h under the Hausdorff distance.
Theorem 2
Let assumption (c) hold. Assume that parameter space à is a closed convex set. If
disf(á n , è n ), Ã(á0 , è0 )g ! 0, then
h(x, á n , è n ) ! h(x, á0 , è0 ), a:e:
# Board of the Foundation of the Scandinavian Journal of Statistics 2001.
Consistency in mixture models
Scand J Statist 28
609
Further, if there is a function M(x) integrable on any ®nite interval of R1 such that
jh(x, á n , è n )j < M(x), then
H(x, á n , è n ) ! H(x, á0 , è0 ),
where H(x, á, è) is the distribution function of the random variable with density function
h(x, á, è).
Proof. Assume that disf(á, è), Ã(á0 , è0 )g ! 0. Then there is a sequence of points in
Ã(á0 , è0 ), denoted as (á0n , è0n ), such that já0n ÿ á n j ‡ jè0n ÿ è n j ! 0. Therefore
h(x, á n , è n ) ÿ h(x, á0 , è0 ) ˆ h(x, á n , è n ) ÿ h(x, á0n , è0n ) ! 0, a.e. from assumption (c) and
property 1.
Now assume that there is a local majorizing function M(x)
the densities.
„of
„ R for the nsequence
R
n
h(t,
á
,
è
)
!
h(t, á0 , è0 ),
It
then
follows
that
for
any
®xed
x,
R
.
0,
we
have
ÿ
R
ÿ
R
„x
„x
n
n
0
0
h(t,
á
,
è
)
!
h(t,
á
,
è
).
For
any
ä
.
0,
let
R
be
suf®ciently
large
so that
ÿR
„R
„ÿRR
0
0
h(t,
á
,
è
)
.
1
ÿ
ä=2.
Then
there
is
an
N
.
0
such
that
for
n
.
N
ÿR
ÿR
h(t, á n , è n ) . 1 ÿ ä. Therefore
…
…1 …ÿ R !
…1 … R
…ÿ R !
n
n
1 ˆ h(t, á , è ) ˆ
h.1 ÿ ä ‡
h:
‡
‡
‡
R
ÿR
ÿ1
R
ÿ1
Consequently for any ä . 0 there are R . 0, N . 0 such that
…1
…ÿ R
h(t, á n , è n ) ‡
h(t, á n , è n ) , ä:
R
ÿ1
It follows that for any ®x x
…x
…x
h(t, á n , è n ) !
h(t, á0 , è0 ),
ÿR
ÿR
This proves the second conclusion of theorem 2.
The existence of a local majorizing function is a weak requirement. It holds, for example,
when h(, á n , è n ) is bounded from above.
From theorems 1 and 2 we can show the following existence result for ML estimators in
mixture models.
Theorem 3
Let X ˆ (X 1 , X 2 , . . ., X n ) be i.i.d. observations with density h(x, á0 , è0 ) satisfying assumptions (a)±(c). Assume that parameter space à is a closed convex set. Let the ML estimator of the
model be (á n , è n ). Then
Prf9N . 0 such that è n exists for all n . N g ˆ 1:
Proof. Suppose that for some n and sample X i , (á n , è n ) does not exist. Let (á nm , è nm ),
m ˆ 1, 2, . . . be the maximum sequence of the likelihood function
L(X , á, è) ˆ h(X 1 , á, è)h(X 2 , á, è) h(X n , á, è)
i.e. lim m!1 L(X , á nm , è nm ) ˆ sup(á,è)2à L(X , á, è). Then f(á nm , è nm )g must be outside of
f(á, è) : dis[(á, è), Ã 0 ] . 1g \ Ã when m is large enough, from theorem 2. Therefore we can
®nd a parameter (án , èn ) such that disf(án , èn ), Ã 0 g . 1 and jL(X , án , èn )=L(X, á0 ,
è0 )j . 1=2. Therefore supdisf(á,è),Ã 0 g>1 jL(X , á, è)=L(X , á0 , è0 )j . 1=2. It then follows that
# Board of the Foundation of the Scandinavian Journal of Statistics 2001.
610
R. C. H. Cheng and W. B. Liu
Scand J Statist 28
f9 infinite n for which (á n , è n ) does not existg n
o
9 infinite n such that
sup
jL(X , á, è)=L(X , á0 , è0 )j . 1=2 :
disf(á,è),Ã 0 g>1
The second is a null event from the proof of theorem 3.
We now give examples of the application of the above results.
Example 1. Special cases of the following very simple model have been much studied (see
Hartigan, 1985; Berman, 1986; Smith, 1989 for examples) though mainly from the point of view
of likelihood ratio tests. Let
È ˆ fè : 0 < è , 1g,
(4)
⠈ (á, „è) 2 [0, 1] 3 È. Suppose that for any è 2 È, f (, è) 2 B‡ . Let f 0 (x) ˆ f (x, 0), and
1
E( g) ˆ ÿ1 g f 0 dx. Consider the mixture model:
h(á, è, x) ˆ (1 ÿ á) f (x, 0) ‡ á f (x, è):
(5)
Indeterminacy occurs when either á ˆ 0 or è ˆ 0, or in the terminology of the above, when
Ã(á, è) is not a singleton set. Clearly therefore this happens only if áè ˆ 0. All other points
correspond to determinate cases. Therefore let
à 0 ˆ f(á, è): (á, è) 2 [0, 1] 3 È, á ˆ 0 or è ˆ 0g,
(6)
which is a closed set in [0, 1] 3 È. The single model de®ned by à 0 is f 0 . We are interested in
consistency of the ML estimators in this case and apply theorem 1 to get the following.
Result 1
Suppose assumptions (a)±(c) are satis®ed by f, i.e. they hold with k ˆ 1, è1 ˆ è, È1 ˆ È as
given in (4) and f 1 (è) ˆ f (è). Let X ˆ (X 1 , X 2 , . . ., X n ) be i.i.d. observations with density
f (, 0). Let the ML estimator of the model (5) be (á n , è n ). Then
disf(á n , è n ), Ã 0 g ! 0 w:p: 1:
Proof. We apply theorem 1 with È1 ˆ f0g, È2 ˆ [0, 1], f 1 ˆ f 2 ˆ f .
An explicit model satisfying this result is where f (x, è) is the normal density
f (x, è) ˆ (2ð)ÿ1=2 exp[ÿ(x ÿ è)2 =2].
The practical signi®cance of result 1 is that it shows that it is meaningless to try to estimate
either á or è separately when the true model is f 0 . This raises the question whether the
combined quantity j ˆ áè can be consistently estimated. This will be considered in the next
section.
Example 2. Let f (x) 2 B‡ . Then
xÿì
f L (x, ì, ó ) ˆ ó ÿ1 f
ó
(7)
is a location-scale model. Consider the mixture of two such models:
h(x, á, è) ˆ á f L (x, ì1 , ó 1 ) ‡ (1 ÿ á) f L (x, ì2 , ó 2 ),
(8)
with parameter space
à ˆ f(á, è): (á, è) 2 [0, 1] 3 Èg
# Board of the Foundation of the Scandinavian Journal of Statistics 2001.
(9)
Consistency in mixture models
Scand J Statist 28
611
where
È ˆ R1 3 R1E 3 R1 3 R1E , R1E ˆ fx 2 R1 : x > E . 0g:
(10)
It is not unreasonable to assume that the variances have a positive lower bound as the cases of
zero variances are degenerate. Without this lower bound, the likelihood becomes unbounded if
ó 1 or ó 2 ! 0; and ML, based on a global maximum, does not then give consistent estimators.
In this case one can use a modi®ed ML estimator, for example, the maximum product of
spacings estimator suggested by Cheng & Amin (1983). We will discuss this in a forthcoming
paper.
Let
Ã( ì, ó ) ˆf(á, è) 2 [0, 1] 3 È : á ˆ 0; ì1 ˆ ì, ó 1 ˆ ó g
[ f(á, è) 2 [0, 1] 3 È : á ˆ 1; ì2 ˆ ì, ó 2 ˆ ó g
[ f(á, è) 2 [0, 1] 3 È : ì1 ˆ ì2 ˆ ì, ó 1 ˆ ó 2 ˆ ó g,
ÿ 1 , ì , 1 and 0 , ó , 1,
(11)
i.e. Ã( ì, ó ) is the set of all points in the original parameterization representing the pure location
scale model (7). Let
[
Ã0 ˆ
Ã( ì, ó )
(12)
( ì,ó )
Thus à 0 is the set of all points for which the model (8) is indeterminate. Theorem 1 gives the
consistency of the ML estimator in this case. We have the following.
Result 2
Suppose assumptions (a)±(c) are satis®ed with k ˆ 1, è1 ˆ ( ì, ó ), È1 ˆ È as given in (10),
and f 1 (, è) ˆ f L (, ì, ó ). Let X ˆ (X 1 , X 2 , . . ., X n ) be i.i.d., observations with density
f L (, ì0 , ó 0 ). Let the ML estimator of the model (8) be (á n , è n ). Then
disf(á n , è n ), Ã( ì0 , ó 0 )g ! 0 w:p: 1:
Proof. Similarly we apply theorem 1 with f 1 (x, ì, ó ) ˆ f 2 (x, ì, ó ) ˆ f L (x, ì, ó ), È1 ˆ
R1 3 R1E , È2 ˆ R1 3 R1E . Assumptions (a)±(c) are satis®ed (with k ˆ 2), since ó i > E.
A frequently met case is when f is the standard normal density function. It is easy to check that
all the conditions required in result 2 hold for this case.
3.
p
n-Consistency and test statistics
Convergence of the ML estimator to a set rather than to a point means that estimates of
individual parameters may remain unstable. It would therefore be attractive if the parameters
can be transformed so that each set Ã(á, è) maps into a separate singleton set, and where all
parameters can be consistently estimated. Such transformations do exist, but are somewhat
messy to de®ne in full generality, being closely dependent on the particular form of the Ã(á, è).
We therefore do not discuss such transformations here. However, it is worth noting that certain
p
estimators in broad classes are n-consistent. We consider an interesting case. Titterington et
al. (1985) consider the mixture model with components each with the same exponential density
# Board of the Foundation of the Scandinavian Journal of Statistics 2001.
612
R. C. H. Cheng and W. B. Liu
(
f (x, è) ˆ exp b(x) ‡
k
X
Scand J Statist 28
)
ti (x)èi ÿ a(è) :
iˆ1
and point out that the likelihood equations must satisfy
k
X
áj
jˆ1
@a(è( j) )
@è(i j)
ˆ nÿ1
n
X
ti (xl ),
i ˆ 1, 2, . . ., p:
(13)
lˆ1
Examples of this property have been observed by a number of authors (see Titterington et al.
1985, for references) but has not been discussed in the context of indeterminacy. If we therefore
retain the original parameterization but simply de®ne as additional parameters,
P
j i ˆ kjˆ1 á j @a(è( j) )=@è(i j) , and de®ne our estimator j^i of j i as being the solution of the
P
likelihood equation, then we have immediately from (13) that j^i ˆ nÿ1 ntˆ1 ti (xl ), so that when
Efti (X )g and varfti (X )g are ®nite then the j^i are asymptotically normally distributed and so
p
are n-consistent. (Note that we have not investigated the possibility that the MLE lies on the
boundary, when it might not necessarily satisfy (13); we do not therefore claim that j^i will
necessarily always be the MLE, though we conjecture that it usually will be.)
A typical case is that of the two component normal mixture which has mean and variance
ø1 ˆ áì1 ‡ (1 ÿ á)ì2
ø2 ˆ á(1 ÿ á)( ì1 ÿ ì2 )2 ‡ áó 21 ‡ (1 ÿ á)ó 22 :
(14)
If we therefore let t1 (x) ˆ x and t2 (x) ˆ x 2, then the likelihood equation estimators of ø1 and
ø2 , just de®ned, are
^1 ˆ x ˆ nÿ1
ø
n
X
iˆ1
xi ,
and
^2 ˆ nÿ1
ø
n
X
(xi ÿ x)2 ,
(15)
iˆ1
p
^1 and ø
^2 are n-consistent estimators of the mean
i.e. the sample mean and variance. Thus ø
and variance of the distribution whether the data come from a single normal model or a mixture
of two normals.
Dif®culties like those pointed out by Hartigan (1985), show that the likelihood ratio cannot be
used to test the null hypothesis that the embedded model is the true model. However, in broad
classes simple test statistics can be found that are consistent in the conventional sense under the
null hypothesis.
Consider for example the location-scale model (8). With no loss of generality, we suppose
that ó 21 > ó 22 to avoid the same model being duplicated simply through interchange of
(á, ì1 , ó 21 ) with ((1 ÿ á), ì2 , ó 22 ). Here if we de®ne
ø(á, è) ˆ á(1 ÿ á)f( ì1 ÿ ì2 )2 ‡ (ó 1 ÿ ó 2 )2 g,
(16)
we ®nd that
ø(á, è) ˆ 0 if and only if (á, ì1 , ì2 , ó 1 , ó 2 ) 2 à 0
(17)
so that ø(á, è) is an indicator of the region corresponding to the embedded model. However,
because à 0 is not necessarily bounded, the property (17) is not in itself suf®cient to guarantee
consistency directly. The following result gives two conditions under either of which consistency
is obtained.
Result 3
Suppose assumptions (a)±(c) are satis®ed by f as in result 2. Let ø(á, è) be as de®ned in (16)
and let (á n , è n ) be the ML estimator using the model (8).
# Board of the Foundation of the Scandinavian Journal of Statistics 2001.
Consistency in mixture models
Scand J Statist 28
613
(i) If fX n g is drawn from the full model (8) with parameter value (á0 , è0 ), then
ø(á n , è n ) ! ø(á0 , è0 ), w:p: 1:
(ii) If fX n g is drawn from the location-scale model (7) with parameter value ( ì0 , ó 0 ), then
ø(á n , è n ) ! 0, w:p: 1:
(18)
provided either
(a) f(á n , è n )g is bounded (w.p. 1)
or
(b) ø2n ! (ó 0 )2 (w.p. 1),
where ø2 is as de®ned in (14).
Proof. Case (i) is just the regular case; we therefore need only consider case (ii). In this latter
case the assumptions allow theorem 1 to be applied to show that disf(á n , è n ), Ã( ì0 , ó 0 )g ! 0
w.p. 1. It follows from property 1 that there is a sequence f(á n , è n )g Ã( ì0 , ó 0 ) such that
j(á n , è n ) ÿ (á n , è n )j ! 0, and from the de®nition of Ã(ì0 , ó 0 ):
ø(á n , è n ) ÿ ø(á0 , è0 ) ˆ ø(á n , è n ) ÿ ø(á n , è n ):
If f(á n , è n )g is bounded, so is f(á n , è n )g. Then it is easy to see, from the uniform
continuity of the mapping ø, that (18) holds on any bounded domain.
Now assume that ø2n ! (ó 0 )2 . If fá n g does not converge to 1 or 0, then 0 , á n , 1 so that
ì1n ˆ ì2n ˆ ì, ó 1n ˆ ó 2n ˆ ó 0 . Therefore f(á n , è n )g is bounded so that (18) is true.
Suppose that fá n g converges to 1 or 0, say 1. Then á n ˆ 1 so that ì1n ˆ ì0 , ó 1n ˆ ó 0 .
Therefore ì1n ! ì0 , ó 1n ! ó 0 . However ø2n ! (ó 0 )2 ; that is, á n (1 ÿ á n )(ì1n ÿ ì2n )2 ! 0,
(1 ÿ á n )ó 2n ! 0. It follows that ø(á n , è n ) ! 0. Thus (18) holds.
^ ! 0 w.p. 1 if and only if
Result 3 gives two conditions under either of which ø
2
2
n
n
(á, ì1 , ì2 , ó 1 , ó 2 ) 2 Ã 0 . The ®rst condition, that f(á , è )g be bounded is satis®ed if the
parameters are bounded, a condition not too unreasonable in practice. As pointed out earlier, the
normal mixture model is an example where the second condition is satis®ed. This follows from
Pn
^2 ) ˆ nÿ1 iˆ1
(xi ÿ x)2 in this case, so that ø2n ! (ó 0 )2 (w.p. 1)
(15), which shows that ø2n ( ø
if the sample is drawn from the straight normal distribution.
Neither condition is needed if a normalization of the parameters is ®rst carried out. Let
è~ ˆ ( ì~1 , ì~2 , ó~1 , ó~2 ) where
ì~i ˆ ì i =(1 ‡ ì2i )1=2 ,
ó~i ˆ ó i =(1 ‡ ó i ),
i ˆ 1, 2,
(19)
~ be precisely the same as ø except that ì i and ó i are replaced by ì~i and ó~i respectively
and let ø
(i ˆ 1, 2). Then a similar argument to the case where f(á n , è n )g is bounded can be used to
show that, as f( ì~n , ó~ n )g is bounded, we have the following.
Result 39
~ without the need of the ®nal conditions (a) or (b).
Result 3 applies with ø replaced by ø
When result 3 or 39 holds then ø provides a test statistic of the null hypothesis that h ˆ f L .
Though the non-regularity is not entirely removed, the results do give us a practical means of
carrying out the test of hypothesis, we give a parametric bootstrap method for doing this and
report some numerical results in the next section.
# Board of the Foundation of the Scandinavian Journal of Statistics 2001.
614
R. C. H. Cheng and W. B. Liu
Scand J Statist 28
4. Bootstrap test of the embedded model
For simplicity we attach a subscript to è to distinguish between models. Thus we write F(, è M ),
and F(, è E ) for the full mixture and embedded models respectively, and we write è0M or è0E to
indicate the true parameter value according to whether the full or embedded model is the correct
one.
When result 3 or 39 holds for the two-component mixture model (8), the parameter ø in (16)
provides a test of the null hypothesis
H 0 : F(, è E ),
è E unknown,
versus H 1 : F(, è M ), è M unknown:
(20)
~ converges w.p. 1 to the true value, ø0 ˆ 0 if and only if H 0 is true. The obvious form for a
as ø,
size á test is as follows.
Suppose the sample X is drawn from F(, è0 ), and the full model is ®tted giving F(, è^M ). Let
^ be the statistic calculated from è^M . The distribution of ø,
^ which depends on è0 , will be
ø
denoted by G(, è0 ). This distribution will differ depending on whether è0 ˆ è0M or è0 ˆ è0E.
^ with ã(á),
The null distribution is G(, è0E ). If this is known we can compare the test statistic ø
the (1 ÿ á)100 percentile of G(, è0E ) and reject the null if
^ . ã(á):
ø
Because of the non-regularity, the distribution G(, è0E ) is dif®cult to obtain theoretically.
However, parametric bootstrap sampling is an easy way to estimate G(, è0E ). We draw B
(bootstrap) samples, X (i) , i ˆ 1, 2, . . ., B from F(, è^E ) and obtain, from these, the estimates
^(i)
è^(i)
E , i ˆ 1, 2, . . ., B, and, in turn from these, the test statistics ø , i ˆ 1, 2, . . ., B. By
^(i)
construction these have distribution G(, è^E ) and the empirical distribution function of the ø
therefore tends to G(, è^E ) as B ! 1. Then the above theory gives conditions under which
è^E ! è0E w.p. 1, when the embedded (null) model is correct. Strictly we need to show that this
implies G(, è^E ) ! G(, è0E ), however, because of the non-regular nature of the problem, this
seems dif®cult to do in general. This same problem occurs with the parametric resampling
method proposed by Feng & McCulloch (1996) for constructing con®dence intervals. However,
as these authors point out, this desirable property is only obtained in the asymptotic limit
anyway. For ®nite samples, simulation studies are usually the easiest way to verify that the
approximation is satisfactory.
It is hoped to discuss the effectiveness of such tests more fully elsewhere, but as a brief
example we present some simulation results for the mixture of two normals for sample sizes
^ the ML estimator of (16), was used as the test statistic. Its
n ˆ 20 and 50. In each case ø,
distribution was constructed as follows. First a sample of size n was drawn from N(0, 1) and the
ML estimates ì^ and ó^ obtained. Then 1000 bootstrap samples each of size n were drawn from
N ( ì^, ó^ 2 ). For each bootstrap sample the mixture model (8) was ®tted by the EM algorithm and
^ then calculated. The EDF of the resulting 1000 such øs
^ then estimates G(, ì^, ó^ ) which in
ø
turn estimates G(, 0, 1).
^ ó^ ), m (ˆ 100) such
^ ó^ 2 affects the variability of G(, ì,
To illustrate how the variability of ì,
EDFs were calculated and shown in Figs 1 and 2. Selected percentage points together with an
indication of their spread are given in Table 1.
It will be seen that the spread of estimates appears quite narrow. This is con®rmed when we
consider the power of the statistic in detecting alternatives. To study this, EDFs were also
constructed in exactly the way just given except that for each EDF the initial sample was drawn
from the mixture model 0:7N (0, 1) ‡ 0:3N (ì2 , 1) and the bootstrap samples drawn from
^ ( ì^2 , ó^ 22 ), where á,
^ ì^1 , ì^2 , ó^ 1 , ó^ 2 were the ML estimates obtained from
^ ì^1 , ó^ 21 ) ‡ (1 ÿ á)N
áN(
the initial sample. For n ˆ 20 and 50, Figs 1 and 2 show m ˆ 50 such EDFs for each of the
# Board of the Foundation of the Scandinavian Journal of Statistics 2001.
Scand J Statist 28
Consistency in mixture models
615
^ each constructed from 1000 samples of size 20. The number of EDFs shown is 100 for
Fig. 1. EDFs of ã,
the initial model N (0, 1) and 50 for each of the initial models 0:7N (0, 1) ‡ 0:3N (ì2 , 1), ì2 ˆ 1, 2, 3:
^ each constructed from 1000 samples of size 50. The number of EDFs shown is 100 for
Fig. 2. EDFs of ã,
the initial model N (0, 1) and 50 for each of the initial models 0:7N (0, 1) ‡ 0:3N (ì2 , 1), ì2 ˆ 1, 2, 3:
# Board of the Foundation of the Scandinavian Journal of Statistics 2001.
616
R. C. H. Cheng and W. B. Liu
Scand J Statist 28
^ for the N (0, 1) model estimated from 100 EDFs each
Table 1. Selected percentage points ã(á) of ø
constructed from 1000 samples of size n
á
n ˆ 20
0.10
0.05
0.01
n ˆ 50
0.10
0.05
0.01
ãmin (á)
ã(á)
ãmax (á)
SD
0.744
0.852
1.031
0.787
0.900
1.136
0.833
0.947
1.234
0.018
0.021
0.040
0.553
0.626
0.755
0.585
0.666
0.809
0.618
0.704
0.875
0.013
0.015
0.027
cases ì2 ˆ 1, 2, 3. As anticipated by the values of ø0 (ˆ 0:21, 0.84, 1.89 when ì2 ˆ 1, 2, 3
respectively), the power is low when ì2 ˆ 1. This is not surprising as the difference in shape
between the models 0:7N(0, 1) ‡ 0:3N (1, 1) and N (0, 1) is small compared with the inherent
variability in the relatively small samples considered, and so is probably dif®cult to detect by
any method. The difference between models becomes marked as ì2 increases, and when ì2 ˆ 3
the power of the test seems quite satisfactory.
References
Berman, M. (1986). Some unusual examples of extrema associated with hypothesis tests when nuiscance
parameters are present under the alternative. In Proceedings of Paci®c statistics congress (eds I. S. Francis,
B. F. J. Manly & F. C. Lam), North Holland, Amsterdam.
Cheng, R. C. H. & Amin, N. A. K. (1983). Estimating parameters in continuous univariate distributions with
a shifted origin. J. Roy. Statist. Soc. Ser. B 45, 394±403.
Cheng, R. C. H. & Traylor, L. (1995). Non-regular maximum likelihood problems. J. Roy. Statist. Soc. Ser. B
57, 3±44.
Feng, Z. D. & McCulloch, C. E. (1996). Using bootstrap likelihood ratios in ®nite mixture models. J. Roy.
Statist. Soc. Ser. B 58, 593±608.
Ghosh, J. K. & Sen, P. K. (1985). On the asymptotic performance of the log likelihood ratio statistic for the
mixture model and related results. In Proceedings of Berkeley symposium in honor of Neyman, J. and
Kiefer, J. (eds L. LeCam & R. A. Olshen), Vol. II, 789±806, Wadsworth, New York.
Hartigan, J. A. (1985). A failure of likelihood asymptotics for the mixture model. In Proceedings of Berkeley
symposium in honor of Neyman, J. and Kiefer, J. (eds L. LeCam & R. A. Olshen), Vol. II, 807±810.
Wadsworth, New York.
Redner, R. (1981). Note on the consistency of the maximum likelihood estimate for nonidenti®able
distributions. Ann. Statist. 9, 225±228.
Richardson, S. & Green, P. J. (1997). A Bayesian analysis of mixtures with an unknown number of
components. J. Roy. Statist. Soc. Ser. B 59, 473±484.
Smith, R. L. (1989). A survey of nonregular problems. Proceedings of the International Statistical Institute
conference, 47th session, Paris, 353±372.
Titterington, D. M., Smith, A. F. M. & Makov, U. E. (1985). Statistical analysis of ®nite mixture distributions.
Wiley, Chichester.
Wald, A. (1949). Note on the consistency of the maximum likelihood estimates. Ann. Math. Statist. 20, 595±
601.
Received April 1998, in ®nal form October 2000
Wenbin Liu, Canterbury Business School, University of Kent, Canterbury CT2 7PE, UK.
# Board of the Foundation of the Scandinavian Journal of Statistics 2001.