# Board of the Foundation of the Scandinavian Journal of Statistics 2001. Published by Blackwell Publishers Ltd, 108 Cowley Road, Oxford OX4 1JF, UK and 350 Main Street, Malden, MA 02148, USA Vol 28: 603±616, 2001 The Consistency of Estimators in Finite Mixture Models R. C. H. CHENG and W. B. LIU University of Kent ABSTRACT. The parameters of a ®nite mixture model cannot be consistently estimated when the data come from an embedded distribution with fewer components than that being ®tted, because the distribution is represented by a subset in the parameter space, and not by a single point. Feng & McCulloch (1996) give conditions, not easily veri®ed, under which the maximum likelihood (ML) estimator will converge to an arbitrary point in this subset. We show that the conditions can be considerably weakened. Even though embedded distributions may not be uniquely represented in the parameter space, estimators of quantities of interest, like the mean or variance of the distribution, may nevertheless actually be consistent in the conventional sense. We give an example of some practical interest where the ML estimators p are n-consistent. Similarly consistent statistics can usually be found to test for a simpler model vs a full model. We suggest a test statistic suitable for a general class of model and propose a parameter-based bootstrap test, based on this statistic, for when the simpler model is correct. Key words: embedded model, indeterminacy, maximum likelihood, parametric bootstrap 1. Introduction Finite mixture models have been much studied both theoretically and practically, and a very thorough review has been given by Titterington et al. (1985). However, there is one problem which gives rise to serious theoretical and practical dif®culties that has not been so well investigated. It occurs when data come from a mixture model with fewer components than that actually being ®tted. The test of the null-hypothesis that there are fewer components than that under the alternative is then non-regular because certain parameters in the alternative are indeterminate under the null. Moreover, under the null, the indeterminate parameters cannot be consistently estimated. This problem of parameter indeterminacy has been reviewed by Smith (1989) and by Cheng & Traylor (1995) who focus on the hypothesis testing problem. In this paper, mainly, we consider the parameter estimation problem. As in the case of hypothesis testing, the estimation problem occurs, because the model being considered has more components than actually needed. The true model is then an embedded special case and is not uniquely de®ned in the parameter space embracing the full model. It is not therefore represented by a single point, but by any point of a certain subset of the parameter space. This complication was ®rst discussed by Redner (1981). However Redner's work did not address the case where the estimator of a mixture model approaches a boundary point of the parameter space, which is actually very important in applications. Redner's ths 3-4 results are based directly on Wald's technique that has two essential parts: one dealing with any compact interior set and the other handling boundary points by assuming that the density function goes to zero whenever parameters approach to a boundary point. As the mixture density function may not tend to zero when the parameters approach a boundary point of the (e.g. unbounded) parameter space, Wald's result or approach (therefore Redner's) cannot be adopted directly. That seems why Redner, ths 5-6, did not address this important issue, by assuming that the estimator remains in a compact subset. In a recent paper, Feng & McCulloch (1996) establish a consistency result for ®nite mixture models in this case. In that paper, they give conditions under 604 R. C. H. Cheng and W. B. Liu Scand J Statist 28 which the vector of maximum likelihood (ML) estimators of the parameters will converge to an arbitrary point in the subset representing the true model, allowing the estimators to approach a boundary point of the parameter space. However, their result assumes some complicated conditions, including those of uniform high order differentiability in neighbourhoods of the indeterminate sets, which are rather dif®cult to check, so that it is not easily established even in relatively simple cases if the result holds. We show that the classic consistency argument given by Wald (1949) can be extended to cover this situation. In our work we apply a compacti®cation technique: we ®rstly compactify the parameter space by including all the boundary points into an extended parameter space, and then show that the mixture density function can be extended to this new space in a continuous fashion. Then we extend Wald's technique to the extended parameter space. The conditions under which consistency is shown to hold by applying this approach are considerably weaker and in particular do not require differentiability. These conditions are much easier to check, and we give explicit examples, covering mixtures of location-scale models and mixtures of exponential models, including such cases as mixtures of normals. Besides, the proof is short as well. The problem can alternatively be dealt with by reparameterization of the model so that each subset of points representing a given embedded model is mapped into a single point. Though such reparameterizations do exist, they are model dependent and are in general dif®cult to de®ne simply. Moreover such a reparameterization may not guarantee the complete removal of the non-regularity, so that the question of when estimators are ef®cient is thus in general still unresolved. We do not therefore discuss such reparameterizations in general here, however we do give an example of some practical interest, where estimators of important parameters are p n-consistent irrespective of whether the true model is the embedded or full model. Ghosh & Sen (1985) mention the possibility of reparameterization but do not pursue the idea as their results require differentiability conditions. Their suggested reparameterization (1985, p. 791) does not have the desired properties. The likelihood ratio, which is not consistent because of the non-regularity, cannot be used to test for an embedded model. However, consistent statistics can usually be found, and we give an example that can be used for a wide class of mixture models. We suggest a parametric bootstrapping method, using this statistic, for carrying out the test and report numerical results where the statistic has performed satisfactorily. 2. Consistency of maximum likelihood estimators Let L1 and B denote the following spaces of integrable functions on the interval (ÿ1, 1): L1 f : f measurable, k f k 1 ÿ1 jfj,1 B f f : f 2 L1 , k f k 1, f > 0g: Let f 1 , f 2 2 L1 . It follows that f 1 f 2 in L1 if and only if f 1 (x) f 2 (x) almost everywhere in R1 . Let Ù1 , Ù2 be two closed sets in Rm . We de®ne a metric between the two sets as follows: dis(Ù1 , Ù2 ) dis(Ù2 , Ù1 ) inf inf jx ÿ yj y2Ù2 x2Ù1 When Ù1 , Ù2 are singleton sets (i.e. single points), this metric agrees with the classic Euclidean distance. The following is used later. # Board of the Foundation of the Scandinavian Journal of Statistics 2001. Consistency in mixture models Scand J Statist 28 605 Property 1 (i) dis(Ù1 , Ù2 ) 0 if and only if there are sequences of points, fxn g in Ù1 and f yn g in Ù2 , such that jxn ÿ yn j ! 0 as n ! 1. (ii) dis(xn , Ù) ! 0 if and only if there is a sequence f yn g of points in Ù, such that jxn ÿ yn j ! 0 as n ! 1. Consider the following mixture model: h(x, á, è) k X á i f i (x, èi ), i1 where á 2 A f(á1 , á2 , . . ., á k ) : á i > 0, (i 1, . . ., k), Óá i 1g, è 2 È f(è1 , è2 , . . ., è k ) : èi 2 È i (i 1, . . ., k)g, and the È i (i 1, . . ., k) are closed convex sets belonging to Rp . Let à A 3 È. For any given (á0 , è0 ) 2 à such that h(, á0 , è0 ) 2 B , we de®ne the set Ã(á0 , è0 ) f(á, è) : (á, è) 2 à and h(, á, è) h(, á0 , è0 )g: It can be shown that Ã(á0 , è0 ) is a closed set if certain regularity conditions, for example assumption (c) below, hold. Ã(á0 , è0 ), however is not a singleton set. There are two distinct ways in which such non-uniqueness arises. The ®rst occurs because the same set of parameter values can be ascribed to different permutations of the mixture components. The problem therefore occurs because the full parameter space is made up of a number of identical copies of what are in essence the same subspace. This problem is well known and is frequently referred to as the interchangeability problem. It can be handled simply by placing an order restriction on the values of just one of the parameter components, see for example Richardson & Green (1997). For simplicity of exposition and notation we do not include this order restriction explicitly, but simply agree to regard all such permutation copies as being one and the same. The second problem of non-uniqueness is much more serious. Ignoring permutations, there are other cases when Ã(á0 , è0 ) is not a singleton set, but is actually a continuum of parameter values, when, for any other point (á1 , è1 ) 2 Ã(á0 , è0 ) in this continuum, we have h(, á1 , è1 ) h(, á0 , è0 ). Thus h(, á0 , è0 ) does not have a unique representation in Ã. This happens whenever (á0 , è0 ) has a value for which h(x, á0 , è0 ) takes the reduced form h(x, á0 , è0 ) Ó l á f (x, è ), 1 mi mi mi with l < k ÿ 1, è m i 2 È m i . Clearly the ML (or for that matter any other) estimator will not be consistent in this situation in the sense of converging to a unique point (á0 , è0 ). If (á0 , è0 ) is the true parameter value in the sense that h(, á0 , è0 ) is the correct model, then we can, at best, only expect that the ML estimator converges to Ã(á0 , è0 ) in, for example, the Hausdorff distance. This is essentially theorem 1 below. We now give suf®cient conditions for the theorem to hold. These conditions are convenient to check but not necessarily a minimum set. For any 1 1 < i < k, let è0i 2 È i and write Eè0 ( g) ÿ1 g(x) f i (x, è0i ) dx. i Assumption (a) Let 1 < i < k. For any èi 2 È i, f i (, èi ) 2 B . The support of fi is independent of èi . Furthermore f i (, è1i ) f i (, è2i ) in B only if è1i è2i . # Board of the Foundation of the Scandinavian Journal of Statistics 2001. 606 R. C. H. Cheng and W. B. Liu Scand J Statist 28 Assumption (b) Let 1 < i < k, and gi (x, èi ) maxf f i (x, èi ), 1g. For any èi 2 È i, Eè0 [logf f i (èi )g] . ÿ1, i on the support of fi , and Eè0 [logf gi (èi )g] , 1: i Also h n Eè0 log i sup èi 2È i ,jèi ÿè0i j<r g i (èi ) oi , 1, for r . 0 suf®ciently small, and oi h n Eè0 log g i (èi ) , 1, sup i èi 2È i ,jèi j> r . 0 for r suf®ciently large. Assumption (c) Let 1 < i < k. For almost every ®xed x 2 R lim f i (x, èi ) 0: jèi j!1 lim f i (x, èi ) f (x, è0i ), èi !è0i if èi , è0i 2 È i . We use the following two lemmas in the proof of theorem 1. The ®rst lemma shows assumption (b) holds for h as well if it does so for f i. For any g 2 L1, let 1 E(á0 ,è0 ) f g(x)g ÿ1 g(x)h(x, á0 , è0 ) dx. Lemma 1 For any (á, è) 2 Ã, assumption (b) holds with k 1, è1 replaced by (á, è), è01 by (á0 , è0 ) and f 1 (, è1 ) by h(, á, è). Proof. The desired results follow from the following inequalities: á log( f 1 ) (1 ÿ á)log( f 2 ) < logfá f 1 (1 ÿ á) f 2 g, and for f 1, f 2 > 1, logfá f 1 (1 ÿ á) f 2 g < log 2 log( f 1 ) log( f 2 ). The second lemma extends a well known inequality for density functions to non-density functions. Lemma 2 Let C f f 2 L1 : k f k , 1, f . 0g. Then for any f 2 C and g 2 B 1 log( f = g) g dx , 0: ÿ1 Proof. The inequality follows from Jensen's inequality. ^ but when discussing the We denote ML estimators not only by the conventional (^ á, è), n n convergence of sequences of ML estimators, also by (á , è ); in the latter case the circum¯ex is # Board of the Foundation of the Scandinavian Journal of Statistics 2001. Consistency in mixture models Scand J Statist 28 607 suppressed for simplicity. Note that for simplicity of terminology we refer to (á n , è n ) as the ML estimator. However for any given n, (á n , è n ) is not necessarily unique, so that it can actually be anyone of a set of possible choices. The point however, as the following theorem makes clear, is that anyone of these values is allowed. Theorem 1 Let X (X 1 , X 2 , . . ., X n ) be i.i.d. observations with density h0 (x) h(x, á0 , è0 ) satisfying the assumptions (a)±(c). Assume that the parameter space à is a closed convex set. Let the ML estimator of the model be (á n , è n ). Then disf(á n , è n ), Ã(á0 , è0 )g ! 0 w:p: 1: Proof. Without loss of generality, we can assume that à is compact. This is because we can always extend È i to include all its in®nite clusters so that it is a compact set. Each set in à then consists of ®nite points and these in®nite clusters. At any such in®nite cluster point, h can be uniquely de®ned 1as a result of assumption (c). At such cluster points h is not necessarily zero but always has ÿ1 h < 1 from assumption (c). We adopt the approach used in Wald (1949). First we show that ( ) h(X 1 , á, è)h(X 2 , á, è) h(X n , á, è) P lim sup 0 1, (1) n!1 (á,è)2S (h0 (X 1 )h0 (X 2 ) h0 (X n ) where S is any closed subset of à such that disfS, Ã(á0 , è0 )g . 0. To show (1), we only need to con®rm that for each point (á , è ) 2 S, there is always a neighbourhood, N (á , è ), of the point such that n o E(á0 ,è0 ) log sup h(á, è) , E(á0 ,è0 ) logfh(á0 , è0 )g: (2) (á,è)2 N (á ,è ) To this end, ®rst assume that (á , è ) is a ®nite point. Let fNi (á , è )g (i 1, 2, . . ., ) be a sequence of decreasing neighbourhoods of the point (á , è ) such that \ i>1 Ni (á , è ) (á , è ). Without loss of generality, we can assume that E(á0 ,è0 ) logfsup(á,è)2 N i (á ,è ) 3 h(á, è)g exists for i 1, 2, . . . due to assumption (b). It follows from our assumptions that n o lim log sup h(á, è) logfh(á , è )g: i!1 (á,è)2 N i (á ,è ) It is also clear that we have n lim E(á0 ,è0 ) log sup n!1 (á,è)2 N i (á ,è ) o h(á, è) > E(á0 ,è0 ) logfh(á , è )g: (3) The sequence fE(á0 ,è0 ) logfsup(á,è)2 N i (á ,è ) h(á, è)gg is clearly decreasing so that n o n o log sup h(á, è) ÿ log sup h(á, è) > 0 (á,è)2 N1 (á ,è ) (á,è)2 N i (á ,è ) Then it follows from the Fatou's lemma and (3) that n o lim E(á0 ,è0 ) log sup h(á, è) E(á0 ,è0 ) logfh(á , è )g , E(á0 ,è0 ) logfh(á0 , è0 )g: i!1 (á,è)2 N i (á ,è ) Therefore (2) follows when (á , è ) is a ®nite point. Let (á , è ) be an in®nite point. We ®rst show that (2) is true when N (á , è ) degenerates into the single point (á , è ). From our assumption we know h(á , è ) has the following form: # Board of the Foundation of the Scandinavian Journal of Statistics 2001. 608 R. C. H. Cheng and W. B. Liu h(x, á , è ) l X i1 Scand J Statist 28 ám i f m i (x, èm i ), where 0 < l < k ÿ 1 and ám i f m i (, èm i ) . 0. If Ó1l ám i , 1, then from lemma 2, E(á0 ,è0 ) logfh(á , è )g , E(á0 ,è0 ) log h(á0 , è0 ): Otherwise, we only need to show that h(á , è ) 6 h(á0 , è0 ). Assume this is not true (here we cannot yet say that (á , è ) 2 Ã(á0 , è0 ), as it is not a ®nite point of à ). Then (á , è ) 2 Ã(á0 , è0 ), as it is the limiting point of the sequence f(á1s , . . ., á sk ) 3 (è1s , . . ., è sk )g 2 Ã(á0 , è0 ), where á s á if j m , otherwise á s 0; i j èj if j mi , è sj j j è sj otherwise ! 1: This is impossible as we have disfS, Ã(á0 , è0 )g . 0. Now let Ni (á , è ) be a sequence of T decreasing neighbourhoods of the point (á , è ) such that i N i (á , è ). Again from lemma 1 and Fatou's lemma n o lim E(á0 ,è0 ) log sup h(á, è) < E(á0 ,è0 ) logfh(á , è )g i!1 (á,è)2 N i (á ,è ) , E(á0 ,è0 ) logfh(á0 , è0 )g: Therefore we have proved (2). Then by using the Heine±Borel ®nite open cover theorem and the same technique as used in the proof of theorem 1 in Wald (1949), (1) follows. From (1) we can show that disf(á n , è n ), Ã(á0 , è0 )g ! 0 w.p. 1, by using the same arguments in the proof of th. 2 of Wald (1949) but changing jè ÿ è0 j into disf(á, è), Ã(á0 , è0 )g. Theorem 1 reduces to the usual form of consistency for the case where Ã(á0 , è0 ) is a singleton set and therefore is a natural extension of well established consistency results for the ML estimators. Also, it implies the convergence results in Feng & McCulloch (1996). Of the three assumptions, (b) is usually the most tedious to verify. The following gives a simpler condition that is suf®cient for assumption (b) to hold. Corollary 1 Let fi satisfy assumptions (a) and (c). Suppose j f i (x, è)j < M(x) for all 1 < i < k, where Eè0 M , 1 and Eè0 [logf f i (èi )g] . ÿ1. Then theorem 1 holds. i i Theorem 1 shows that when the true model is an indeterminate case then the ML estimator converges in Hausdorff distance towards the indeterminate set of points Ã(á0 , è0 ) de®ning the model. This does not in itself guarantee that the ®tted model tends to that of the true model, which is what is really needed. This important point is not discussed by Feng & McCulloch (1996), but is dealt with in our case by the following theorem which shows that assumption (c) guarantees the continuity of h under the Hausdorff distance. Theorem 2 Let assumption (c) hold. Assume that parameter space à is a closed convex set. If disf(á n , è n ), Ã(á0 , è0 )g ! 0, then h(x, á n , è n ) ! h(x, á0 , è0 ), a:e: # Board of the Foundation of the Scandinavian Journal of Statistics 2001. Consistency in mixture models Scand J Statist 28 609 Further, if there is a function M(x) integrable on any ®nite interval of R1 such that jh(x, á n , è n )j < M(x), then H(x, á n , è n ) ! H(x, á0 , è0 ), where H(x, á, è) is the distribution function of the random variable with density function h(x, á, è). Proof. Assume that disf(á, è), Ã(á0 , è0 )g ! 0. Then there is a sequence of points in Ã(á0 , è0 ), denoted as (á0n , è0n ), such that já0n ÿ á n j jè0n ÿ è n j ! 0. Therefore h(x, á n , è n ) ÿ h(x, á0 , è0 ) h(x, á n , è n ) ÿ h(x, á0n , è0n ) ! 0, a.e. from assumption (c) and property 1. Now assume that there is a local majorizing function M(x) the densities. of R for the nsequence R n h(t, á , è ) ! h(t, á0 , è0 ), It then follows that for any ®xed x, R . 0, we have ÿ R ÿ R x x n n 0 0 h(t, á , è ) ! h(t, á , è ). For any ä . 0, let R be suf®ciently large so that ÿR R ÿRR 0 0 h(t, á , è ) . 1 ÿ ä=2. Then there is an N . 0 such that for n . N ÿR ÿR h(t, á n , è n ) . 1 ÿ ä. Therefore 1 ÿ R ! 1 R ÿ R ! n n 1 h(t, á , è ) h.1 ÿ ä h: R ÿR ÿ1 R ÿ1 Consequently for any ä . 0 there are R . 0, N . 0 such that 1 ÿ R h(t, á n , è n ) h(t, á n , è n ) , ä: R ÿ1 It follows that for any ®x x x x h(t, á n , è n ) ! h(t, á0 , è0 ), ÿR ÿR This proves the second conclusion of theorem 2. The existence of a local majorizing function is a weak requirement. It holds, for example, when h(, á n , è n ) is bounded from above. From theorems 1 and 2 we can show the following existence result for ML estimators in mixture models. Theorem 3 Let X (X 1 , X 2 , . . ., X n ) be i.i.d. observations with density h(x, á0 , è0 ) satisfying assumptions (a)±(c). Assume that parameter space à is a closed convex set. Let the ML estimator of the model be (á n , è n ). Then Prf9N . 0 such that è n exists for all n . N g 1: Proof. Suppose that for some n and sample X i , (á n , è n ) does not exist. Let (á nm , è nm ), m 1, 2, . . . be the maximum sequence of the likelihood function L(X , á, è) h(X 1 , á, è)h(X 2 , á, è) h(X n , á, è) i.e. lim m!1 L(X , á nm , è nm ) sup(á,è)2à L(X , á, è). Then f(á nm , è nm )g must be outside of f(á, è) : dis[(á, è), à 0 ] . 1g \ à when m is large enough, from theorem 2. Therefore we can ®nd a parameter (án , èn ) such that disf(án , èn ), à 0 g . 1 and jL(X , án , èn )=L(X, á0 , è0 )j . 1=2. Therefore supdisf(á,è),à 0 g>1 jL(X , á, è)=L(X , á0 , è0 )j . 1=2. It then follows that # Board of the Foundation of the Scandinavian Journal of Statistics 2001. 610 R. C. H. Cheng and W. B. Liu Scand J Statist 28 f9 infinite n for which (á n , è n ) does not existg n o 9 infinite n such that sup jL(X , á, è)=L(X , á0 , è0 )j . 1=2 : disf(á,è),à 0 g>1 The second is a null event from the proof of theorem 3. We now give examples of the application of the above results. Example 1. Special cases of the following very simple model have been much studied (see Hartigan, 1985; Berman, 1986; Smith, 1989 for examples) though mainly from the point of view of likelihood ratio tests. Let È fè : 0 < è , 1g, (4) â (á, è) 2 [0, 1] 3 È. Suppose that for any è 2 È, f (, è) 2 B . Let f 0 (x) f (x, 0), and 1 E( g) ÿ1 g f 0 dx. Consider the mixture model: h(á, è, x) (1 ÿ á) f (x, 0) á f (x, è): (5) Indeterminacy occurs when either á 0 or è 0, or in the terminology of the above, when Ã(á, è) is not a singleton set. Clearly therefore this happens only if áè 0. All other points correspond to determinate cases. Therefore let à 0 f(á, è): (á, è) 2 [0, 1] 3 È, á 0 or è 0g, (6) which is a closed set in [0, 1] 3 È. The single model de®ned by à 0 is f 0 . We are interested in consistency of the ML estimators in this case and apply theorem 1 to get the following. Result 1 Suppose assumptions (a)±(c) are satis®ed by f, i.e. they hold with k 1, è1 è, È1 È as given in (4) and f 1 (è) f (è). Let X (X 1 , X 2 , . . ., X n ) be i.i.d. observations with density f (, 0). Let the ML estimator of the model (5) be (á n , è n ). Then disf(á n , è n ), à 0 g ! 0 w:p: 1: Proof. We apply theorem 1 with È1 f0g, È2 [0, 1], f 1 f 2 f . An explicit model satisfying this result is where f (x, è) is the normal density f (x, è) (2ð)ÿ1=2 exp[ÿ(x ÿ è)2 =2]. The practical signi®cance of result 1 is that it shows that it is meaningless to try to estimate either á or è separately when the true model is f 0 . This raises the question whether the combined quantity j áè can be consistently estimated. This will be considered in the next section. Example 2. Let f (x) 2 B . Then xÿì f L (x, ì, ó ) ó ÿ1 f ó (7) is a location-scale model. Consider the mixture of two such models: h(x, á, è) á f L (x, ì1 , ó 1 ) (1 ÿ á) f L (x, ì2 , ó 2 ), (8) with parameter space à f(á, è): (á, è) 2 [0, 1] 3 Èg # Board of the Foundation of the Scandinavian Journal of Statistics 2001. (9) Consistency in mixture models Scand J Statist 28 611 where È R1 3 R1E 3 R1 3 R1E , R1E fx 2 R1 : x > E . 0g: (10) It is not unreasonable to assume that the variances have a positive lower bound as the cases of zero variances are degenerate. Without this lower bound, the likelihood becomes unbounded if ó 1 or ó 2 ! 0; and ML, based on a global maximum, does not then give consistent estimators. In this case one can use a modi®ed ML estimator, for example, the maximum product of spacings estimator suggested by Cheng & Amin (1983). We will discuss this in a forthcoming paper. Let Ã( ì, ó ) f(á, è) 2 [0, 1] 3 È : á 0; ì1 ì, ó 1 ó g [ f(á, è) 2 [0, 1] 3 È : á 1; ì2 ì, ó 2 ó g [ f(á, è) 2 [0, 1] 3 È : ì1 ì2 ì, ó 1 ó 2 ó g, ÿ 1 , ì , 1 and 0 , ó , 1, (11) i.e. Ã( ì, ó ) is the set of all points in the original parameterization representing the pure location scale model (7). Let [ Ã0 Ã( ì, ó ) (12) ( ì,ó ) Thus à 0 is the set of all points for which the model (8) is indeterminate. Theorem 1 gives the consistency of the ML estimator in this case. We have the following. Result 2 Suppose assumptions (a)±(c) are satis®ed with k 1, è1 ( ì, ó ), È1 È as given in (10), and f 1 (, è) f L (, ì, ó ). Let X (X 1 , X 2 , . . ., X n ) be i.i.d., observations with density f L (, ì0 , ó 0 ). Let the ML estimator of the model (8) be (á n , è n ). Then disf(á n , è n ), Ã( ì0 , ó 0 )g ! 0 w:p: 1: Proof. Similarly we apply theorem 1 with f 1 (x, ì, ó ) f 2 (x, ì, ó ) f L (x, ì, ó ), È1 R1 3 R1E , È2 R1 3 R1E . Assumptions (a)±(c) are satis®ed (with k 2), since ó i > E. A frequently met case is when f is the standard normal density function. It is easy to check that all the conditions required in result 2 hold for this case. 3. p n-Consistency and test statistics Convergence of the ML estimator to a set rather than to a point means that estimates of individual parameters may remain unstable. It would therefore be attractive if the parameters can be transformed so that each set Ã(á, è) maps into a separate singleton set, and where all parameters can be consistently estimated. Such transformations do exist, but are somewhat messy to de®ne in full generality, being closely dependent on the particular form of the Ã(á, è). We therefore do not discuss such transformations here. However, it is worth noting that certain p estimators in broad classes are n-consistent. We consider an interesting case. Titterington et al. (1985) consider the mixture model with components each with the same exponential density # Board of the Foundation of the Scandinavian Journal of Statistics 2001. 612 R. C. H. Cheng and W. B. Liu ( f (x, è) exp b(x) k X Scand J Statist 28 ) ti (x)èi ÿ a(è) : i1 and point out that the likelihood equations must satisfy k X áj j1 @a(è( j) ) @è(i j) nÿ1 n X ti (xl ), i 1, 2, . . ., p: (13) l1 Examples of this property have been observed by a number of authors (see Titterington et al. 1985, for references) but has not been discussed in the context of indeterminacy. If we therefore retain the original parameterization but simply de®ne as additional parameters, P j i kj1 á j @a(è( j) )=@è(i j) , and de®ne our estimator j^i of j i as being the solution of the P likelihood equation, then we have immediately from (13) that j^i nÿ1 nt1 ti (xl ), so that when Efti (X )g and varfti (X )g are ®nite then the j^i are asymptotically normally distributed and so p are n-consistent. (Note that we have not investigated the possibility that the MLE lies on the boundary, when it might not necessarily satisfy (13); we do not therefore claim that j^i will necessarily always be the MLE, though we conjecture that it usually will be.) A typical case is that of the two component normal mixture which has mean and variance ø1 áì1 (1 ÿ á)ì2 ø2 á(1 ÿ á)( ì1 ÿ ì2 )2 áó 21 (1 ÿ á)ó 22 : (14) If we therefore let t1 (x) x and t2 (x) x 2, then the likelihood equation estimators of ø1 and ø2 , just de®ned, are ^1 x nÿ1 ø n X i1 xi , and ^2 nÿ1 ø n X (xi ÿ x)2 , (15) i1 p ^1 and ø ^2 are n-consistent estimators of the mean i.e. the sample mean and variance. Thus ø and variance of the distribution whether the data come from a single normal model or a mixture of two normals. Dif®culties like those pointed out by Hartigan (1985), show that the likelihood ratio cannot be used to test the null hypothesis that the embedded model is the true model. However, in broad classes simple test statistics can be found that are consistent in the conventional sense under the null hypothesis. Consider for example the location-scale model (8). With no loss of generality, we suppose that ó 21 > ó 22 to avoid the same model being duplicated simply through interchange of (á, ì1 , ó 21 ) with ((1 ÿ á), ì2 , ó 22 ). Here if we de®ne ø(á, è) á(1 ÿ á)f( ì1 ÿ ì2 )2 (ó 1 ÿ ó 2 )2 g, (16) we ®nd that ø(á, è) 0 if and only if (á, ì1 , ì2 , ó 1 , ó 2 ) 2 à 0 (17) so that ø(á, è) is an indicator of the region corresponding to the embedded model. However, because à 0 is not necessarily bounded, the property (17) is not in itself suf®cient to guarantee consistency directly. The following result gives two conditions under either of which consistency is obtained. Result 3 Suppose assumptions (a)±(c) are satis®ed by f as in result 2. Let ø(á, è) be as de®ned in (16) and let (á n , è n ) be the ML estimator using the model (8). # Board of the Foundation of the Scandinavian Journal of Statistics 2001. Consistency in mixture models Scand J Statist 28 613 (i) If fX n g is drawn from the full model (8) with parameter value (á0 , è0 ), then ø(á n , è n ) ! ø(á0 , è0 ), w:p: 1: (ii) If fX n g is drawn from the location-scale model (7) with parameter value ( ì0 , ó 0 ), then ø(á n , è n ) ! 0, w:p: 1: (18) provided either (a) f(á n , è n )g is bounded (w.p. 1) or (b) ø2n ! (ó 0 )2 (w.p. 1), where ø2 is as de®ned in (14). Proof. Case (i) is just the regular case; we therefore need only consider case (ii). In this latter case the assumptions allow theorem 1 to be applied to show that disf(á n , è n ), Ã( ì0 , ó 0 )g ! 0 w.p. 1. It follows from property 1 that there is a sequence f(á n , è n )g Ã( ì0 , ó 0 ) such that j(á n , è n ) ÿ (á n , è n )j ! 0, and from the de®nition of Ã(ì0 , ó 0 ): ø(á n , è n ) ÿ ø(á0 , è0 ) ø(á n , è n ) ÿ ø(á n , è n ): If f(á n , è n )g is bounded, so is f(á n , è n )g. Then it is easy to see, from the uniform continuity of the mapping ø, that (18) holds on any bounded domain. Now assume that ø2n ! (ó 0 )2 . If fá n g does not converge to 1 or 0, then 0 , á n , 1 so that ì1n ì2n ì, ó 1n ó 2n ó 0 . Therefore f(á n , è n )g is bounded so that (18) is true. Suppose that fá n g converges to 1 or 0, say 1. Then á n 1 so that ì1n ì0 , ó 1n ó 0 . Therefore ì1n ! ì0 , ó 1n ! ó 0 . However ø2n ! (ó 0 )2 ; that is, á n (1 ÿ á n )(ì1n ÿ ì2n )2 ! 0, (1 ÿ á n )ó 2n ! 0. It follows that ø(á n , è n ) ! 0. Thus (18) holds. ^ ! 0 w.p. 1 if and only if Result 3 gives two conditions under either of which ø 2 2 n n (á, ì1 , ì2 , ó 1 , ó 2 ) 2 à 0 . The ®rst condition, that f(á , è )g be bounded is satis®ed if the parameters are bounded, a condition not too unreasonable in practice. As pointed out earlier, the normal mixture model is an example where the second condition is satis®ed. This follows from Pn ^2 ) nÿ1 i1 (xi ÿ x)2 in this case, so that ø2n ! (ó 0 )2 (w.p. 1) (15), which shows that ø2n ( ø if the sample is drawn from the straight normal distribution. Neither condition is needed if a normalization of the parameters is ®rst carried out. Let è~ ( ì~1 , ì~2 , ó~1 , ó~2 ) where ì~i ì i =(1 ì2i )1=2 , ó~i ó i =(1 ó i ), i 1, 2, (19) ~ be precisely the same as ø except that ì i and ó i are replaced by ì~i and ó~i respectively and let ø (i 1, 2). Then a similar argument to the case where f(á n , è n )g is bounded can be used to show that, as f( ì~n , ó~ n )g is bounded, we have the following. Result 39 ~ without the need of the ®nal conditions (a) or (b). Result 3 applies with ø replaced by ø When result 3 or 39 holds then ø provides a test statistic of the null hypothesis that h f L . Though the non-regularity is not entirely removed, the results do give us a practical means of carrying out the test of hypothesis, we give a parametric bootstrap method for doing this and report some numerical results in the next section. # Board of the Foundation of the Scandinavian Journal of Statistics 2001. 614 R. C. H. Cheng and W. B. Liu Scand J Statist 28 4. Bootstrap test of the embedded model For simplicity we attach a subscript to è to distinguish between models. Thus we write F(, è M ), and F(, è E ) for the full mixture and embedded models respectively, and we write è0M or è0E to indicate the true parameter value according to whether the full or embedded model is the correct one. When result 3 or 39 holds for the two-component mixture model (8), the parameter ø in (16) provides a test of the null hypothesis H 0 : F(, è E ), è E unknown, versus H 1 : F(, è M ), è M unknown: (20) ~ converges w.p. 1 to the true value, ø0 0 if and only if H 0 is true. The obvious form for a as ø, size á test is as follows. Suppose the sample X is drawn from F(, è0 ), and the full model is ®tted giving F(, è^M ). Let ^ be the statistic calculated from è^M . The distribution of ø, ^ which depends on è0 , will be ø denoted by G(, è0 ). This distribution will differ depending on whether è0 è0M or è0 è0E. ^ with ã(á), The null distribution is G(, è0E ). If this is known we can compare the test statistic ø the (1 ÿ á)100 percentile of G(, è0E ) and reject the null if ^ . ã(á): ø Because of the non-regularity, the distribution G(, è0E ) is dif®cult to obtain theoretically. However, parametric bootstrap sampling is an easy way to estimate G(, è0E ). We draw B (bootstrap) samples, X (i) , i 1, 2, . . ., B from F(, è^E ) and obtain, from these, the estimates ^(i) è^(i) E , i 1, 2, . . ., B, and, in turn from these, the test statistics ø , i 1, 2, . . ., B. By ^(i) construction these have distribution G(, è^E ) and the empirical distribution function of the ø therefore tends to G(, è^E ) as B ! 1. Then the above theory gives conditions under which è^E ! è0E w.p. 1, when the embedded (null) model is correct. Strictly we need to show that this implies G(, è^E ) ! G(, è0E ), however, because of the non-regular nature of the problem, this seems dif®cult to do in general. This same problem occurs with the parametric resampling method proposed by Feng & McCulloch (1996) for constructing con®dence intervals. However, as these authors point out, this desirable property is only obtained in the asymptotic limit anyway. For ®nite samples, simulation studies are usually the easiest way to verify that the approximation is satisfactory. It is hoped to discuss the effectiveness of such tests more fully elsewhere, but as a brief example we present some simulation results for the mixture of two normals for sample sizes ^ the ML estimator of (16), was used as the test statistic. Its n 20 and 50. In each case ø, distribution was constructed as follows. First a sample of size n was drawn from N(0, 1) and the ML estimates ì^ and ó^ obtained. Then 1000 bootstrap samples each of size n were drawn from N ( ì^, ó^ 2 ). For each bootstrap sample the mixture model (8) was ®tted by the EM algorithm and ^ then calculated. The EDF of the resulting 1000 such øs ^ then estimates G(, ì^, ó^ ) which in ø turn estimates G(, 0, 1). ^ ó^ ), m ( 100) such ^ ó^ 2 affects the variability of G(, ì, To illustrate how the variability of ì, EDFs were calculated and shown in Figs 1 and 2. Selected percentage points together with an indication of their spread are given in Table 1. It will be seen that the spread of estimates appears quite narrow. This is con®rmed when we consider the power of the statistic in detecting alternatives. To study this, EDFs were also constructed in exactly the way just given except that for each EDF the initial sample was drawn from the mixture model 0:7N (0, 1) 0:3N (ì2 , 1) and the bootstrap samples drawn from ^ ( ì^2 , ó^ 22 ), where á, ^ ì^1 , ì^2 , ó^ 1 , ó^ 2 were the ML estimates obtained from ^ ì^1 , ó^ 21 ) (1 ÿ á)N áN( the initial sample. For n 20 and 50, Figs 1 and 2 show m 50 such EDFs for each of the # Board of the Foundation of the Scandinavian Journal of Statistics 2001. Scand J Statist 28 Consistency in mixture models 615 ^ each constructed from 1000 samples of size 20. The number of EDFs shown is 100 for Fig. 1. EDFs of ã, the initial model N (0, 1) and 50 for each of the initial models 0:7N (0, 1) 0:3N (ì2 , 1), ì2 1, 2, 3: ^ each constructed from 1000 samples of size 50. The number of EDFs shown is 100 for Fig. 2. EDFs of ã, the initial model N (0, 1) and 50 for each of the initial models 0:7N (0, 1) 0:3N (ì2 , 1), ì2 1, 2, 3: # Board of the Foundation of the Scandinavian Journal of Statistics 2001. 616 R. C. H. Cheng and W. B. Liu Scand J Statist 28 ^ for the N (0, 1) model estimated from 100 EDFs each Table 1. Selected percentage points ã(á) of ø constructed from 1000 samples of size n á n 20 0.10 0.05 0.01 n 50 0.10 0.05 0.01 ãmin (á) ã(á) ãmax (á) SD 0.744 0.852 1.031 0.787 0.900 1.136 0.833 0.947 1.234 0.018 0.021 0.040 0.553 0.626 0.755 0.585 0.666 0.809 0.618 0.704 0.875 0.013 0.015 0.027 cases ì2 1, 2, 3. As anticipated by the values of ø0 ( 0:21, 0.84, 1.89 when ì2 1, 2, 3 respectively), the power is low when ì2 1. This is not surprising as the difference in shape between the models 0:7N(0, 1) 0:3N (1, 1) and N (0, 1) is small compared with the inherent variability in the relatively small samples considered, and so is probably dif®cult to detect by any method. The difference between models becomes marked as ì2 increases, and when ì2 3 the power of the test seems quite satisfactory. References Berman, M. (1986). Some unusual examples of extrema associated with hypothesis tests when nuiscance parameters are present under the alternative. In Proceedings of Paci®c statistics congress (eds I. S. Francis, B. F. J. Manly & F. C. Lam), North Holland, Amsterdam. Cheng, R. C. H. & Amin, N. A. K. (1983). Estimating parameters in continuous univariate distributions with a shifted origin. J. Roy. Statist. Soc. Ser. B 45, 394±403. Cheng, R. C. H. & Traylor, L. (1995). Non-regular maximum likelihood problems. J. Roy. Statist. Soc. Ser. B 57, 3±44. Feng, Z. D. & McCulloch, C. E. (1996). Using bootstrap likelihood ratios in ®nite mixture models. J. Roy. Statist. Soc. Ser. B 58, 593±608. Ghosh, J. K. & Sen, P. K. (1985). On the asymptotic performance of the log likelihood ratio statistic for the mixture model and related results. In Proceedings of Berkeley symposium in honor of Neyman, J. and Kiefer, J. (eds L. LeCam & R. A. Olshen), Vol. II, 789±806, Wadsworth, New York. Hartigan, J. A. (1985). A failure of likelihood asymptotics for the mixture model. In Proceedings of Berkeley symposium in honor of Neyman, J. and Kiefer, J. (eds L. LeCam & R. A. Olshen), Vol. II, 807±810. Wadsworth, New York. Redner, R. (1981). Note on the consistency of the maximum likelihood estimate for nonidenti®able distributions. Ann. Statist. 9, 225±228. Richardson, S. & Green, P. J. (1997). A Bayesian analysis of mixtures with an unknown number of components. J. Roy. Statist. Soc. Ser. B 59, 473±484. Smith, R. L. (1989). A survey of nonregular problems. Proceedings of the International Statistical Institute conference, 47th session, Paris, 353±372. Titterington, D. M., Smith, A. F. M. & Makov, U. E. (1985). Statistical analysis of ®nite mixture distributions. Wiley, Chichester. Wald, A. (1949). Note on the consistency of the maximum likelihood estimates. Ann. Math. Statist. 20, 595± 601. Received April 1998, in ®nal form October 2000 Wenbin Liu, Canterbury Business School, University of Kent, Canterbury CT2 7PE, UK. # Board of the Foundation of the Scandinavian Journal of Statistics 2001.
© Copyright 2026 Paperzz