Label Switching Under Model Uncertainty
Sylvia Frühwirth-Schnatter
Department of Applied Statistics
Johannes Kepler Universität Linz, Austria
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
1
Outline of the Talk
• The label switching problem
• Relabeling through clustering in the point process representation
• Relabeling when the number of components is unknown
• Some real applications
• Overfitting heterogeneity of component specific parameters
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
2
The Label Switching problem
Let the component specific parameter θ k take values in Θ and
consider relabelling the draws (θ 1, . . . , θ K ) of a mixture with K
components.
• Most papers work in the full parameter space ΘK to identify
suitable permutations of the labels (Celeux, 1998; Celeux et al.,
2000; Stephens, 2000b; Marin et al., 2005; Jasra et al., 2005;
Nobile and Fearnside, 2007; Sperrin et al., 2009; Spezia, 2009)
• “Simple” relabeling: operates in Θ or even a subspace Θ̃ ⊂ Θ
(Frühwirth-Schnatter, 2001)
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
3
PART I
Clustering in the point process representation
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
4
Clustering in the Point Process Representation
See page 97 in
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
5
The Point Process Representation of a Finite Mixture Model
4
3.5
3.5
3
3
2.5
2.5
2
4
2
σ
σ2
Stephens (2000a): any finite mixture distribution has a representation as marked point process and may be seen as a distribution
of the points {θ 1, . . . , θ K } over the parameter space Θ.
2
1.5
1.5
1
1
0.5
0.5
0
−2
−1
0
µ
1
2
3
0
−2
−1
µ
0
1
Figure 1: Point process representation of univariate normal mixtures with 3
components
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
6
The Point Process Representation of the MCMC Draws
Frühwirth-Schnatter (2006): MCMC draws cluster around the
true
“true” point {θ true
,
.
.
.
,
θ
1
K } in the parameter space Θ.
2
2
2
σk 1.8
σ2
1.8
1.6
1.6
1.4
1.4
1.2
1.2
1
1
0.8
0.8
0.6
0.6
0.4
−4
−3
−2
−1
µ
0
1
2
3
0.4
−4
−3
−2
−1
0
1
2
µk
3
Figure 2: Point process representations of mixture and MCMC draws
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
7
Labelling Based on the Point Process Representation
Frühwirth-Schnatter (2001) suggested to use a point process
representation of the MCMC draws to identify a mixture model.
A visual inspection of these plots allows to study the difference in the
component specific parameters and to formulate an identifiability
constraint.
Although this works quite well in lower dimensions, it is difficult
or even impossible to extend this method to higher dimensional
problems.
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
8
The Point Process Representation in Higher Dimension
Consider following mixture of 4 multivariate normals of dimension
r = 6 with
³
µ1 µ2 µ3
´
µ4 =
−2
−2
−2
0
3
0
−3
3
4
4
4
4
0
0
0
0
0
2
0
0
1
0
1
0
,
Σ1 = 0.5Ir , Σ2 = 4Ir + 0.2er , Σ3 = 4Ir − 0.2er , Σ4 = Ir .
θ k = (µk vec(Σ)) contains r + r(r + 1)/2 = 27 coefficients.
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
9
3
−4 −2
1
0
0 2
µ.,2
0
µ.,4
1
0
−1
4
µ.,5
µ.,4
1
3
4
µ.,3
5
3
2
1
0
−1
2
2
1
1
0
−1
−1
0
µ.,4
1
µ.,5
0
−1
−4 −2
4
µ.,6
.,5
0 2
µ.,2
1
µ.,6
µ
.,6
2
µ
4
0 2
µ.,2
4
3
2
1
0
−1
−3 −2 −1 0
µ.,1
3
2
1
0
−1
−4 −2
1
0 2
µ.,2
4
4
µ.,3
5
2
µ.,6
−1
−3 −2 −1 0
µ.,1
1
1
µ.,4
µ.,3
.,6
µ
0
0
−1
−3 −2 −1 0
µ.,1
1
5
1
3
2
1
0
−1
−1
4
3
−3 −2 −1 0
µ.,1
1
2
−1
−4 −2
1
µ.,5
5
µ.,4
4
2
0
−2
−4
−3 −2 −1 0
µ.,1
µ.,3
µ
.,2
The Point Process Representation in Higher Dimension
3
4
µ.,3
5
1 2
µ.,5
3
1
0
−1
3
0
−1
−1
0
Figure 3: Two-dimensional projections of the point process representation
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
10
The Point Process Representation in Higher Dimension
3
11
2.5
10
9
largest eigenvalue of Σ
2
trace(Σ)
8
7
6
1.5
1
5
4
0.5
3
2
−4
−2
0
2
−1
log(det Σ )
4
0
0
0.5
1
smallest eigenvalue of Σ
1.5
Figure 4: Two-dimensional projections of the point process representation
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
11
3
−4 −2
1
0
0 2
µ.,2
0
µ.,4
1
0
−1
4
µ.,5
µ.,4
1
3
4
µ.,3
5
3
2
1
0
−1
2
2
1
1
0
−1
−1
0
µ.,4
1
µ.,5
0
−1
−4 −2
4
µ.,6
.,5
0 2
µ.,2
1
µ.,6
µ
.,6
2
µ
4
0 2
µ.,2
4
3
2
1
0
−1
−3 −2 −1 0
µ.,1
3
2
1
0
−1
−4 −2
1
0 2
µ.,2
4
4
µ.,3
5
2
µ.,6
−1
−3 −2 −1 0
µ.,1
1
1
µ.,4
µ.,3
.,6
µ
0
0
−1
−3 −2 −1 0
µ.,1
1
5
1
3
2
1
0
−1
−1
4
3
−3 −2 −1 0
µ.,1
1
2
−1
−4 −2
1
µ.,5
5
µ.,4
4
2
0
−2
−4
−3 −2 −1 0
µ.,1
µ.,3
µ
.,2
The Point Process Representation of the MCMC draws
3
4
µ.,3
5
1 2
µ.,5
3
1
0
−1
3
0
−1
−1
0
Figure 5: Point process representation of 5000 draws (1000 observations)
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
12
Clustering in the Point Process Representation
Use k-means clustering in the point process representation of the
MCMC draws (Frühwirth-Schnatter, 2006, p.97):
• Identify K clusters instead of K! clusters as in Celeux (1998)
• Apply k-means clustering to all KM posterior draws of the
(m)
parameter vector θ k , k = 1, . . . , K, m = 1, . . . , M .
(m)
• This delivers a classification index Ik
1, . . . , K, m = 1, . . . , M .
∈ {1, . . . , K}, k =
It is usually sufficient to consider a subset of the components-specific
parameters to obtain those classification indeces.
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
13
Relabelling
(m)
Use ρm = (I1
(m)
, . . . , IK ), m = 1, . . . , M , to identify the mixture:
• Check, if ρm is a permutation of {1, . . . , K} (Condition (3.57)
is not sufficient). If ρm is a permutation, a unique labelling is
achieved by reordering the draws through ρm:
(m)
ηk
⇒
(m)
η̃ρm(k),
(m)
θk
⇒
(m)
θ̃ ρm(k)
• The same permutation is used to relabel the MCMC draws
(m)
(m)
(m)
S
= (S1 , . . . , SN ) of the hidden allocation:
(m)
S̃i
Sylvia Frühwirth-Schnatter
(m)
= ρm(Si
),
i = 1, . . . , N.
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
14
Application to the Example
• Component specific parameter θ k contains r + r(r + 1)/2 =
27 coefficients. Use only the component mean, i.e. ϑk =
0
(µk,1 · · · µk,r ) ; ϑk contains 6 elements.
• k-means clustering identifies 4 clusters in M K = 20 000
(m)
(m)
realizations of the 6-dimensional variable ϑk . For each ϑk a
(m)
classification index Ik results.
(m)
All classification sequences ρm = (I1
are permutations of {1, . . . , 4}.
(m)
, . . . , I4
), m = 1, . . . , M
One could add measures describing Σk to ϑk .
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
15
3
−4 −2
1
0
0 2
µ.,2
0
µ.,4
1
0
−1
4
µ.,5
µ.,4
1
3
4
µ.,3
5
3
2
1
0
−1
2
2
1
1
0
−1
−1
0
µ.,4
1
µ.,5
0
−1
−4 −2
4
µ.,6
.,5
0 2
µ.,2
1
µ.,6
µ
.,6
2
µ
4
0 2
µ.,2
4
3
2
1
0
−1
−3 −2 −1 0
µ.,1
3
2
1
0
−1
−4 −2
1
0 2
µ.,2
4
4
µ.,3
5
2
µ.,6
−1
−3 −2 −1 0
µ.,1
1
1
µ.,4
µ.,3
.,6
µ
0
0
−1
−3 −2 −1 0
µ.,1
1
5
1
3
2
1
0
−1
−1
4
3
−3 −2 −1 0
µ.,1
1
2
−1
−4 −2
1
µ.,5
5
µ.,4
4
2
0
−2
−4
−3 −2 −1 0
µ.,1
µ.,3
µ
.,2
Application to the Example
3
4
µ.,3
5
1 2
µ.,5
3
1
0
−1
3
0
−1
−1
0
Figure 6: Point process representation of 5000 MCMC draws
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
16
Recent High Dimensional Applications
Frühwirth-Schnatter and Pyne (2010): mixtures of univariate and
multivariate skew normal and skew-t distributions; application to
Alzheimer’s disease data and flow cytometry; use entire θ k for
univariate mixtures and ϑk = E(yi|θ k ) for multivariate mixtures
Pamminger and Frühwirth-Schnatter (2010): clustering categorial
time series panel using finite mixtures of Markov chain models;
application to a panel of 10 000 individual wage mobility data from
the Austrian labor market; ϑk contains the persistence probabilities
of the cluster specific transition matrices
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
17
Extension to Markov Switching Models
Easily extended to more general mixture models like Markov
switching models (Frühwirth-Schnatter, 2008)
Hahn et al. (2010): continuous time, multivariate Markov switching
models; application to modeling multivariate financial time series;
use
0
ϑk = (µk,1 · · · µk,r Diag (Σk )) .
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
18
PART II
Relabeling when the number of components
is unknown
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
19
Relabeling when the number of components is unknown
• There exists no unique labeling when the model is overfitting the
number of components
• Choose the prior on the weight distribution carefully
• Pick the right estimator for K true (K true is the “true ” number
of non-empty, non-identical components)
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
20
The Role of the Prior on the Weight Distribution
η = (η1, . . . , ηK ) ∼ D (e1, . . . , eK ), i.e.
p(η) ∝
K
Y
e −1
ηkk
.
k=1
Hyperparameters e1, . . . , eK very influential, in particular, if the
model is overfitting.
Very popular choice: η ∼ D (1, . . . , 1) (uniform on the unit simplex)
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
21
Reference Type Priors
Bernardo and Girón (1988) consider a mixture with two components,
where the components specific parameters are known:
• For K = 2, the Beta prior η1 ∼ B (0.5, 0.5) approximates the
reference prior, when the mixture components are well-separated.
• For K > 2, the prior η ∼ D (0.5, . . . , 0.5) approaches the
reference prior, when the component mixture densities have
pairwise disjoint support.
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
22
Reference Type Priors
When the mixture is overfitting, the following holds:
• For K = 2, the uniform prior η1 ∼ B (1, 1) approximates the
reference prior.
• For K > 2, the reference prior tends to the Dirichlet prior
η ∼ D (1, . . . , 1) when all component mixture densities converge
to the same distribution.
But, is the reference prior a good choice for overfitting mixture?
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
23
The likelihood for overfitting mixtures
The likelihood is hell for overfitting mixtures because it reflects both
ways of cutting back to the true number of components:
• Leave one group empty: ηk is shrunken toward 0; θ k is identified
only through the prior p(θ k )
• Let two components have identical parameters: θ k − θ l is
shrunken toward 0; only the sum of the components weights
ηk + ηl is identified.
Hence, under the reference prior, the posterior is very blurred,
because it mixes these two unidentifiability modes.
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
24
The likelihood for overfitting mixtures
5
−35
x 10
6
4
4
2
f(y|µ ,µ )
5
3
1
3
µ
2
2
2
1
0
1
4
0
2
0
µ
2
−1
0
2
1
µ
1
3
4
5
−1
−1
0
1
2
µ
3
4
5
1
Figure 7: Simulate data with µ1 = 1, µ1 = 1.5, σ12 = σ22 = 1, N = 100;
surface and contours of the integrated mixture likelihood p(y|µ1, µ2)
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
25
Help the Posterior Distribtution to focus
Stick-breaking representation of the Dirichlet distribution:
ηk = 1 −
k−1
X
ηj φk ,
j=1
φk ∼ B ek ,
K
X
k = 1, . . . , K,
ej ,
k = 1, . . . , K − 1.
j=k+1
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
26
Help the Posterior Distribtution to focus
Choose a prior which decides, whether you prefer empty groups or
identical components for overfitting mixtures.
Exchangeable prior with ej ≡ e0:
• e0 > 1: bound φk away from 0 and 1; forces non-empty groups;
E(φk ) = 1/(K − (k − 1)) with finite variance; e.g. e0 = 4
(Frühwirth-Schnatter, 2006).
• e0 < 1: no finite variance, force φk either toward 0 (empty group)
or toward 1 (all remaining groups empty) - shrinkage prior; e.g.
e0 = α/K max – Dirichlet approximation to the Dirichlet process
prior (Ishwaran et al., 2001)
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
27
Help the Posterior Distribtution to focus
10
10
φ1
φ1
φ2
9
φ2
9
φ
φ
3
3
φ4
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
φ4
8
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 8: Priors of φk for K = 5; e0 = 4 (left hand side) and e0 = 0.1
(right hand side)
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
28
Help the Posterior Distribtution to focus
10
φ1
φ2
9
φ
3
φ4
8
7
6
5
4
3
2
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 9: Priors of φk for K = 5; e0 = 1
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
29
How to identify a unique labelling
Most papers recommend to use some model selection method to
estimate the true K true and to identify a unique labeling for a model
with fixed K = K̂ true.
Popular methods: marginal likelihoods, BIC, RJMCMC
How is K, the number of components in the fitted mixture,
related to K true (Nobile, 2004)? Is the number of components
K or the number of non-empty components Kn closer to K true?
The answer to this is not independent from the prior on the weights.
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
30
The meaning of K
If e0 > 1, then empty groups are unlikely apriori. If a mixture with
K components is overfitting, then:
• The number Kn of non-empty groups is apriori equal to the
number of components K in the fitted mixture.
• Hence apriori Kn will overestimate K true, if the mixture is
overfitting.
Use the marginal likelihood p(y|K) or the posterior p(K|y) in
RJMCMC to estimate K true.
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
31
The meaning of K
If e0 < 1, then empty groups are likely apriori. If a mixture with K
components is overfitting, then:
• The number Kn of non-empty groups is apriori smaller than the
number of components K in the fitted mixture.
• Hence apriori the marginal likelihood p(y|K) and the posterior
p(K|y) in RJMCMC will overestimate K true (Nobile, 2004).
• The number Kn of non-empty groups will be a much better
estimator of K true.
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
32
The meaning of K
Estimate Kn by considering the posterior draws of Nk (S), where
Nk (S) are the number of observations allocated to group k:
Kn = K −
K
X
I{Nk (S) = 0}.
k=1
Note that Kn is invariant to label switching.
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
33
Example - Univariate Mixture of Normals
K true = 2, µ1 = −4, µ2 = 3, σ12 = σ22 = 0.5, η1 = 0.2
1000 observations, 5000 MCMC draws
Table 1: Comparison of various priors of the weight distribution
e0
4
0.1
Sylvia Frühwirth-Schnatter
K
log p(y|K)
Kn
non-permutations
1
-2471.7
-
-
2
–1575.9
2(5000)
53/5000
3
–1578.0
3(5000)
2342/5000
4
–1629.2
4(5000)
4660/5000
5
-1636.2
5(5000)
4660/5000
2
–1576.55
2(5000)
61/5000
3
–1576.46
2(4223)
4398/5000
4
–1623.44
2(4070)
4841/5000
5
–1623.77
2(3740)
4841/5000
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
34
Example - Univariate Mixture of Normals
How about the uniform “reference prior”?
Table 2: Comparison of various priors of the weight distribution
e0 K
1
log p(y|K)
Kn
non-permutations
2
–1583.87
2(5000)
0/5000
3
-1583.49
3(4997)
5000/5000
4
-1585.94
4(4897)
5000/5000
5
-1592.73
5(4652)
5000/5000
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
35
Example - Multivariate Mixture of Normals
K true = 4 (extremely well-separated clusters), r = 8 dimensional
1000 observations, 5000 MCMC draws
Table 3: Comparison of various priors of the weight distribution
e0
K
log p(y|K)
Kn
non-permutations
4
4
-12807.05
4(5000)
0/5000
5
-12825.86
4(5000)
3982/5000
6
-12846.82
4( 4992)
4626/5000
4
-12814.36
4(5000)
0/5000
5
–12817.88
4(5000)
4398/5000
6
-12824.00
4(4998)
4629/5000
4
-12809.22
4(5000)
0/5000
5
-12817.55
4(4997)
3887/5000
6
-12828.06
4(5000)
4630/5000
0.1
1
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
36
PART III
A Real Application
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
37
Panel Data Analysis from the Austrian Labor Market
The data set consists of time series for N = 9 809 men entering
the labour market in the years 1975 to 1980 at an age of at most
25 years. The time series represent gross monthly wages in May of
successive years and exhibit individual lengths ranging from 2 to 27
years with the median length being equal to 23.
Following Weber (2001), the gross monthly wage is divided into six
categories labelled with 0 up to 5:
Category 0: zero-income or non-employment
Categories 1 to 5: quintiles of the income distribution.
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
38
Panel Data Analysis from the Austrian Labor Market
Individual wage mobility time series of twenty employees:
5
4
3
2
1
0
5
4
3
2
1
0
0 10
25
5
4
3
2
1
0
5
15
6
5
4
3
2
1
0
10
2
6
10
15
5
4
3
2
1
0
3
5
4
3
2
1
0
5
15
1
5
3
5
7
10
25
15
1 3 5 7
5
4
3
2
1
0
1
3
5
5
4
3
2
1
0
2
6
25
10 20
10 20
5
4
3
2
1
0
0 10
0
5
4
3
2
1
0
0
5
4
3
2
1
0
0
5
5
4
3
2
1
0
5
4
3
2
1
0
1
5
4
3
2
1
0
25
5
4
3
2
1
0
5
4
3
2
1
0
5
5
4
3
2
1
0
0 10
5
4
3
2
1
0
2
Sylvia Frühwirth-Schnatter
5
4
3
2
1
0
5
15
5
4
3
2
1
0
5
15
2
6
10
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
39
Pamminger and Frühwirth-Schnatter (2010)
Model-based clustering applied to the whole time series yi =
(yi,1, . . . , yi,Ti ).
Hierarchical mixture model:
• To capture the persistence of wage categories for an individual
use a first-order Markov chain model with transition matrix ξ si.
• The rows of ξ si follow a Dirichlet distribution with cluster-specific
parameters:
ξ si,j·|(Si = k) ∼ D (ek,j0, . . . , ek,j5) ,
Sylvia Frühwirth-Schnatter
j = 0, . . . , 5.
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
40
Pamminger and Frühwirth-Schnatter (2010)
• Each cluster is characterized by a “typical” cluster-specific
transition matrix ξ k = E(ξ si) in group k. However, ξ si is allowed
to vary around ξ k .
It is possible to work out the marginal distribution p(yi|Si = k, ek ):
Q5
p(yi|ek ) =
P5
Q5
Q5
j=0 Γ(
l=0 ek,jl )
j=0
l=0 Γ(Ni,jl
Q5 Q5
QK
P5
j=0
l=0 Γ(ek,jl )
j=1 Γ(
l=0(Ni,jl
+ ek,jl)
+ ek,jl))
,
where Ni,jl is the number of transitions from state j to state l
observed in time series i.
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
41
Pamminger and Frühwirth-Schnatter (2010)
BIC
AW E
ICL − BIC
n1
n2
H
AIC
n1
n2
n1
n2
CLC
1
401183.0
401441.9
401551.5
401880.8
402100.1
401111.0
401441.9
401551.5
2
3
4
392033.4
389080.9
386851.8
392558.4
389871.9
387908.8
392780.7
390206.9
388356.5
394842.1
394206.1
393567.7
395286.7
394876.0
394463.1
395002.9
395176.2
394683.3
395673.9
396187.2
396034.4
395896.2
396522.2
396482.1
5
6
385852.4
385004.7
387175.5
386594.0
387735.9
387267.0
393894.4
394466.5
395015.0
395812.6
394947.0
395308.4
396638.1
397339.6
397198.5
398012.6
Table 4: Model selection criteria for various numbers H of clusters f;
criteria depending on P
sample size are computed with sample size n1 = N
N
and sample size n2 = i=1 Ti.
Model selection criteria and discussion with the economists lead to
choosing K = 4
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
42
Pamminger and Frühwirth-Schnatter (2010)
1
1
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.6
0.6
0.6
0.5
ξ22
1
0.9
ξ22
ξ11
Model identification through clustering in the point process
representation
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0
0
0.2
0.4
ξ00
0.6
0.8
1
0
0
0.1
0.2
0.4
ξ00
0.6
0.8
1
0
0
0.2
0.4
ξ11
0.6
0.8
1
Figure 10: Scatter plots of the MCMC draws of the persistence probabilities
(m) (m)
(m) (m)
(m) (m)
(ξk,00, ξk,11) (left hand side), (ξk,00, ξk,22) (middle) and (ξk,11, ξk,22) (right
hand side)
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
43
Results
Visualizations of estimated transition matrices ξˆ1, ξˆ2, ξˆ3, ξˆ4
unemployed ( 0.2417 )
0
1
2
3
4
climbers ( 0.3996 )
5
0
0
0
1
1
2
2
3
3
4
4
5
5
low wage ( 0.196 )
0
Sylvia Frühwirth-Schnatter
1
2
3
1
2
3
4
5
4
5
flexible ( 0.1627 )
4
5
0
0
0
1
1
2
2
3
3
4
4
5
5
1
2
3
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
44
Results
Posterior expectation of the wage distribution π h,t over the wage
categories 0 to 5 after a period of t years in the various clusters
unemployed
1.0
0.8
0.6
0.4
0.2
0.0
t=0
t=1
t=2
t=3
t=4
t=5
10
15
20
25
30
50
100
t=0
t=1
t=2
t=3
t=4
t=5
10
15
20
25
30
50
100
t=0
t=1
t=2
t=3
t=4
t=5
10
15
20
25
30
50
100
t=0
t=1
t=2
t=3
t=4
t=5
10
15
20
25
30
50
100
1.0
climbers
0.8
0.6
0.4
0.2
0.0
1.0
low wage
0.8
0.6
0.4
0.2
0.0
1.0
flexible
0.8
0.6
0.4
0.2
0.0
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
45
Results
Typical group members within each cluster: wage careers of the five
individuals with the highest posterior classification probability
unemployed
5
4
3
2
1
0
5
4
3
2
1
0
0
10 20
5
4
3
2
1
0
0
10 20
5
4
3
2
1
0
0
10 20
5
4
3
2
1
0
0
10 20
0
10 20
climbers
5
4
3
2
1
0
5
4
3
2
1
0
5
15
5
4
3
2
1
0
0
5
4
3
2
1
0
10 20
5
15
5
4
3
2
1
0
0
10 20
5
15
5
15
5
15
low wage
5
4
3
2
1
0
5
4
3
2
1
0
5
15
25
5
4
3
2
1
0
5
15
25
5
4
3
2
1
0
0
10 20
5
4
3
2
1
0
0
10 20
flexible
5
4
3
2
1
0
5
4
3
2
1
0
5
Sylvia Frühwirth-Schnatter
15
25
5
4
3
2
1
0
5
15
25
5
4
3
2
1
0
5
15
5
4
3
2
1
0
0
10 20
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
46
Results
Wage careers of the individuals no. 10, 25, 50, 100 and 200 in the
posterior classification probability ranking
unemployed
5
4
3
2
1
0
5
4
3
2
1
0
0
10 20
5
4
3
2
1
0
0
5
4
3
2
1
0
10 20
5
15
25
5
4
3
2
1
0
0
10 20
5
15
5
15
5
15
5
15
25
climbers
5
4
3
2
1
0
5
4
3
2
1
0
5
15
5
4
3
2
1
0
5
15
5
4
3
2
1
0
25
5
15
25
5
4
3
2
1
0
5
15
low wage
5
4
3
2
1
0
5
4
3
2
1
0
0
10 20
5
4
3
2
1
0
5
15
25
5
4
3
2
1
0
0
10 20
5
4
3
2
1
0
5
15
flexible
5
4
3
2
1
0
5
4
3
2
1
0
5
Sylvia Frühwirth-Schnatter
15
25
5
4
3
2
1
0
5
15
25
5
4
3
2
1
0
5
15
5
4
3
2
1
0
5
15
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
47
PART IV
Overfitting Heterogeneity
of component specific parameters
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
48
Example 2
N = 1000 8-dimensional observations generated by following
mixture of 4 multivariate normals with
³
µ1 µ2 µ3
Sylvia Frühwirth-Schnatter
´
µ4 =
−1
−1
1
1
0
0
0
0
4
4
4
4
0
0
0
0
−1
1
−1
1
0
0
0
0
1
1
1
1
0
0
0
0
.
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
49
2
1
0
−1
0
µ.,4
0
µ.,6
3
1
1
0
−1
−1
1
1
0
−1
−1
4
µ.,3
0
µ.,4
0
µ.,6
1
3
.,2
5
1
1
3
1
0
−1
−2
1
0
−1
5
0
µ.,5
2
2
1
0
−2
0
1
µ.,7
2
4
µ.,3
µ.,6
4
µ
5
2
0
−2
3
.,3
2
0
−2
−1
µ.,5
1
0
−1
0
µ.,2
0
µ.,4
0
µ.,5
0
µ.,1
2
0
µ.,2
1
4
µ
5
0
µ.,4
1
0
µ.,5
2
.,3
1
1
0
−1
−1
2
1
0
−1
−2
µ.,6
0
µ
1
0
−1
1
2
0
−2
−1
µ.,5
µ.,4
0
µ.,2
0
µ.,1
µ.,5
µ.,5
1
1
0
−1
−1
2
1
0
−1
−2
µ.,8
5
2
1
0
0
µ.,1
µ.,4
µ.,4
µ.,3
0
µ
µ.,8
µ.,7
2
1
0
−1
4
µ.,3
µ.,8
3
1
1
0
−1
−1
.,2
µ.,8
µ
.,7
µ
.,7
µ.,6
.,2
1
0
−1
0
µ.,1
2
2
0
−2
−2
µ.,7
0
µ
2
5
4
3
−1
µ.,6
0
µ.,1
0
µ.,1
µ.,8
1
2
1
0
−1
µ.,3
2
1
0
−1
−2
2
1
0
−1
−2
µ.,8
1
0
−1
−1
0
µ.,1
µ.,8
µ
µ.,6
2
1
0
−2
2
5
4
3
−2
µ.,7
1
0
−1
−2
.,7
µ.,2
Example 2 - Fit a mixture with 4 components
Figure 11: Fit a mixture with 4 components, priors as in Stephens (1997);
point process representation of 5000 MCMC draws
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
50
Common Priors on Component Specific Location Parameters
Usually, normal priors with fixed hyperparameters, e.g.
• Univariate¡ mixtures
of normals (Richardson and Green, 1997):
¢
µk ∼ N m, R2 , where m and R are the midpoint and the
length of the observation interval.
• Multivariate
mixtures of normals (Stephens, 1997): µk,l ∼
¡
¢
N ml, Rl2 , l = 1, . . . , r; where ml and Rl are the midpoint
and the length of the observation interval of the lth feature.
• Mixtures of regression¡ models¢ (Hurn et al., 2003):
N (0, 10) or βk,j ∼ N 0, 10σj2 , j = 1, . . . , d.
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
βk,j ∼
51
Shrinkage Priors
Use shrinkage priors well-known from variable selection (Park and
Casella, 2008; Fahrmeir et al., 2010) - work in progress.
E.g. for multivariate mixtures of normals:
µk,l ∼ Lap (ml, Bl) ,
√
Bl=Rl/ 2 implies Var(µk,l)=2Bl2=Rl2 or use a hyperprior p(Bl).
Hierarchical formulation:
µk,l|ψk,l ∼ N (ml, ψk,l) ,
¡
¢
2
ψk,l ∼ E 1/(2Bl ) .
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
52
MCMC Estimation
Easy extension of “simple MCMC”:
• Conditional on the prior variances ψ = (ψk,1, l = 1, . . . , r, k =
1, . . . , K) run usual MCMC (i.e. sampling the component specific
parameters, the weight distribution, the allocations)
−1
• The posterior of ψk,l
|µk,l follows an inverse Gaussian distribution:
µ
−1
ψk,l
|µk,l
Sylvia Frühwirth-Schnatter
1
¶
1
∼ InvGau
, 2 .
Bl|µk,l − ml| Bl
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
53
MCMC Estimation
Allow Bl to be an unknown hyperparameter with conjugate prior
Bl ∼ G −1 (sl, Sl).
Update Bl at each MCMC sweep, i.e. sample Bl|µ1,l, . . . , µK,l from
the appropriate inverted Gamma posterior:
Ã
Bl|µ1,l, . . . , µK,l ∼ G −1 sl + K, Sl +
K
X
!
|µk,l − ml| .
k=1
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
54
2
1
0
−1
0
µ.,4
0
µ.,6
3
1
1
0
−1
−1
1
1
0
−1
−1
4
µ.,3
0
µ.,4
0
µ.,6
1
3
.,2
5
1
1
3
1
0
−1
−2
1
0
−1
5
0
µ.,5
2
2
1
0
−2
0
1
µ.,7
2
4
µ.,3
µ.,6
4
µ
5
2
0
−2
3
.,3
2
0
−2
−1
µ.,5
1
0
−1
0
µ.,2
0
µ.,4
0
µ.,5
0
µ.,1
2
0
µ.,2
1
4
µ
5
0
µ.,4
1
0
µ.,5
2
.,3
1
1
0
−1
−1
2
1
0
−1
−2
µ.,6
0
µ
1
0
−1
1
2
0
−2
−1
µ.,5
µ.,4
0
µ.,2
0
µ.,1
µ.,5
µ.,5
1
1
0
−1
−1
2
1
0
−1
−2
µ.,8
5
2
1
0
0
µ.,1
µ.,4
µ.,4
µ.,3
0
µ
µ.,8
µ.,7
2
1
0
−1
4
µ.,3
µ.,8
3
1
1
0
−1
−1
.,2
µ.,8
µ
.,7
µ
.,7
µ.,6
.,2
1
0
−1
0
µ.,1
2
2
0
−2
−2
µ.,7
0
µ
2
5
4
3
−1
µ.,6
0
µ.,1
0
µ.,1
µ.,8
1
2
1
0
−1
µ.,3
2
1
0
−1
−2
2
1
0
−1
−2
µ.,8
1
0
−1
−1
0
µ.,1
µ.,8
µ
µ.,6
2
1
0
−2
2
5
4
3
−2
µ.,7
1
0
−1
−2
.,7
µ.,2
Example 2 - Use a shrinkage Prior
Figure 12: Point process representation of 5000 MCMC draws
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
55
2
1
0
−1
0
µ.,4
0
µ.,6
3
1
1
0
−1
−1
1
1
0
−1
−1
4
µ.,3
0
µ.,4
0
µ.,6
1
3
.,2
5
1
1
3
1
0
−1
−2
1
0
−1
5
0
µ.,5
2
2
1
0
−2
0
1
µ.,7
2
4
µ.,3
µ.,6
4
µ
5
2
0
−2
3
.,3
2
0
−2
−1
µ.,5
1
0
−1
0
µ.,2
0
µ.,4
0
µ.,5
0
µ.,1
2
0
µ.,2
1
4
µ
5
0
µ.,4
1
0
µ.,5
2
.,3
1
1
0
−1
−1
2
1
0
−1
−2
µ.,6
0
µ
1
0
−1
1
2
0
−2
−1
µ.,5
µ.,4
0
µ.,2
0
µ.,1
µ.,5
µ.,5
1
1
0
−1
−1
2
1
0
−1
−2
µ.,8
5
2
1
0
0
µ.,1
µ.,4
µ.,4
µ.,3
0
µ
µ.,8
µ.,7
2
1
0
−1
4
µ.,3
µ.,8
3
1
1
0
−1
−1
.,2
µ.,8
µ
.,7
µ
.,7
µ.,6
.,2
1
0
−1
0
µ.,1
2
2
0
−2
−2
µ.,7
0
µ
2
5
4
3
−1
µ.,6
0
µ.,1
0
µ.,1
µ.,8
1
2
1
0
−1
µ.,3
2
1
0
−1
−2
2
1
0
−1
−2
µ.,8
1
0
−1
−1
0
µ.,1
µ.,8
µ
µ.,6
2
1
0
−2
2
5
4
3
−2
µ.,7
1
0
−1
−2
.,7
µ.,2
Example 2 - Identification
Figure 13: Point process representation of 4983 identified MCMC draws
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
56
Concluding Remarks
• Simple relabeling works
• Be decisive when choosing the prior on the weights
• Consequences for RJMCMC and Bayesian non-parametric
methods for mixtures
• Shrinkage priors might help if many coefficient are not truely
component specific
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
57
References
Bernardo, J. M. and F. G. Girón (1988). A Bayesian analysis of
simple mixture problems. In J. M. Bernardo, M. H. DeGroot,
D. V. Lindley, and A. F. M. Smith (Eds.), Bayesian Statistics 3,
pp. 67–78. New York: Clarendon.
Celeux, G. (1998). Bayesian inference for mixture: The label
switching problem. In P. J. Green and R. Rayne (Eds.),
COMPSTAT 98, pp. 227–232. Heidelberg: Physica.
Celeux, G., M. Hurn, and C. P. Robert (2000). Computational and
inferential difficulties with mixture posterior distributions. Journal
of the American Statistical Association 95, 957–970.
Fahrmeir, L., T. Kneib, and S. Konrath (2010). Bayesian
regularisation in structured additive regression: A unifying
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
58
References
perspective on shrinkage, smoothing and predictor selection.
Statistics and Computing XX, to appear.
Frühwirth-Schnatter, S. (2001). Markov chain Monte Carlo
estimation of classical and dynamic switching and mixture models.
Journal of the American Statistical Association 96, 194–209.
Frühwirth-Schnatter, S. (2006). Finite Mixture and Markov
Switching Models. New York: Springer.
Frühwirth-Schnatter, S. (2008). Comment on article by T. Rydén
on “EM versus Markov chain Monte Carlo for estimation of hidden
Markov models”. Bayesian Analysis 3, 689–698.
Frühwirth-Schnatter, S. and S. Pyne (2010).
Bayesian
Inference for Finite Mixtures of Univariate and Multivariate
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
59
References
Skew Normal and Skew-t Distributions. Biostatistics. doi:
10.1093/biostatistics/kxp062.
Hahn, M., S. Frühwirth-Schnatter, and J. Sass (2010). Markov
chain Monte Carlo methods for parameter estimation in
multidimensional continuous time Markov switching models.
Journal of Financial Econometrics 8, 88–121.
Hurn, M., A. Justel, and C. P. Robert (2003). Estimating mixtures of
regressions. Journal of Computational and Graphical Statistics 12,
55–79.
Ishwaran, H., L. F. James, and J. Sun (2001). Bayesian model
selection in finite mixtures by marginal density decompositions.
Journal of the American Statistical Association 96, 1316–1332.
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
60
References
Jasra, A., C. C. Holmes, and D. A. Stephens (2005). Markov chain
Monte Carlo methods and the label switching problem in Bayesian
mixture modelling. Statistical Science 20, 50–67.
Marin, J.-M., K. Mengersen, and C. P. Robert (2005). Bayesian
modelling and inference on mixtures of distributions. In D. Dey and
C. Rao (Eds.), Bayesian Thinking, Modelling and Computation,
Volume 25 of Handbook of Statistics, Chapter 16. Amsterdam:
North-Holland.
Nobile, A. (2004). On the posterior distribution of the number
of components in a finite mixture. The Annals of Statistics 32,
2044–2073.
Nobile, A. and A. Fearnside (2007). Bayesian finite mixtures with
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
61
References
an unknown number of components: The allocation sampler.
Statistics and Computing 17, 147–162.
Pamminger, C. and S. Frühwirth-Schnatter (2010). Bayesian
Clustering of Categorical Time Series Using Finite Mixtures
of Markov Chain Models.
Research Report IFAS,
http://www.ifas.jku.at/.
Park, T. and G. Casella (2008). The Bayesian Lasso. Journal of the
American Statistical Association 103, 681–686.
Richardson, S. and P. J. Green (1997). On Bayesian analysis of
mixtures with an unknown number of components. Journal of the
Royal Statistical Society, Ser. B 59, 731–792.
Sperrin, M., T. Jaki, and E. Wit (2009). Probabilistic relabelling
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
62
References
strategies for the label switching problem in Bayesian mixture
models. Statistics and Computing accepted, ADD.
Spezia, L. (2009). Reversible jump and the label switching problem
in hidden Markov models. Journal of Statistical Planning and
Inference 139, 2305–2315.
Stephens, M. (1997). Bayesian Methods for Mixtures of Normal
Distributions. Ph. D. thesis, University of Oxford. CHECK.
Stephens, M. (2000a). Bayesian analysis of mixture models with
an unknown number of components – An alternative to reversible
jump methods. The Annals of Statistics 28, 40–74.
Stephens, M. (2000b). Dealing with label switching in mixture
models. Journal of the Royal Statistical Society, Ser. B 62,
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
63
References
795–809.
Weber, A. (2001). State dependence and wage dynamics: A
heterogeneous Markov chain model for wage mobility in Austria.
Research report, Institute for Advanced Studies, Vienna.
Sylvia Frühwirth-Schnatter
Workshop on Mixture Estimation and Applications, Edinburgh, March 3-5, 2010D
64
© Copyright 2026 Paperzz