Mixtures of Shifted Asymmetric Laplace Distributions

Mixtures of Shifted Asymmetric Laplace Distributions
Brian C. Franczak, Ryan P. Browne, and Paul D. McNicholas∗
arXiv:1207.1727v3 [stat.ME] 21 Dec 2012
Department of Mathematics & Statistics, University of Guelph.
Abstract
A mixture of shifted asymmetric Laplace distributions is introduced and used for clustering and
classification. A variant of the EM algorithm is developed for parameter estimation by exploiting the
relationship with the general inverse Gaussian distribution. This approach is mathematically elegant
and relatively computationally straightforward. Our novel mixture modelling approach is demonstrated
on both simulated and real data to illustrate clustering and classification applications. In these analyses,
our mixture of shifted asymmetric Laplace distributions performs favourably when compared to the
popular Gaussian approach. This work, which marks an important step in the non-Gaussian modelbased clustering and classification direction, concludes with discussion as well as suggestions for future
work.
1
Introduction
Finite mixture models are based on the underlying assumption that a population is a convex combination of
a finite number of densities. They therefore lend themselves quite naturally to classification and clustering
problems. Formally, a random vector X arises from a parametric finite mixture distribution if, for all x ⊂ X,
we can write its density as
G
X
f (x | ϑ) =
πg fg (x | θ g ),
g=1
PG
where πg > 0, such that g=1 πg = 1 are the mixing proportions, f1 (x | θ g ), . . . , fG (x | θ g ) are called
component densities, and ϑ = (π, θ 1 , . . . , θ G ) is the vector of parameters with π = (π1 , . . . , πG ). The component densities f1 (x | θ 1 ), . . . , fG (x | θ G ) are usually taken to be of the same type, most often multivariate
Gaussian. In the event
PGthat the component densities are multivariate Gaussian, the density of the mixture
model is f (x | ϑ) = g=1 πg φ(x | µg , Σg ), where φ(x | µg , Σg ) is the multivariate Gaussian density with
mean µg and covariance matrix Σg . The popularity of the multivariate Gaussian distribution is due to
its mathematical tractability and its flexibility in capturing densities; we will return to this latter point in
Section 4.1. Herein, we shall follow convention and use the term model-based clustering to mean clustering using mixture models. Model-based classification (e.g., McNicholas, 2010), or partial classification (cf.
McLachlan, 1992, Section 2.7), can be regarded as a semi-supervised version of model-based clustering.
At the time of the review paper of Fraley and Raftery (2002), almost all work on clustering and classification using mixture models had been based on Gaussian mixture models (e.g., Banfield and Raftery (1993),
Celeux and Govaert (1995), Ghahramani and Hinton (1997), Fraley and Raftery (1998), Tipping and Bishop
(1999), McLachlan and Peel (2000a), and McLachlan and Peel (2000b), amongst others). An important
example of non-Gaussian work from this time is the early work on clustering using mixtures of multivariate t-distributions carried out by McLachlan and Peel (1998) and Peel and McLachlan (2000). This work
∗ Department of Mathematics & Statistics, University of Guelph, Guelph, Ontario, N1G 2W1, Canada.
[email protected].
1
E-mail:
was the forerunner to several papers on clustering using mixtures of multivariate t-distributions, including
those by McLachlan et al. (2007), Andrews and McNicholas (2011), and Baek and McLachlan (2011). Work
has also burgeoned on skew-normal distributions (e.g., Lin, 2009), skew-t distributions (e.g., Lin, 2010; Lee
and McLachlan, 2011; Vrbik and McNicholas, 2012), and other non-elliptically contoured distributions (e.g.,
Karlis and Meligkotsidou, 2007; Karlis and Santourian, 2009; Browne et al., 2012).
The recent burgeoning of non-Gaussian approaches to model-based clustering and classification has coincided with yet more papers on Gaussian approaches. These include work on extensions of mixtures of factor
analyzers (McNicholas and Murphy, 2008, 2010; Baek et al., 2010) and developments on variable selection
and dimension reduction for Gaussian model-based clustering (Raftery and Dean, 2006; Maugis et al., 2009;
Scrucca, 2010). Nevertheless, the fecundity of non-Gaussian approaches has certainly been more potent than
that of Gaussian approaches over the last few years. This paper introduces a non-Gaussian approach that
allows for skewness while also parameterizing location and scale. Our approach is effective while also being mathematically elegant and relatively computationally straightforward. The methodology is developed
in Section 2, parameter estimation via deterministic annealing and an expectation-maximization algorithm
is outlined in Section 3, and both simulated and real data analyses are used to illustrate our approach
(Section 4). The paper concludes with a summary and suggestions for future work (Section 5).
2
2.1
Methodology
Generalized Inverse Gaussian Distribution
The density of a random variable X following a generalized inverse Gaussian (GIG) distribution is given by
ax + b/x
(a/b)ν/2 xν−1
√
exp −
,
(1)
q(x) =
2
2Kν ( ab)
for x > 0, where a, b ∈ R+ , ν ∈ R, and Kν is the modified Bessel function of the third kind with index ν.
There are several special cases of the GIG distribution, such as the gamma distribution (b = 0, ν > 0) and
the inverse Gaussian distribution (ν = −1/2). The GIG distribution was introduced by Good (1953) and
its statistical properties were laid down by Barndorff-Nielsen and Halgreen (1977), Blæsild (1978), Halgreen
(1979), and Jørgensen (1982). It has some attractive properties including the tractability of the following
expected values:
√ √ √
√
bKν+1
ab
aKν+1
ab
2ν
√ ,
√ −
E [X] = √
E [1/X] = √
.
(2)
b
aK
ab
bK
ab
ν
2.2
ν
Shifted Asymmetric Laplace Distribution
Consider a p-dimensional random vector Z from a centralized asymmetric Laplace (CAL) distribution (Kotz
et al., 2001). The density of Z is given by
ν/2
2Kν (u)
z0 Σ−1 z
(3)
f (z | α, Σ) =
exp{z0 Σ−1 α},
(2π)p/2 |Σ|1/2 2 + α0 Σ−1 α
q
where ν = (2 − p)/2, u = (2 + α0 Σ−1 α) z0 Σ−1 z , Σ is a covariance matrix, and α ∈ Rp represents the
skewness in each dimension. Kotz et al. (2001) use the notation Z v ALp (α, Σ) to indicate that the random
variable Z follows a p-dimensional CAL distribution and provide an extensive properties list.
The CAL density (3) is prohibitive for model-based clustering and classification applications because it
would force each component density to be joined at the same origin. To address this problem, consider
2
Z v ALp (α, Σ) and introduce a shift parameter µ ∈ Rp by considering a random vector X = (Z + µ) v
SALp (α, Σ, µ), where SALp (α, Σ, µ) denotes a p-dimensional shifted (non-centralized) asymmetric Laplace
(SAL) distribution with density given by
2 exp{(x − µ)0 Σ−1 α}
ξ (x | α, Σ, µ) =
(2π)p/2 |Σ|1/2
δ (x, µ | Σ)
2 + α0 Σ−1 α
ν/2
Kν (u) ,
(4)
q
0
where u = (2 + α0 Σ−1 α)δ (x, µ | Σ), δ (x, µ | Σ) = (x − µ) Σ (x − µ) is the squared Mahalanobis distance between x and µ, and ν, α, and Σ are defined as before.
Kotz et al. (2001)
√ note that the random variable Z v ALp (α, Σ) can be generated through the relationship Z = W α + W Y, where W is a random variable from an exponential distribution with mean 1 and
Y v N (0, Σ) is generated independent of W . Therefore,
the random variable X v SALp (α, Σ, µ) can be
√
generated through the relationship X = µ + W α + W Y and so X | W = w v N (µ + wα, wΣ).
The distribution of W conditional on the data can be computed through the use of Bayes’ theorem, i.e.,
fW (w | X = x) = fX (x | W = w)h(w)/fX (x),
where X | W = w v N (µ + wα, wΣ), W v exp(1), and fX (x) is the density of the shifted asymmetric
Laplace distribution given in (4). It follows that
wν−1
fW (w | X = x) =
2
δ (x, µ | Σ)
2 + α0 Σ−1 α
−ν/2
1
exp − 2w
δ (x, µ | Σ) − w2 2 + α0 Σ−1 α
q
,
Kν
(2 + α0 Σ−1 α)δ(x, µ | Σ)
(5)
where ν, α, µ, Σ, and δ (x, µ | Σ) are as defined for (4). Recalling the density of a GIG random variable (1),
it then follows from (5) that fW (w | X = x) is the GIG density with a ≡ 2 + α0 Σ−1 α and b ≡ δ(x, µ | Σ),
cf. (Barndorff-Nielsen, 1997).
We introduce a finite mixture of SAL distributions so that the gth component density is SALp (αg , Σg , µg ),
where the parameters are as defined for (4). The density of a mixture of SAL distributions is f (x | ϑ) =
PG
g=1 πg ξ(x | αg , Σg , µg ), where ϑ is the vector of all model parameters and ξ(x | αg , Σg , µg ) is the SAL
density from (4).
3
3.1
Parameter Estimation
EM Algorithm
The expectation-maximization (EM) algorithm (Dempster et al., 1977) is an iterative procedure for finding
the maximum likelihood estimates when data are incomplete or are treated as such. EM algorithm computations are based on the complete-data likelihood, i.e., the likelihood of the observed data plus the latent
or missing data. In the E-step, the expected value of the complete-data log-likelihood is computed, and in
the M-step, this value is maximized with respect to the model parameters. The E- and M-steps are then
iterated until some convergence criterion is attained.
A common criticism of the use of the EM algorithm for model-based clustering is that the singularityriddled likelihood surface makes parameter estimation unreliable and heavily dependent on the starting
values. To help overcome this problem, Zhou and Lange (2010) introduced a deterministic annealing algorithm that flattens the likelihood surface by introducing an auxiliary variable v ∈ [0, 1]. We will illustrate this
approach by using a version of this deterministic annealing algorithm in conjunction with an EM algorithm
to fit our SAL mixture model (cf. Section 3.3). Note that Eltoft et al. (2006) give an example of an EM-type
algorithm for fitting a multivariate Laplace distribution.
3
3.2
Application to Mixture of SAL Distributions
For our SAL mixture models, the complete-data comprise the observed x1 , . . . , xn , the component membership labels τ 1 , . . . τ n , and the variable W . For each i, we have τ i = (τi1 , . . . , τiG ), where τig = 1 if
observation i is in component g and τig = 0 otherwise, for i = 1, . . . , n and g = 1, . . . , G. The complete-data
likelihood is given by
n Y
G
Y
τig
L=
[πg φ xi | µg + wi αg , wi Σg h (wi )] ,
(6)
i=1 g=1
with the same notation used previously and where φ xi | µg + wi αg , wi Σg is the density of a multivariate
Gaussian distribution with mean µg +wi αg and covariance matrix wi Σg . The expected-value of the completedata log-likelihood is given by
Q=
G
X
ng log πg −
g=1
n
−
n X
n
G
G
X
X
0
np X
ng
np
E [log Wi | xi ] +
+
2
log 2π −
log Σ−1
τ̂ig xi − µg Σ−1
g
g αg
2
2 i=1
2
g=1
i=1 g=1
G
n
n
G
G
XX
1 XX
0
1 XX
τ̂ig E [Wi | xi ] ,
xi − µg −
τ̂ig xi − µg E [1/Wi | xi ] Σ−1
τ̂ig E [Wi | xi ] α0g Σ−1
g
g αg −
2 i=1 g=1
2 i=1 g=1
i=1 g=1
where ng =
Pn
i=1 τ̂ig
and
πg ξ xi | αg , Σg , µg
τ̂ig := E [τig | xi ] = PG
j=1
πj ξ xi | αj , Σj , µj
.
(7)
The expected values E [Wi | xi ] and E [1/Wi | xi ] are computed using the formulae in (2).
In the M-step, we maximize Q with respect to the model parameters to get the updates. Specifically, the
mixing proportions, skewness parameter, and shift parameter are updated by π̂g = ng /n,
Pn
Pn
Pn
i=1 τ̂ig E [1/Wi | xi ]
j=1 τ̂jg xj − ng
i=1 τ̂ig E [1/Wi | xi ] xi
Pn
Pn
α̂g =
,
2
τ̂
E
[W
|
x
]
τ̂
E
[1/W
i
i
j | x j ] − ng
i=1 ig
j=1 jg
and
Pn
µ̂g =
Pn
Pn
i=1 τ̂ig E [Wi | xi ]
j=1 E [1/Wj | xj ] xj − ng
i=1 τ̂ig xi
Pn
Pn
,
2
τ̂
E
[W
|
x
]
E
[1/W
|
x
]
−
n
i
i
j
j
g
i=1 ig
j=1
respectively. Each component covariance matrix Σ̂g is updated by
n
Σ̂g = Sg − α̂g r0g − rg α̂0g +
X
1
α̂g α̂0g
τ̂ig E [Wi | xi ] ,
ng
i=1
(8)
0
Pn
Pn
where Sg = (1/ng ) i=1 τ̂ig E [1/Wi | xi ] xi − µ̂g xi − µ̂g and rg = (1/ng ) i=1 τ̂ig xi − µ̂g .
The E- and M-steps are iterated until convergence. Practically, the E-step consists of updating the values
of the expected values τ̂ig , E [Wi | xi ], and E [1/Wi | xi ]. At the first iteration, these updates are based on
the initialized values of the parameter estimates π̂g , α̂g , µ̂g , and Σ̂g (cf. Section 3.3). For all other iterations,
these updates are based on the values of the parameter estimates from the previous iteration. The M-step
consists of updating the values of the parameters π̂g , α̂g , µ̂g , and Σ̂g based on the expected values from the
E-step. The E- and M-steps are iterated until convergence (cf. Section 3.4).
At convergence, the τ̂ig are the a posteriori probabilities of component membership for each observation
and can be used to cluster the observations into groups. Predicted classifications are obtained via maximum a
posteriori (MAP) probabilities, where MAP{τ̂ig } = 1 if maxg {τ̂ig } occurs at component g and MAP{τ̂ig } = 0
otherwise.
4
3.3
Initialization
The deterministic annealing algorithm (Zhou and Lange, 2010) is the same as the EM algorithm described
in Section 3.1, except that now
[πg ξ xi | αg , Σg , µg ]v
(9)
E [τig | xi ] = PG
v
h=1 [πh ξ (xi | αh , Σh , µh )]
in each E-step. Here, the auxiliary parameter v ∈ [0, 1], which is drawn from an increasing sequence of userspecified length, transforms the likelihood surface to improve the chances of finding the dominant mode. The
user-specified sequence runs from 0 to 1 and its length determines how many iterations of the deterministic
annealing algorithm will be preformed. The annealing algorithm itself is initialized using random starting
values of πg , αg , Σg , and µg . In our analyses (Section 4), we run the deterministic annealing algorithm ten
times and choose the values that give the highest likelihood as the starting values for our EM algorithms.
3.4
3.4.1
Convergence
Aitken acceleration
The Aitken acceleration (Aitken, 1926) is used to determine convergence of our EM algorithms. An EM
(k+1)
− l(k+1) < , where l(k+1) is the log-likelihood at
algorithm can be considered to have converged when l∞
iteration k + 1 and
l(k+1) − l(k)
(k+1)
l∞
= l(k) +
1 − a(k)
is an asymptotic estimate of the log-likelihood at iteration k + 1 (cf. Böhning et al., 1994). The value
a(k) = [l(k+1) − l(k) ]/[l(k) − l(k−1) ] is the Aitken acceleration at iteration k. For the analyses herein, we use
(k+1)
− l(k) < (cf. Lindsay,
a slightly modified convergence criterion, stopping our EM algorithms when l∞
1995).
3.4.2
Dealing with infinite likelihood
As our EM algorithm iterates, we must handle the complications that arise when computing µ̂g . Specifically,
as the algorithm iterates towards convergence, the value of µ̂g will tend to an observation xi . This happens
because of the [δ(x, µ | Σ)/(2 + α0 Σ−1 α)]ν/2 term in the multivariate SAL density (4). Although these
estimates of µ̂g maximize the likelihood, they create computational issues when trying to determine the
remaining parameter values and, specifically, the expected value E [1/Wi | xi ].
To overcome this problem, we stop searching for µg when µ̂g has the same value as some xi because the
log-likelihood becomes infinite at this point. We proceed by taking the value of µ̂g at the iteration before it
becomes equal to any xi (we denote this value µ̂∗g ) as the estimate for µg . We then update of αg using
α̂∗g
" Pn
=
τ̂ig xi −
Pi=1
n
i=1 τ̂ig E [Wi
0 #0
µ∗g
,
| xi ]
and update Σg given the update in (11). We use a different approach to overcome this problem in the
deterministic annealing algorithm, simply restricting the expected value of E[1/Wi | xi ] from exceeding a
value of − log(1 − v) at each iteration. We acknowledge that our solution to this problem is a simple-minded
one. However, we have found it to be quite effective and a more thorough exploration of this problem in
general is the subject of ongoing work.
5
3.5
Model-Based Classification
Suppose we have n observations and k of these observations have known group memberships. We can
use the group memberships of these k observations to estimate memberships for the remaining n − k observations within a joint likelihood framework. This approach, known as model-based classification, is a
semi-supervised version of model-based clustering. Without loss of generality, we order the observations
x1 , . . . , xk , xk+1 , . . . , xn so that the first k have known group memberships. Therefore, the values of τig are
known for i = 1, . . . , k and the SAL model-based classification likelihood is given by
L (x1 , . . . , xn , τ 1 , . . . , τ n | ϑ) =
k Y
G
Y
n
H
Y
X
[πg ξ xi | αg , Σg , µg ]τig
πh ξ (xj | αh , Σh , µh ),
i=1 g=1
(10)
j=k+1 h=1
where H ≥ G. Parameter estimation is carried out in an analogous fashion to model-based clustering.
3.6
Model Selection and Performance
The Bayesian information criterion (BIC; Schwarz, 1978) is commonly used for Gaussian mixture model
selection:
BIC = 2l(x | ϑ̂) − ν log n,
where l(x | ϑ̂) is the maximized log-likelihood, ϑ̂ is the maximum likelihood estimate of ϑ, ν is the number
of free parameters in the model, and n is the number of observations. In the analyses herein, we prefer to use
a model selection approach that is specifically designed for clustering and classification applications. To this
end, we use the integrated classification likelihood (ICL; Biernacki et al., 2000), which is calculated using
ICL ≈ BIC +
n
G
X
X
MAP{ẑig } log ẑig ,
i=k+1 g=1
Pn
PG
where i=k+1 g=1 MAP{ẑig } log ẑig , the estimated mean entropy, reflects the uncertainty in the classification of observation i into group g.
To assess clustering and classification performance, we use a cross tabulation of our MAP classifications
against the true group memberships. We then compute the adjusted Rand index (ARI; Hubert and Arabie,
1985). The Rand index (Rand, 1971) was introduced to compare partitions. It is the ratio of pairs that
should be and are together plus pairs that should be and are apart, divided by the total number of pairs.
The Rand index takes a value between 0 and 1, where 1 indicates perfect agreement. An unattractive
feature of the Rand index is that it has a positive expected value under random classification. To correct
this undesirable property, Hubert and Arabie (1985) introduced the ARI to account for chance agreement.
The ARI also takes a value of 1 when classification is perfect but has an expected value of 0 under random
classification. The ARI can also take negative values and this happens for classifications that are worse than
would be expected by chance.
4
4.1
Data Analyses
Introduction
The SAL mixture models are applied to simulated data (Section 4.2) and to two real data sets: the famous Old
Faithful geyser data (Section 4.3) and data on cellular localization sites for proteins in yeast (Section 4.4).
The simulation study was conducted to assess the accuracy of our SAL parameter estimates, including
selection of the number of components, and to draw comparison with the Gaussian mixture model approach.
6
In the real analyses, we illustrate the SAL approach, judging its performance alongside that of Gaussian
mixtures.
On the face of it, a comparison of our SAL approach to Gaussian mixtures might be considered a little
empty because one would expect Gaussian mixtures to use more than one component to model a cluster with
skewness. Indeed, there has been a lot of work within the literature on merging Gaussian components (e.g.,
Baudry et al., 2010; Hennig, 2010). However, our examples illustrate that when predicted classifications
differ for the SAL and Gaussian approaches, merging Gaussian components cannot always be used to rectify
shortcomings in the Gaussian results (cf. Sections 4.2 and 4.4). One may also argue that the mixture
of multivariate skew-t distributions could also be used to accommodate skewness. This is true, but two
different representations of the mixture of skew-t distributions have appeared within the literature (cf. Sahu
et al., 2003; Pyne et al., 2009) and neither have the elegance of SAL mixtures; this is most apparent in
the intractability of the E-step for the skew-t approach (cf. Lin, 2010; Lee and McLachlan, 2011; Vrbik and
McNicholas, 2012).
Note that, for all analyses, we use the same deterministic annealing starting values for the mixtures of
Gaussian distributions as for our SAL mixtures. Further, we treat each analysis as a genuine clustering
problem by removing all labels; we then apply SAL and Gaussian mixture models.
4.2
Simulation Study
10
0
5
Parameter 2
15
20
We used the relationship between the SAL and Gaussian distributions (cf. Section 2.2) to generate multivariate SAL data. Specifically, we simulated 25 data sets for n = 500 with p = 2 dimensions and G = 2
components (e.g., Figure 1). The data were generated using skewness parameters α1 = (2, 1) and α2 = (2, 2),
shift parameters µ1 = (0, −2) and µ2 = (0, 5), and covariance matrices
1 0.5
1 0
Σ1 =
and
Σ2 =
.
0.5 1
0 1
0
5
10
15
20
Parameter 1
Figure 1: Example of a simulated data set (n = 500), coloured by component.
The clustering results (Table 1) show that the SAL mixtures gave almost perfect clustering performance
over all 25 runs (average ARI=0.9968). The Gaussian approach, however, gave relatively poor clustering
performance (average ARI=0.4988) and never returned the correct number of clusters. The most common
Gaussian solution (chosen 21 times) has five components, e.g., Figure 2, and each of the other four solutions
had four components.
7
Table 1: Summary for the analysis of the 25 simulated data sets using our SAL mixtures, and using the
Gaussian approach.
SAL
100%
0.9968 (0.00516)
0%
0.4988 (0.06052)
10
0
5
Parameter 2
15
20
Gaussian
G = 2 selected
Average ARI (std. dev.)
G = 2 selected
Average ARI (std. dev.)
0
5
10
15
20
Parameter 1
Figure 2: One of the G = 5 Gaussian solutions for the simulated data, coloured by MAP classification
results.
The five-component Gaussian solution depicted in Figure 2 is typical of 21 of the cases observed in this
simulation. Particularly striking here is that the components cannot be combined to give the correct cluster
memberships. Instead, there is a clear ‘noise’ group comprising points that do not fit neatly within any of
the other four components. Although one might argue that this problem is obvious by inspection in two
dimensions, and might be resolved in some post hoc fashion, it would be virtually impossible to detect or
resolve in all but very low dimensional examples.
4.3
Old Faithful Geyser Data
The famous Old Faithful geyser data, which are available as faithful in R (R Development Core Team,
2012), comprise a two-variable data set measuring the waiting time between eruptions and the duration of
272 eruptions of the Old Faithful geyser in Yellowstone National Park. These data are well-known as an
example of skewness and they have been used many times to illustrate approaches to analyzing skewed data
(e.g., Ali et al., 2010). Our mixture of SAL distributions and mixtures of Gaussian distributions were fitted
to the geyser data. There are no ‘true’ classifications for these data but both approaches selected sensible
groups with identical classifications. The associated contour plots (Figure 3) illustrate that the fit to the
data is better for the mixture of SAL distributions. Note that applying a mixture of skew-t distributions to
these data will also give nicely fitting contours (cf. Vrbik and McNicholas, 2012). However, as discussed in
Section 4.1, the mixture of SAL distributions is less computationally cumbersome.
8
1
0
-1
-2
-2
-1
0
1
2
GMM
2
SAL
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
Figure 3: Model-based clustering results with contours for the SAL and Gaussian mixture models (GMMs)
on the Old Faithful data.
4.4
The Data
0.6
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
mcg
0.2
0.4
alm
0.8
1.0
4.4.1
Yeast Data
0.1
0.8
0.6
0.4
0.2
0.0
vac
Figure 4: The yeast data with the CYT and ME3 location sites highlighted.
The yeast data, which are available from the UCI machine learning repository, contain cellular localization
sites of 1,484 proteins. The development of these data, as well as classification results using a ‘rule-based
expert system’, are discussed by Nakai and Kanehisa (1991, 1992). For illustration, we consider three
variables: McGeoch’s method for signal sequence recognition (mcg), the score of the ALOM membrane
spanning region prediction program (alm), and the score of discriminant analysis of the amino acid content
of vacuolar and extracellular proteins (vac). The goal of our cluster analysis is to distinguish between the
two localization sites, CYT (cytosolic or cytoskeletal) and ME3 (membrane protein, no N-terminal signal),
cf. Figure 4.
9
4.4.2
Model-Based Clustering Results
The SAL and Gaussian mixture models were fitted to these data for G = 1, . . . , 5 components and the best fitting model was chosen using the ICL. The chosen SAL mixture model had two components (ICL=−5173.133)
and the chosen Gaussian mixture model had three components (ICL=−5161.878). The classification performance of the SAL mixture model (ARI=0.81) is superior to that of the Gaussian mixture model (ARI=0.56),
as illustrated in Table 2 and Figure 5. Again, this superiority cannot be negated by merging Gaussian components.
0.8
0.6
0.5
0.4
0.3
0.2
mcg
0.1
0.8
0.6
0.4
0.2
1.0
0.8
0.6
alm
0.4
0.9
0.7
0.2
0.9
0.0
0.8
0.7
0.6
0.5
0.4
0.3
0.2
mcg
vac
0.2
0.4
0.6
alm
0.8
1.0
Table 2: Clustering results for the best SAL and Gaussian mixture model, as chosen by ICL, for the yeast
data. The localization sites are cross-tabulated against our predicted classifications (A, B, C) in each case.
SAL
GMM
A
B
A
B
C
CYT
448
15
379
12
72
ME3
14
149
13
11
139
0.1
0.8
0.6
0.4
0.2
0.0
vac
Figure 5: Clustering results for the SAL and Gaussian mixture models on the yeast data; the colours highlight
the predicted component memberships.
Now, one may argue that the Gaussian mixture model would perform better under a different model
selection criterion. Therefore, we also investigated the Gaussian mixture model with G = 2 (Table 3).
Surprisingly, this Gaussian mixture model gives very poor clustering performance, producing classifications
that are worse than would be expected by guessing (i.e., ARI=−0.088 < 0).
Table 3: Classification results for the two component Gaussian mixture model on the yeast data. The
localization sites are cross-tabulated against our predicted classifications (A, B).
A
B
CYT
106
357
ME3
1
162
4.4.3
Model-Based Classification Results
To compare model-based classification within the SAL and Gaussian mixture modelling frameworks, we
analyze the yeast data with 70% of the group memberships taken to be known. We set G = H = 2 and each
10
model was fitted using 25 different random 70/30 partitions of the data. The aggregate classification results
(Table 4) and ARIs (0.86 and −0.080, respectively) indicate that the SAL mixture models outperform their
Gaussian counterparts by some margin.
Table 4: Aggregate classification results for the SAL and Gaussian mixture models for the yeast data
with 70% of the labels taken as known. The localization sites are cross-tabulated against our predicted
classifications (A, B) for the observations with unknown labels in each case.
SAL
CYT
ME3
A
3403
87
GMM
B
71
1139
A
964
29
B
2502
1205
The poor performance of the Gaussian mixture modelling approach here is surprising because the parameter estimates are computed with 70% of the location sites taken as known. This result raises questions
around the efficacy of the semi-supervised Gaussian model-based classification approach, as well as further
reinforcing the need for more flexible non-Gaussian approaches. Once again, the poor performance of the
Gaussian approach on these data cannot be mitigated by merging components.
5
Summary and Discussion
A mixture of SAL distributions model was introduced and applied for both clustering and classification.
An EM algorithm was used for parameter estimation, with starting values obtained using the deterministic
annealing approach of Zhou and Lange (2010). To account for both parsimony and entropy, the ICL is used
to select the number of mixture components. Our model-based clustering approach was illustrated on both
real and simulated data.
In the simulation, data were generated from a SAL mixture model with two components that were very
well separated. The SAL mixtures gave near-perfect results on these data whereas the Gaussian mixture
models consistently overestimated the number of components. Furthermore, it is particularly notable that
the overestimation of the number of components by the Gaussian mixture models could not be resolved by
merging components.
We also analyzed two real data sets. For the Old Faithful data, we considered only model-based clustering
with the SAL and Gaussian mixture models and both gave the same predicted group memberships. However,
a contour plot revealed that the SAL mixture model captured the shape of the data far more effectively
than its Gaussian counterpart. The yeast protein location data presented a much more difficult clustering
problem, and so we also used these data to illustrate model-based classification. For model-based clustering,
the chosen SAL model gave very good clustering performance (G = 2, ARI=0.81) and outperformed its
Gaussian counterpart (G = 3, ARI=0.56). Furthermore, when we forced G = 2 components, the Gaussian
mixture modelling approach gave worse clustering performance than would be expect by guessing (ARI< 0).
In the model-based classification applications, we set G = H = 2 and considered 25 random subsets of
the data for which 70% of the locations were taken as known. The SAL mixtures again gave excellent
performance (ARI=0.86) but the Gaussian mixtures had negative ARIs; it is surprising that the Gaussian
approach gave very poor classification performance in the semi-supervised case when 70% of the locations
were taken as known. This result calls into question the efficacy of the Gaussian model-based classification
approach. Again, as with the simulation study, the poor performance of Gaussian mixtures on the yeast
data could not be mitigated by merging components.
A decade on from the landmark paper of Fraley and Raftery (2002), we have put forth a case for
substantial departure from the Gaussian model-based clustering paradigm. Unlike the skew-normal and skewt approaches, which are perhaps less substantial departures, our approach is elegant and computationally
11
straightforward. This paper reinforces the fact that the literature is moving away from Gaussian approaches
with some alacrity but it should not be taken as being pejorative of Gaussian mixture models. Gaussian
mixture models remain a highly effective method for clustering and classifying certain types of data, and will
forever be the midwife that introduced mixture model-based clustering. Gaussian mixtures can, however,
give misleading results and, as we have illustrated, one cannot rely on rectifying poor performance by merging
components.
As mentioned in Section 3.4.2, a better solution to the issue of infinite log-likelihood values in our EM
algorithm is the subject of ongoing work. Future work will focus on the introduction of parsimony into
these models by decomposing the covariance structure in the same way as described by Celeux and Govaert
(1995) for the Gaussian mixture model (Σg = λg D0g Ag Dg ). Analysis of very high-dimensional data with our
SAL mixtures will be explored along the lines of work by McNicholas and Murphy (2008, 2010) and Baek
et al. (2010), who focused on extensions of mixtures of factor analyzers. Handling noise will be considered,
both via the addition of a uniform component (cf. Banfield and Raftery, 1993) and through the trimmed
clustering procedures of Gallegos and Ritter (2005). Work will also be conducted on developing a model
selection technique to compare clustering results from SAL mixtures with those from other mixtures, such
as mixtures of skew-t distributions.
Acknowledgements
This work was supported by an Ontario Graduate Scholarship, the University Research Chair in Computational Statistics, and an Early Researcher Award from the Ontario Ministry of Research and Innovation.
References
Aitken, A. C. (1926). On Bernoulli’s numerical solution of algebraic equations. Proceedings of the Royal
Society of Edinburgh 46, 289–305.
Ali, M. M., J. Woo, and S. Nadarajah (2010). Some skew symmetric inverse reflected distributions. Brazilian
Journal of Probability and Statistics 24 (1), 1–23.
Andrews, J. L. and P. D. McNicholas (2011). Mixtures of modified t-factor analyzers for model-based
clustering, classification, and discriminant analysis. Journal of Statistical Planning and Inference 141 (4),
1479–1486.
Baek, J. and G. J. McLachlan (2011). Mixtures of common t-factor analyzers for clustering high-dimensional
microarray data. Bioinformatics 27, 1269–1276.
Baek, J., G. J. McLachlan, and L. K. Flack (2010). Mixtures of factor analyzers with common factor loadings:
Applications to the clustering and visualization of high-dimensional data. IEEE Transactions on Pattern
Analysis and Machine Intelligence 32, 1298–1309.
Banfield, J. D. and A. E. Raftery (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics 49 (3), 803–821.
Barndorff-Nielsen, O. and C. Halgreen (1977). Infinite divisibility of the hyperbolic and generalized inverse
Gaussian distributions. Z. Wahrscheinlichkeitstheorie Verw. Gebiete 38, 309–311.
Barndorff-Nielsen, O. E. (1997). Normal inverse Gaussian distributions and stochastic volatility modelling.
Scandinavian Journal of Statistics 24 (1), 1–13.
Baudry, J.-P., Raftery, G. A. E., Celeux, K. Lo, and R. Gottardo (2010). Combining mixture components
for clustering. Journal of Computational and Graphical Statistics 19, 332–353.
12
Biernacki, C., G. Celeux, and G. Govaert (2000). Assessing a mixture model for clustering with the integrated
completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (7), 719–725.
Blæsild, P. (1978). The shape of the generalized inverse Gaussian and hyperbolic distributions. Research
Report 37, Department of Theoretical Statistics, Aarhus University, Denmark.
Böhning, D., E. Dietz, R. Schaub, P. Schlattmann, and B. Lindsay (1994). The distribution of the likelihood
ratio for mixtures of densities from the one-parameter exponential family. Annals of the Institute of
Statistical Mathematics 46, 373–388.
Browne, R. P., P. D. McNicholas, and M. D. Sparling (2012). Model-based learning using a mixture of
mixtures of Gaussian and uniform distributions. IEEE Transactions on Pattern Analysis and Machine
Intelligence 34 (4), 814–817.
Celeux, G. and G. Govaert (1995). Gaussian parsimonious clustering models. Pattern Recognition 28,
781–793.
Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the
EM algorithm. Journal of the Royal Statistical Society: Series B 39 (1), 1–38.
Eltoft, T., T. Kim, and T. Lee (2006). On the multivariate Laplace distribution. IEEE Signal Processing
Letters 13 (5), 300–303.
Fraley, C. and A. E. Raftery (1998). How many clusters? Which clustering methods? Answers via modelbased cluster analysis. The Computer Journal 41 (8), 578–588.
Fraley, C. and A. E. Raftery (2002). Model-based clustering, discriminant analysis, and density estimation.
Journal of the American Statistical Association 97 (458), 611–631.
Gallegos, M. T. and G. Ritter (2005). A robust method for cluster analysis. Annals of Statistics 33, 347–380.
Ghahramani, Z. and G. E. Hinton (1997). The EM algorithm for factor analyzers. Technical Report CRGTR-96-1, University of Toronto, Toronto.
Good, I. J. (1953). The population frequencies of species and the estimation of population parameters.
Biometrika 40, 237–260.
Halgreen, C. (1979). Self-decomposibility of the generalized inverse Gaussian and hyperbolic distributions.
Z. Wahrscheinlichkeitstheorie Verw. Gebiete 47, 13–18.
Hennig, C. (2010). Methods for merging Gaussian mixture components. Advances in Data Analysis and
Classification 4, 3–34.
Hubert, L. and P. Arabie (1985). Comparing partitions. Journal of Classification 2, 193–218.
Jørgensen, B. (1982). Statistical Properties of the Generalized Inverse Gaussian Distribution. New York:
Springer-Verlag.
Karlis, D. and L. Meligkotsidou (2007). Finite mixtures of multivariate Poisson distributions with application.
Journal of Statistical Planning and Inference 137 (6), 1942–1960.
Karlis, D. and A. Santourian (2009). Model-based clustering with non-elliptically contoured distributions.
Statistics and Computing 19 (1), 73–83.
Kotz, S., T. J. Kozubowski, and K. Podgorski (2001). The Laplace Distribution and Generalizations: A
Revisit with Applications to Communications, Economics, Engineering, and Finance (1st ed.). Burkhauser
Boston.
13
Lee, S. and G. J. McLachlan (2011). On the fitting of mixtures of multivariate skew t-distributions via the
EM algorithm. arXiv:1109.4706.
Lin, T.-I. (2009). Maximum likelihood estimation for multivariate skew normal mixture models. Journal of
Multivariate Analysis 100, 257–265.
Lin, T.-I. (2010). Robust mixture modeling using multivariate skew t distributions. Statistics and Computing 20, 343–356.
Lindsay, B. G. (1995). Mixture models: Theory, geometry and applications. In NSF-CBMS Regional
Conference Series in Probability and Statistics, Volume 5. California: Institute of Mathematical Statistics:
Hayward.
Maugis, C., G. Celeux, and M.-L. Martin-Magniette (2009). Variable selection for clustering with Gaussian
mixture models. Biometrics 65 (3), 701–709.
McLachlan, G. J. (1992). Discriminant Analysis and Statistical Pattern Recognition. New Jersey: John
Wiley & Sons.
McLachlan, G. J., R. W. Bean, and L. B.-T. Jones (2007). Extension of the mixture of factor analyzers
model to incorporate the multivariate t-distribution. Computational Statistics and Data Analysis 51 (11),
5327–5338.
McLachlan, G. J. and D. Peel (1998). Robust cluster analysis via mixtures of multivariate t-distributions.
In Lecture Notes in Computer Science, Volume 1451, pp. 658–666. Berlin: Springer-Verlag.
McLachlan, G. J. and D. Peel (2000a). Finite Mixture Models. New York: John Wiley & Sons.
McLachlan, G. J. and D. Peel (2000b). Mixtures of factor analyzers. In Proceedings of the Seventh International Conference on Machine Learning, San Francisco, pp. 599–606. Morgan Kaufmann.
McNicholas, P. D. (2010). Model-based classification using latent Gaussian mixture models. Journal of
Statistical Planning and Inference 140 (5), 1175–1181.
McNicholas, P. D. and T. B. Murphy (2008). Parsimonious Gaussian mixture models. Statistics and Computing 18, 285–296.
McNicholas, P. D. and T. B. Murphy (2010). Model-based clustering of microarray expression data via latent
Gaussian mixture models. Bioinformatics 26 (21), 2705–2712.
Nakai, K. and M. Kanehisa (1991). Expert system for predicting protein localization sites in gram-negative
bacteria. Protiens 11 (2), 95–110.
Nakai, K. and M. Kanehisa (1992). A knowledge base for predicting protein localization sites in eukaryotic
cells. Genomics 14 (4), 897–911.
Peel, D. and G. J. McLachlan (2000). Robust mixture modelling using the t distribution. Statistics and
Computing 10 (4), 339–348.
Pyne, S., X. Hu, K. Wang, E. Rossin, T.-I. Lin, L. M. Maier, C. Baecher-Allan, G. J. McLachlan, P. Tamayo,
D. A. Hafler, P. L. D. Jager, and J. P. Mesirow (2009). Automated high-dimensional flow cytometric data
analysis. Proceedings of the National Academy of Sciences 106, 8519–8524.
R Development Core Team (2012). R: A Language and Environment for Statistical Computing. Vienna,
Austria: R Foundation for Statistical Computing.
14
Raftery, A. E. and N. Dean (2006). Variable selection for model-based clustering. Journal of the American
Statistical Association 101 (473), 168–178.
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American
Statistical Association 66, 846–850.
Sahu, S. K., D. K. Dey, and M. D. Branco (2003). A new class of multivariate skew distributions with
application to Bayesian regression models. Canadian Journal of Statistics 31, 129–150.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics 6 (2), 461–464.
Scrucca, L. (2010). Dimension reduction for model-based clustering. Statistics and Computing 20 (4), 471–
484.
Tipping, T. E. and C. M. Bishop (1999). Mixtures of probabilistic principal component analysers. Neural
Computation 11 (2), 443–482.
Vrbik, I. and P. D. McNicholas (2012). Analytic calculations for the EM algorithm for multivariate skew-t
mixture models. Statistics and Probability Letters 82 (6), 1169–1174.
Zhou, H. and K. L. Lange (2010). On the bumpy road to the dominant mode. Scandinavian Journal of
Statistics 37 (4), 612–631.
15