Learning mixtures by simplifying kernel density estimators

Learning mixtures by simplifying kernel density
estimators
Olivier Schwander? and Frank Nielsen?†
?
Laboratoire d'Informatique, École Polytechnique, Palaiseau, France
?
Sony Computer Science Laboratories Inc., Tokyo, Japan
{schwander,nielsen}@lix.polytechnique.fr
Gaussian mixture models are a widespread tool for modeling various and complex probability density functions. They can be
estimated by various means, often using Expectation-Maximization or
Kernel Density Estimation. In addition to these well known algorithms,
new and promising stochastic modeling methods include Dirichlet Process mixtures and k-Maximum Likelihood Estimators. Most of the methods, including Expectation-Maximization, lead to compact models but
may be expensive to compute. On the other hand Kernel Density Estimation yields to large models which are computationally cheap to build.
In this paper we present new methods to get high-quality models that
are both compact and fast to compute. This is accomplished by the simplication of Kernel Density Estimator. The simplication is a clustering
method based on k-means-like algorithms. Like all k-means algorithms,
our method rely on divergences and centroids computation and we use
two dierent divergences (and their associated centroids), Bregman and
Fisher-Rao. Along with the description of the algorithms, we describe
the pyMEF library, which is a Python library designed for the manipulation of mixture of exponential families. Unlike most of the other existing
tools, this library allows to use any exponential family instead of being
limited to a particular distribution. The generic library allows to rapidly
explore the dierent available exponential families in order to choose the
better suited for a particular application. We evaluate the proposed algorithms by building mixture models on examples from a bio-informatics
application. The quality of the resulting models is measured in terms of
log-likelihood and of Kullback-Leibler divergence.
Abstract.
Kernel Density Estimation, simplication, Expectation Maximization, k-means, Bregman, Fisher-Rao
Keywords:
1 Introduction
Statistical methods are nowadays commonplace in modern signal processing.
There are basically two major approaches for modeling experimental data by
probability distributions: we may either consider a semi-parametric modeling by
a nite mixture model learnt using the Expectation-Maximization (EM) procedure, or alternatively choose a non-parametric modeling using a Kernel Density
Estimator (KDE).
On the one hand mixture modeling requires to x or learn the number of
components but provides a useful compact representation of data. On the other
hand, KDE nely describes the underlying empirical distribution at the expense
of the dense model size. In this paper, we present a novel statistical modeling
method that simplies eciently a KDE model with respect to an underlying
distance between Gaussian kernels. We consider the Fisher-Rao metric and the
Kullback-Leibler divergence. Since the underlying Fisher-Rao geometry of Gaussians is hyperbolic without a closed-form equation for the centroids, we rather
adopt a close approximation that bears the name of hyperbolic model centroid,
and show its use in a single-step clustering method. We report on our experiments
that show that the KDE simplication paradigm is a competitive approach over
the classical EM, in terms of both processing time and quality.
In Section 2, we present generic results about exponential families, denition,
Legendre transform, various forms of parametrization and associated Bregman
divergences. These preliminary notions allow us to introduce the Bregman hard
clustering algorithm for simplication of mixtures.
In Section 3, we present the mixture models and we briey describe some
algorithms to build them.
In Section 4, we introduce tools for the simplication of mixture models. We
begin with the well known Bregman Hard Clustering and present our new tool,
the Model Hard Clustering [23] which makes use of an expression of the FisherRao distance for the univariate Gaussian distribution. The Fisher-Rao distance is
expressed using the Poincaré hyperbolic distance and the associated centroids are
computed with model centroids. Moreover, since an iterative algorithm may be
too slow in time-critical applications, we introduce a one-step clustering method
which consists in removing the iterative part of a traditional k-means and taking
only the rst step of the computation. This method is shown experimentally to
achieve the same approximation quality (in terms of log-likelihood) at the cost
of a little increase in the number of components of the mixtures.
In Section 5, we describe our new software library pyMEF aimed at the manipulation of mixtures of exponential families. The goal of this library is to unify
the various tools used to build mixtures which are usually limited to one kind
of exponential family. The use of the library is further explained with a short
tutorial.
In Section 6, we study experimentally the performance of our methods
through two applications. First we give a simple example of the modeling of
the intensity histogram of an image which shows that the proposed methods
are competitive in terms of log-likelihood. Second, a real-world application in
bio-informatics is presented where the models built by the proposed methods
are compared to reference state-of-the-art models built using Dirichlet Process
Mixtures.
2 Exponential families
2.1
Denition and examples
A wide range of usual probability density functions belongs to the class of exponential families: the Gaussian distribution but also Beta, Gamma, Rayleigh
distributions and many more. An exponential family is a set of probability mass
or probability density functions admitting the following canonical decomposition:
p(x; θ) = exp(ht(x), θi − F (θ) + k(x))
(1)
with
t(x) the sucient statistic,
θ the natural parameters,
h·, ·i the inner product,
F the log-normalizer,
k(x) the carrier measure.
The log-normalizer characterizes the exponential family [5]. It is a strictly
convex and dierentiable function which is equal to:
F (θ) = log
Z
exp(ht(x), θi + k(x)) dx
(2)
x
The next paragraphs detail the decomposition of some common distributions.
Univariate Gaussian distribution The normal distribution is an exponential
family: the usual formulation of the density function
(x − µ)2
exp −
f (x; µ, σ ) = √
2σ 2
2πσ 2
2
1
matches the canonical decomposition of the exponential families with
t(x) = (x, x2 ),
1
µ
,− 2 ,
(θ1 , θ2 ) =
σ2
2σ
θ12
1
π
F (θ1 , θ2 ) = −
+ log −
,
4θ2
2
θ2
k(x) = 0.
(3)
Multivariate Gaussian distribution The multivariate normal distribution
(d is the dimension of the space of the observations)
f (x; µ, Σ) =
(2π)d/2
1
p
det(Σ)
exp
(x − µ)T Σ(x − µ)
2
(4)
can be described using the canonical parameters as follows:
t(x) = (x, −xxT )
(θ1 , θ2 ) = Σ
1
−1
1
µ, Σ −1
2
1
d
F (θ1 , θ2 ) = tr(θ2−1 θ1 θ1T ) − log det θ2 + log π
4
2
2
k(x) = 0
2.2
Dual parametrization
The natural parameters space used in the previous section admits a dual space.
This dual parametrization of the exponential families comes from the properties
of the log-normalizer. Since it is a strictly convex and dierential function, it
admits a dual representation by the Legendre-Fenchel transform:
F ? (η) = sup {hθ, ηi − F (θ)}
θ
(5)
We get the maximum for η = ∇F (θ). The parameters η are called expectation
parameters since η = E [t(x)].
Gradient of F and of its dual F ? are inversely reciprocal:
∇F = (∇F ? )
−1
(6)
and F ? itself can be computed by:
F =
?
Z
−1
(∇F ? )
+ constant.
(7)
Notice that this integral is often dicult to compute and the convex conjugate
F ? of F may be not known in closed-form. We can bypass the anti-derivative
operation by plugging in Eq. (5) the optimal value ∇F (θ∗ ) = η (that is, θ∗ =
(∇F −1 )(η)). We get
F ? (η) = h(∇F −1 )(η), ηi − F ((∇F −1 )(η))
(8)
This requires to take the reciprocal gradient ∇F −1 = ∇F ∗ , but allows us to
discard the constant of integration in Eq. (7).
Thus a member of an exponential family can be described equivalently with
the natural parameters or with the dual expectation parameters.
2.3
Bregman divergences
The Kullback-Leibler (KL) divergence between two members of the same exponential family can be computed in closed-form using a bijection between Bregman divergences and exponential families. Bregman divergences are a family of
divergences parametrized by the set of strictly convex and dierentiable functions F :
BF (pkq) = F (p) − F (q) − hp − q, ∇F (q)i
(9)
F is a strictly convex and dierentiable function called the generator of the
Bregman divergence.
The family of Bregman divergences generalizes a lot of usual divergences, for
example:
the squared Euclidean distance, for F (x) = x2 ,
the Kullback-Leibler (KL) divergence, with the Shannon negative entropy
Pd
F (x) = i=1 xi log xi (also called Shannon information).
Banerjee et al. [2] showed that Bregman divergences are in bijection with
the exponential families through the generator F . This bijection allows one to
compute the Kullback-Leibler divergence between two members of the same
exponential family:
KL (p(x, θ1 ), p(x, θ2 )) =
Z
p(x, θ1 )
x
p(x, θ1 )
dx
p(x, θ2 )
= BF (θ2 , θ1 )
(10)
(11)
where F is the log-normalizer of the exponential family and the generator of the
associated Bregman divergence.
Thus, computing the Kullback-Leibler divergence between two members of
the same exponential family is equivalent to compute a Bregman divergence
between their natural parameters (with swapped order).
2.4
Bregman centroids
Except for the squared Euclidean distance and the squared Mahalanobis distance, Bregman divergences are not symmetrical. This leads to two sided denitions for Bregman centroids:
the left-sided one
cL = arg min
x
X
ωi BF (x, pi )
(12)
ωi BF (pi , x)
(13)
i
and the right-sided one
cR = arg min
x
X
i
These two centroids are centroids by optimization, that is, the unique solution
of an optimization problem. Using this principle and various symmetrizations of
the KL divergence, we can design symmetrized Bregman centroid:
Jereys-Bregman divergences:
BF (p, q) + BF (q, p)
2
(14)
p+q
BF (p, p+q
2 ) + BF (q, 2 )
2
(15)
SF (p, q) =
Jensen-Bregman divergences [18]:
JF (p, q) =
Skew Jensen-Bregman divergences [18]:
(α)
JF (p, q) = αBF (p, αp + (1 − α)q) + (1 − α)BF (q, αp + (1 − α)q)
(16)
Closed-form formula are known for the left- and right-sided centroids [2]:
cR = arg min
X
x
=
n
X
ωi BF (pi , x)
(17)
i
(18)
ωi pi
i=1
cL = arg min
X
x
ωi BF (x, pi )
(19)
i
!
= ∇U ∗
X
ωi ∇U (pi )
(20)
i
3 Mixture Models
3.1
Statistical mixtures
Mixture models are a widespread tool for modeling complex data in a lot of
various domains, from image processing to medical data analysis through speech
recognition. This success is due to the capacity of these models to estimate the
probability density function (pdf) of complex random variables. For a mixture
f of n components, the probability density function takes the form:
f (x) =
n
X
ωi g(x; θi )
(21)
i=1
where ω i denotes the weight of component i (
of the exponential family g .
P
ωi = 1) and θi are the parameters
Gaussian mixture models (GMM) are a universal special case used in the
large majority of the mixture models applications:
f (x) =
n
X
ωi g(x; µi , σi2 )
(22)
i=1
Each component g(x; µi , σi2 ) is a normal distribution, either univariate or multivariate.
Even if GMMs are the most used mixture models, mixtures of exponential
families like Gamma, Beta or Rayleigh distributions are common in some elds
([14,12]).
3.2
Getting mixtures
We present here some well-known algorithms to build mixtures. For more details,
please have a look at the references cited in the next paragraphs.
Expectation-Maximization The most common tool for the estimation of the
parameters of a mixture model is the Expectation-Maximization (EM) algorithm
[8]. It maximizes the likelihood of the density estimation by iteratively computing
the expectation of the log-likelihood using the current estimate of the parameters
(E step) and by updating the parameters in order to maximize the log-likelihood
(M step).
Even if originally considered for Mixture of Gaussians (MoGs) the ExpectationMaximization has been extended by Banerjee et al. [2] to learn mixture of arbitrary exponential families.
The pitfall is that this method leads only to a local maximum of the loglikelihood. Moreover, the number of components is dicult to choose.
Dirichlet Process Mixtures To avoid the problem of the choice of the number
of components, one has proposed to use a mixture model with an innite number
of components. It can be done with a Dirichlet process mixture (DPM) [20]
which uses a Dirichlet process to build priors for the mixing proportions of the
components. If one needs a nite mixture, it is easy to sort the components
according to their weights ωi and to keep only the components above some
threshold. The main drawback is that the building of the model needs to evaluate
a Dirichlet process using a Monte-Carlo Markov Chain (for example with the
Metropolis algorithm) which is computationally costly.
Kernel Density Estimation The kernel density estimator (KDE) [19] (also
known as the Parzen windows method) avoids the problem of the choice of the
number of components by using one component (a Gaussian kernel) centered on
each point of the dataset. All the components share the same weight and since the
µi parameters come directly from the data points, the only remaining parameters
are the σi which are chosen equal to a constant called the bandwidth. The critical
part of the algorithm is the choice of the bandwidth: a lot of studies have been
made to automatically tune this parameter (see [25] for a comprehensive survey)
but it can also be chosen by hand depending on the dataset. Since there is one
Gaussian component a point in the data set, a mixture built with a kernel
density is dicult to manipulate: the size is large and common operations are
slow (evaluation of the density, random sampling, etc) since it is necessary to
loop over all the components of the mixture.
Pros and cons The main drawbacks of the EM algorithm are the risk to
converge to a local optimum and the number of iterations needed to nd this
optimum. While it may be costly, this time is only spent during the learning step.
On the other hand, learning a KDE is nearly free but evaluating the associated
pdf is costly since we need to loop over each component of the mixture. Given the
typical size of a dataset (a 120 × 120 image leads to 14400 components), the mixture can be unsuitable for time-critical applications. Dirichlet process mixtures
usually give high precision models which are very useful in some applications [3]
but at a computational cost which is not aordable in most applications.
Since mixtures with a low number of components have proved their capacity
to model complex data (Figure 1), it would be useful to build such a mixture
avoiding the costly learning step of EM or DPM.
4 Simplication of kernel density estimators
4.1
Bregman Hard Clustering
The Bregman Hard Clustering algorithm is an extension of the celebrated kmeans clustering algorithm to the class of Bregman divergences [2]. It has been
proposed in Garcia et al. [10] to use this method for the simplication of mixtures
of exponential families. Similarly to the Lloyd k-means algorithm, the goal is to
minimize the following cost function, for the simplication of n components
mixture to a k components mixture (with k < n):
L = 0min 0
θ1 ,...,θk
X
X
1<j≤k
i
BF (θj0 , θi )
(23)
where F is the log-normalizer of the considered exponential family, the θi are the
natural parameters of the source mixture and the θj0 are the natural parameters
of the target mixture.
With the bijection between exponential families and Bregman divergences,
the cost function L can be written in terms of Kullback-Leibler divergence:
L = min
c1 ,...,ck
X
X
1<j≤k
i
KL(xi , cj )
(24)
where the xi are the components of the original mixture and the cj are the
components of the target mixture. With this reformulation, the Bregman Hard
Clustering is shown to be a k-means with the Kullback-Leibler divergence (instead of the usual L2 -based distance). As in the L2 version, the k-means involves
two steps: assignation and centroid updates. The centroids of the cluster are here
computed using the closed-formula presented in Section 2.4.
Though left-, right-sided and symmetrized formulations of this optimization
problem can be used, it has been shown experimentally in [10] that the rightsided Bregman Hard Clustering performs better in terms of Kullback-Leibler
error. This experimental result is explained theoretically by a theorem stating
that the right-sided centroid is the best single-component approximation of a
mixture model, in terms of Kullback-Leibler divergence. Introduced by Pelletier
[1], a complete and more precise proof of this result is given in the following
section.
4.2
Kullback-Leibler centroids as geometric projections
Pelletier proved ([1], Theorem 4.1) that the right-sided KL barycenter p̄∗ can be
interpreted as the information-theoretic projection of the mixture model distribution p̃ ∈ P onto the model exponential family sub-manifold EF :
p̄∗ = arg min KL(p̃ : p)
p∈EF
(25)
Since the mixture of exponential families is not an exponential family (p̃ 6∈
EF ),1 it yields a neat interpretation: the best KL approximation of a mixture of
components of the same exponential family is the exponential family member
dened using the right-sided KL barycenter of mixture parameters.
Let θij for j ∈ {1, ..., d} be the d coordinates in the primal coordinate system
of parameter θi .
Let us write for short θ = θ(p), and θ̄∗ = θ(p̄∗ ) the natural coordinates of p
and p¯∗ , respectively. Similarly, denote by η = η(p), η̄ = η(p̄), and η̄ ∗ = η(p̄∗ ) the
dual moment coordinates of p and p¯∗ , respectively.
We have
p̃(x)
dx
p(x)
= Ep̃ [log p̃] − Ep̃ [log p]
KL(p̃ : p) =
Z
p̃(x) log
= Ep̃ [log p̃] − Ep̃ [hθ, t(x)i − F (θ) + k(x)]
= Ep̃ [log p̃] + F (θ) − hθ, Ep̃ [t(x)]i − Ep̃ [k(x)]
(26)
(27)
(28)
(29)
since Ep̃ [F (θ)] = F (θ) p̃(x)dx = F (θ).
Pn
Using
the
fact
that
Ep̃ [t(x)] = EPni=1 wi pF (x;θi ) [t(x)] = i=1 wi EpF (x;θi ) [t(x)] =
Pn
∗
i=1 wi ηi = η̄ , it follows that
R
1
The product of exponential families is an exponential family.
200
150
100
50
00
0.012
0.012
0.010
0.010
0.008
0.008
0.006
0.006
0.004
0.004
0.002
0.002
0.0000
50
100
150
200
250
0.0000
50
50
100
100
150
150
200
200
250
250
300
Top to bottom, left to right: original image, original histogram, raw KDE (14400
components) and simplied mixture (8 components). Even with very few components
compared to the mixture produced by the KDE, the simplied mixture still reproduces
very well the shape of the histogram.
Fig. 1.
*
KL(p̃ : p) = Ep̃ [log p̃] + F (θ) − Ep̃ [k(x)] −
θ,
n
X
+
wi ηi
i=1
∗
(30)
(31)
= Ep̃ [log p̃] + F (θ) − Ep̃ [k(x)] − hθ, η̄ i.
Let us now add for mathematical convenience the neutralized sum F (θ̄∗ ) +
hθ̄ , η̄ ∗ i − F (θ̄∗ ) − hθ̄∗ , η̄ ∗ i = 0 to the former equation.
Since
∗
KL(p̄∗ : p) = BF (θ : θ̄∗ ) = F (θ) − F (θ̄∗ ) − hθ − θ̄∗ , η̄ ∗ i,
(32)
KL(p̃ : p̄∗ ) = Ep̃ [log p̃] − Ep̃ [k(x)] + F (θ̄∗ ) − hθ̄∗ , η̄ ∗ i,
(33)
and
We end up with the following Pythagorean sum:
KL(p̃ : p) = Ep̃ [log p̃] + F (θ) − Ep̃ [k(x)] − hη̄ ∗ , θi
∗
∗
∗
∗
∗
∗
+F (θ̄ ) + hθ̄ , η̄ i − F (θ̄ ) − hθ̄ , η̄ i
KL(p̃ : p) = KL(p̄∗ : p) + KL(p̃ : p̄∗ )
(34)
(35)
(36)
This expression is therefore minimized for KL(p̄∗ : p) = 0 (since KL(p̄∗ : p) ≥
0), that is for p = p̄∗ . The closest distribution of EF to p̃ ∈ P is given by the
dual barycenter. In other words, distribution p̄∗ is the right-sided KL projection
of the mixture model onto the model sub-manifold. Geometrically speaking, it
is the projection of p̃ via the mixture connection: the m-connection. Figure 2
illustrates the projection operation.
P
p
p̃
p̄
m-geodesic
∗
EF
exponential family
mixture sub-manifold
manifold of probability distribution
Projection operation from the mixture manifold to the model exponential family
sub-manifold.
Fig. 2.
This theoretically explains why the right-sided KL centroid (ie., left-sided
Bregman centroid) is preferred for simplifying mixtures [16] emanating from a
kernel density estimator.
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.000
2
4
6
8
10
Right-sided (dashed line) and left-sided (dotted line) Kullback-Leibler centroids
of a 2-components Gaussian mixture model. The left-sided centroid focuses on the
highest mode of the mixture while the right-sided one tries to cover the supports of all
the components. Pelletier's result says the right-sided centroid is the closest Gaussian
to the mixture.
Fig. 3.
4.3
Model Hard Clustering
The statistical manifold of the parameters of exponential families can be studied
through the framework of Riemaniann geometry. It has been proved by Censov
[6] that the Fisher-Rao metric is the only meaningful Riemaniann metric on the
statistical manifold:
d log p d log p
I(θ) = [gij ] = E
d θi
d θj
(37)
The Fisher-Rao distance (FRD) between two distributions is computed using
the length of the geodesic path between the two points on the statistical manifold:
FRD(p(x; θ1 ), p(x; θ2 )) = min
θ(t)
Z
0
1
s
dθ
dt
T
I(θ)
dθ
dt
dt
(38)
with θ such that θ(0) = θ1 and θ(1) = θ2 .
This integral is not known in the general case and is usually dicult to
compute (see [21] for a numerical approximation in the case of the Gamma
distribution).
However, it is known in the case of a normal distribution that the Fisher-Rao
metric yields an hyperbolic geometry [13,7].
For univariate Gaussian, a closed-form formula of the Fisher-Rao distance
can be expressed, using the Poincaré hyperbolic distance in the Poincaré upper
half-plane:
FRD(f (x; µp , σp2 ), f (x; µq , σq2 ))
=
√
2 ln
|( √p2 , σp ) − ( √q2 , σq )| + |( √p2 , σp ) − ( √p2 , σp )| (39)
µ
µ
µ
µ
µ
µ
µ
µ
|( √p2 , σp ) − ( √p2 , σp )| − |( √p2 , σp ) − ( √p2 , σp )|
where | · | denotes the L2 Euclidean distance.
In order to perform the k-means iterations using Fisher-Rao distance, we
need to dene centroids on the hyperbolic space. Model centroids, introduced
by Galperin [9] and successfully used in [22] for hyperbolic centroidal Voronoi
tesselations, are a way to dene centroids in the three kinds of constant curvature
spaces (namely, Euclidean, hyperbolic or spherical). For a d-dimensional curved
space, it starts with nding a (k + 1)-dimensional model in the Euclidean space.
For a 2D hyperbolic space, it will be the Minkowski model, that is the upper
sheet of the hyperboloid −x2 − y 2 + z 2 = 1.
ω1 p01 + ω2 p02
Minkowski model
ω2 p02
ω1 p01
p01
p1
p02
c0
c
p2
Klein disk
O
Fig. 4.
Computation of the centroid c given the system (ω1 , p1 ), (ω2 , p2 ).
First, each point p (with coordinates (xp , yp )) lying on the Klein disk is
embedded in the Minkowski model:
x p0 =
xp
1 − x2p + yp2
yp0 =
yp
1 − x2p + yp2
zp0 =
1
1 − x2p + yp2
(40)
Next the center of mass of the points is computed
c00 =
X
(41)
ωi p0i
This point needs to be normalized to lie on the Minkowski model, so we look
for the intersection between the vector Oc00 and the hyperboloid:
c0 =
−x2c00
c00
− zc200 + zc200
(42)
From this point in the Minkowski model, we can use the reverse transform
in order to get a point in the original Klein disk [17]:
xc =
xc 0
zc0
yc =
yc0
zc0
(43)
Although this scheme gives the centroid of points located on the Klein disk, it
is not sucient since parameters of the Gaussian distribution are in the Poincaré
upper half-plane [7]. Thus we need to convert points from one model to another,
using the Poincaré disk as an intermediate step. For a point (a, b) on the halfplane, let z = a + ib, the mapping with the Poincaré disk is:
z0 =
z−i
z+i
z=
i(z 0 + 1)
1 − z0
(44)
And for a point p on the Poincaré disk, the mapping with a point k on the Klein
disk is:
p=
1−
p
1 − hk, ki
hk, ki
k=
2
p
1 + hp, pi
(45)
5 Software library
5.1
Presentation
Several tools are already available to build mixture models, either for mixtures
of Gaussian distributions or for mixtures of other distributions. But these tools
are usually dedicated to a particular family of distributions.
In order to provide a unied and powerful framework for the manipulation
of arbitrary mixture models, we develop pyMEF, a Python library dedicated to
the mixtures of exponential families.
Given the success of the Gaussian mixture models, there are already numerous other software available to deal with it:
some R packages: MCLUST (http://www.stat.washington.edu/mclust/) and
MIX (http://icarus.math.mcmaster.ca/peter/mix/),
MIXMOD [4] which also works on multinomial and provides bindings for Matlab
and Scilab,
PyMIX [11], another Python library which goes beyond simple mixture with
Context-specic independence mixtures and dependence trees,
scikits.learn, a Python module for machine learning (http://scikitlearn.sf.net),
jMEF [16,10] which is the only other library dealing with mixtures of exponential families, written in Java.
Although exponential families other than normal distributions have been successfully used in the literature (see [12] as an example for the Beta distribution),
it was made using an implementation specic to the underlying distribution
per se. The improvement of libraries such as jMEF and pyMEF is to introduce
genericity: changing the exponential family means simply changing a parameter
of the Bregman Soft clustering (equivalent to performing a EM task), and not
completely rewriting the algorithm.
Moreover, the choice of the good distribution is a dicult problem in itself,
and is often inspected experimentally, by looking at the shape of the histogram
or by comparing a performance score (the log-likelihood or any meaningful score
in the considered application) computed with mixtures of various distributions.
It is worth here to use a unied framework instead of using dierent libraries
from various sources with various interfaces.
The goal of the pyMEF library is to provide a consistent framework with
various algorithms to build mixtures (Bregman Soft clustering) and various Information-theoretic simplication methods (Bregman Hard clustering,
Burbea-Rao Hard Clustering [15], Fisher Hard Clustering) along with some
widespread exponential families:
univariate Gaussian,
multivariate Gaussian,
Generalized Gaussian,
multinomial,
Rayleigh,
Laplacian.
Another goal of pyMEF is to be easily extensible and more distributions are
planned, like:
Dirichlet,
Gamma,
Von Mises-Fisher.
5.2
Extending pyMEF
The set of available exponential families can be easily extended by users. Following the principles of Flash Cards introduced in [16] for jMEF it is sucient
to implement in a Python class the function describing the distribution:
the core of the family (the log-normalizer F and its gradient ∇F , the carrier
measure k and the sucient statistic t),
the dual characterization with the Legendre dual of F (F ? and ∇F ? )
the conversion between three parameters space (source to natural, natural
to expectation, expectation to source and their reciprocal).
5.3
An example with a Gaussian Mixture Model
We present here a basic example of a pyMEF session. The following can be used
interactively in the Python toplevel or be part of a larger software. This allows
both a rapid exploration of a dataset and the development of a real application
with the same tools.
We begin with loading the required modules:
import numpy
from m a t p l o t l i b import pyplot
from pyMEF. Build import BregmanSoftClustering , KDE
from pyMEF. S i m p l i f y import BregmanHardClustering
from pyMEF. F a m i l i e s import U n i v a r i a t e G a u s s i a n
An example dataset (6550 samples) is loaded using standard numpy functions:
data = numpy . l o a d t x t ( " data . t x t " )
data = data . reshape ( data . shape \ f o o t n o t e {DEFINITION NOT FOUND: 0 } , 1)
An 8-component mixture model is built on this dataset using the Bregman
Soft Clustering algorithm (also known as EM in the Gaussian case):
em = BregmanSoftClustering ( data , 8 , UnivariateGaussian , ( ) )
mm_em = em . run ( )
Another mixture is built using Kernel Density Estimation (leading to a 6550component mixture).
mm_kde = KDE( data , UnivariateGaussian , ( ) )
This very large model is then simplied into an 8-component mixture with
the Bregman Hard Clustering algorithm:
kmeans = BregmanHardClustering (mm_kde, 8)
mm_s = kmeans . run ( )
We nally compute the log-likelihood of the models (original and simplied).
print "EM: " , mm_em. l o g L i k e l i h o o d ( data )
print "KDE: " , mm_kde. l o g L i k e l i h o o d ( data )
print " S i m p l i f i e d KDE: " , mm_s. l o g L i k e l i h o o d ( data )
For illustration purposes (see Figure 5), we plot the histogram of the original data and the three computed models (pyMEF does not provide any display
functions, we rely instead on the powerful matplotlib2 library).
pyplot . s u b p l o t ( 2 , 2 , 1)
pyplot . h i s t ( data , 1000)
pyplot . xlim ( 0 , 20)
x = numpy . arange ( 0 , 2 0 , 0 . 1 )
pyplot . s u b p l o t ( 2 , 2 , 2)
pyplot . p l o t ( x , mm_em( x ) )
pyplot . s u b p l o t ( 2 , 2 , 3)
pyplot . p l o t ( x , mm_kde( x ) )
pyplot . s u b p l o t ( 2 , 2 , 4)
pyplot . p l o t ( x , mm_s( x ) )
pyplot . show ( )
A real application would obviously use multiple runs of the soft and hard
clustering algorithms to avoid being trapped in a bad local optimum that can
be reached by the two local optimization methods.
In this example, the Bregman Soft clustering gives the best result in terms
of log-likelihood (Table 1) but the model is visually not really satisfying (there
is a lot of local maxima near the rst mode of the histogram, instead of just one
mode). The models relying on Kernel Density Estimation give a bit worse loglikelihood but are visually more convincing. The important point is the quality
of the simplied model: while having a lot less components (8 instead of 6550)
the simplied model is nearly identical to the original KDE (both visually and
in terms of log-likelihood).
Model
Log-likelihood
EM
-18486.7957123
KDE
-18985.4483699
Simplied KDE -19015.0604457
Table 1. Log-likelihood of the three computed models. EM still gives the best value
and the simplied KDE has nearly the same log-likelihood than the original KDE.
2
http://matplotlib.sourceforge.net/
45
0.30
40
0.25
35
30
0.20
25
0.15
20
15
0.10
10
0.05
5
00
5
10
15
20
0.000
0.09
0.09
0.08
0.08
0.07
0.07
0.06
0.06
0.05
0.05
0.04
0.04
0.03
0.03
0.02
0.02
0.01
0.01
0.000
5
10
15
20
0.000
5
10
15
20
5
10
15
20
Output from the pyMEF demo. Top-left the histogram from the data; top-right, the model computed by EM; bottom-left the one
from KDE; bottom-right the simplied KDE. Visual appearance is quite bad for EM while it is very good for both KDE and simplied
KDE, even with a lot less components in the simplied version.
Fig. 5.
5.4
Examples with other exponential families
Although the Gaussian case is the more widespread and the more universal
case, a lot of other exponential families are useful in particular applications. We
present here two examples implemented in pyMEF using the formula detailed in
[16].
Rayleigh distribution The Rayleigh mixture models are used in the eld of Intravascular UltraSound imaging [24] for segmentation and classication tasks.
We presents in Figure 6 an example of the learning of a Rayleigh mixture model
on a synthetic dataset built from a 5 components mixture of Rayleigh distributions. The graphics shown in this gure have been generated with the following
script (for the sake of brevity, we omit here the loops used to select the best
model among some tries). Notice how similar this code is to the previous example, showing the genericity of our library: using dierent exponential families for
the mixtures is just a matter of changing one parameter in the program.
import sys , numpy
from
from
from
from
pyMEF import MixtureModel
pyMEF. Build import BregmanSoftClustering
pyMEF. S i m p l i f y import BregmanHardClustering
pyMEF. F a m i l i e s import Rayleigh
# O r i g i n a l mixture
k = 5
mm = MixtureModel ( 5 , Rayleigh , ( ) )
mm\ footnotemark [ 1 ] . s o u r c e ( ( 1 . , ) )
mm\ f o o t n o t e {DEFINITION NOT FOUND: 1
mm\ f o o t n o t e {DEFINITION NOT FOUND: 2
mm\ f o o t n o t e {DEFINITION NOT FOUND: 3
mm\ f o o t n o t e {DEFINITION NOT FOUND: 4
}. source ( ( 1 0 . , ) )
}. source ( ( 3 . , ) )
}. source ( ( 5 . , ) )
}. source ( ( 7 . , ) )
# Data sample
data = mm. rand (10000)
# Bregman S o f t C l u s t e r i n g k=5
em5 = BregmanSoftClustering ( data , 5 , Rayleigh , ( ) )
em5 . run ( )
mm_em5 = em5 . mixture ( )
# Bregman S o f t C l u s t e r i n g k=32 + S i m p l i f i c a t i o n
em32 = BregmanSoftClustering ( data , 32 , Rayleigh , ( ) )
em32 . run ( )
mm_em32 = em . mixture ( )
kmeans5 = BregmanHardClustering (mm_em32, 5)
kmeans . run ( )
mm_simplified = kmeans . mixture ( )
Laplace distribution Although Laplace distributions are only exponential families
when their mean is null, zero-mean Laplacian mixture models are used in various
applications. Figure 7 presents the same experiments as in Figure 6 and has been
generated with exactly the same script, just by replacing all occurrences of the
word Rayleigh by the word CenteredLaplace.
0.30
350
0.25
300
250
0.20
200
0.15
150
0.10
100
0.05
0.000
50
2
4
6
8
10
12
00
0.30
0.35
0.25
0.30
2
4
6
8
10
12
2
4
6
8
10
12
0.25
0.20
0.20
0.15
0.15
0.10
0.10
0.05
0.000
0.05
2
4
6
8
10
12
0.000
Rayleigh mixture models. The top left gure is the true mixture (synthetic data) and the top right one is the histogram of 10000
sample drawn from the true mixture. The bottom left gure is a mixture build with the Bregman Soft Clustering algorithm (with 5
components) and the bottom right one is a mixture built by rst getting a 32 components mixture with Bregman Soft Clustering and
then simplifying it to a 5 components mixtures with the Bregman Hard Clustering algorithm.
Fig. 6.
0.18
1800
0.16
1600
0.14
1400
0.12
1200
0.10
1000
0.08
800
0.06
600
0.04
400
0.02
200
0.00 20
15
10
5
0
5
10
15
20
0.16
0 20
15
10
5
0
5
10
15
20
15
10
5
0
5
10
15
20
0.30
0.14
0.25
0.12
0.20
0.10
0.08
0.15
0.06
0.10
0.04
0.05
0.02
0.00 20
15
10
5
0
5
10
15
20
0.00 20
Laplace mixture models. The top left gure is the true mixture (synthetic data) and the top right one is the histogram of 10000
sample drawn from the true mixture. The bottom left gure is a mixture build with the Bregman Soft Clustering algorithm (with 5
components) and the bottom right one is a mixture built by rst getting a 32 components mixture with Bregman Soft Clustering and
then simplifying it to a 5 components mixtures with the Bregman Hard Clustering algorithm.
Fig. 7.
6 Applications
6.1
Experiments on images
We study here the quality, in terms of log-likelihood, and the computation time
of the proposed methods compared to a baseline Expectation-Maximization algorithm. The source distribution is the intensity histogram of the famous Lena
image (see Figure 1). As explained in Section 4.1, for the Kullback-Leibler divergence, we report only results for right-sided centroids since it performs better
(as indicated by the theory) than the two other avors and has the same computation cost. The third and fourth methods are the Model centroid, both with
a full k-means and with only one iteration.
The top part of Figure 8 shows the evolution of the log-likelihood as a function
of the number of components k. First, we see that all the algorithms perform
nearly the same and converge very quickly to a maximum value (the KL curve
is merged with the EM one).
Kullback-Leibler divergence and Fisher-Rao metric perform similarly but
they are rather dierent from a theoretical standpoint: KL assumes an underlying at geometry while Fisher-Rao is related to the curved hyperbolic geometry
of Gaussian distributions. However at innitesimal scale (or on dense compact
clusters) they behave the same.
The bottom part of Figure 8 describes the running time (in seconds) as a
function of k. Despite the fact that the quality of mixtures is nearly identical,
the costs are very dierent. Kullback-Leibler divergence is very slow (even in
closed-form, the formulas are quite complex to calculate). While achieving the
same log-likelihood, model centroid is the fastest method, signicantly faster
than EM.
While being slower to converge when k increases, the one step model clustering performs still well and is roughly two times faster than a complete k-means.
The initialization is random: we do not use k-means++ here since its cost during
initialization cancels the benet of performing only one step.
6.2
Prediction of 3D structures of RNA molecules
RNA molecules play an important role in many biological processes. The understanding of the functions of these molecules depends on the study of their 3D
structure. A common approach is to use knowledge-based potential built from
inter-atomic distance coming from experimentally determining structures. Recent works use mixture models [3] to model the distribution of the inter-atomic
distances.
In the original work presented in [3] the authors use Dirichlet Process Mixtures to build the mixture models. This gives high quality mixtures, both in
terms of log-likelihood and in the context of the application, but with a high
computational cost which is not aordable for building thousands of mixtures.
We study here the eectiveness of our proposed simplication mixtures compared to reference high quality mixtures built with Dirichlet Process Mixtures.
7000
8000
9000
Log-likelihood
10000
11000
12000
Expectation-Maximization
Bregman Hard Clustering
Model Hard Clustering
Onestep Model Hard Clustering
13000
14000
150000
90
80
70
5
10
15
20
25
30
35
20
25
30
35
Number of components
Expectation-Maximization
Bregman Hard Clustering
Model Hard Clustering
Onestep Model Hard Clustering
60
Time
50
40
30
20
10
00
5
10
15
Number of components
Log-likelihood of the simplied models and computation time. All the algorithms reach the same log-likelihood maximum with quite few components (but the
one-step model centroid needs a few more components than all the others). Model
centroid based clusterings are the fastest methods, Kullback-Leibler clustering is even
slower than EM due to the computational cost of the KL distance and centroids.
Fig. 8.
We evaluate the quality of our simplied models by computing mixture in an
absolute way, with the log-likelihood, and in a relative way, with the KullbackLeibler divergence between a mixture built with Dirichlet and a simplied mixture.
A more detailed study of this topic is presented in [26].
Method
DPM
KDE
KDE + Bregman Hard Clustering
KDE + Model Hard Clustering
KDE + One step Model Hard Clustering
Log-likelihood
-18420.6999452
-18985.4483699
-18998.3203038
-18974.0717664
-19322.2443988
Log-likelihood of the model built by the state-of-the-art Dirichlet Process
Mixture, by Kernel Density Estimation, and by our new simplied models. DPM is
better but the proposed simplication methods perform as well as the KDE.
Table 2.
KL DPM KDE BHC MHC One step MHC
DPM 0.0 0.051 0.060 0.043
0.066
KDE 0.090 0.0 0.018 0.002
0.016
Kullback-Leibler divergence matrix for models built by Dirichlet Process
Mixture (DPM), by Kernel Density Estimation (KDE), by the Bregman Hard Clustering (BHC), by the Model Hard Clustering (MHC) and by the one-step Model Hard
Clustering. We limit the lines of the table to only DPM and KDE since by the nature
of Kullback-Leibler, the left term of the divergence is supposed to be the "true" distribution and the right term the estimated distribution (left term comes from the lines
and right term from the columns).
Table 3.
Both DPM and KDE produce high quality models (see Table 2): for the rst
with high computational cost, for the second with a high number of components.
Moreover, these two models are very close for the Kullback-Leibler divergence:
this means that one may choose between the two algorithms depending on the
most critical point, time or size, in their application.
Simplied models get nearly identical log-likelihood values. Only the one-step
Model Hard Clustering leads to a signicant loss in likelihood.
Simplied models using Bregman and Model Hard Clustering are both close
to the reference DPM model and to the original KDE (Table 3). Moreover, the
Model Hard Clustering outperforms the Bregman Hard Clustering in the two
cases. As expected, the one-step Model Hard Clustering is the furthest: it will
depend on the application to know if the decrease in computation time is worth
the loss in quality.
7 Conclusion
We presented a novel modeling paradigm which is both fast and accurate. From
the Kernel Density Estimates which are precise but dicult to use due to their
size, we are able to build new models which achieve the same approximation quality while being faster to compute and compact. We introduce a new mixture simplication method, the Model Hard Clustering, which relies on the Fisher-Rao
metric to perform the simplication. Since closed-form formula are not known in
the general case we exploit the underlying hyperbolic geometry, allowing to use
the Poincaré hyperbolic distance and the Model centroids, which are a notion of
centroids in constant curvature spaces.
Models simplied by the Bregman Hard Clustering and by Model Hard
Clustering have both a quality comparable to models built by ExpectationMaximization or by Kernel Density Estimation. But the Model Hard Clustering
does not only give very high quality models, it is also faster than the usual
Expectation-Maximization. The quality of the models simplied by the Model
Hard Clustering justify the use of the Model centroids as a substitute for the
Fisher-Rao centroids.
Both Model and Bregman Hard clustering are also competitive with stateof-the-art approaches in a bio-informatics application for the modeling of the
3D structure of a RNA molecule, giving models which are very close, in terms
of Kullback-Leibler divergence, to reference models built with Dirichlet Process
Mixtures.
Acknowledgments.
The authors would like to thank Julie Bernauer (INRIA team Amib, LIX,
École Polytechnique) for insightful discussions about the bio-informatics application of our work and for providing us with the presented dataset. FN (5793b870)
would like to thank Dr Kitano and Dr Tokoro for their support.
References
1. Pelletier B. Informative barycentres in statistics. Annals of the Institute of Statistical Mathematics, 57(4):767780, December 2005.
2. A. Banerjee, S. Merugu, I.S. Dhillon, and J. Ghosh. Clustering with Bregman
divergences. The Journal of Machine Learning Research, 6:17051749, 2005.
3. J. Bernauer, X. Huang, A.Y.L. Sim, and M. Levitt. Fully dierentiable coarsegrained and all-atom knowledge-based potentials for rna structure evaluation.
RNA, 17(6):1066, 2011.
4. C. Biernacki, G. Celeux, G. Govaert, and F. Langrognet. Model-based cluster and
discriminant analysis with the MIXMOD software. Computational Statistics &
Data Analysis, 51(2):587600, 2006.
5. L. D. Brown. Fundamentals of statistical exponential families: with applications in
statistical decision theory. IMS, 1986.
6. N. N. ƒencov. Statistical decision rules and optimal inference, volume 53 of Translations of Mathematical Monographs. American Mathematical Society, Providence,
R.I., 1982. Translation from the Russian edited by Lev J. Leifman.
7. S.I.R. Costa, S.A. Santos, and J.E. Strapasson. Fisher information matrix and
hyperbolic geometry. In Information Theory Workshop, 2005 IEEE, page 3 pp.,
aug.-1 sept. 2005.
8. A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete
data via the em algorithm. Journal of the Royal Statistical Society. Series B
(Methodological), pages 138, 1977.
9. G.A. Galperin. A concept of the mass center of a system of material points in the
constant curvature spaces. Communications in Mathematical Physics, 154(1):63
84, 1993.
10. V. Garcia, F. Nielsen, and R. Nock. Levels of details for gaussian mixture models.
Computer VisionACCV 2009, pages 514525, 2010.
11. B. Georgi, I.G. Costa, and A. Schliep. PyMix- The Python mixture package- a
tool for clustering of heterogeneous biological data. BMC bioinformatics, 11(1):9,
2010.
12. Y. Ji, C. Wu, P. Liu, J. Wang, and K.R. Coombes. Applications of beta-mixture
models in bioinformatics. Bioinformatics, 21(9):2118, 2005.
13. R. E. Kass and P. W. Vos. Geometrical Foundations of Asymptotic Inference. John
Wiley & Sons, September 1987.
14. I. Mayrose, N. Friedman, and T. Pupko. A Gamma mixture model better accounts
for among site rate heterogeneity. Bioinformatics, 21(Suppl 2), 2005.
15. F. Nielsen, S. Boltz, and O. Schwander. Bhattacharyya clustering with applications to mixture simplications. In IEEE International Conference on Pattern
Recognition, Istanbul, Turkey, 2010. ICPR'10.
16. F. Nielsen and V. Garcia. Statistical exponential families: A digest with ash cards.
arXiv:0911.4863, November 2009.
17. F. Nielsen and R. Nock. Hyperbolic voronoi diagrams made easy. arXiv:0903.3287,
March 2009.
18. F. Nielsen and R. Nock. Jensen-bregman voronoi diagrams and centroidal tessellations. In Voronoi Diagrams in Science and Engineering (ISVD), 2010 International
Symposium on, pages 5665. IEEE, 2010.
19. E. Parzen. On estimation of a probability density function and mode. The annals
of mathematical statistics, 33(3):10651076, 1962.
20. C.E. Rasmussen. The innite gaussian mixture model. Advances in neural information processing systems, 12:554560, 2000.
21. F. Reverter and JM Oller. Computing the rao distance for gamma distributions.
Journal of computational and applied mathematics, 157(1):155167, 2003.
22. G. Rong, M. Jin, and X. Guo. Hyperbolic centroidal voronoi tessellation. In
Proceedings of the 14th ACM Symposium on Solid and Physical Modeling, SPM
'10, page 117126, New York, NY, USA, 2010. ACM.
23. O. Schwander and F. Nielsen. Model centroids for the simplication of kernel
density estimators. In Acoustics, Speech and Signal Processing (ICASSP), 2012
IEEE International Conference on, march 2012.
24. J. C Seabra, F. Ciompi, O. Pujol, J. Mauri, P. Radeva, and J. Sanches. Rayleigh
mixture model for plaque characterization in intravascular ultrasound. IEEE
Transactions on Biomedical Engineering, 58(5):13141324, May 2011.
25. S.J. Sheather and M.C. Jones. A reliable data-based bandwidth selection method
for kernel density estimation. Journal of the Royal Statistical Society. Series B
(Methodological), 53(3):683690, 1991.
26. A. Y. L. Sim, O. Schwander, M. Levitt, and J. Bernauer. Evaluating mixture
models for building rna knowledge-based potentials. Journal of Bioinformatics
and Computational Biology (to appear), 2012.