Simplification and comparison of mixtures of Exponential families

Exponential Families
Mixture Models
Software library
Comparison
Simplification and comparison of mixtures of
Exponential families
Séminaire MIA, La Rochelle
Olivier Schwander
Joint work with Frank Nielsen (LIX/Sony CSL), Julie Bernauer
(LIX/INRIA), Adelene Sim, Michael Levitt (Stanford)
LIX – École Polytechnique
March 20, 2012
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Outline
Exponential Families
Definition and Examples
Bregman divergences
Mixture Models
Statistical mixtures
Getting mixtures
Simplification of GMM
Software library
Presentation
Comparison
Majoring Kullback-Liebler
Other ground distances
Comparing Kullback-Liebler and EMD
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Definition and Examples
Bregman divergences
An ubiquitous family
Game: guess corresponding distributions
Olivier Schwander
Images from Wikipedia
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Definition and Examples
Bregman divergences
An ubiquitous family
Gaussian or normal (generic, isotropic Gaussian, diagonal Gaussian,
rectified Gaussian or Wald distributions, log-normal), Poisson,
Bernoulli, binomial, multinomial (trinomial, Hardy-Weinberg
distribution), Laplacian, Gamma (including the chi-squared), Beta,
exponential, Wishart, Dirichlet, Rayleigh, probability simplex,
negative binomial distribution, Weibull, Fisher-von Mises, Pareto
distributions, skew logistic, hyperbolic secant, negative binomial,
etc.
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Definition and Examples
Bregman divergences
Exponential families
Definition
p(x; λ) = pF (x; θ) = exp (ht(x)|θi − F (θ) + k(x))
I
λ source parameter
I
t(x) sufficient statistic
I
θ natural parameter
I
F (θ) log-normalizer
I
k(x) carrier measure
F is a stricly convex and differentiable function
h·|·i is a scalar product
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Definition and Examples
Bregman divergences
Examples
Poisson distribution
p(x; λ) =
I
t(x) = x
I
θ = log λ
I
F (θ) = exp(θ)
I
k(x) = 0
Olivier Schwander
λx
exp(−λ)
x!
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Definition and Examples
Bregman divergences
Multivariate normal distribution
Exponential family
I
1
(x − µ)t Σ−1 (x − µ)
√
exp −
2
2π det Σ
θ = (θ1 , θ2 ) = Σ−1 µ, 12 Σ−1
F (θ) = 14 tr θ1−1 θ2 θ2T − 21 log det θ1 + d2 log π
I
t(x) = (x, −x t x)
I
k(x) = 0
p(x; µ, Σ) =
I
Composite vector-matrix inner product
hθ, θ0 i = θ1t θ10 + tr(θ2t θ20 )
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Definition and Examples
Bregman divergences
Bregman divergences
Bregman divergence
BF (p, q) = F (p) − F (q) + hp − q|∇F (q)i
F is a strictly convex and differentiable function
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Definition and Examples
Bregman divergences
Examples
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Definition and Examples
Bregman divergences
Bregman ball
Itakura-Saito
I
Left-sided ball
I
Right-sided ball
Left-sided ball is convex since DF (·kq) is convex
Very useful in sound processing applications
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Definition and Examples
Bregman divergences
Bregman centroids
Non-symmetric divergence
BF (pkq) 6= BF (qkp)
Various notions of centroids
P
I
i ωi BF (x, pi )
P
Right-sided one: minx i ωi BF (pi , x)
I
More with symmetrizations of Bregman divergences !
I
Left-sided one: minx
Here, mean by optimization
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Definition and Examples
Bregman divergences
Closed-from formula
Right-sided
cR =
n
X
pi
i=1
Left-sided
!
L
c = ∇U
∗
X
∇U(pi )
i
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Definition and Examples
Bregman divergences
Legendre duality
Legendre-Fenchel transform
Strictly convex and differentiable functions come by pairs
Legendre dual
F ? (η) = sup {hθ, ηi − F (θ)}
θ
Extremum obtained for η = ∇F (θ)
For Bregman divergences
I
BF (p, q) = BF ? (q, p)
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Definition and Examples
Bregman divergences
Expectation parameters
For Exponential families
η = E [t(X )] = µ = ∇F (θ)
Properties
I
I
∇F ? = (∇F )−1
R
R
F ? = ∇F ? = (∇F )−1
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Definition and Examples
Bregman divergences
Dual parametrization
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Definition and Examples
Bregman divergences
Flash cards
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Definition and Examples
Bregman divergences
Bijection with EF
For each EF, a Bregman divergence, and reciprocally
pF (x; θ) = exp (−BF ? (x, η) + F ? (x) + k(x))
Relation with Kullback-Leibler divergence
KL(pF (x; θ1 ), pF (x; θ1 )) = dF ? (η1 , η2 ) = dF (θ2 , θ1 )
Closed form formula for KL between members of the same EF
Bregman Hard Clustering
k-means with KL
Banerjee, Merugu, Dhillon, Ghosh. Clustering with Bregman divergences. 2005
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Definition and Examples
Bregman divergences
Bregman Soft Clustering: EM
Initialization
I
clustering of the dataset
I
for the cluster i, ηi =
1
ni
P
j
t(xj )
Expectation step
ωj exp (−BF (t(xi ), ηj )) exp(k(xi ))
p(i, j) = P
l ωl exp (−BF (t(xi ), ηl )) exp(k(xi ))
Maximization step
ωj =
1
N
P
i p(i, j)
Olivier Schwander
ηj =
P
i p(i,j)t(xj )
P
i p(i,j)
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Statistical mixtures
Getting mixtures
Simplification of GMM
Mixtures of Exponential families
Mixture
P
ωi Pr(X = x|µi , Σi )
I
Pr(X = x) =
I
each Pr(X = x|µi , Σi ) is an EF
i
Famous special case
Gaussian Mixtures Models (GMM)
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Statistical mixtures
Getting mixtures
Simplification of GMM
Getting mixtures
Expectation-Maximization
Kernel density estimation
I
or Parzen windows methods
I
one Kernel by data point (often a Gaussian kernel)
I
fixed bandwidth
200
0.012
0.010
150
0.008
100
0.006
0.004
50
0.002
00
50
100
150
200
Olivier Schwander
250
0.0000
50
100
150
200
Simplification and comparison of mixture models
250
Exponential Families
Mixture Models
Software library
Comparison
Statistical mixtures
Getting mixtures
Simplification of GMM
Why simplification ?
Did you notice the time to load the previous slide ?
I
Vectorial format for the KDE
I
Need to load 120 × 120 = 14400 Gaussian
KDE: good approximation but
I
very large mixture: time and memory problems
I
slow number of components is often enough (EM)
EM: small approximation but
we may want a fixed number of components without learning a new
mixture
I
EM is slow
I
we don’t have the
original dataset,
just the model
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Statistical mixtures
Getting mixtures
Simplification of GMM
k-means
200
150
100
50
00
50
0.012
0.012
0.010
0.010
0.008
0.008
0.006
0.006
0.004
0.004
0.002
0.002
0.0000
0.0000
50
100
150
200
Olivier Schwander
250
100
50
150
100
200
250
150
200
250
Simplification and comparison of mixture models
300
Exponential Families
Mixture Models
Software library
Comparison
Statistical mixtures
Getting mixtures
Simplification of GMM
Fisher divergence
Riemannian metric on the statistical manifold
Fisher information matrix
gij = I (θi , θj ) = E
∂
∂
~
~
log p(X ; θ)
log p(X ; θ)
∂θi
∂θj
ds =
XX
Olivier Schwander
gij d θi d θj
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Statistical mixtures
Getting mixtures
Simplification of GMM
Fisher divergence
Known for 0-mean Gaussian
I
Not really interesting for mixtures. . .
I
Open problem for others cases
For 1D data
I
Poincaré hyperbolic distance in the Poincaré upper half-plane
FRD(fp , fq ) =
√
2 ln
µ
|( √p2 , σp )
µ
|( √p2 , σp )
µ
µ
µ
µ
µ
µ
− ( √q2 , σq )| + |( √p2 , σp ) − ( √p2 , σp )|
− ( √p2 , σp )| − |( √p2 , σp ) − ( √p2 , σp )|
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Statistical mixtures
Getting mixtures
Simplification of GMM
Ficher centroids
No closed-form formula
I
even for 1D Gaussian
I
brute-force search for the minimizer ? not very elegant
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Statistical mixtures
Getting mixtures
Simplification of GMM
Model centroids
Centroid in constant curvature spaces
ω1 p01 + ω2 p02
I from Poincaré upper half-plane
to Poincaré disk
I from Poincaré disk to Klein
disk
Minkowski model
I from Klein disk to Minkowski
model
ω2 p02
ω1 p01
p01
I Center of Mass and
p1
p02
c0
c
Klein disk
p2
renormalization
I from Minkowski model to
. . . to Poincaré upper
half-plane
O
Galperin. A concept of the mass center of a system of material points in the
constant curvature spaces. 1993
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Statistical mixtures
Getting mixtures
Simplification of GMM
Experiments: log-likelihood
7000
8000
9000
Time
10000
11000
12000
13000
Expectation-Maximization
Bregman Hard Clustering
Model Hard Clustering
Onestep Model Hard Clustering
14000
150000
I
I
5
10
15
20
Number of components
25
30
35
EM and k-means with KL are very good even no matter the
number of components
k-means with model centroids and one-step k-means with
model centroids just need a little more components
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Statistical mixtures
Getting mixtures
Simplification of GMM
Experiments: time
90
80
70
Expectation-Maximization
Bregman Hard Clustering
Model Hard Clustering
Onestep Model Hard Clustering
Time
60
50
40
30
20
10
00
I
5
10
15
20
Number of components
25
30
35
KL even slower than EM (closed-form formula does not mean
cheap computation)
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Presentation
pyMEF: a Python library for Exponential families
Manipulation of mixture of EF
I
direct creation of mixtures
I
learning of mixtures: Bregman soft clustering
I
simplification of mixtures: Bregman hard clustering, Model
hard clustering
I
vizualization
Goals
I
generic framework for EF (and Information Geometry)
I
rapid prototyping (Python shell)
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Presentation
Flash cards
c l a s s MyEF( E x p o n e n t i a l F a m i l y ) :
...
def l a m b d a 2 t h e t a ( s e l f , l ) :
...
def t h e t a 2 l a m b d a ( s e l f , t ) :
...
def e t a 2 l a m b d a ( s e l f , e ) :
...
def l a m b d a 2 e t a ( s e l f , l ) :
...
def t ( s e l f , x ) :
...
def k ( s e l f , x ) :
...
def F ( s e l f , t ) :
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Majoring Kullback-Liebler
Other ground distances
Comparing Kullback-Liebler and EMD
Earth Mover Distance
A.k.a.
I
Mass transport problem
I
Optimal transport problem
I
Wasserstein/Monge/Kantorovitch metric
EMD(P, Q) = minF F (i, j)d (gi , gj )
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Majoring Kullback-Liebler
Other ground distances
Comparing Kullback-Liebler and EMD
Majoring Kullback-Liebler
Log-sum inequality
X
P
X
ai
ai
ai log P ≤
ai log
bi
bi
KL(P; P 0 ) ≤ KL(ω, ω 0 ) +
X
ωi KL(fi , fi 0 )
Can be improved for a well-chosen ordering of the components !
Looking for the best permutation
I
Optimal transport problem
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Majoring Kullback-Liebler
Other ground distances
Comparing Kullback-Liebler and EMD
Majoration as EMD
X
KL(P; P 0 ) ≤ KL(ω, ωσ0 ∗ ) +
ωi KL(fi , fσ0 ∗ i )
X ωi
0
=
ωi log 0 + KL(fi , fσ∗ i )
ωσ ∗ i
Ground distance
α
GD ((α, f ), (β, g )) = α log + KL(f , g )
β
Finding σ
I
Hungarian algorithm O(n3 )
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Majoring Kullback-Liebler
Other ground distances
Comparing Kullback-Liebler and EMD
Kullback-Liebler divergence
Sound information theoretic interpretation
Z
p
KL(p, q) = p log
q
Closed-form formula between Exponential families
I
Nice for Gaussian Mixture Models
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Majoring Kullback-Liebler
Other ground distances
Comparing Kullback-Liebler and EMD
Bhattacharyya distance
Amount of overlapping between densities
Z
√
pq
B(p, q) =
Closed-form formula between Exponential families
I
Nice for Gaussian Mixture Models
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Majoring Kullback-Liebler
Other ground distances
Comparing Kullback-Liebler and EMD
Mass transport distance
Real mass transportation problem
1/2
1/2 1/2
MT(p, q) = |µp − µq |2 + trΣp + trΣq − 2tr Σp Σq Σp
Closed-form formula between Gaussian distributions
I
Nice for Gaussian Mixture Models
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Majoring Kullback-Liebler
Other ground distances
Comparing Kullback-Liebler and EMD
A toy retrieval application
Relevant examples
I
A “good” distance: KL
I
Taking the closest 10% from the query
Evaluation
I
Precision at rank k
I
Mean average precision
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Majoring Kullback-Liebler
Other ground distances
Comparing Kullback-Liebler and EMD
Is the ranking the same ?
Ranking with KL
Ranking with EMD
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Majoring Kullback-Liebler
Other ground distances
Comparing Kullback-Liebler and EMD
Precision at rank k
1.0
Precision
0.8
0.6
0.4
Bhat
MT
KL
Maj KL
0.2
0.00
50
100
Short list size
150
200
EMD + Bhattacharyya is the most similar distance
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Majoring Kullback-Liebler
Other ground distances
Comparing Kullback-Liebler and EMD
mAP
Mean average precision
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
Bhat
MT
KL
Maj KL
Low scores
I
Very different distances
EMD + Bhattacharyya is still the most similar distance
Olivier Schwander
Simplification and comparison of mixture models
Exponential Families
Mixture Models
Software library
Comparison
Majoring Kullback-Liebler
Other ground distances
Comparing Kullback-Liebler and EMD
Conclusion
A better way to get mixtures
I
Compact mixtures
I
Fast to learn
I
Fast to use
New ways to compare mixtures
I
Quite different from KL
I
A lot faster
Olivier Schwander
Simplification and comparison of mixture models