Class Probabilities and the Log-sum-exp Trick

Class Probabilities and the Log-sum-exp Trick
Oren Freifeld
Computer Science, Ben-Gurion University
May 14, 2017
Oren Freifeld (BGU CS)
May 14, 2017
1 / 10
Disclaimer
Both the problem and the solution described in these slides are widely
known. I don’t remember where I saw the solution the first time, and
couldn’t find out who should be credited with discovering it. Some of my
derivations below are based on Ryan Adams’s post at
https://hips.seas.harvard.edu/blog/2013/01/09/
computing-log-sum-exp/
Oren Freifeld (BGU CS)
May 14, 2017
2 / 10
Numerical Issues with Computing Class Probabilities
We often need to compute, for data point x, expressions such as:
p(z = k|θ, x) ∝ wk exp(lk ) and PK
wk exp(lk )
k0 =1 wk0
exp(lk0 )
where lk ∈ R ∀k ∈ {1, . . . , K}
Here, lk does not necessarily stand for log-likelihood; rather, it stands for the
nominal value of the exponent of the k−th term of interest.
Oren Freifeld (BGU CS)
May 14, 2017
3 / 10
Example
In EM for GMM, the E step involves
ri,k = PK
= PK
πk N (xi ; µk , Σk )
k0 =1 πk0 N (xi ; µk0 , Σk0 )
πk (2π)−n/2 |Σk |−1/2 exp
− 12 (xi − µk )T Σ−1
k (xi − µk )
0
− 12 (xi − µk0 )T Σ−1
k0 (xi − µk )


lk
z }| {
}|
{
z


πk |Σk |−1/2 exp − 12 (xi − µk )T Σ−1
k (xi − µk )
k0 =1 πk0 (2π)
wk
= PK
k0 =1
−n/2 |Σ 0 |−1/2 exp
k
0
πk0 |Σk0 |−1/2 exp − 12 (xi − µk0 )T Σ−1
k0 (xi − µk )
Remark
Here, the π in the 2π term (which cancels out anyway) is the number π,
while πk is the weight of the k-th component; confusing, but it is a fairly
standard notation, especially in Bayesian statistics.
Oren Freifeld (BGU CS)
May 14, 2017
4 / 10
Numerical Issues with Computing Class Probabilities
If lk < 0 and |lk | is too large, we might P
have situations where (on a
computer) exp(lk ) = 0 for all k. Thus, K
k0 =1 wk0 exp(lk0 ) will be zero.
Similarly, if lk > 0 (can happen, for example, for some non-Gaussian
conditional class probabilities), might get +∞ (and/or overflow) if lk is
too large.
These issues appear in many clustering problems, including in (either
Bayesian or non-Bayesian) mixture models.
Oren Freifeld (BGU CS)
May 14, 2017
5 / 10
The Log-sum-exp Trick
Fact
∀a ∈ R and ∀{lk }K
k=1 ⊂ R :
log
K
X
k=1
Oren Freifeld (BGU CS)
exp(lk ) = a + log
K
X
exp(lk − a)
k=1
May 14, 2017
6 / 10
The Log-sum-exp Trick
Proof.
log
K
X
exp(lk ) = log
k=1
K
X
exp(lk − a + a) = log
k=1
= log exp(a)
K
X
exp(lk − a)
= a + log
!
exp(lk − a) exp(a)
k=1
!
k=1
K
X
K
X
= log exp(a) + log
K
X
!
exp(lk − a)
k=1
exp(lk − a)
k=1
Oren Freifeld (BGU CS)
May 14, 2017
7 / 10
The Log-sum-exp Trick
Fact
∀a ∈ R and ∀{lk }K
k=1 ⊂ R :
exp(lk − a)
exp(lk )
= PK
PK
k0 =1 exp(lk0 − a)
k0 =1 exp(lk0 )
Oren Freifeld (BGU CS)
May 14, 2017
8 / 10
Proof.
(1) log
XK
k=1
exp(lk ) = a + log
(2) exp(lk ) = exp (log exp(lk ))
(3)
K
X
exp(lk ) = exp log
k=1
K
X
XK
exp(lk − a) (by the previous fact)
k=1
(1) with K=1
=
!
exp(lk )
exp (a + log exp(lk − a))
(1)
= exp a + log
k=1
K
X
!
exp(lk − a)
k=1
exp(lk − a)
exp (log exp(lk − a))
P
(4) PK
=
K
0 − a)
exp(l
0
k
exp
log
exp(l
−
a)
k =1
k
k=1
=
exp(a)
exp (log exp(lk − a))
exp(a) exp log PK exp(l − a)
k=1
=
k
exp (a + log exp(lk − a))
P
exp a + log K
exp(l
−
a)
k
k=1
Oren Freifeld (BGU CS)
(2)&(3)
=
exp(lk )
PK
k0 =1 exp(lk0 )
May 14, 2017
9 / 10
The Log-sum-exp Trick
exp(lk − a)
exp(lk )
= PK
PK
k0 =1 exp(lk0 − a)
k0 =1 exp(lk0 )
Choose a = maxk lk and compute the LHS, not the problematic RHS.
This will prevent +∞, and even if some values vanish, we will have at
least one survivor (emaxk lk −a = e0 = 1 > 0) so the denominator will be
strictly positive (and finite).
More generally, instead of computing
wk exp(lk )
PK
k0 =1 wk0 exp(lk0 )
use
exp(lk + log wk − a)
PK
k0 =1 exp(lk0 + log wk − a)
where a = maxk lk + log wk and where we also used the fact that
wk exp(lk ) = exp(log wk ) exp(lk ) = exp(lk + log wk ).
Oren Freifeld (BGU CS)
May 14, 2017
10 / 10