p(xn | B, 7)=(1 − 7)N(xn | 0, 1) + 7N(xn

p(xn | B, 7)=(1 − 7)N(xn | 0, 1) + 7N(xn | B,1). L(B, 7) = η

CS-E4820 Machine Learning: Advanced Probabilistic Methods
Pekka Parviainen, Pekka Marttinen, Sami Remes, Pedram Daee (Spring 2017)
Exercise solutions, round 9, due 23:59 on 24th March, 2017
Please return your solutions on Peergrade.io as a single anonymized PDF file.
Problem 1. “Stochastic Gradient Ascent.”
In this exercise we will apply stochastic gradient ascent to the familiar two-component Gaussian mixture
p( xn | θ, π ) = (1 − π )N ( xn | 0, 1) + π N ( xn | θ, 1).
(a) Compute the partial derivatives
∂L
∂θ
and
∂L
∂π
of the log-likelihood
N
L(θ, π ) =
∑ log p(xn | θ, π ).
n =1
(b) The parameter π is constrained to be in the interval (0, 1), which is problematic for
unconstrained optimization with SGD. We can apply a change of variable such that
π = η (π̂ ) where
1
η (z) =
1 + exp(−z)
and thus
π
1−π
can be treated as an unconstrained parameter. Compute the partial derivative of the
log-likelihood, ∂∂ L
π̂ , with respect to the new parameter π̂.
π̂ = η −1 (π ) = log
Hint: Chain rule for differentiation
∂z
∂x
=
∂z ∂y
∂y ∂ x.
(c) Implement stochastic gradient ascent using a mini-batch size S = 20 and step-size schedule
γt = ( t + τ ) − κ
with forgetting rate κ = 0.9 and delay τ = 1 by writing the code to compute the gradients in simple sgd template.m. Plot the estimated values as a function of iterations
of the algorithm, and compare them to the true values (which can be plotted as horizontal
lines in the same plot). Does the algorithm seem to converge to the true values?
Solution. (a) Since ∂∂θ L(θ, π ) = ∑nN=1 ∂∂θ log p( xn | θ, π ), we can compute the derivatives for each point separately as follows.
∂ log p( xn | θ, π )
∂
=
log((1 − π )N ( xn | 0, 1) + π N ( xn | θ, 1))
∂θ
∂θ
∂
∂ θ π N ( xn | θ, 1)
=
(1 − π )N ( xn | 0, 1) + π N ( xn | θ, 1)
π ( xn − θ )N ( xn | θ, 1)
=
.
(1 − π )N ( xn | 0, 1) + π N ( xn | θ, 1)
Similarly for π
∂
∂ log p( xn | θ, π )
=
log((1 − π )N ( xn | 0, 1) + π N ( xn | θ, 1))
∂π
∂π
∂
[(1 − π )N ( xn | 0, 1) + π N ( xn | θ, 1)]
= ∂π
(1 − π )N ( xn | 0, 1) + π N ( xn | θ, 1)
−N ( xn | 0, 1) + N ( xn | θ, 1)]
.
=
(1 − π )N ( xn | 0, 1) + π N ( xn | θ, 1)
(b) The chain rule gives
∂L
∂L∂π
=
.
∂ π̂
∂ π ∂ π̂
So we need to compute only
∂π
∂
∂
1
=
η (π̂ ) =
∂ π̂
∂ π̂
∂ π̂ 1 + exp(−π̂ )
exp(−π̂ )
=
(1 + exp(−π̂ ))2
exp(−π̂ )
1
·
=
1 + exp(−π̂ ) 1 + exp(−π̂ )
1
1
(1 −
) = π (1 − π ).
=
1 + exp(−π̂ )
1 + exp(−π̂ )
(c) Trace plot shown in Figure 1. The code:
1
2
3
4
5
6
7
8
9
10
for i t e r = 1: n iter
g r a d i e n t = [ 0 0 ] ; % P a r t i a l d e r i v a t i v e s o f t h e log − l i k e l i h o o d w. r .
t . t h e t a and p i .
f o r sample round = 1 : b a t c h s i z e
n = c e i l ( rand ∗ n samples ) ; % S e l e c t a random sample
% P a r t i a l d e r i v a t i v e w. r . t . t h e t a
d e r t h = ( p i . ∗ normpdf ( x ( n ) , th , 1 ) . ∗ ( x ( n ) − th ) ) . / . . .
((1 − p i ) . ∗ normpdf ( x ( n ) , 0 , 1 ) + p i . ∗ normpdf ( x ( n ) , th , 1 ) )
;
% P a r t i a l d e r i v a t i v e w. r . t . p i
d e r p i = ( − 1.∗ normpdf ( x ( n ) , 0 , 1 ) + normpdf ( x ( n ) , th , 1 ) ) . / . . .
((1 − p i ) . ∗ normpdf ( x ( n ) , 0 , 1 ) + p i . ∗ normpdf ( x ( n ) , th ,
1) ) ;
11
gradient = gradient + [ der th , der pi ] ;
end
% Update t h e t a
th = th + s t e p s i z e s ( i t e r ) ∗ g r a d i e n t ( 1 ) ;
12
13
14
15
16
% Update p i
p i h a t = p i h a t + s t e p s i z e s ( i t e r ) . ∗ g r a d i e n t ( 2 ) ∗ p i ∗(1 − p i ) ;
p i = 1/(1+ exp(− p i h a t ) ) ;
17
18
19
20
end
Figure 1: SGD trace plot.
Problem 2. “Stochastic Variational Inference.”
Read the pseudo-code for the stochastic variational algorithm for the simple Gaussian mixture
model (attached under Lecture 9), and do the following:
(a) Implement the steps (10), (12), and (13) of the pseudo code to simple svi template.m.
(Note that step 12 is already partially implemented.)
(b) Plot the true (as a horizontal line) and estimated values of the global parameters θ and
π with respect to the number of iterations of the algorithm. Do this for batch sizes of
1,2,10,50. Comment briefly on the results.
Solution.
(a) Code below.
(b) Trace plots shown in Figure 2. Increasing the batch size seems to increase the
convergence rate, which makes sense as the algorithm sees more data faster.
1
2
3
4
for i t e r = 1: n iter
% Allocate :
p i p a r = z e r o s ( b a t c h s i z e , 2 ) ; % F i r s t and second v a r i a t i o n a l parameters
f o r q ( p i ) f o r each sampled data p o i n t .
e t a = z e r o s ( b a t c h s i z e , 2 ) ; % F i r s t and second v a r i a t i o n a l parameters
f o r q ( th ) f o r each sampled data p o i n t .
5
6
7
8
9
10
%
%
%
E
E
(4)
the
and
log
log
Values t h a t depend on q ( p i ) and t h a t a r e needed when computing
r e s p o n s i b i l i t i e s . These a r e t h e same f o r a l l p o s s i b l e samples ,
can t h e r e f o r e be c a l c u l a t e d once o u t s i d e t h e i n n e r loop .
pi = psi ( alpha pi ) − psi ( alpha pi + beta pi ) ;
pi c = psi ( beta pi ) − psi ( alpha pi + beta pi ) ;
11
12
13
f o r sampling round = 1 : b a t c h s i z e
s e l e c t e d i n d e x = c e i l ( rand ∗ n samples ) ; % S e l e c t randomly a sample
whose l o c a l parameters t o update
14
15
% (7) :
Figure 2: Results with SVI.
16
E l o g v a r = ( x ( s e l e c t e d i n d e x ) − normal mean ) . ˆ 2 + 1/
normal precision ;
17
18
19
20
21
22
23
24
25
% ( 8 ) Compute t h e r e s p o n s i b i l i t e s , f a c t o r q ( z )
log rho1 = E l o g p i c − 0 . 5 . ∗ log (2∗ pi ) − 0 . 5 . ∗ ( x ( s e l e c t e d i n d e x
) .ˆ2) ;
log rho2 = E l og p i − 0 . 5 . ∗ log (2∗ pi ) − 0 . 5 . ∗ E log var ;
max log rho = max ( l o g r h o 1 , l o g r h o 2 ) ; % Normalize t o avoid
numerical problems when e x p o n e n t i a t i n g .
rho1 = exp ( l o g r h o 1 − max log rho ) ;
rho2 = exp ( l o g r h o 2 − max log rho ) ;
r 2 = rho2 . / ( rho1 + rho2 ) ;
r1 = 1 − r2 ;
26
27
28
29
30
31
32
33
% ( 9 ) Compute i n t e r m e d i a t e g l o b a l v a r i a t i o n a l parameters o f t h e
% f a c t o r q ( p i ) =Beta ( par1 , par2 ) , assuming t h e sampled data item i s
% r e p l i c a t e d n samples ti me s .
N1 = n samples . ∗ r 1 ;
N2 = n samples . ∗ r 2 ;
p i p a r ( sampling round , 1 ) = N2 + alpha0 ; % F i r s t parameter
p i p a r ( sampling round , 2 ) = N1 + alpha0 ; % Second parameter
34
35
36
% ( 1 0 ) Compute i n t e r m e d i a t e v a r i a t i o n a l ( n a t u r a l ) parameters o f t h e
% f a c t o r q ( t h e t a ) =normal ( par1 , par2 ) , par1=mu/var , par2 = − 1/(2 var ) ,
% where mu and var a r e t h e mean and v a r i a n c e o f t h e Gaussian
% d i s t r i b u t i o n . Again assume t h a t t h e sampled data item i s
% r e p l i c a t e d n samples ti me s .
x2 avg = x ( s e l e c t e d i n d e x ) ;
t h p r e c i s i o n = b e t a 0 + N2 ;
t h v a r = 1/ t h p r e c i s i o n ;
th mu = t h v a r . ∗ N2 . ∗ x2 avg ;
e t a ( sampling round , 1 ) = th mu . / t h v a r ; % The 1 s t n a t u r a l
parameter
e t a ( sampling round , 2 ) = −1 . / ( 2 . ∗ t h v a r ) ; % The 2nd n a t u r a l
parameter
37
38
39
40
41
42
43
44
45
end
46
47
% ( 1 2 ) Update g l o b a l v a r i a t i o n a l parameters o f f a c t o r q ( p i )
aux = mean ( p i p a r , 1 ) ; % New e s t i m a t e s , average over sampled data p o i n t s
.
alpha pi new = aux ( 1 ) ; b e t a p i n e w = aux ( 2 ) ;
a l p h a p i = (1 − s t e p s i z e s ( i t e r ) ) . ∗ a l p h a p i + s t e p s i z e s ( i t e r ) . ∗
alpha pi new ; % Updated e s t i m a t e ( combination o f old and new )
b e t a p i = (1 − s t e p s i z e s ( i t e r ) ) . ∗ b e t a p i + s t e p s i z e s ( i t e r ) . ∗
beta pi new ;
48
49
50
51
52
53
% ( 1 2 ) Update g l o b a l v a r i a t i o n a l parameters o f f a c t o r q ( th )
aux = mean ( eta , 1 ) ; % Two n a t u r a l parameters o f t h e normal , averaged
over sampled data p o i n t s .
eta new1 = aux ( 1 ) ; eta new2 = aux ( 2 ) ;
e t a 1 = (1 − s t e p s i z e s ( i t e r ) ) . ∗ e t a 1 + s t e p s i z e s ( i t e r ) . ∗ eta new1 ;
e t a 2 = (1 − s t e p s i z e s ( i t e r ) ) . ∗ e t a 2 + s t e p s i z e s ( i t e r ) . ∗ eta new2 ;
54
55
56
57
58
59
% ( 1 3 ) Compute t h e ’ standard ’ mean and p r e c i s i o n parameters o f
% o f t h e f a c t o r q ( th ) .
n o r m a l p r e c i s i o n = −2 . ∗ e t a 2 ;
normal mean = e t a 1 . / n o r m a l p r e c i s i o n ;
60
61
62
63
64
% Keep t r a c k o f t h e c u r r e n t e s t i m a t e s
p i e s t ( i t e r ) = alpha pi / ( alpha pi + beta pi ) ;
t h e s t ( i t e r ) = − 0.5 . ∗ e t a 1 . / e t a 2 ;
65
66
67
68
end

Download Report

p(xn | B, 7)=(1 − 7)N(xn | 0, 1) + 7N(xn | B,1). L(B, 7) = η

Paperzz.com

Your Paperzz