CS-E4820 Machine Learning: Advanced Probabilistic Methods Pekka Parviainen, Pekka Marttinen, Sami Remes, Pedram Daee (Spring 2017) Exercise solutions, round 9, due 23:59 on 24th March, 2017 Please return your solutions on Peergrade.io as a single anonymized PDF file. Problem 1. “Stochastic Gradient Ascent.” In this exercise we will apply stochastic gradient ascent to the familiar two-component Gaussian mixture p( xn | θ, π ) = (1 − π )N ( xn | 0, 1) + π N ( xn | θ, 1). (a) Compute the partial derivatives ∂L ∂θ and ∂L ∂π of the log-likelihood N L(θ, π ) = ∑ log p(xn | θ, π ). n =1 (b) The parameter π is constrained to be in the interval (0, 1), which is problematic for unconstrained optimization with SGD. We can apply a change of variable such that π = η (π̂ ) where 1 η (z) = 1 + exp(−z) and thus π 1−π can be treated as an unconstrained parameter. Compute the partial derivative of the log-likelihood, ∂∂ L π̂ , with respect to the new parameter π̂. π̂ = η −1 (π ) = log Hint: Chain rule for differentiation ∂z ∂x = ∂z ∂y ∂y ∂ x. (c) Implement stochastic gradient ascent using a mini-batch size S = 20 and step-size schedule γt = ( t + τ ) − κ with forgetting rate κ = 0.9 and delay τ = 1 by writing the code to compute the gradients in simple sgd template.m. Plot the estimated values as a function of iterations of the algorithm, and compare them to the true values (which can be plotted as horizontal lines in the same plot). Does the algorithm seem to converge to the true values? Solution. (a) Since ∂∂θ L(θ, π ) = ∑nN=1 ∂∂θ log p( xn | θ, π ), we can compute the derivatives for each point separately as follows. ∂ log p( xn | θ, π ) ∂ = log((1 − π )N ( xn | 0, 1) + π N ( xn | θ, 1)) ∂θ ∂θ ∂ ∂ θ π N ( xn | θ, 1) = (1 − π )N ( xn | 0, 1) + π N ( xn | θ, 1) π ( xn − θ )N ( xn | θ, 1) = . (1 − π )N ( xn | 0, 1) + π N ( xn | θ, 1) Similarly for π ∂ ∂ log p( xn | θ, π ) = log((1 − π )N ( xn | 0, 1) + π N ( xn | θ, 1)) ∂π ∂π ∂ [(1 − π )N ( xn | 0, 1) + π N ( xn | θ, 1)] = ∂π (1 − π )N ( xn | 0, 1) + π N ( xn | θ, 1) −N ( xn | 0, 1) + N ( xn | θ, 1)] . = (1 − π )N ( xn | 0, 1) + π N ( xn | θ, 1) (b) The chain rule gives ∂L ∂L∂π = . ∂ π̂ ∂ π ∂ π̂ So we need to compute only ∂π ∂ ∂ 1 = η (π̂ ) = ∂ π̂ ∂ π̂ ∂ π̂ 1 + exp(−π̂ ) exp(−π̂ ) = (1 + exp(−π̂ ))2 exp(−π̂ ) 1 · = 1 + exp(−π̂ ) 1 + exp(−π̂ ) 1 1 (1 − ) = π (1 − π ). = 1 + exp(−π̂ ) 1 + exp(−π̂ ) (c) Trace plot shown in Figure 1. The code: 1 2 3 4 5 6 7 8 9 10 for i t e r = 1: n iter g r a d i e n t = [ 0 0 ] ; % P a r t i a l d e r i v a t i v e s o f t h e log − l i k e l i h o o d w. r . t . t h e t a and p i . f o r sample round = 1 : b a t c h s i z e n = c e i l ( rand ∗ n samples ) ; % S e l e c t a random sample % P a r t i a l d e r i v a t i v e w. r . t . t h e t a d e r t h = ( p i . ∗ normpdf ( x ( n ) , th , 1 ) . ∗ ( x ( n ) − th ) ) . / . . . ((1 − p i ) . ∗ normpdf ( x ( n ) , 0 , 1 ) + p i . ∗ normpdf ( x ( n ) , th , 1 ) ) ; % P a r t i a l d e r i v a t i v e w. r . t . p i d e r p i = ( − 1.∗ normpdf ( x ( n ) , 0 , 1 ) + normpdf ( x ( n ) , th , 1 ) ) . / . . . ((1 − p i ) . ∗ normpdf ( x ( n ) , 0 , 1 ) + p i . ∗ normpdf ( x ( n ) , th , 1) ) ; 11 gradient = gradient + [ der th , der pi ] ; end % Update t h e t a th = th + s t e p s i z e s ( i t e r ) ∗ g r a d i e n t ( 1 ) ; 12 13 14 15 16 % Update p i p i h a t = p i h a t + s t e p s i z e s ( i t e r ) . ∗ g r a d i e n t ( 2 ) ∗ p i ∗(1 − p i ) ; p i = 1/(1+ exp(− p i h a t ) ) ; 17 18 19 20 end Figure 1: SGD trace plot. Problem 2. “Stochastic Variational Inference.” Read the pseudo-code for the stochastic variational algorithm for the simple Gaussian mixture model (attached under Lecture 9), and do the following: (a) Implement the steps (10), (12), and (13) of the pseudo code to simple svi template.m. (Note that step 12 is already partially implemented.) (b) Plot the true (as a horizontal line) and estimated values of the global parameters θ and π with respect to the number of iterations of the algorithm. Do this for batch sizes of 1,2,10,50. Comment briefly on the results. Solution. (a) Code below. (b) Trace plots shown in Figure 2. Increasing the batch size seems to increase the convergence rate, which makes sense as the algorithm sees more data faster. 1 2 3 4 for i t e r = 1: n iter % Allocate : p i p a r = z e r o s ( b a t c h s i z e , 2 ) ; % F i r s t and second v a r i a t i o n a l parameters f o r q ( p i ) f o r each sampled data p o i n t . e t a = z e r o s ( b a t c h s i z e , 2 ) ; % F i r s t and second v a r i a t i o n a l parameters f o r q ( th ) f o r each sampled data p o i n t . 5 6 7 8 9 10 % % % E E (4) the and log log Values t h a t depend on q ( p i ) and t h a t a r e needed when computing r e s p o n s i b i l i t i e s . These a r e t h e same f o r a l l p o s s i b l e samples , can t h e r e f o r e be c a l c u l a t e d once o u t s i d e t h e i n n e r loop . pi = psi ( alpha pi ) − psi ( alpha pi + beta pi ) ; pi c = psi ( beta pi ) − psi ( alpha pi + beta pi ) ; 11 12 13 f o r sampling round = 1 : b a t c h s i z e s e l e c t e d i n d e x = c e i l ( rand ∗ n samples ) ; % S e l e c t randomly a sample whose l o c a l parameters t o update 14 15 % (7) : Figure 2: Results with SVI. 16 E l o g v a r = ( x ( s e l e c t e d i n d e x ) − normal mean ) . ˆ 2 + 1/ normal precision ; 17 18 19 20 21 22 23 24 25 % ( 8 ) Compute t h e r e s p o n s i b i l i t e s , f a c t o r q ( z ) log rho1 = E l o g p i c − 0 . 5 . ∗ log (2∗ pi ) − 0 . 5 . ∗ ( x ( s e l e c t e d i n d e x ) .ˆ2) ; log rho2 = E l og p i − 0 . 5 . ∗ log (2∗ pi ) − 0 . 5 . ∗ E log var ; max log rho = max ( l o g r h o 1 , l o g r h o 2 ) ; % Normalize t o avoid numerical problems when e x p o n e n t i a t i n g . rho1 = exp ( l o g r h o 1 − max log rho ) ; rho2 = exp ( l o g r h o 2 − max log rho ) ; r 2 = rho2 . / ( rho1 + rho2 ) ; r1 = 1 − r2 ; 26 27 28 29 30 31 32 33 % ( 9 ) Compute i n t e r m e d i a t e g l o b a l v a r i a t i o n a l parameters o f t h e % f a c t o r q ( p i ) =Beta ( par1 , par2 ) , assuming t h e sampled data item i s % r e p l i c a t e d n samples ti me s . N1 = n samples . ∗ r 1 ; N2 = n samples . ∗ r 2 ; p i p a r ( sampling round , 1 ) = N2 + alpha0 ; % F i r s t parameter p i p a r ( sampling round , 2 ) = N1 + alpha0 ; % Second parameter 34 35 36 % ( 1 0 ) Compute i n t e r m e d i a t e v a r i a t i o n a l ( n a t u r a l ) parameters o f t h e % f a c t o r q ( t h e t a ) =normal ( par1 , par2 ) , par1=mu/var , par2 = − 1/(2 var ) , % where mu and var a r e t h e mean and v a r i a n c e o f t h e Gaussian % d i s t r i b u t i o n . Again assume t h a t t h e sampled data item i s % r e p l i c a t e d n samples ti me s . x2 avg = x ( s e l e c t e d i n d e x ) ; t h p r e c i s i o n = b e t a 0 + N2 ; t h v a r = 1/ t h p r e c i s i o n ; th mu = t h v a r . ∗ N2 . ∗ x2 avg ; e t a ( sampling round , 1 ) = th mu . / t h v a r ; % The 1 s t n a t u r a l parameter e t a ( sampling round , 2 ) = −1 . / ( 2 . ∗ t h v a r ) ; % The 2nd n a t u r a l parameter 37 38 39 40 41 42 43 44 45 end 46 47 % ( 1 2 ) Update g l o b a l v a r i a t i o n a l parameters o f f a c t o r q ( p i ) aux = mean ( p i p a r , 1 ) ; % New e s t i m a t e s , average over sampled data p o i n t s . alpha pi new = aux ( 1 ) ; b e t a p i n e w = aux ( 2 ) ; a l p h a p i = (1 − s t e p s i z e s ( i t e r ) ) . ∗ a l p h a p i + s t e p s i z e s ( i t e r ) . ∗ alpha pi new ; % Updated e s t i m a t e ( combination o f old and new ) b e t a p i = (1 − s t e p s i z e s ( i t e r ) ) . ∗ b e t a p i + s t e p s i z e s ( i t e r ) . ∗ beta pi new ; 48 49 50 51 52 53 % ( 1 2 ) Update g l o b a l v a r i a t i o n a l parameters o f f a c t o r q ( th ) aux = mean ( eta , 1 ) ; % Two n a t u r a l parameters o f t h e normal , averaged over sampled data p o i n t s . eta new1 = aux ( 1 ) ; eta new2 = aux ( 2 ) ; e t a 1 = (1 − s t e p s i z e s ( i t e r ) ) . ∗ e t a 1 + s t e p s i z e s ( i t e r ) . ∗ eta new1 ; e t a 2 = (1 − s t e p s i z e s ( i t e r ) ) . ∗ e t a 2 + s t e p s i z e s ( i t e r ) . ∗ eta new2 ; 54 55 56 57 58 59 % ( 1 3 ) Compute t h e ’ standard ’ mean and p r e c i s i o n parameters o f % o f t h e f a c t o r q ( th ) . n o r m a l p r e c i s i o n = −2 . ∗ e t a 2 ; normal mean = e t a 1 . / n o r m a l p r e c i s i o n ; 60 61 62 63 64 % Keep t r a c k o f t h e c u r r e n t e s t i m a t e s p i e s t ( i t e r ) = alpha pi / ( alpha pi + beta pi ) ; t h e s t ( i t e r ) = − 0.5 . ∗ e t a 1 . / e t a 2 ; 65 66 67 68 end
© Copyright 2026 Paperzz