Sharif University of Technology Probabilistic Graphical Models (CE-956) M. Soleymani Homework 3: Learning Deadline: 17 Ordibehesht 1: Parameter Learning in Bayesian Networks (25 Pts) Figure 1: A Markov chain of length L. 1. (6 Pts) Figure 1 illustrates a Markov chain of length L with S states(i.e. X` ∈ {1, 2, ..., S}). Observations include N instances of this chain. In other words, observations D is as follows: (n) (n) (n) S S×S are the model parameters. More D = {X1 , X2 , ..., XL }N n=1 . π ∈ R and A ∈ R precisely, the generative process is as follows: p(X1 ) = Categorical(π1 , ..., πS ) p(X`+1 = i|X` = j) = Aij (1) Use maximum likelihood estimation to find the parameters π and A. (Note: Bayesian network of Figure 1 have shared CPDs. So we can’t use the maximum likelihood equations for Bayesian networks with disjoint CPDs.) Figure 2: A Markov chain of length L. Unlike Figure 1, π and A are random variables. 2. (3 Pts) Figure 2 illustrates Bayesian approach for learning the parameters π and A. β ∈ RS 1 2 and α ∈ RS×S are hyper-parameters. Priors distributions over π and A are as follows: p(π) = Dirichlet(β1 , ..., βS ) p(A1j , A2j , ..., ASj ) = Dirichlet(α1j , α2j , ..., αSj ) (2) Find the posterior distribution p(π, A|D). (Note: You can use the fact that Dirichlet distribution is conjugate prior for categorical distribution and use equations of this table without proof.) 3. (4 Pts) Prove the following statement (and then compare maximum likelihood and Bayesian approach): ”when N → ∞, the expected value of the posterior distribution tends to be the same as the maximum likelihood estimation.” (N +1) (N +1) 4. (4 Pts) Find the predictive distribution p(X1 , ..., XL |D). You can firstly prove Equation 3. The posterior distribution is a Dirichlet distribution and the moments of Dirichlet distribution are tractable and available here. You can utilize these equations to simplify the right-hand-side of Equation 3. h i (N +1) (N +1) (N +1) (N +1) p(X1 , ..., XL |D) = E(π,A)∼p(π,A|D) p(X1 , ..., XL |π, A) (3) 5. (3 Pts) In the posterior distribution, columns of A are independent. But according to Figure 2, columns of A are not d − separated. Is this a contradiction?(Hint: You can refer to pages 746,747 of Koller’s book.) 6. (5 Pts) Discuss the effect of hyper-parameters α and β. Using BDe prior is advantageous, specially in the case where we want to learn structure of a Bayesian network and CPDs parameters jointly. Describe some advantages of BDe prior. (Hint: You can refer to page 806 Koller’s book.) 2: Parameter Learning of Markov Random Field (44 Pts) 2.1 Consider an MRF with the log linear form: P (X, θ) = PM T 1 Z(θ) exp{ c=1 θc fc (Xc )}. 1. (4 Pts) Prove that ∇2θc log(Z(θ)) = var(fc ) 2. (3 Pts) Why is the log likelihood concave? 3. (5 Pts) As it is said in the slides, parameter learning of MRFs with gradient descent, needs an inference in each step and it is not efficient. An alternativeQfor maximizing likelihood is to maximize pseudo-likelihood. Pseudo-likelihood is defined as i p(Xi |X−i ). It can be a good estimate of the likelihood function (why?). The problem is how to maximize this function. We want to use the gradient descent method to maximize it. First show that we can write 1 Q each p(Xi |X−i ) in the form of Zi (θ) Xi ∈c fc . 4. (3 Pts) Then compute ∇θc log(Zi (θ)) in the form of an expectation. 5. (4 Pts) Write each step of the gradient descent completely. 3 6. (5 Pts) What is the pros and cons of using pseudo-likelihood? Compare it with maximum likelihood. When do you use the pseudo-likelihood instead of the maximum likelihood or vice versa? 2.2 (20 Pts) Consider the following Ising model with binary (0/1) random variables. 10000 samples are provided in the data fileand you are supposed to learn the parameters with the maximum likelihood and the pseudo-likelihood methods. 5 P (X, θ) = 5 X X 1 exp{ θc .Xc + θc,c+1 .Xc .Xc+1 } Z(θ) c=1 (4) c=1 Probabilistic Programming languages are developed specially for probabilistic modelling and inference. In these languages lots of inference techniques are supported, so you don’t need to implement them yourself. One of these languages is Dimple. You should learn and use it for this question. 1. Learn parameters with the maximum likelihood estimation. (You should implement the gradient descent and in each step you can use any exact inference method supported in Dimple. ) 2. Learn parameters with the pseudo likelihood objective function. (pseudo likelihood has been supported in this language and you can use it.) 3. Compare the results. 3: Structure Learning in Markov Random Field (16 Pts) P 1 Suppose we have a discrete pairwise MRF Pθ (X) = Z(θ) exp{ (s,t)∈E θst Xs Xt } and each random variable can only take the values in {−1, 1}. We have independent samples from this MRF as training data and the aim is to learn the structure of it. The structure of each PGM determines the conditional independency relations between its random variables. In this problem, we are going to learn the structure using l1-regularized logistic regression [1]. The main idea is to think of the MRF as a weighted complete graph with the zero weight on the non-existing edges. 1. (4 Pts) Prove the following statement: Pθ (Xr |X−r ) = P exp(2Xr t∈V \r θrt ) P 1+exp(2Xr t∈V \r θrt ) 4 2. (4 Pts) As you see it is exactly in the form of a logistic regression problem. Determine what should be the classification problem, what are the random variables and the parameters we want to learn. 3. (4 Pts) Suppose we start from node r and we want to learn θ\r = {θru , u ∈ V \ r} from observations {x1 , . . . , xn }. Show that we should solve the following optimization problem: n X X X 1X min θru µ̂ru + λkθ\r k log{exp(xir θrt ) + exp(−xir θrt )} − θ\r n 1 t∈V \r t∈V \r (5) u∈V \r 4. (4 Pts) By solving it for each r we can learn all the parameters. What kind of problems could occur? (Hint: refer to the paper!) 4: EM algorithm (15 Pts) Figure 3: A simple model for document classification. Figure 3 illustrates a model for document classification. We have D documents. Each document belongs to one of T classes. The variable zd ∈ {1, 2, ..., T } specifies the class of d’th document. The d-th document contains Nd words {wd1 , wd2 , ..., wdNd }. There exists V words in the dictionary. θ ∈ RT and ϕ ∈ RT ×V are the hyper-parameters of the model. The generative process is as follows: 1. For 1 ≤ d ≤ D, draw zd ∼ Categorical(θ1 , ..., θT ). 2. For 1 ≤ d ≤ D and 1 ≤ n ≤ Nd , draw wdn ∼ Categorical(ϕzd 1 , ϕzd 2 , ..., ϕzd V ). θt is the probability of assigning a document to the t-th class. ϕtv is thr probability of adding the v’th word to a document in the t-th class. D documents are observed. We want to estimate θ and ϕ using th EM algorithm. Derive E-step and M-step equations. References [1] Ravikumar, Pradeep and Wainwright, Martin J and Lafferty, John D and others Highdimensional Ising model selection using 1-regularized logistic regression, The Annals of Statistics, 2010.
© Copyright 2025 Paperzz