PGM_HW3.pdf

Sharif University of Technology
Probabilistic Graphical Models (CE-956)
M. Soleymani
Homework 3: Learning
Deadline: 17 Ordibehesht
1: Parameter Learning in Bayesian Networks (25 Pts)
Figure 1: A Markov chain of length L.
1. (6 Pts) Figure 1 illustrates a Markov chain of length L with S states(i.e. X` ∈ {1, 2, ..., S}).
Observations include N instances of this chain. In other words, observations D is as follows:
(n)
(n)
(n)
S
S×S are the model parameters. More
D = {X1 , X2 , ..., XL }N
n=1 . π ∈ R and A ∈ R
precisely, the generative process is as follows:
p(X1 ) = Categorical(π1 , ..., πS )
p(X`+1 = i|X` = j) = Aij
(1)
Use maximum likelihood estimation to find the parameters π and A. (Note: Bayesian network of Figure 1 have shared CPDs. So we can’t use the maximum likelihood equations for
Bayesian networks with disjoint CPDs.)
Figure 2:
A Markov chain of length L. Unlike Figure 1, π and A are random variables.
2. (3 Pts) Figure 2 illustrates Bayesian approach for learning the parameters π and A. β ∈ RS
1
2
and α ∈ RS×S are hyper-parameters. Priors distributions over π and A are as follows:
p(π) = Dirichlet(β1 , ..., βS )
p(A1j , A2j , ..., ASj ) = Dirichlet(α1j , α2j , ..., αSj )
(2)
Find the posterior distribution p(π, A|D). (Note: You can use the fact that Dirichlet distribution is conjugate prior for categorical distribution and use equations of this table without
proof.)
3. (4 Pts) Prove the following statement (and then compare maximum likelihood and Bayesian
approach): ”when N → ∞, the expected value of the posterior distribution tends to be the
same as the maximum likelihood estimation.”
(N +1)
(N +1)
4. (4 Pts) Find the predictive distribution p(X1
, ..., XL
|D). You can firstly prove
Equation 3. The posterior distribution is a Dirichlet distribution and the moments of Dirichlet
distribution are tractable and available here. You can utilize these equations to simplify the
right-hand-side of Equation 3.
h
i
(N +1)
(N +1)
(N +1)
(N +1)
p(X1
, ..., XL
|D) = E(π,A)∼p(π,A|D) p(X1
, ..., XL
|π, A)
(3)
5. (3 Pts) In the posterior distribution, columns of A are independent. But according to Figure
2, columns of A are not d − separated. Is this a contradiction?(Hint: You can refer to pages
746,747 of Koller’s book.)
6. (5 Pts) Discuss the effect of hyper-parameters α and β. Using BDe prior is advantageous,
specially in the case where we want to learn structure of a Bayesian network and CPDs
parameters jointly. Describe some advantages of BDe prior. (Hint: You can refer to page 806
Koller’s book.)
2: Parameter Learning of Markov Random Field (44 Pts)
2.1 Consider an MRF with the log linear form: P (X, θ) =
PM T
1
Z(θ) exp{ c=1 θc fc (Xc )}.
1. (4 Pts) Prove that ∇2θc log(Z(θ)) = var(fc )
2. (3 Pts) Why is the log likelihood concave?
3. (5 Pts) As it is said in the slides, parameter learning of MRFs with gradient descent, needs
an inference in each step and it is not efficient. An alternativeQfor maximizing likelihood is to
maximize pseudo-likelihood. Pseudo-likelihood is defined as i p(Xi |X−i ). It can be a good
estimate of the likelihood function (why?). The problem is how to maximize this function.
We want to use the gradient descent
method to maximize it. First show that we can write
1 Q
each p(Xi |X−i ) in the form of Zi (θ) Xi ∈c fc .
4. (3 Pts) Then compute ∇θc log(Zi (θ)) in the form of an expectation.
5. (4 Pts) Write each step of the gradient descent completely.
3
6. (5 Pts) What is the pros and cons of using pseudo-likelihood? Compare it with maximum
likelihood. When do you use the pseudo-likelihood instead of the maximum likelihood or vice
versa?
2.2 (20 Pts)
Consider the following Ising model with binary (0/1) random variables. 10000 samples are
provided in the data fileand you are supposed to learn the parameters with the maximum likelihood
and the pseudo-likelihood methods.
5
P (X, θ) =
5
X
X
1
exp{
θc .Xc +
θc,c+1 .Xc .Xc+1 }
Z(θ)
c=1
(4)
c=1
Probabilistic Programming languages are developed specially for probabilistic modelling and inference. In these languages lots of inference techniques are supported, so you don’t need to implement
them yourself. One of these languages is Dimple. You should learn and use it for this question.
1. Learn parameters with the maximum likelihood estimation. (You should implement the
gradient descent and in each step you can use any exact inference method supported in
Dimple. )
2. Learn parameters with the pseudo likelihood objective function. (pseudo likelihood has been
supported in this language and you can use it.)
3. Compare the results.
3: Structure Learning in Markov Random Field (16 Pts)
P
1
Suppose we have a discrete pairwise MRF Pθ (X) = Z(θ)
exp{ (s,t)∈E θst Xs Xt } and each random
variable can only take the values in {−1, 1}. We have independent samples from this MRF as
training data and the aim is to learn the structure of it. The structure of each PGM determines
the conditional independency relations between its random variables. In this problem, we are going
to learn the structure using l1-regularized logistic regression [1]. The main idea is to think of the
MRF as a weighted complete graph with the zero weight on the non-existing edges.
1. (4 Pts) Prove the following statement: Pθ (Xr |X−r ) =
P
exp(2Xr t∈V \r θrt )
P
1+exp(2Xr t∈V \r θrt )
4
2. (4 Pts) As you see it is exactly in the form of a logistic regression problem. Determine what
should be the classification problem, what are the random variables and the parameters we
want to learn.
3. (4 Pts) Suppose we start from node r and we want to learn θ\r = {θru , u ∈ V \ r} from
observations {x1 , . . . , xn }. Show that we should solve the following optimization problem:
n
X
X
X
1X
min
θru µ̂ru + λkθ\r k
log{exp(xir
θrt ) + exp(−xir
θrt )} −
θ\r n
1
t∈V \r
t∈V \r
(5)
u∈V \r
4. (4 Pts) By solving it for each r we can learn all the parameters. What kind of problems could
occur? (Hint: refer to the paper!)
4: EM algorithm (15 Pts)
Figure 3: A simple model for document classification.
Figure 3 illustrates a model for document classification. We have D documents. Each document
belongs to one of T classes. The variable zd ∈ {1, 2, ..., T } specifies the class of d’th document.
The d-th document contains Nd words {wd1 , wd2 , ..., wdNd }. There exists V words in the dictionary.
θ ∈ RT and ϕ ∈ RT ×V are the hyper-parameters of the model. The generative process is as follows:
1. For 1 ≤ d ≤ D, draw zd ∼ Categorical(θ1 , ..., θT ).
2. For 1 ≤ d ≤ D and 1 ≤ n ≤ Nd , draw wdn ∼ Categorical(ϕzd 1 , ϕzd 2 , ..., ϕzd V ).
θt is the probability of assigning a document to the t-th class. ϕtv is thr probability of adding the
v’th word to a document in the t-th class. D documents are observed. We want to estimate θ and
ϕ using th EM algorithm. Derive E-step and M-step equations.
References
[1] Ravikumar, Pradeep and Wainwright, Martin J and Lafferty, John D and others Highdimensional Ising model selection using 1-regularized logistic regression, The Annals of Statistics, 2010.