HW4.pdf

Homework 4
1
40-957 Special Topics in Artificial Intelligence:
Probabilistic Graphical Models
Homework 4 (70 + 20 Pts)
Due data: 1393/3/2
Directions: Submit your codes and report to "[email protected]"
with the subject "HW4 Submission-student number" (for example: "HW4
Submission-90207018")
Late submission policy: late submission carries a penalty of 10% off
your assignment per day after the deadline.
Problem 1 (30 Pts)
In this question, you will implement algorithms for learning an HMM to model a customer
and also predict his behavior. At each time step the customer is in one of the states"rich" or
"poor", and according to his state his probability of buying "expensive" or "cheap" items will
change. Since we only know the history of customer purchases, we can model the problem
by an HMM with 2 hidden states ("rich" and "poor")and two kinds of observation (whether
he is buying "expensive" or "cheap" products). The parameters of this models (initial
probabilities, transition probabilities, and observation probabilities) should be learned from
the history of the T=5000 previous observations (which is given in the "Observations.mat"
file). We then could use the model to predict the next observations.
Hint:You may need to work in log-space to avoid numerical problems when dealing with
joint probabilities of O(T) variables.
2
1.1 (10 Pts) Implement the Baum-Welch algorithm for learning the HMM
parameters (You are not allowed to use Matlab HMM functions).Run the algorithm
for 10 different random initials. Does the algorithm always converge to the same
value? If not, choose parameters with the biggest likelihood function l(θ) =
P (observations|θ).
Note:You can check the convergence of the algorithm by measuring the change in
the value of the parameters (for example,you can stop the iterations when the value
of the no parameter is changing above 0.00001).
1.2 (6 Pts) In this part, we will use an exhaustive search to find the model parameter
θ = (π, A, B) which maximizes the likelihood function l(θ). Note that there is a
total of 5 free parameters:π1 , A1,2 , A2,1 , B1,1 , B2,1 .
Grid the whole parameter space by changing each parameter from 0 to 1 by 0.1
steps (115 combinations in total) and each time compute the log likelihood of the
first 500 observations and store the results in an array. (It may take about an hour to
compute them all.) Using this sampled loglikelihood function, answer the following
questions:
1.2.1 How many local maxima does the likelihood function have? What is the
worst one?(We call a point in this 5-dimensional grid a local maximum, if moving
a step in each of the 10 axis-aligned directions, cause the likelihood to decrease.)
1.2.2 Find the parameter set which globally maximize the likelihood. Is it the same
as the one you have found in part 1.1?
1.2.3 Is the global maximum unique? Why?
1.3 (4 Pts) Consider an HMM with n states, and suppose that θ∗ = (π ∗ , A∗ , B ∗ )
is a global maximum of the likelihood function. Prove that if all elements of A∗
are distinct (i.e. ∀i, j, k, l : (i, j) ̸= (k, l) → A∗(i,j) ̸= A∗(k,l) ), there exists at least
n! distinct θs maximizing the likelihood (which θ∗ is just one of them).
1.4 (10 Pts) Using the best parameters obtained in part 2, and considering that
O1 , ..., O5000 are all given, answer the following questions:
∗
1.4.1 Find the most probable state sequence S1∗ , ..., S5000
using Viterbi algorithm.
Mention the last 50 states in your report.
S1∗ , ..., ST∗ = argmaxS1 ,...,ST P (S1 , ..., ST |O1 , ..., OT , θ∗ )
(1)
1.4.2 Find the most probable state of the customer at each time. Mention the last
50 states in your report.
∀i : Ŝi∗ = argmaxSi P (Si |O1 , ..., OT , θ∗ )
(2)
1.4.3 What are the most probable next three observations:
∗
∗
∗
(O5001
, O5002
, O5003
) = argmaxO5001 ,O5002 ,O5003 P (O5001 , O5002 , O5003 |O1 , ..., OT , θ∗ )
Homework 4
3
Problem 2 (15 Pts)
Assume we have a HMM with the following observation model (Gaussian Mixture model)
p(xt |zt = i, θ) =
K
∑
wik N (xt |µik , Σik ) i = 1, ..., N, t = 1, ..., T
(3)
k=1
where xt ∈ RM . In the above equation, θ = (Π, W, A, B), where Π = [π1 , ..., πN ] (πi =
p(z1 = i)) is the initial state distribution, W = {wik }(i = 1, ...N ; k = 1, ..., K) is the
mixing proportion parameter, A(i, j) = p(zt = j|zt−1 = i) is the transition matrix, and
B = {µik , Σik }(i = 1, ..., N ; k = 1, ..., K) are the parameters of the class-conditional
densities.
Assume we want to learn the parameters θ using a set of training data {X j =
[xj1 , ..., xjT ]}P
j=1 based on EM algorithm. However, in many applications, the observations
are high-dimensional vectors (M is very large). Hence, estimating the parameters of N K
Gaussians (N KM + N KM 2 values) requires a large amount of data. A simple solution
is to use just K gaussians instead of N K gaussians, and to let the state influence the
mixing weights but not the means and covariances. This relaxed model of HMM is called
tied-mixture HMM.
2.1 (5 Pts) Plot the graphical representation of this model.
2.2 (10 Pts) Derive the E step and M step for learning θ using the training data.
Problem 3 (15 Pts)
CRFs have many applications in different fields such as Computer Vision, Speech
Recognition, NLP. One of the applications of CRFs in text analysis, is finding the Norm
Phrases (NPs) in a sentences. For instance, consider the following sentence:
"I am the heisenberg, and I want to kill you now at this place." we denote xi as the tokens
in the sentence, and yi ∈ Γ = {B, I, O} as the labels, where B, I, O, stand for Begining
of an NP, Intermediate token in NP, and Others respectively. For example for the above
sentence, the labels whold be:
I [B] am [O] the [B] heisenberg [I], and [O] I [B] want [O] to [O] kill [O] you [B] now [O]
at [O] this [B] place [I].
Now, consider the following CRF model for the above problem:
N
∑
1
T
P (y|x; w) =
exp{w
f (xi , yi , yi−1 )}
Z(x; w)
i=1
(4)
where
Z(x; w) =
∑
y ′ ∈ΓN
exp{wT
N
∑
′
f (xi , yi′ , yi−1
)}
(5)
i=1
and w ∈ Rd is the free parameter of the model, f : Σ × Γ × Γ → Rd is the feature vector
(Σ is the set of English vocabularies), and N is the number of words in the sentence.
4
3.1 (3 Pts) Draw the graphical representation of the above model (for simplicity
assume that f can be decomposed as f (xi , yi , yi−1 ) = g(xi , yi ) + h(yi , yi−1 )).
3.2 (2 Pts) Define two sample feature functions for this problem.
3.3 (6 Pts) Given the sequence {x1 , ..., xN }, w, f , and the CRF, suggest
a polynomial time algorithm which calculates the marginal probability that
subsequence {xj , ...xj+k } is a NP.
Hint: a sequence {xj , ..., xj+k } is a NP if and only if yj = B and yj+1 = ... = yj+k = I.
3.3 (4 Pts) In order to learn the free parameter w of the CRF based on the data
(t) (t)
D = {xj , yj }N
j=1 using ML technique, we should maximize the following the
conditional log-likelihood function:
L(w) =
∑
log p(y|x; w) =
(x,y)∈D
)
N
∑ (
∑
wT
f (xi , yi , yi−1 ) − log Z(x; w)
(x,y)∈D
i=1
Show that:
∂L
=
∂w
N
∑ (∑
(x,y)∈D
)
f (xi , yi , yi−1 ) −
′
)]
Ep(y′ |x;w) [f (xi , yi′ , yi−1
i=1
Problem 4 (30 Pts)
In this problem, you are going to implement a Kalman filter for object tracking in the
provided video sequences (in the video sequence you should track the red car). A sample
frame of the sequence is provided in Fig. 1. The implemented Kalman filter should estimate
the state xt = (xt , yt , ht , wt ) of the red car, where (xt , yt ) denotes the location of the upper
left corner of the bounding box in the frame t and ht and wt denote the height and the width
of the bounding box, respectively. The output of your code must be a video sequence in
which, each frame should display the state of the object as a bounding box (Fig. 2).
To implement Kalman filter, consider the state and measurement equations,
xk = Fk xk−1 + vk−1
zk = Hk xk + nk
where vk−1 ∼ N (0, Qk−1 ) and nk ∼ N (0, Rk ) are state and measurement noise, Fk =
Hk = I, and Qk = Rk = diag(σx , σy , σh , σw ).
Provided that noises are Gaussian and state and measurement equations being linear.
State probability given all observations so far, p(xk |z1:k ) is always Gaussian. So we need
only update mean and covariance of state probability distribution.
Matlab skeleton code and supplementary functions have been provided. Functions are
provided to accomplish tasks such as reading files, displaying images, initializing the tracker.
Hints are also provided in the comments of the skeleton code.
Homework 4
5
Figure 1 A sample frame of the problem 1’s video sequences.
4.1 (20 Pts) Implement the tracker. You must do the following steps:
1. Initialize state, x0 with the object position in the first frame. Set Qk = Rk =
diag(4, 4, 2, 2).
2. Predict object position using p(xk |z1:k−1 ) = N (mk|k−1 , Pk|k−1 ).
3. To find the measurement zk , generate 100 object candidate by sampling from
p(xk |z1:k−1 ) and consider the best one as zk . To find the best candidate you must
define an observation model (the observation model computes the likelihood of the
observed data from the image given the corresponding state). This should be done
by comparing a color histogram extracted from the candidate object bounding box
to a known color model extracted beforehand. You should model the likelihood as
P (xk |yk ) = exp(−λD(h, h∗ )), where yk is the image of current frame and D(h, h∗ )
is the KL divergence of the two histograms (you may also use Bhattacharyya distance),
h is a color histogram corresponding to the candidate xk and h∗ is a known color model
(histogram of the manually initialized bounding box of the object in the first frame or
best candidate in previous frame). For your convenience, set λ = 15.
Figure 2 Sample frames of the output video sequence.
6
4. Update object state equation using observation zk by p(xk |z1:k ) = N (mk|k , Pk|k ). In
the case of zero mean guassian noises, parameters can be easily found as [1]:
mk|k−1 = Fk mk−1|k−1
Pk|k−1 = Qk−1 + Fk Pk−1|k−1 FkT
mk|k = mk|k−1 + Kk (zk − Hk mk|k−1 )
Pk|k = Pk|k−1 − Kk Hk Pk|k−1
where
Sk = Hk Pk|k−1 HkT + Rk , Kk = Pk|k−1 HkT Sk−1
4.2 (10 Pts) Derive the Kalman filter equations for the mentioned linearGaussian filtering model with non-zero-mean noises, vk−1 ∼ N (mq , Q) and nk ∼
N (mr , R).