HW5.pdf

Homework 5
1
40-957 Special Topics in Artificial Intelligence:
Probabilistic Graphical Models
Homework 5 (130 + 20)
Due data: 1393/4/5
Directions: Submit your codes and report to "[email protected]"
with the subject "HW5 Submission-student number" (for example: "HW5
Submission-90207018")
Late submission policy: late submission carries a penalty of 10% off
your assignment per day after the deadline.
Problem 1 (15 Pts)
Consider a mixture of two one-dimensional Gaussians as
p(x) = wN (x; µ1 , σ12 ) + (1 − w)N (x; µ2 , σ22 )
(1)
Now, assume the proposal distribution q(x0 ) be a Gaussian as
q(x0 ) = N (x0 |x, σp2 ).
(2)
1.1 Implement a Metropolis-Hastings sampler.
Now, assume you are given the parameters of the target distribution as
w = 0.3, µ1 = 0, µ2 = 10, σ1 = σ2 = 2
(3)
1.2 Sample N = 100, 1000, 10000 points for each σp = 0.1, 1, 10. Plot histograms
of samples in each experiment and discuss about these results.
Problem 2
Social networks are a well-known way to model the relations between individual entities
in a group or community that are connected by binary links arranged in structured patterns.
These entities can practically be anything: modules or functions of a large software
system connected by dependence relations, human beings linked together by their everyday
social intercourse, proteins in the human genome, related to each other by their possible
2
interactions, and financial networks.
In this problem, you are going to implement a MCMC sampling algorithm for link prediction
tasks. Consider a graph with a set of N nodes, representing entities to be modeled. Let Y
be an N × N binary matrix that contains these links. For node pairs i 6= j, we let yij = 1
if there is a relationship (e.g., friendship) from the entity i to the entity j, and yij = 0
otherwise. For simplicity, we focus on the directed graphs, for which there is a distinct
binary relationship yij from entity j to entity i. Unobserved links are left unfilled. The goal
of the link prediction is to learn a model from the observed links such that we can predict
the values of the unfilled entries.
Recently, various approaches based on probabilistic models have been developed for link
prediction among which, Mixed Membership Stochastic Block Model (MMSB) and Latent
Feature Relational Model (LFR) have produced the state of the art results.
2.1. Mixed Membership Stochastic Block Model (MMSB) (40 Pts)
MMSB model assumes there are a finite number of classes (communities) that each entitiy
of the network belongs to one and only one of these classes. In these methods, the structure
of the graph is determined entirely based on these classes. More precisely, existence of a
link between two arbitrary nodes in the graph depends on only the classes of those entities.
However, it should be noted that these classes are not observed, Hence the main goal of
class-based methods is to determine these unobserved classes, assigning the entities to these
classes and inferring the class interactions.
More precisely, this model assumes that there exists a set of hidden communities that
each node belongs to (in these model, each node has a distribution over the communities).
Conditioned on the latent community membership of each node, all relations are assumed
to be generated independently (The relation between the links and communities are
demonstrated in Figure 1). The link generation process for this model is as follows (in
this model it is assumed that there are K unknown latent communities indexed by integers
k = 1, ..., K).
• Each node i has a mixed membership πi , where πi denotes a multinomial distribution
over the K communities drawn from a Dirichlet distribution (πi ∼ Dir(α, ..., α)).
• Generate a matrix W where for each pair of represented communties zi and zj , draw
W (zi , zj ) ∼ Beta(1, 1), where W (zi , zj ) is the probability of a node in the community
zi having a link to a node in the community zj .
• Generate a link from node i to node j (yij ∈ {0, 1}) as yij ∼ Bernoulli(πiT W πj ).
Homework 5
3
Figure 1 The relation between the links and communities in the MMSB model.
2.1.1 Given fixed parameters Π = [π1 , ..., πN ], W , and observations Y for some
subset of ordered entity pairs, derive a formula for the posterior distribution
P (zi |z/i , Y, Π, W ). Here, z/i denotes the community assignments for all the entities
except the node i.
2.1.2 Given fixed entity assignments z , derive formulas for the posterior
distributions of the model parameters, P (πi |Π/i , W, Y, z) and P (W |Π, z, Y ).
These should be members of some standard exponential family of distributions.
2.1.3 Implement a Gibbs sampler using the formulas from Parts 2.1.1 and 2.1.2.
Each iteration should resample each of the variables z, W, Π once. Initialize by
sampling the parameters W, Π, z from their prior distributions.
2.1.4 Apply your Gibbs sampler to the Advise-Network dataset, a network
describing advice relationships among N = 71 attorneys in a New England law
firm (this dataset and its description are provided). You must run 5 times, each
time training on 80% of the data and using the remaining 20% as a test data
and using a different random initialization (Note that the training data is half
of the directed node pairs (potential links), not half of the actually present
links). For the MMSB model assume α = 1 and K = 6. Run the sampler for 500
iterations from random initializations and use the last 300 samples for prediction.
After collecting 300 samples {[W (1) , Z (1) , Π(1) ], ..., [W (300) , Z (300) , Π(300) ]} of
the posterior distribution of the hidden variables, the predictive distribution of a
missing link is estimated as the average of the predictive distributions for each of
the collected samples. Assuming that we want to predict the missing link eij between
entities i and j , the approximate predictive distribution is computed as Eq. 4.
300
P (eij = 1|Y ) ≈
1 X
P (eij |W (s) , Z (s) , Π(s) ).
300 s=1
(4)
4
2. Latent Feature Relational Model (LFR) (35 Pts)
The main problem with the above model is the kinds of relational structure they consider for
the relational datasets. For instance, in a social network, it is natural to consider a class which
contains "female graduate school athletes" and another which contains "female graduate
school actresses". However, class-based models have two options to describe this property
of this fantacy relational dataset. One option is merging the classes into one class and the
other option is to duplicate the knowledge about common aspects of them. In solution of
class-based models, in which for every new attribute like this the number of classes would
potentially be duplicated, there is a computational problem and that is the growth of the
number of classes quickly leads to an overabundance of classes. Furthermore, if a person
is both an actress and a athlete, the class-based models add another class for that or utilize
a mixed membership model, which can be interpreted as the more a student is an athlete,
the less she is an actress.
A nice approach that can handle this problem is to decribe entities by some features. In
the above example, there could be separate features for "graduate school student","athlete",
and "actress". So, each person is defined based on the presence or absence of each of these
features and the relation between different persons is determined based on the sharing
features between that persons.
Motivated by the above example, LFR uses features to describe the nodes of the network,
and the relationship between two nodes is determined by the presence or absence of each
of these features.
In this model, links are generated as follows (The relation between the links and features
are demonstrated in Figure 2).
• Each node i is assigned a K-dimensional binary vector zi (K is the number of the
features) each of its elements is drawn from a Bernoulli distribution as
zki ∼ Bernoulli(πk ), k = 1, ..., K, i = 1, ..., N
(5)
πk ∼ Beta(aπ /K, bπ (K − 1)/K) k = 1, ..., K
(6)
where aπ and bπ are the hyper-parameters of the model.
• Generate a matrix W as Wij ∼ N (0, 1) (in this model, instead of restricting the matrix
elements to be in [0, 1], W can be real-valued).
• The probability of being a link from node i to node j in Y is defined as
P (yij |zi , zj , W ) = σ(ziT W zj ).
where σ(.) is the logistic function which maps real numbers to [0, 1] numbers.
(7)
Homework 5
5
Figure 2 The relation between the links and features in the LFR model.
2.2.1 Given fixed parameters Π = [π1 , ..., πN ], W , and observations Y for some
subset of ordered entity pairs, derive a formula for the posterior distribution
P (zi |z/i , Y, Π, W ). Here, z/i denotes the features for all the entities except the
node i.
2.2.2 Given the fixed entity assignments z , derive formulas for the posterior
distributions of the model parameters, P (πi |Π/i , W, Y, z) and P (W |Π, z, Y ). It is
worth noting that since we do not have a conjugate prior on W in this model, we
cannot directly sample W from its posterior. Hence, you should use a MetropolisHastings step for each weight in which you propose a new weight from a normal
distribution centered around the old one.
2.2.3 Implement an MCMC sampler using the formulas from Parts 2.2.1 and 2.2.2.
Each iteration should resample each of the z, W, Π variables once. Initialize by
sampling the parameters W, Π, z from their prior distributions.
2.2.4 Apply your Gibbs sampler to the Advise-Network dataset using the setup of
Part 2.1.4. For this model, use the following settings for the hyper-parameters:
K = 10; aπ = K, bπ = K/2
(8)
Problem 3 (60 Pts)
The past several years have witnessed the rapid development of the theory and algorithms
of sparse representation (or coding) and its successful applications in signal classification,
image denoising, image reconstruction, compressive sensing, etc. Sparse codes can
efficiently represent signals using linear combination of basis elements which are called
atoms. A collection of atoms is referred to as a dictionary. This dictionary is defined as
D = [d1 , d2 , ..., dK ] ∈ RM ×K , where di id the i-th atom with dimension M .
Suppose that we want to find some atoms of the dictionary D, such that the reconstructed
6
signal X̂, is as similar as the original signal X, and the combination coefficients α is as
sparse as possible. The formulation of this problem can be expressed as
α̂ = argmin kαk0 s.t. kX̂ − Xk22 ≤ T,
(9)
α
where X̂ = Dα, T is a predefined threshold, and k.k0 is l0 norm which is defined as
kxk0 = #{j s.t. xj 6= 0} = lim
q→0
M
X
|xj |q .
(10)
j=1
Unfortunately, due to the NP-hard nature of the above problem, α̂ cannot be computed
efficiently, but it has been demonstrated that l1 norm is also creates sparse solutions. Thus,
the Eq. 9 can be reformulated as
α̂ = argmin kX − Dαk22 + λkαk1 .
(11)
α
In Eq. 11 , λ is a regularization parameter which controls the tradeoff between sparseness
and reconstruction error. If the dictionary is not a pre-defined one, we have to find both the
sparse codes and a proper dictionary, thus the general form of the optimization problem
changes to
[α̂, D̂] = argmin kX − Dαk22 + λkαk1 .
(12)
α,D
In recent years, some researchers have proposed probabilistic models for sparse
representation (SR) and dictionary learning (DL) using PGMs framework that have led to
state of the art results.
Recently, Zhou et al. [1] have proposed a new framework for dictionary learning (DL)
using the probabilistic models. Consider we are given a training set of N signals X =
[x1 , x2 , ..., xN ] ∈ RM ×N with dimension M . In that paper, each signal xi is modelled as
a sparse combination of atoms of a dictionary D ∈ RM ×K , with an additive noise i , where
K is the number of dictionary atoms. The Matrix form of the model can be formulated as
X = DA + E,
(13)
where XM ×N is the set of the input signals, AK×N is the set of the K dimensional sparse
codes, and E ∼ N (0, γx−1 IM ) is the zero-mean Gaussian noise with precision value γx
(IM is an M × M Identity matrix). Moreover, the matrix of the sparse codes (e.g. A) is
modeled as an element-wise multiplication of a binary matrix (Z) and a weight matrix (S).
Hence, the model of Equation 13 can be reformulated as
X = D(Z S) + E,
(14)
where is the element-wise multiplication operator.
The hierarchical probabilistic model given the training data X = {xi }N
i=1 , can be expressed
as
P (X | D, Z, S, γx ) ∼
N
Y
j=1
N (xj ; D(zj sj ), γx−1 IM ),
(15)
Homework 5
7
P (γx | ax , bx ) ∼ Gamma(γx ; ax , bx ),
P (Z | Π) ∼
N Y
K
Y
Bernoulli(zki ; πk ),
(16)
(17)
i=1 k=1
K
Y
P (Π | aπ , bπ , K) ∼
Beta(πk ; aπ /K, bπ (K − 1)/K),
(18)
k=1
P (S | γs ) ∼
N Y
K
Y
N (ski ; 0, γs−1 ),
(19)
i=1 k=1
P (γs | as , bs ) ∼ Gamma(γs ; as , bs ),
P (D | γd ) ∼
M Y
K
Y
N (dik ; 0, γd−1 ),
(20)
(21)
i=1 k=1
P (γd | ad , bd ) ∼ Gamma(γd ; ad , bd ),
(22)
where Π = [π1 , π2 , ..., πK ] . Φ = {aπ , bπ , ax , bx , ad , bd , as , bs , K} are the hyperparameters of the proposed model.
In [1], Zhou et al. have used MCMC for approximating the posterior distribution of thelatent
variables, given the observation i.e., they approximate P (D, Z, S, Π, γx , γs , γd |X) . The
goal of this question is to develop Mean Field Variational Inference for approximating the
posterior distribution of the latent variables. You should use a fully factorized variational
distribution as
q(Π, Z, S, D, γs , γx , γd ) =
M
K Y
Y
k=1 i=1
qπk (πk )qdik (dik )
N Y
K
Y
qzkj (zkj )qskj (skj )qγs (γs )qγx (γx )qγd (γd ).
j=1 k=1
3.1.1 Plot the graphical representation of the above probabilistic model using
Plate Notation.
3.1.2 Develop the variational update equations for the above model.
Now, you are given an Image I ∈ R256×256 with zero-mean Gaussian additive noise with
different noise variance σ 2 = 15; 25; 50 (Figure. 3). Similar to [1], partition the image into
62000 of 8 × 8 overlapping blocks {xi ∈ R64 }62000
i=1 and set the hyper-parameters as
ax = bx = as = bs = ad = bd = 10−6 , K = 128, aπ = K, bπ = K/2.
(23)
8
Figure 3 from left to right: The original image, noisy image with σ 2 = 15, noisy image with
σ 2 = 25, noisy image with σ 2 = 50.
3.1.3 Implement your derived Variational Inference using the image data based
on the above setup. The original image (original.mat) and the noisy versions of it
(noisy15.mat, noisy25.mat, noisy50.mat) are provided.
3.1.4 Reconstruct the noisy images based on theP
learned dictionary and the
sparse codes. Reconstruct each patch xi as xi = T1 Tt=1 Eq [D](zit sti ), where
{zit , sti }Tt=1 are T i.i.d samples from the approximate posterior distributions q(zi )
and q(si ). Compute the PSNR measure for the denoised image (the code for
computing the PSNR value is provided). Then, compare the results to the denoised
images using MCMC algorithm (the results (reconstructed images and their PSNR)
for MCMC algorithm are available in [1]).
References
[1] Mingyuan Zhou, Haojun Chen, John Paisley, Lu Ren, Guillermo Sapiro and Lawrence
Carin, Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations,
In NIPS, 2009.