Homework 5 1 40-957 Special Topics in Artificial Intelligence: Probabilistic Graphical Models Homework 5 (130 + 20) Due data: 1393/4/5 Directions: Submit your codes and report to "[email protected]" with the subject "HW5 Submission-student number" (for example: "HW5 Submission-90207018") Late submission policy: late submission carries a penalty of 10% off your assignment per day after the deadline. Problem 1 (15 Pts) Consider a mixture of two one-dimensional Gaussians as p(x) = wN (x; µ1 , σ12 ) + (1 − w)N (x; µ2 , σ22 ) (1) Now, assume the proposal distribution q(x0 ) be a Gaussian as q(x0 ) = N (x0 |x, σp2 ). (2) 1.1 Implement a Metropolis-Hastings sampler. Now, assume you are given the parameters of the target distribution as w = 0.3, µ1 = 0, µ2 = 10, σ1 = σ2 = 2 (3) 1.2 Sample N = 100, 1000, 10000 points for each σp = 0.1, 1, 10. Plot histograms of samples in each experiment and discuss about these results. Problem 2 Social networks are a well-known way to model the relations between individual entities in a group or community that are connected by binary links arranged in structured patterns. These entities can practically be anything: modules or functions of a large software system connected by dependence relations, human beings linked together by their everyday social intercourse, proteins in the human genome, related to each other by their possible 2 interactions, and financial networks. In this problem, you are going to implement a MCMC sampling algorithm for link prediction tasks. Consider a graph with a set of N nodes, representing entities to be modeled. Let Y be an N × N binary matrix that contains these links. For node pairs i 6= j, we let yij = 1 if there is a relationship (e.g., friendship) from the entity i to the entity j, and yij = 0 otherwise. For simplicity, we focus on the directed graphs, for which there is a distinct binary relationship yij from entity j to entity i. Unobserved links are left unfilled. The goal of the link prediction is to learn a model from the observed links such that we can predict the values of the unfilled entries. Recently, various approaches based on probabilistic models have been developed for link prediction among which, Mixed Membership Stochastic Block Model (MMSB) and Latent Feature Relational Model (LFR) have produced the state of the art results. 2.1. Mixed Membership Stochastic Block Model (MMSB) (40 Pts) MMSB model assumes there are a finite number of classes (communities) that each entitiy of the network belongs to one and only one of these classes. In these methods, the structure of the graph is determined entirely based on these classes. More precisely, existence of a link between two arbitrary nodes in the graph depends on only the classes of those entities. However, it should be noted that these classes are not observed, Hence the main goal of class-based methods is to determine these unobserved classes, assigning the entities to these classes and inferring the class interactions. More precisely, this model assumes that there exists a set of hidden communities that each node belongs to (in these model, each node has a distribution over the communities). Conditioned on the latent community membership of each node, all relations are assumed to be generated independently (The relation between the links and communities are demonstrated in Figure 1). The link generation process for this model is as follows (in this model it is assumed that there are K unknown latent communities indexed by integers k = 1, ..., K). • Each node i has a mixed membership πi , where πi denotes a multinomial distribution over the K communities drawn from a Dirichlet distribution (πi ∼ Dir(α, ..., α)). • Generate a matrix W where for each pair of represented communties zi and zj , draw W (zi , zj ) ∼ Beta(1, 1), where W (zi , zj ) is the probability of a node in the community zi having a link to a node in the community zj . • Generate a link from node i to node j (yij ∈ {0, 1}) as yij ∼ Bernoulli(πiT W πj ). Homework 5 3 Figure 1 The relation between the links and communities in the MMSB model. 2.1.1 Given fixed parameters Π = [π1 , ..., πN ], W , and observations Y for some subset of ordered entity pairs, derive a formula for the posterior distribution P (zi |z/i , Y, Π, W ). Here, z/i denotes the community assignments for all the entities except the node i. 2.1.2 Given fixed entity assignments z , derive formulas for the posterior distributions of the model parameters, P (πi |Π/i , W, Y, z) and P (W |Π, z, Y ). These should be members of some standard exponential family of distributions. 2.1.3 Implement a Gibbs sampler using the formulas from Parts 2.1.1 and 2.1.2. Each iteration should resample each of the variables z, W, Π once. Initialize by sampling the parameters W, Π, z from their prior distributions. 2.1.4 Apply your Gibbs sampler to the Advise-Network dataset, a network describing advice relationships among N = 71 attorneys in a New England law firm (this dataset and its description are provided). You must run 5 times, each time training on 80% of the data and using the remaining 20% as a test data and using a different random initialization (Note that the training data is half of the directed node pairs (potential links), not half of the actually present links). For the MMSB model assume α = 1 and K = 6. Run the sampler for 500 iterations from random initializations and use the last 300 samples for prediction. After collecting 300 samples {[W (1) , Z (1) , Π(1) ], ..., [W (300) , Z (300) , Π(300) ]} of the posterior distribution of the hidden variables, the predictive distribution of a missing link is estimated as the average of the predictive distributions for each of the collected samples. Assuming that we want to predict the missing link eij between entities i and j , the approximate predictive distribution is computed as Eq. 4. 300 P (eij = 1|Y ) ≈ 1 X P (eij |W (s) , Z (s) , Π(s) ). 300 s=1 (4) 4 2. Latent Feature Relational Model (LFR) (35 Pts) The main problem with the above model is the kinds of relational structure they consider for the relational datasets. For instance, in a social network, it is natural to consider a class which contains "female graduate school athletes" and another which contains "female graduate school actresses". However, class-based models have two options to describe this property of this fantacy relational dataset. One option is merging the classes into one class and the other option is to duplicate the knowledge about common aspects of them. In solution of class-based models, in which for every new attribute like this the number of classes would potentially be duplicated, there is a computational problem and that is the growth of the number of classes quickly leads to an overabundance of classes. Furthermore, if a person is both an actress and a athlete, the class-based models add another class for that or utilize a mixed membership model, which can be interpreted as the more a student is an athlete, the less she is an actress. A nice approach that can handle this problem is to decribe entities by some features. In the above example, there could be separate features for "graduate school student","athlete", and "actress". So, each person is defined based on the presence or absence of each of these features and the relation between different persons is determined based on the sharing features between that persons. Motivated by the above example, LFR uses features to describe the nodes of the network, and the relationship between two nodes is determined by the presence or absence of each of these features. In this model, links are generated as follows (The relation between the links and features are demonstrated in Figure 2). • Each node i is assigned a K-dimensional binary vector zi (K is the number of the features) each of its elements is drawn from a Bernoulli distribution as zki ∼ Bernoulli(πk ), k = 1, ..., K, i = 1, ..., N (5) πk ∼ Beta(aπ /K, bπ (K − 1)/K) k = 1, ..., K (6) where aπ and bπ are the hyper-parameters of the model. • Generate a matrix W as Wij ∼ N (0, 1) (in this model, instead of restricting the matrix elements to be in [0, 1], W can be real-valued). • The probability of being a link from node i to node j in Y is defined as P (yij |zi , zj , W ) = σ(ziT W zj ). where σ(.) is the logistic function which maps real numbers to [0, 1] numbers. (7) Homework 5 5 Figure 2 The relation between the links and features in the LFR model. 2.2.1 Given fixed parameters Π = [π1 , ..., πN ], W , and observations Y for some subset of ordered entity pairs, derive a formula for the posterior distribution P (zi |z/i , Y, Π, W ). Here, z/i denotes the features for all the entities except the node i. 2.2.2 Given the fixed entity assignments z , derive formulas for the posterior distributions of the model parameters, P (πi |Π/i , W, Y, z) and P (W |Π, z, Y ). It is worth noting that since we do not have a conjugate prior on W in this model, we cannot directly sample W from its posterior. Hence, you should use a MetropolisHastings step for each weight in which you propose a new weight from a normal distribution centered around the old one. 2.2.3 Implement an MCMC sampler using the formulas from Parts 2.2.1 and 2.2.2. Each iteration should resample each of the z, W, Π variables once. Initialize by sampling the parameters W, Π, z from their prior distributions. 2.2.4 Apply your Gibbs sampler to the Advise-Network dataset using the setup of Part 2.1.4. For this model, use the following settings for the hyper-parameters: K = 10; aπ = K, bπ = K/2 (8) Problem 3 (60 Pts) The past several years have witnessed the rapid development of the theory and algorithms of sparse representation (or coding) and its successful applications in signal classification, image denoising, image reconstruction, compressive sensing, etc. Sparse codes can efficiently represent signals using linear combination of basis elements which are called atoms. A collection of atoms is referred to as a dictionary. This dictionary is defined as D = [d1 , d2 , ..., dK ] ∈ RM ×K , where di id the i-th atom with dimension M . Suppose that we want to find some atoms of the dictionary D, such that the reconstructed 6 signal X̂, is as similar as the original signal X, and the combination coefficients α is as sparse as possible. The formulation of this problem can be expressed as α̂ = argmin kαk0 s.t. kX̂ − Xk22 ≤ T, (9) α where X̂ = Dα, T is a predefined threshold, and k.k0 is l0 norm which is defined as kxk0 = #{j s.t. xj 6= 0} = lim q→0 M X |xj |q . (10) j=1 Unfortunately, due to the NP-hard nature of the above problem, α̂ cannot be computed efficiently, but it has been demonstrated that l1 norm is also creates sparse solutions. Thus, the Eq. 9 can be reformulated as α̂ = argmin kX − Dαk22 + λkαk1 . (11) α In Eq. 11 , λ is a regularization parameter which controls the tradeoff between sparseness and reconstruction error. If the dictionary is not a pre-defined one, we have to find both the sparse codes and a proper dictionary, thus the general form of the optimization problem changes to [α̂, D̂] = argmin kX − Dαk22 + λkαk1 . (12) α,D In recent years, some researchers have proposed probabilistic models for sparse representation (SR) and dictionary learning (DL) using PGMs framework that have led to state of the art results. Recently, Zhou et al. [1] have proposed a new framework for dictionary learning (DL) using the probabilistic models. Consider we are given a training set of N signals X = [x1 , x2 , ..., xN ] ∈ RM ×N with dimension M . In that paper, each signal xi is modelled as a sparse combination of atoms of a dictionary D ∈ RM ×K , with an additive noise i , where K is the number of dictionary atoms. The Matrix form of the model can be formulated as X = DA + E, (13) where XM ×N is the set of the input signals, AK×N is the set of the K dimensional sparse codes, and E ∼ N (0, γx−1 IM ) is the zero-mean Gaussian noise with precision value γx (IM is an M × M Identity matrix). Moreover, the matrix of the sparse codes (e.g. A) is modeled as an element-wise multiplication of a binary matrix (Z) and a weight matrix (S). Hence, the model of Equation 13 can be reformulated as X = D(Z S) + E, (14) where is the element-wise multiplication operator. The hierarchical probabilistic model given the training data X = {xi }N i=1 , can be expressed as P (X | D, Z, S, γx ) ∼ N Y j=1 N (xj ; D(zj sj ), γx−1 IM ), (15) Homework 5 7 P (γx | ax , bx ) ∼ Gamma(γx ; ax , bx ), P (Z | Π) ∼ N Y K Y Bernoulli(zki ; πk ), (16) (17) i=1 k=1 K Y P (Π | aπ , bπ , K) ∼ Beta(πk ; aπ /K, bπ (K − 1)/K), (18) k=1 P (S | γs ) ∼ N Y K Y N (ski ; 0, γs−1 ), (19) i=1 k=1 P (γs | as , bs ) ∼ Gamma(γs ; as , bs ), P (D | γd ) ∼ M Y K Y N (dik ; 0, γd−1 ), (20) (21) i=1 k=1 P (γd | ad , bd ) ∼ Gamma(γd ; ad , bd ), (22) where Π = [π1 , π2 , ..., πK ] . Φ = {aπ , bπ , ax , bx , ad , bd , as , bs , K} are the hyperparameters of the proposed model. In [1], Zhou et al. have used MCMC for approximating the posterior distribution of thelatent variables, given the observation i.e., they approximate P (D, Z, S, Π, γx , γs , γd |X) . The goal of this question is to develop Mean Field Variational Inference for approximating the posterior distribution of the latent variables. You should use a fully factorized variational distribution as q(Π, Z, S, D, γs , γx , γd ) = M K Y Y k=1 i=1 qπk (πk )qdik (dik ) N Y K Y qzkj (zkj )qskj (skj )qγs (γs )qγx (γx )qγd (γd ). j=1 k=1 3.1.1 Plot the graphical representation of the above probabilistic model using Plate Notation. 3.1.2 Develop the variational update equations for the above model. Now, you are given an Image I ∈ R256×256 with zero-mean Gaussian additive noise with different noise variance σ 2 = 15; 25; 50 (Figure. 3). Similar to [1], partition the image into 62000 of 8 × 8 overlapping blocks {xi ∈ R64 }62000 i=1 and set the hyper-parameters as ax = bx = as = bs = ad = bd = 10−6 , K = 128, aπ = K, bπ = K/2. (23) 8 Figure 3 from left to right: The original image, noisy image with σ 2 = 15, noisy image with σ 2 = 25, noisy image with σ 2 = 50. 3.1.3 Implement your derived Variational Inference using the image data based on the above setup. The original image (original.mat) and the noisy versions of it (noisy15.mat, noisy25.mat, noisy50.mat) are provided. 3.1.4 Reconstruct the noisy images based on theP learned dictionary and the sparse codes. Reconstruct each patch xi as xi = T1 Tt=1 Eq [D](zit sti ), where {zit , sti }Tt=1 are T i.i.d samples from the approximate posterior distributions q(zi ) and q(si ). Compute the PSNR measure for the denoised image (the code for computing the PSNR value is provided). Then, compare the results to the denoised images using MCMC algorithm (the results (reconstructed images and their PSNR) for MCMC algorithm are available in [1]). References [1] Mingyuan Zhou, Haojun Chen, John Paisley, Lu Ren, Guillermo Sapiro and Lawrence Carin, Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations, In NIPS, 2009.
© Copyright 2026 Paperzz