CSC321 Winter 2017 Homework 8 Homework 8 Deadline: Wednesday, March 29, at 11:59pm. Submission: You must submit your solutions as a PDF file through MarkUs1 . You can produce the file however you like (e.g. LaTeX, Microsoft Word, scanner), as long as it is readable. Late Submission: MarkUs will remain open until 2 days after the deadline; until that time, you should submit through MarkUs. If you want to submit the assignment more than 2 days late, please e-mail it to [email protected]. The reason for this is that MarkUs won’t let us collect the homeworks until the late period has ended, and we want to be able to return them to you in a timely manner. Weekly homeworks are individual work. See the Course Information handout2 for detailed policies. 1. Categorial Distribution. Let’s consider fitting the categorical distribution, which is a discrete distribution over K outcomes, which we’ll number 1 through K. The probability of each category is explicitly represented with P parameter θk . For it to be a valid probability distribution, we clearly need θk ≥ 0 and k θk = 1. We’ll represent each observation x as a 1-of-K encoding, i.e, a vector where one of the entries is 1 and the rest are 0. Under this model, the probability of an observation can be written in the following form: p(x; θ) = K Y θkxk . k=1 (a) [2pts] Determine the formula for the maximum likelihood estimate of the parameters P (i) in terms of the counts Nk = i xk of all the outcomes. You may assume all of the counts are nonzero. Note that your solution needs to obey the constraints θk ≥ 0 and P 3 k θk = 1. You might want to see Khan Academy for background on constrained optimization. (b) [2pts] Now consider Bayesian parameter estimation. For the prior, we’ll use the Dirichlet distribution, which is defined over the set of probability vectors (i.e. vectors that are nonnegative and whose entries sum to 1). Its PDF is as follows: ak −1 p(θ) ∝ θ1a1 −1 · · · θK . A useful fact is that if θ ∼ Dirichlet(a1 , . . . , aK ), then ak . k0 ak0 E[θk ] = P Determine the posterior distribution p(θ | D), where D is the set of observations. From that, determine the posterior predictive probability that the next outcome will be k. (c) [1pt] Still assuming the Dirichlet prior distribution, determine the MAP estimate of the parameter vector θ. 1 https://markus.teach.cs.toronto.edu/csc321-2017-01 http://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/syllabus.pdf 3 https://www.khanacademy.org/math/multivariable-calculus/applications-of-multivariable-derivatives/ lagrange-multipliers-and-constrained-optimization/v/constrained-optimization-introduction 2 1 CSC321 Winter 2017 Homework 8 2. Gibbs Sampling. Suppose we have a Boltzmann machine with two variables, with a weight w and the same bias b for both variables: The variables both take values in {−1, 1}. (a) [2pts] For the values w = 1 and b = −0.2, determine the probability distribution over all four configurations of the variables. Then determine the marginal probability p(x1 = 1). Now do the same thing for w = 3 and b = −0.2. (Give your answers to three decimal places.) (b) We’ll initialize both variables to 1, and then run Gibbs sampling to try to sample from the model distribution. On odd numbered steps, we will resample x1 from its conditional distribution; on even numbered steps, we will resample x2 from its conditional distribution. Denote by αt the probability that the value chosen by Gibbs sampling on the tth step is a 1. Since we are initializing to 1, we have the initial condition α0 = 1. i. [2pts] Give a recurrence that expresses αt as a function of αt−1 . ii. [1pts] Using your recurrence, determine the value of α100 for each of the two cases given above (i.e. (w = 1, b = −0.2), and (w = 3, b = −0.2)). Give your answer to three decimal places. We encourage you to try to derive an explicit formula for αt , but we’ll give you full credit if you just write a program to compute it. 2
© Copyright 2025 Paperzz