Machine Learning Srihari Mixture of Gaussians Sargur Srihari [email protected] 1 Machine Learning Srihari Mixtures of Gaussians • Also Called as Finite Mixture Models (FMMs) K p(x) = ∑ π k N(x | µ k , Σ k ) k=1 Goal: Determine the three sets of parameters π , µ, Σ using maximum likelihood Machine Learning Srihari Gaussian Mixture Model • A linear superposition of Gaussian components • Provides a richer class of density models than the single Gaussian • GMM are formulated in terms of discrete latent variables – Provides deeper insight – Motivates EM algorithm • Which gives a maximum likelihood solution to no. of components and their means/covariances 3 Machine Learning GMM Formulation Srihari • Linear superposition of K Gaussians: p(x) = ∑ π N(x | µ , Σ ) K k k k k=1 • Introduce a latent variable z that has K values – Use 1-of-K representation • Let z = z1,..,zK whose elements are zk ∈{0,1} and ∑ zk = 1 k • There are K possible states of z corresponding to K components • Define joint distribution p(x,z)=p(x|z)p(z) – x is observed variable – z is hidden or missing variable conditional marginal Machine Learning Srihari Specifying p(z), z ={z1,..,zK } • Associate a probability with each component zk – Denote p(zk = 1) = π k where parameters satisfy 0 ≤ π k ≤ 1 and ∑ π k = 1 k • Because z uses 1-of-K it follows that K p(z) = ∏ π k =1 zk k With one component p(z1 ) = π 1z1 With two components p(z1 , z2 ) = π 1z1π 2 z2 – since components are mutually exclusive and hence are independent 5 Machine Learning Srihari Conditional distribution p(x|z) • For a particular component (value of z) p(x | zk = 1) = N(x | µ k , Σ k ) • Thus p(x|z) Kcan be written in the form z Due to the exponent z all product p(x | z) = ∏ N ( x | µk ,Σ k ) terms except for one equal one k=1 • Thus marginal distribution of x is obtained by summing over all possible states of z k K k zk K p(x) = ∑ p(z)p(x | z) = ∑ ∏ π N ( x | µ k , Σk ) = ∑ π k N ( x | µ k , Σk ) zk k z z k=1 k=1 • This is the standard form of a Gaussian mixture 6 Machine Learning Srihari Latent Variable • If we have observations x1,..,xN • Because marginal distribution is in the form p(x) = ∑ p(x,z) z – It follows that for every observed data point xn there is a corresponding latent vector zn , i.e., its sub-class • Thus we have found a formulation of Gaussian mixture involving an explicit latent variable – We are now able to work with joint distribution p(x,z) instead of marginal p(x) • Leads to significant simplification through introduction of expectation maximization 7 Machine Learning Srihari Another conditional probability (Responsibility) • In EM p(z |x) plays a role • The probability p(zk=1|x) is denoted γ (zk ) • From Bayes theorem γ (z k ) ≡ p(z k = 1 | x) = p(z k = 1) p(x | z k = 1) K ∑ p(z j = 1) p(x | z j = 1) j=1 = p(x,z)=p(x|z)p(z) π k N(x | µk ,Σ k ) K ∑ π N(x | µ ,Σ j k j ) j=1 • View p(zk = 1) = π k as prior probability of component k γ (zk ) = p(zk = 1 | x) as the posterior probability it is also the responsibility that component k takes for explaining the observation x 8 Machine Learning Srihari Plan of Discussion • Next we look at 1. How to get data from a mixture model synthetically and then 2. How to estimate the parameters from the data 9 Machine Learning Srihari Synthesizing data from mixture 500 points from • Use ancestral sampling three Gaussians – Start with lowest numbered node and draw a sample, • Generate sample of z, called z^ – move to successor node and draw a sample given the parent value, etc. Complete Data set • Then generate a value for x from conditional p(x|z^) • Samples from p(x,z) are plotted according to value of x and colored with value of z • Samples from marginal p(x) obtained by ignoring values of z Incomplete Data set 10 Machine Learning Srihari Illustration of responsibilities • Evaluate for every data point – Posterior probability of each component • Responsibility γ (znk ) is associated with data point xn • Color using proportion of red, blue and green ink – If for a data point γ (zn1 ) = 1 it is colored red – If for another point γ (zn2 ) = γ (zn 3 ) = 0.5 it has equal blue and green and will appear as cyan 11 Machine Learning Srihari Maximum Likelihood for GMM • We wish to model data set {x1,..xN} using a mixture of Gaussians (N items each of dimension D) "x % • Represent by N x D matrix X X = $$x '' 1 2 – Nth row is given by xn T $x n ' $ ' # & • Represent N latent variables with N x K matrix Z with € rows znT • Bayesian Network for p(x,z) is "z1 % $ ' z Z = $ 2' $z n ' $ ' # & € We are given N i.i.d. samples {x n } with unknown latent points {z n } with parameters π n . The samples have parameters: mean µ and covariance Σ Goal is to estimate the three sets of parameters 12 Machine Learning Srihari Likelihood Function for GMM Mixture density function is K p(x) = ∑ p(z) p(x | z) = ∑ π k N ( x | µk ,Σ k ) z Summation is for Marginalizing the joint k=1 Therefore Likelihood function is ⎧K ⎫ p(X | π , µ,Σ) = ∏ ⎨∑ π k N(x n | µk ,Σ k )⎬ ⎭ n=1 ⎩ k=1 N Product is over the N i.i.d. samples Therefore log-likelihood function is ⎧K ⎫ ln p(X | π , µ,Σ) = ∑ ln⎨∑ π k N(x n | µk ,Σ k )⎬ ⎭ n=1 ⎩ k=1 N Which we wish to maximize A more difficult problem than for a single Gaussian 13 Machine Learning Srihari Maximization of Log-Likelihood ⎧K ⎫ ln p(X | π , µ, Σ) = ∑ ln ⎨∑ π k N(x n | µ k , Σ k ) ⎬ n=1 ⎩ k=1 ⎭ N • Goal is to estimate the three sets of parameters π , µ, Σ • Before proceeding with the m.l.e. briefly mention two technical issues: – Problem of singularities – Identifiability of mixtures 14 Machine Learning Srihari Problem of Singularities with Gaussian mixtures • Consider Gaussian mixture – components with covariance matrices Σ k = σ k 2 I • Data point that falls on a mean µ j = x n will contribute to the likelihood function 1 1 N (x n | x n , σ I ) = (2π )1/2 σ j 2 j since exp(xn-µj)2=1 • As σ j → 0 term goes to infinity • Therefore maximization of log-likelihood ⎧ ⎫ ln p(X | π , µ, Σ) = ∑ ln ⎨∑ π N(x | µ , Σ ) ⎬ is not well-posed N n=1 One component assigns finite values and other to large value K ⎩ k=1 k n k k ⎭ – Does not happen with a single Gaussian • Multiplicative factors go to zero – Does not happen in the Bayesian approach • Problem is avoided using heuristics – Resetting mean or covariance Multiplicative values 15 Take it to zero Machine Learning Srihari Problem of Identifiability A density p(x | θ ) is identifiable if θ ≠ θ ' then there is an x for which p(x | θ ) ≠ p(x | θ ') A K-component mixture will have a total of K! equivalent solutions – Corresponding to K! ways of assigning K sets of parameters to K components • E.g., for K=3 K!=6: 123, 132, 213, 231, 312, 321 – For any given point in the space of parameter values there will be a further K!-1 additional points all giving exactly same distribution • However any of the equivalent solutions is as good as the other A Two ways of labeling three Gaussian subclasses C B B C A 16 Machine Learning Srihari EM for Gaussian Mixtures • EM is a method for finding maximum likelihood solutions for models with latent variables • Begin with log-likelihood function ⎧K ⎫ ln p(X | π , µ,Σ) = ∑ ln⎨∑ π k N(x n | µk ,Σ k )⎬ ⎭ n=1 ⎩ k=1 N • We wish to find π , µ, Σ that maximize this quantity • Take derivatives in turn w.r.t – Means µ k and set to zero – covariance matrices Σ and set to zero k – mixing coefficients π and set to zero k 17 Machine Learning Srihari EM for GMM: Derivative wrt µk • Begin with log-likelihood function ⎧K ⎫ ln p(X | π , µ,Σ) = ∑ ln⎨∑ π k N(x n | µk ,Σ k )⎬ ⎭ n=1 ⎩ k=1 N • Take derivative w.r.t the means µ k and set to zero – Making use of exponential form of Gaussian – Use formulas: d u' d u ln u = and e = e u u' – We get dx u dx −1 π k N(x n | µk ,Σ k ) 0=∑ (x n − µk ) ∑ k n=1 ∑ π j N(x n | µ j ,Σ j ) N j Inverse of covariance matrix γ (znk ) the posterior probabilities 18 Machine Learning Srihari M.L.E. solution for Means Σ k(assuming non-singularity) • Multiplying by 1 µk = Nk N ∑γ (z n =1 nk )x n • Where we have defined Mean of kth Gaussian component is the weighted mean of all the points in the data set: where data point xn is weighted by the posterior probability that component k was responsible for generating xn N N k = ∑ γ (znk ) n=1 – Which is the effective number of points assigned to cluster k 19 Machine Learning Srihari M.L.E. solution for Covariance • Set derivative wrt Σk to zero – Making use of mle solution for covariance matrix of single Gaussian 1 N Σk = γ (znk )(x n − µk )(x n − µk )T ∑ N k n=1 – Similar to result for a single Gaussian for the data set but each data point weighted by the corresponding posterior probability – Denominator is effective no of points in component 20 Machine Learning Srihari M.L.E. solution for Mixing Coefficients • Maximize ln p(X | π , µ, Σ) w.r.t. π k – Must take into account that mixing coefficients sum to one – Achieved using Lagrange multiplier and maximizing ⎛K ⎞ ln p(X | π , µ,Σ) + λ⎜ ∑ π k −1⎟ ⎝ k=1 ⎠ – Setting derivative wrt gives Nk πk = N π k to zero and solving 21 Machine Learning Srihari Summary of m.l.e. expressions • GMM maximum likelihood estimates 1 µk = Nk N ∑γ (z n =1 nk )x n Parameters (means) 1 N Σk = γ (znk )(x n − µk )(x n − µk )T ∑ N k n=1 Nk πk = N N N k = ∑ γ (znk ) Parameters(covariance matrices) Parameters (Mixing Coefficients) n=1 All three are in terms of responsibliities 22 Machine Learning Srihari EM Formulation • The results for µk , Σ k , π k are not closed form solutions for the parameters – Since γ (znk ) the responsibilities depend on those parameters in a complex way • Results suggest an iterative solution • An instance of EM algorithm for the particular case of GMM 23 Machine Learning Srihari Informal EM for GMM • First choose initial values for means, covariances and mixing coefficients • Alternate between following two updates – Called E step and M step • In E step use current value of parameters to evaluate posterior probabilities, or responsibilities • In the M step use these posterior probabilities to to re-estimate means, covariances and mixing coefficients 24 Machine Learning Srihari EM using Old Faithful Data points and Initial mixture model After 2 cycles Initial E step Determine responsibilities After 5 cycles After first M step Re-evaluate Parameters After 20 cycles 25 Machine Learning Srihari Comparison with K-Means K-means result E-M result 26 Machine Learning Srihari Animation of EM for Old Faithful Data • http://en.wikipedia.org/wiki/ File:Em_old_faithful.gif Code in R #initial parameter estimates (chosen to be deliberately bad) theta <- list( tau=c(0.5,0.5), mu1=c(2.8,75), mu2=c(3.6,58), sigma1=matrix(c(0.8,7,7,70),ncol=2), sigma2=matrix(c(0.8,7,7,70),ncol=2) ) 27 Machine Learning Srihari Practical Issues with EM • Takes many more iterations than K-means • Each cycle requires significantly more comparison • Common to run K-means first in order to find suitable initialization • Covariance matrices can be initialized to covariances of clusters found by K-means • EM is not guaranteed to find global maximum of log likelihood function 28 Machine Learning Srihari Summary of EM for GMM • Given a Gaussian mixture model • Goal is to maximize the likelihood function w.r.t. the parameters (means, covariances and mixing coefficients) Step1: Initialize the means µ,k covariances Σ k and mixing coefficients π k and evaluate initial value of log-likelihood 29 Machine Learning Srihari EM continued • Step 2: E step: Evaluate responsibilities using current parameter values π N (x | µ , Σ ) γ ( zk )= k k K ∑π j =1 j k N (x | µ j , Σ j )) • Step 3: M Step: Re-estimate parameters using current responsibilities µ new k Σ new k 1 N = γ (znk )x n ∑ N k n=1 1 N = γ (znk )(x n − µknew )(x n − µknew )T ∑ N k n=1 π knew = Nk N where N N k = ∑ γ (znk ) n=1 30 Machine Learning Srihari EM Continued • Step 4: Evaluate the log likelihood ⎧K ⎫ ln p(X | π , µ,Σ) = ∑ ln⎨∑ π k N(x n | µk ,Σ k )⎬ ⎭ n=1 ⎩ k=1 N – And check for convergence of either parameters or log likelihood • If convergence not satisfied return to Step 2 31
© Copyright 2026 Paperzz