Contrastive Divergence Learning Geoffrey E. Hinton A discussion led by Oliver Woodford Contents • • • • • Maximum Likelihood learning Gradient descent based approach Markov Chain Monte Carlo sampling Contrastive Divergence Further topics for discussion: – Result biasing of Contrastive Divergence – Product of Experts – High-dimensional data considerations Maximum Likelihood learning • Given: – Probability model £ p(x; £) = 1 f (x; £) Z(£) • - model parameters Z(£) • - the partition Rfunction, defined as Z(£) = f (x; £) dx X = fx gK – Training data • Aim: – Find £ k k=1 – Or, f (x; £) = exp¡ (x2¾¹) 2 £ = f¹; ¾g p Z(£) = ¾ 2¼ Known result: Q likelihood of training data: thatp(X; maximizes 1 f (x ; £) £) = K k k=1 Z(£) £ Toy example that minimizes negative ¡ PK log of likelihood: E(X; £) = K log(Z(£)) k=1 log(f (xk ; £)) ¡ 2 Maximum Likelihood learning • Method: @E(X;£) – @£ =0 @E(X; £) @£ at minimum = = h¢i X K @ log f (x ; £) @ log Z(£) ¡ 1 X i @£ K @£ ¿ i=1 À @ log Z(£) ¡ @ log f (x; £) @£ @£ X ¢ X is the expectation of given the data distribution . ¿ À p (x¡¹)2 @E(X;£) = @ log(¾ 2¼) + @ 2¾2 @£ @£ @£ ® X ¡¹ ) ¹ = hx i @E(X;£) = ¡ xD =E0 X @¹ ¾2 X p ) ¾ = h(x ¡ ¹)2 i @E(X;£) = 1 + (x¡¹)2 =0 @¾ ¾ ¾3 X – Let’s assume that there is no linear solution… X Gradient descent-based approach ´ – Move a fixed step size, , in the direction of steepest gradient. (Not line search – see why later). – This gives the following parameter update equation: @E(X; £) £t+1 = £t ¡ ´ = £t ¡ ´ t µ @£t ¿ À ¶ @ log Z(£t ) ¡ @ log f (x; £t ) @£t @£t X Gradient descent-based approach R Z(£) = f (x; £) dx – Recall . Sometimes this integral will be algebraically intractable. @ log Z(£) @£ – ThisE(X; means we can calculate neither £) nor (hence no line search). – However, with some clever substitution… R @ log Z(£) @£ = = = 1 Z(£) R R @f (x;£) @£ D dx @£ @ log f (x;£) @£ E µD p(x;£) @ log f (x;£t ) @£t 1 @ Z(£) @£ = p(x; £) @ log f (x;£) dx = £t+1 = £t ¡ ´ – so where = 1 @Z(£) Z(£) @£ E 1 Z(£) D p(x;£t ) R f (x; £) dx f (x; £) @ log f (x;£) dx @ log f (x;£) @£ ¡ D @£ E p(x;£) @ log f (x;£t ) @£t E ¶ X can be estimated numerically. Markov Chain Monte Carlo sampling – To estimate Z(£) D @ log f (x;£) @£ E p(x;£) we must draw samples from p(x; £) . – Since is unknown, we cannot draw samples randomly from a cumulative distribution curve. – Markov Chain Monte Carlo (MCMC) methods turn random samples into samples from a proposed distribution, without Z(£) knowing . – Metropolis algorithm: x0 = xk + randn(size(xk )) k • Perturb samples e.g. 0 p(x ;£) < rand(1) x0 k k p(x ;£) • Reject if k • Repeat cycle for all samples until stabilization of the distribution. – Stabilization takes many cycles, and there is no accurate criteria for determining when it has occurred. Markov Chain Monte Carlo sampling – Let us use the training data, MCMC sampling. X , as the starting point for our n Xn X0 £ £ Notation: X1 - training data, - training data after cycles of MCMC, £ £ - samples from proposed distribution with parameters µD equation E becomes: D – Our parameter update £t+1 = £t ¡ ´ @ log f (x;£t ) @£t X1 £t ¡ @ log f (x;£t ) @£t E X0 £t ¶ . Contrastive divergence – Let us make the number of MCMC cycles per iteration small, say even 1. µD ¶ E is now: D E – Our parameter update equation £t+1 = £t ¡ ´ @ log f (x;£t ) @£t X1 £t ¡ @ log f (x;£t ) @£t X0 £t – Intuition: 1 MCMC cycle is enough to move the data from the target distribution towards the proposed distribution, and so suggest which direction the proposed distribution should move to better model the training data. Contrastive divergence bias – We assume: @ E (X;£) @£ ¼ D @ log f (x;£) @£ E X1 £ ¡ D @ log f (x;£) @£ E X0 £ X0 jjX1 £ £ – ML learning equivalent to minimizing , where R jj P Q = p(x) log p(x) dx q(x) (Kullback-Leibler divergence). X0 jjX1 ¡ X1 jjX1 D E £D £ £E £ – CD attempts to minimize jj 1 ¡ @ log f (x;£) ¡ @X @X jjX @ log f (x;£) 0 jj 1 ¡ @ (X £ @£ X £ X1 X ) = £ £ @X1 @X1 jjX1 £ £ £ @£ @X1 @£ ¼0 X1 £ @£ 1 £ X0 £ @£ 1 £ @X1 1 £ £ – Usually , but can sometimes bias results. – See “On Contrastive Divergence Learning”, Carreira-Perpinan & Hinton, AIStats 2005, for more details. £ Product of Experts Dimensionality issues
© Copyright 2026 Paperzz