Lecture 9: Continuous Latent Variable Models

CSC2515: Lecture 8
Continuous Latent Variables
CSC2515 Fall 2007
Introduction to Machine Learning
Lecture 9: Continuous
Latent Variable Models
1
CSC2515: Lecture 8
Continuous Latent Variables
Example: continuous underlying variables
• What are the intrinsic latent dimensions in these two datasets?
• How can we find these dimensions from the data?
2
CSC2515: Lecture 8
Continuous Latent Variables
Dimensionality Reduction vs. Clustering
• Training continuous latent variable models often called
dimensionality reduction, since there are typically many fewer
latent dimensions
• Examples: Principal Components Analysis, Factor Analysis,
Independent Components Analysis
• Continuous causes often more efficient at representing information
than discrete
• For example, if there are two factors, with about 256 settings each,
we can describe the latent causes with two 8-bit numbers
• If we try to cluster the data, we need 216 ~= 105 numbers
3
CSC2515: Lecture 8
Continuous Latent Variables
Generative View
• Each data example generated by first selecting a point from a
distribution in the latent space, then generating a point from the
conditional distribution in the input space
• Simple models: Gaussian distributions in both latent and data
space, linear relationship betwixt
• This view underlies Probabilistic PCA, Factor Analysis
• Will return to this later
4
CSC2515: Lecture 8
Continuous Latent Variables
Standard PCA
• Used for data compression, visualization, feature extraction,
dimensionality reduction
• Algorithm: to find M components underlying D-dimensional data
– select the top M eigenvectors of S (data covariance matrix):
– project each input vector x into this subspace, e.g., zn1 = uT
1 xn
• Full projection onto M dimensions:
• Two views/derivations:
– Maximize variance (scatter of green points)
– Minimize error (red-green distance per datapoint)
5
CSC2515: Lecture 8
Continuous Latent Variables
Standard PCA: Variance maximization
• One dimensional example
• Objective: maximize projected variance w.r.t. U1
N
T x − uT x̄}2 = uT Su
1
{u
1
1 n
1
1
N
n=1
N xn
n=1
N xn
– where sample mean and data covariance are:
1
x̄ =
N
S =
1
N
(
− x̄)(xn − x̄)T
n=1
• Must constrain ||u1||: via Lagrange multiplier, maximize w.r.t u1
uT1 Su1 + λ(1 − uT1 u1 )
• Optimal u1 is principal component (eigenvector with maximal
eigenvalue)
6
CSC2515: Lecture 8
Continuous Latent Variables
Standard PCA: Extending to higher dimensions
• Extend solution to additional latent components: find variancemaximizing directions orthogonal to previous ones
• Equivalent to Gaussian approximation to data
• Think of Gaussian as football (hyperellipsoid)
– Mean is center of football
– Eigenvectors of covariance matrix are axes of football
– Eigenvalues are lengths of axes
• PCA can be thought of as fitting the football to the data:
maximize volume of data projections in M-dimensional subspace
• Alternative formulation: minimize error, equivalent to minimizing
average distance from datapoint to its projection in subspace
7
CSC2515: Lecture 8
Continuous Latent Variables
Standard PCA: Error minimization
• Data points represented by projection onto M-dimensional subspace,
plus some distortion:
• Objective: minimize distortion w.r.t. U1 (reconstruction error of xn)
J = N1
N
n=1
||xn − x̃Tn ||2
x̃n =
J = N1
M
i=1
N D
n=1 i=M +1
zni ui +
D
i=M +1
bi ui
bi (xTn ui − x̄T ui )2 =
D
i=M +1
znj = xT
n uj
bj = x̄T uj
uTi Sui
• The objective is minimized when the D-M components are the
eigenvectors of S with lowest eigenvalues → same result
8
CSC2515: Lecture 8
Continuous Latent Variables
Applying PCA to faces
• Need to first reduce dimensionality of inputs (will see in tutorial
how to handle high-dimensional inputs) – down-sample images
• Run PCA on 2429 19x19 grayscale images (CBCL database)
• Compresses the data: can get good reconstructions with only 3
components
• Pre-processing: can apply classifier to latent representation -PPCA w/ 3 components obtains 79% accuracy on face/non-face
discrimination in test data vs. 76.8% for m.o.G with 84 states
• Can be good for visualization
9
CSC2515: Lecture 8
Continuous Latent Variables
Applying PCA to faces: Learned basis
10
CSC2515: Lecture 8
Continuous Latent Variables
Probabilistic PCA
• Probabilistic, generative view of data
• Assumptions:
– underlying latent variable has a Gaussian distribution
– linear relationship between latent and observed variables
– isotropic Gaussian noise in observed dimensions
p(z) = N (z|0, I)
p(x|z) = N (x|Wz + µ, σ 2I)
x = Wz + µ + ǫ
11
CSC2515: Lecture 8
Continuous Latent Variables
Probabilistic PCA: Marginal data density
• Columns of W are the principal components, σ2 is sensor noise
• Product of Gaussians is Gaussian: the joint p(z,x), the marginal
data distribution p(x) and the posterior p(z|x) are also Gaussian
• Marginal datadensity (predictive distribution):
p(x) = z p(z)p(x|z)dz = N (x|µ, WWT + σ 2 I)
• Can derive by completing square in exponent, or by just
computing mean and covariance given that it is Gaussian:
E[x] = E[µ + Wz + ǫ] = µ + WE[z] + E[ǫ]
= µ + W0 + 0 = µ
C =
=
=
=
Cov[x] = E[(z − µ)(z − µ)T ]
E[(µ + Wz + ǫ − µ)(µ + Wz + ǫ − µ)T ]
E[(Wz + ǫ)(Wz + ǫ)T ]
WWT + σ 2I
12
CSC2515: Lecture 8
Continuous Latent Variables
Probabilistic PCA: Joint distribution
• Joint density for PPCA (x is D-dim., z is M-dim):
– where cross-covariance terms from:
Cov[z, x] = E[(z − 0)(x − µ)T ] = E[z(µ + Wz + ǫ − µ)T ]
= E[z(Wz + ǫ)T ] = WT
• Note that evaluating predictive distribution involves inverting C:
reduce O(D3) to O(M3) by applying matrix inversion lemma:
C−1 = σ −1 I − σ −2 W(WT W + σ 2 I)−1 WT
13
CSC2515: Lecture 8
Continuous Latent Variables
Probabilistic PCA: Posterior distribution
• Inference in PPCA produces posterior distribution over latent z
• Derive by applying Gaussian conditioning formulas (see 2.3 in
book) to joint distribution
p(z|x) = N (z|m, V)
m = WT (WWT + σ 2I)−1(x − µ)
V = I − WT (WWT + σ 2I)−1W
• Mean of inferred z is projection of centered x – linear operation
• Posterior variance does not depend on the input x at all!
14
CSC2515: Lecture 8
Continuous Latent Variables
Standard PCA: Zero-noise limit of PPCA
• Can derive standard PCA as limit of Probabilistic PCA (PPCA) as
σ2 → 0.
• ML parameters W* are the same
• Inference is easier: orthogonal projection
limσ2
T (WWT + σ 2 WT )
W
0
→
−1 = (WT W)−1 WT
• Posterior covariance is zero
15
CSC2515: Lecture 8
Continuous Latent Variables
Probabilistic PCA: Constrained covariance
• Marginal density for PPCA (x is D-dim., z is M-dim):
p(x|θ) = N (x|µ, WWT + σ 2 I)
– where θ = W, µ, σ
• Effective covariance is low-rank outer product of two long skinny
matrices plus a constant diagonal matrix
Cov[x]
W
WT
σI
• So PPCA is just a constrained Gaussian model:
– Standard Gaussian has D + D(D+1)/2 effective parameters
– Diagonal-covariance Gaussian has D+D, but cannot capture correlations
– PPCA: DM + 1 – M(M-1)/2, can represent M most significant correlations
16
CSC2515: Lecture 8
Continuous Latent Variables
xn
n
xn C−1 xn
n
−
1
C
xn
xn
Probabilistic PCA: Maximizing likelihood
L(θ; X) = log p(X|θ) =
log p(
|θ)
N
1
log |C| −
( − µ)
2
2
N
1
= − log |C| − T r[
(
2
2
n
N
1
= − log |C| − T r[C−1S]
2
2
= −
(
− µ)T
− µ)(
− µ)T ]
• Fit parameters (θ = W, µ, σ) to max likelihood: make model
covariance match observed covariance; distance is trace of ratio
• Sufficient statistics: mean µ = (1/N)∑n xn and sample covariance S
• Can solve for ML params directly: kth column of W is the Mth
largest eigenvalue of S times the associated eigenvector; σ is the
sum of all eigenvalues less than Mth one
17
CSC2515: Lecture 8
Continuous Latent Variables
Probabilistic PCA: EM
• Rather than solving directly, can apply EM
• Need complete-data log likelihood
2
log p(X, Z|µ, W, σ ) = n [log p(xn |zn ) + log p(zn )]
• E step: compute expectation of complete log likelihood with
respect to posterior of latent variables z, using current parameters –
can derive E[zn] and E[zn znT] from posterior p(z|x)
• M step: maximize with respect to parameters W and σ
• Iterative solution, updating parameters given current expectations,
expectations give current parameters
• Nice property – avoids direct O(ND2) construction of covariance
matrix, instead involves sums over data cases: O(NDM); can be
implemented online, without storing data
18
CSC2515: Lecture 8
Continuous Latent Variables
Probabilistic PCA: Why bother?
• Seems like a lot of formulas, algebra to get to similar model to
standard PCA, but…
• Leads to understanding of underlying data model, assumptions
(e.g., vs. standard Gaussian, other constrained forms)
• Derive EM version of inference/learning: more efficient
• Can understand other models as generalizations, modifications
• More readily extend to mixtures of PPCA models
• Principled method of handling missing values in data
• Can generate samples from data distribution
19
CSC2515: Lecture 8
Continuous Latent Variables
Factor Analysis
• Can be viewed as generalization of PPCA
• Historical aside – controversial method, based on attempts to
interpret factors: e.g., analysis of IQ data identified factors related
to race
• Assumptions:
– underlying latent variable has a Gaussian distribution
– linear relationship between latent and observed variables
– diagonal Gaussian noise in data dimensions
p(z) = N (z|0, I)
p(x|z) = N (x|Wz + µ, Ψ)
• W: factor loading matrix (D x M)
• Ψ : data covariance (diagonal, or axis-aligned; vs. PCA’s spherical)
20
CSC2515: Lecture 8
Continuous Latent Variables
Factor Analysis: Distributions
• As in PPCA, the joint p(z,x), the marginal data distribution p(x)
and the posterior p(z|x) are also Gaussian
• Marginal data density (predictive distribution):
p(x) = z p(z)p(x|z)dz = N (x|µ, WWT + Ψ)
• Joint density:
• Posterior, derived via Gaussian conditioning
p(z|x) = N (z|m, V)
m = WT (WWT + Ψ)−1(x − µ)
V = I − WT (WWT + Ψ)−1W
21
CSC2515: Lecture 8
Continuous Latent Variables
Factor Analysis: Optimization
• Parameters are coupled, making it impossible to solve for ML
parameters directly, unlike PCA
• Must use EM, or other nonlinear optimization
• E step: compute posterior p(z|x) – use matrix inversion to convert
D x D matrix inversions to M x M
• M step: take derivatives of expected complete log likelihood with
respect to parameters
22
CSC2515: Lecture 8
Continuous Latent Variables
Factor Analysis vs. PCA: Rotations
• In PPCA, the data can be rotated without changing anything:
multiply data by matrix Q, obtain same fit to data
µ ← Qµ
W ← QW
Ψ ← Ψ
• But the scale is important
• PCA looks for directions of large variance, so it will grab large
noise directions
23
CSC2515: Lecture 8
Continuous Latent Variables
Factor Analysis vs. PCA: Scale
• In FA, the data can be re-scaled without changing anything
• Multiply xi by αi:
µi ← αiµi
Wij ← αiWij
Ψi ← α2
i Ψi
• But rotation in data space is important
• FA looks for directions of large correlation in the data, so it will
not model large variance noise
24
CSC2515: Lecture 8
Continuous Latent Variables
Factor Analysis : Identifiability
• Factors in FA are non-identifiable: not guaranteed to find same
set of parameters – not just local minimum but invariance
• Rotate W by any unitary Q and model stays the same – W only
appears in model as outer product WWT
• Replace W with WQ: (WQ)(WQ)T = W(Q QT) WT = WWT
• So no single best setting of parameters
• Degeneracy makes unique interpretation of learned factors
impossible
25
CSC2515: Lecture 8
Continuous Latent Variables
Independent Components Analysis (ICA)
• ICA is another continuous latent variable model, but it has a
non-Gaussian and factorized prior on the latent variables
• Good in situations where most of the factors are small most of
the time, do not interact with each other
• Example: mixtures of speech signals
• Learning problem same as before: find weights from factors to
observations, infer the unknown factor values for given input
• ICA: factors are called “sources”, learning is “unmixing”
26
CSC2515: Lecture 8
Continuous Latent Variables
ICA Intuition
• Since latent variables assumed to be independent, trying to find
linear transformation of data that recovers independent causes
• Avoid degeneracies in Gaussian latent variable models: assume
non-Gaussian prior distribution for latents (sources)
• Often we use heavy-tailed source priors, e.g.,
p(zj ) =
1
π cosh(zj )
=
1
π(exp(zj )+exp(−zj ))
• Geometric intuition: find spikes in histogram
27
CSC2515: Lecture 8
Continuous Latent Variables
ICA Details
• Simplest form of ICA has as many outputs as sources (square)
and no sensor noise on the outputs:
p(z) =
k
p(zk )
x = Vz
• Learning in this case can be done with gradient descent (plus
some “covariant” tricks to make updates faster and more stable)
• If keep V square, and assume isotropic Gaussian noise on the
outputs, there is a simple EM algorithm
• Much more complex cases have been studied also: non-square,
time delays, etc.
28