Face Recognition Ying Wu [email protected] Electrical and Computer Engineering Northwestern University, Evanston, IL http://www.ece.northwestern.edu/~yingwu Recognizing Faces? Lighting View Outline 9 Bayesian Classification 9 9 9 Principal Component Analysis (PCA) Fisher Linear Discriminant Analysis (LDA) Independent Component Analysis (ICA) Bayesian Classification Classifier & Discriminant Function 9 Discriminant Function for Gaussian 9 Bayesian Learning and Estimation 9 Classifier & Discriminant Function Discriminant function gi(x) i=1,…,C Classifier x → ω i if g i ( x) > g j ( x) ∀j ≠ i Example g i ( x) = p (ω i | x) g i ( x) = p ( x | ω i |) p (ω i ) g i ( x) = ln p ( x | ω i ) + ln p (ω i ) The choice of D-function is not unique Decision boundary Multivariate Gaussian x2 p ( x) ~ N ( µ , ∑) principal axes x1 9 The principal axes (the direction) are given by the eigenvectors of ∑; 9 The length of the axes (the uncertainty) is given by the eigenvalues of ∑ Mahalanobis Distance x1 || x1 − µ ||2 >|| x2 − µ ||2 µ x2 || x1 − µ ||M =|| x2 − µ ||M Mahalanobis distance is a normalized distance || x − c ||M = ( x − µ )T ∑ −1 ( x − µ ) Whitening x2 Whitening: – Find a linear transformation (rotation and scaling) such that the covariance becomes an identity matrix (i.e., the uncertainty for each component is the same) y2 y=ATx y1 x1 p(y) ~ N(ATµ, AT∑A) p(x) ~ N(µ, ∑) solution : − A =U Λ T 1 2 wher e Σ = U Λ U T Disc. Func. for Gaussian Minimum-error-rate classifier g i ( x) = ln p ( x | ω i ) + ln p (ω i ) d 1 1 T −1 g i ( x) = − ( x − µ i ) Σ i ( x − µ i ) − ln 2π − ln | Σ i | + ln p (ω i ) 2 2 2 Case I: ∑i = σ2I || x − µ i || 2 1 T T T gi ( x) = − + ln p ( ω ) = − x x − 2 µ x + µ i i i µ + ln p (ω i ) 2 2 2σ 2σ [ ] constant T 1 1 T + µ µ ln ( ω ) g i ( x ) = − 2 µ i x + − p i i 2 σ 2σ = WiT x + Wi 0 Liner discriminant function Boundary: g i ( x) = g j ⇒ W T ( x − x0 ) = 0 where W = µi − µ j x0 = ( µ i + µ j ) − 1 2 σ2 || µ i − µ j ||2 ln p (ω i ) p (ω j ) (µi − µ j ) Example Assume p(ωi)=p(ωj) Let’s derive the decision boundary: µj µi µi + µ j (µi − µ j ) x − =0 2 T Case II: ∑i= ∑ g i ( x) = − 12 ( x − µ i )T Σ −1 ( x − µ i ) + ln p (ω i ) g i ( x) = (Σ −1µ i )T x + (− 12 µ iT Σ −1µ i + ln p(ω i )) T = Wi x + Wi 0 The decision boundary is still linear: W T ( x − x0 ) = 0 where W = Σ −1 ( µ i − µ j ) ln p(ω i ) − ln p(ω j ) 1 x0 = ( µ i + µ j ) − (µi − µ j ) T −1 (µi − µ j ) Σ µi − µ j ) 2 Case III: ∑i= arbitrary T g i ( x) = x Ai x + Wi x + Wi 0 T Ai = − Σ 1 2 where −1 i Wi = Σ i−1µ i Wi 0 = − 12 µ iT Σ i−1µ i − 12 ln | Σ i | + ln p(ω i ) The decision boundary is no longer linear, but hyperquadrics! Bayesian Learning Learning means “training” i.e., estimating some unknowns from “training data” WHY? – It is very difficult to specify these unknowns – Hopefully, these unknowns can be recovered from examples collected. Maximum Likelihood Estimation Collected examples D={x1,x2,…,xn} Estimate unknown parameters θ in the sense that the data likelihood is maximized n p ( D | θ ) = ∏ p ( xk | θ ) Likelihood k =1 Log Likelihood n L(θ ) = ln p ( D | θ ) = ∑ ln p ( xk | θ ) ML estimation k =1 θ * = arg max p( D | θ ) = arg max L( D | θ ) θ θ Case I: unknown µ ln p ( xk | µ ) = − 12 ln[(2π ) d | Σ |] − 12 ( xk − µ )T Σ −1 ( xk − µ ) ∂ ln p (( xk | µ ) = Σ −1 ( xk − µ ) ∂µ n n 1 −1 Σ ( xk − µˆ ) = 0 ⇔ µˆ = ∑ xk ∑ n k =1 k =1 Case II: unknown µ and ∑ ln p ( xk | µ , ∆) = − 12 ln 2π∆ − 21∆ ( xk − µ ) 2 let ∆ = σ 2 n ∂ ln p ( xk | µ , ∆ ) 1 = ( x k −µ ) ∂µ σ 1 ( xk − µˆ ) = 0 ∑ ˆ k =1 ∆ ∂ ln p ( xk | µ , ∆ ) 1 ( x k −µ )2 =− + 2∆ 2∆2 ∂∆ 1 n ( xk − µˆ ) 2 − ∑ +∑ =0 2 ˆ ˆ ∆ k =1 k =1 ∆ 1 n µˆ = ∑ xk n k =1 n 1 σˆ = ∆ˆ = ∑ ( xk − µˆ ) 2 n k =1 2 2 n 1 n generalize µˆ = ∑ xk n k =1 n 1 Σˆ = ∑ ( xk − µˆ )( xk − µˆ )T n k =1 Bayesian Estimation Collected examples D={x1,x2,…,xn}, drawn independently from a fixed but unknown distribution p(x) Bayesian learning is to use D to determine p(x|D), i.e., to learn a p.d.f. p(x) is unknown, but has a parametric form with parameters θ ~ p(θ) Difference from ML: in Bayesian learning, θ is not a value, but a random variable and we need to recover the distribution of θ, rather than a single value. Bayesian Estimation p ( x | D) = ∫ p ( x,θ | D)dθ = ∫ p ( x | θ ) p (θ | D)dθ This is obvious from the total probability rule, i.e., p(x|D) is a weighted average over all θ If p(θ |D) peaks very sharply about some value θ*, then p(x|D) ~ p(x| θ*) The Univariate Case assume µ is the only unknown, p(x|µ)~N(µ, σ2) µ is a r.v., assuming a prior p(µ) ~ N(µ0, σ02), i.e., µ0 is the best guess of µ, and σ0 is the uncertainty of it. n p ( µ | D ) ∝ p ( D | µ ) p ( µ ) = ∏ p ( xk | µ ) p ( µ ) k =1 where p ( xk | µ ) ~ N ( µ , σ 2 ), ( p ( µ ) ~ N ( µ o , σ 02 ) ) 1 n µ0 2 1 1 p ( µ | D) ∝ exp − 2 ( σ 2 + σ 2 µ − 2( σ 2 ∑ xk + σ 2 ) µ k P(µ|D) is also a Gaussian for any # of training examples The Univariate Case p( µ | D) ~ N ( µ n , σ n2 ) if we let we have 2 nσ 0 2 σ µˆ + µ µ n = 2 2 2 n 2 0 nσ 0 + σ nσ 0 + σ 2 σ σ 0 σ n2 = 2 nσ 0 + σ 2 2 n=10 n=5 n=1 σn measures the uncertainty of this guess after observing n examples n=30 p(µ|x1,…,xn) The best guess for µ after observing n examples p(µ|D) becomes more and more sharply peaked when observing more and more examples, i.e., the uncertainty decreases. σ2 2 µ σn → n The Multivariate Case p ( x | µ ) ~ N ( µ , Σ) p(µ ) ~ N (µ 0 , Σ 0 ) let p( µ | D) ~ N ( µ n , Σ n ), we have −1 1 1 ˆ µ n = Σ 0 (Σ 0 + Σ) µ n + n Σ(Σ 0 + n Σ) µˆ 0 1 n −1 Σ n = Σ 0 (Σ 0 + 1n Σ) −1 1n Σ 1 n where µˆ n = ∑ xk n k =1 and p ( x | D) = ∫ p ( x | µ ) p ( µ | D)dµ ~ N ( µ n , Σ + Σ n ) PCA and Eigenface Principal Component Analysis (PCA) 9 Eigenface for Face Recognition 9 PCA: motivation Pattern vectors are generally confined within some low-dimensional subspaces Recall the basic idea of the Fourier transform – A signal is (de)composed of a linear combination of a set of basis signal with different frequencies. PCA: idea r r r x = m + αe αe x m xk n J (α1 ,..., α n , e) = ∑ || (m + α k e) − xk ||2 k =1 = ∑ α k2 || e ||2 −2∑ α k eT ( xk − m) + ∑ || xk − m ||2 ∂J = 2α k − 2eT ( xk − m) = 0 ⇒ α k = eT ( xk − m) ∂α k PCA J (e) = ∑ α k2 − 2∑ α k2 + ∑ || xk − m ||2 ∑ ( x − m)( x − m) e + ∑ ||x = −eSe + ∑ ||x − m || = −e T T k k k − m || 2 2 k arg min J (e) = arg max e Se s.t. || e ||= 1 T e e e* = arg max eT Se + λ (eT e − 1) e Se − λe = 0, i.e., e Se = 1 T To maximize eTSe, we need to select λmax Algorithm Learning the principal components from {x1, x2, …, xn} 1 n (1) m = ∑ xk , n k =1 A = [ x1 − m, ..., xn − m] n (2) S = ∑ ( xk − m)( xk − m)T = AAT k =1 (3) eigenvalue decomposition S = U T ΣU (4) sorting λi and u i (5) P T T 1 = [u , u2T , ..., umT ] PCA for Face Recognition Training data D={x1, …, xM} – Dimension (stacking the pixels together to make a vector of dimension N) – Preprocessing 9 cropping 9 normalization These faces should lie in a “face” subspace Questions: – What is the dimension of this subspace? – How to identify this subspace? – How to use it for recognition? Eigenface The EigenFace approach: M. Turk and A. Pentland, 1992 An Issue In general, N >> M However, S, the covariance matrix, is NxN! Difficulties: – S is ill-conditioned. Rank(S)<<N – The computation of the eigenvalue decomposition of S is expensive when N is large Solution? Solution I: Let’s do eigenvalue decomposition on ATA, which is a MxM matrix ATAv=λv Æ AATAv= λAv To see is clearly! (AAT) (Av)= λ(Av) i.e., if v is an eigenvector of ATA, then Av is the eigenvector of AAT corresponding to the same eigenvalue! Note: of course, you need to normalize Av to make it a unit vector Solution II: You can simply use SVD (singular value decomposition) A = [x1-m, …, xM-m] A = U∑VT – – – – A: NxM U: NxM UTU=I ∑: MxM diagonal V: MxM VTV=VVT=I Fisher Linear Discrimination LDA 9 PCA+LDA for Face Recognition 9 When does PCA fail? PCA Linear Discriminant Analysis Finding an optimal linear mapping W Catches major difference between classes and discount irrelevant factors In the mapped space, data are clustered Within/between class scatters 1 m1 = n1 ∑ x, x∈D1 1 m2 = n2 ∑x x∈D2 the linear transform : y = W T x 1 ~ ~ =W T m m1 = ∑ W T x = W T m1 , m 2 2 n1 x∈D1 S1 = T ( x − m )( x − m ) , ∑ 1 1 x∈D1 S2 = ~ ~ )2 = W T S W , S1 = ∑ ( y − m 1 1 y∈Y1 ~ S2 = T ( x − m )( x − m ) ∑ 2 2 x∈D2 ~ )2 = W T S W ( y − m ∑ 2 2 y∈Y2 within class scatter : SW = S1 + S 2 between class scatter : S B = (m1 − m2 )(m1 − m2 )T Fisher LDA ~ −m ~ |2 W T (m − m )(m − m )T W W T S W |m 1 2 1 2 B J (W ) = ~1 ~2 = = W T ( S1 + S 2 )W W T SW W S1 + S 2 W * = arg max J (W ) W max J (W ) ⇔ S B w = λSW w ← this is a generalized eigenvalue problem Solution I If Sw is not singular S S w = λw −1 W B You can simply do eigenvalue decomposition on SW-1SB Solution II Noticing: – SBW is on the direction of m1-m2 (WHY?) – We are only concern about the direction of the projection, rather than the scale We have −1 W w = S (m1 − m2 ) Multiple Discriminant Analysis 1 mi = ni Si = ~ = 1 m i ni ∑ x, x∈Di ∑ ( x − m )( x − m ) x∈Di C SW = ∑ S k k =1 i i T y =WTx ∑ y =W y∈Yi T mi , 1 C ~ ~ m = ∑ nk mk , n k =1 ~ SW = W T SW W , C ~ ~ −m ~ )(m ~ −m ~ )T = W T S W S B = ∑ ni (m i i B k =1 ~ T | S | | W S BW | * B W = arg max ~ = W | SW | | W T SW W | S B wi = λSW wi Comparing PCA and LDA LDA PCA MDA for Face Recognition Lighting • PCA does not work well! (why?) • solution: PCA+MDA Independent Component Analysis The cross-talk problem 9 ICA 9 Cocktail party x1(t) S1(t) A t t x2(t) S2(t) t t x3(t) t Can you recover s1(t) and s2(t) from x1(t), x2(t) and x3(t)? Formulation x j = a j1s1 + ... + a jn sn n X = ∑ ai si or ∀j X = AS i =1 Both A and S are unknowns! Can you recover both A and S from X? The Idea Y = W T X = W T AS = Z T S y is a linear combination of {si} Recall the central limit theory! A sum of even two independent r.v. is more Gaussian than the original r.v. So, ZTS should be more Gaussian than any of {si} In other words, ZTS become least Gaussian when in fact equal to {si} Amazed! Face Recognition: Challenges View
© Copyright 2026 Paperzz