download

Face Recognition
Ying Wu
[email protected]
Electrical and Computer Engineering
Northwestern University, Evanston, IL
http://www.ece.northwestern.edu/~yingwu
Recognizing Faces?
Lighting
View
Outline
9 Bayesian Classification
9
9
9
Principal Component Analysis (PCA)
Fisher Linear Discriminant Analysis (LDA)
Independent Component Analysis (ICA)
Bayesian Classification
Classifier & Discriminant Function
9 Discriminant Function for Gaussian
9 Bayesian Learning and Estimation
9
Classifier & Discriminant Function
Discriminant function gi(x) i=1,…,C
„ Classifier
x → ω i if g i ( x) > g j ( x) ∀j ≠ i
„ Example
g i ( x) = p (ω i | x)
„
g i ( x) = p ( x | ω i |) p (ω i )
g i ( x) = ln p ( x | ω i ) + ln p (ω i )
The choice of D-function is not unique
„
Decision boundary
Multivariate Gaussian
x2
p ( x) ~ N ( µ , ∑)
principal axes
x1
9 The principal axes (the direction) are given by the eigenvectors of ∑;
9 The length of the axes (the uncertainty) is given by the eigenvalues of ∑
Mahalanobis Distance
x1
|| x1 − µ ||2 >|| x2 − µ ||2
µ
x2
|| x1 − µ ||M =|| x2 − µ ||M
Mahalanobis distance is a normalized distance
|| x − c ||M = ( x − µ )T ∑ −1 ( x − µ )
Whitening
„
x2
Whitening:
– Find a linear transformation (rotation and
scaling) such that the covariance becomes an
identity matrix (i.e., the uncertainty for each
component is the same)
y2
y=ATx
y1
x1
p(y) ~ N(ATµ, AT∑A)
p(x) ~ N(µ, ∑)
solution :
−
A =U Λ
T
1
2
wher e Σ = U Λ U
T
Disc. Func. for Gaussian
„
Minimum-error-rate classifier
g i ( x) = ln p ( x | ω i ) + ln p (ω i )
d
1
1
T −1
g i ( x) = − ( x − µ i ) Σ i ( x − µ i ) − ln 2π − ln | Σ i | + ln p (ω i )
2
2
2
Case I: ∑i = σ2I
|| x − µ i || 2
1
T
T
T
gi ( x) = −
+
ln
p
(
ω
)
=
−
x
x
−
2
µ
x
+
µ
i
i
i µ + ln p (ω i )
2
2
2σ
2σ
[
]
constant
T
1
 1



T
+
µ
µ
ln
(
ω
)
g i ( x ) = − 2 µ i  x +  −
p
i
i 
2
σ

 2σ

= WiT x + Wi 0
Liner discriminant function
Boundary:
g i ( x) = g j ⇒ W T ( x − x0 ) = 0
where
W = µi − µ j
x0 = ( µ i + µ j ) −
1
2
σ2
|| µ i − µ j ||2
ln
p (ω i )
p (ω j )
(µi − µ j )
Example
„
Assume p(ωi)=p(ωj)
Let’s derive the decision boundary:
µj
µi
µi + µ j 

(µi − µ j )  x −
=0
2 

T
Case II: ∑i= ∑
g i ( x) = − 12 ( x − µ i )T Σ −1 ( x − µ i ) + ln p (ω i )
g i ( x) = (Σ −1µ i )T x + (− 12 µ iT Σ −1µ i + ln p(ω i ))
T
= Wi x + Wi 0
The decision boundary is still linear:
W T ( x − x0 ) = 0
where
W = Σ −1 ( µ i − µ j )
ln p(ω i ) − ln p(ω j )
1
x0 = ( µ i + µ j ) −
(µi − µ j )
T −1
(µi − µ j ) Σ µi − µ j )
2
Case III: ∑i= arbitrary
T
g i ( x) = x Ai x + Wi x + Wi 0
T
Ai = − Σ
1
2
where
−1
i
Wi = Σ i−1µ i
Wi 0 = − 12 µ iT Σ i−1µ i − 12 ln | Σ i | + ln p(ω i )
The decision boundary is no longer linear, but hyperquadrics!
Bayesian Learning
Learning means “training”
„ i.e., estimating some unknowns from
“training data”
„ WHY?
„
– It is very difficult to specify these unknowns
– Hopefully, these unknowns can be recovered
from examples collected.
Maximum Likelihood Estimation
Collected examples D={x1,x2,…,xn}
„ Estimate unknown parameters θ in the sense
that the data likelihood is maximized
n
p ( D | θ ) = ∏ p ( xk | θ )
„ Likelihood
k =1
„ Log Likelihood
n
„
L(θ ) = ln p ( D | θ ) = ∑ ln p ( xk | θ )
„
ML estimation
k =1
θ * = arg max p( D | θ ) = arg max L( D | θ )
θ
θ
Case I: unknown µ
ln p ( xk | µ ) = − 12 ln[(2π ) d | Σ |] − 12 ( xk − µ )T Σ −1 ( xk − µ )
∂ ln p (( xk | µ )
= Σ −1 ( xk − µ )
∂µ
n
n
1
−1
Σ
( xk − µˆ ) = 0 ⇔ µˆ = ∑ xk
∑
n k =1
k =1
Case II: unknown µ and ∑
ln p ( xk | µ , ∆) = − 12 ln 2π∆ − 21∆ ( xk − µ ) 2
let ∆ = σ 2
n
∂ ln p ( xk | µ , ∆ ) 1
= ( x k −µ )
∂µ
σ
1
( xk − µˆ ) = 0
∑
ˆ
k =1 ∆
∂ ln p ( xk | µ , ∆ )
1 ( x k −µ )2
=−
+
2∆
2∆2
∂∆
1 n ( xk − µˆ ) 2
− ∑ +∑
=0
2
ˆ
ˆ
∆
k =1
k =1 ∆
1 n
µˆ = ∑ xk
n k =1
n
1
σˆ = ∆ˆ = ∑ ( xk − µˆ ) 2
n k =1
2
2
n
1 n
generalize µˆ = ∑ xk
n k =1
n
1
Σˆ = ∑ ( xk − µˆ )( xk − µˆ )T
n k =1
Bayesian Estimation
„
„
„
„
Collected examples D={x1,x2,…,xn}, drawn
independently from a fixed but unknown
distribution p(x)
Bayesian learning is to use D to determine p(x|D),
i.e., to learn a p.d.f.
p(x) is unknown, but has a parametric form with
parameters θ ~ p(θ)
Difference from ML: in Bayesian learning, θ is not a
value, but a random variable and we need to recover
the distribution of θ, rather than a single value.
Bayesian Estimation
p ( x | D) = ∫ p ( x,θ | D)dθ = ∫ p ( x | θ ) p (θ | D)dθ
„
„
This is obvious from the total probability rule, i.e.,
p(x|D) is a weighted average over all θ
If p(θ |D) peaks very sharply about some value θ*,
then p(x|D) ~ p(x| θ*)
The Univariate Case
„
„
assume µ is the only unknown, p(x|µ)~N(µ, σ2)
µ is a r.v., assuming a prior p(µ) ~ N(µ0, σ02), i.e., µ0
is the best guess of µ, and σ0 is the uncertainty of it.
n
p ( µ | D ) ∝ p ( D | µ ) p ( µ ) = ∏ p ( xk | µ ) p ( µ )
k =1
where
p ( xk | µ ) ~ N ( µ , σ 2 ),
(
p ( µ ) ~ N ( µ o , σ 02 )
)
 1 n

µ0
2
1
1
p ( µ | D) ∝ exp − 2 ( σ 2 + σ 2 µ − 2( σ 2 ∑ xk + σ 2 ) µ 
k


P(µ|D) is also a Gaussian for any # of training examples
The Univariate Case
p( µ | D) ~ N ( µ n , σ n2 )
if we let
we have
2
 nσ 0 2 


σ
 µˆ + 
µ
µ n = 
2
2
2  n
2  0

 nσ 0 + σ 
 nσ 0 + σ 
2
σ
σ
0
σ n2 =
2
nσ 0 + σ 2
2
n=10
n=5
n=1
σn measures the uncertainty of this guess
after observing n examples
n=30
p(µ|x1,…,xn)
The best guess for µ after
observing n examples
p(µ|D) becomes more and
more sharply peaked when
observing more and more
examples, i.e., the uncertainty
decreases.
σ2
2
µ
σn →
n
The Multivariate Case
p ( x | µ ) ~ N ( µ , Σ)
p(µ ) ~ N (µ 0 , Σ 0 )
let p( µ | D) ~ N ( µ n , Σ n ),
we have
−1
1
1
ˆ
µ n = Σ 0 (Σ 0 + Σ) µ n + n Σ(Σ 0 + n Σ) µˆ 0
1
n
−1
Σ n = Σ 0 (Σ 0 + 1n Σ) −1 1n Σ
1 n
where µˆ n = ∑ xk
n k =1
and
p ( x | D) = ∫ p ( x | µ ) p ( µ | D)dµ ~ N ( µ n , Σ + Σ n )
PCA and Eigenface
Principal Component Analysis (PCA)
9 Eigenface for Face Recognition
9
PCA: motivation
Pattern vectors are generally confined within
some low-dimensional subspaces
„ Recall the basic idea of the Fourier transform
„
– A signal is (de)composed of a linear combination
of a set of basis signal with different frequencies.
PCA: idea
r r
r
x = m + αe
αe x
m
xk
n
J (α1 ,..., α n , e) = ∑ || (m + α k e) − xk ||2
k =1
= ∑ α k2 || e ||2 −2∑ α k eT ( xk − m) + ∑ || xk − m ||2
∂J
= 2α k − 2eT ( xk − m) = 0 ⇒ α k = eT ( xk − m)
∂α k
PCA
J (e) = ∑ α k2 − 2∑ α k2 + ∑ || xk − m ||2
∑ ( x − m)( x − m) e + ∑ ||x
= −eSe + ∑ ||x − m ||
= −e
T
T
k
k
k
− m ||
2
2
k
arg min J (e) = arg max e Se
s.t. || e ||= 1
T
e
e
e* = arg max eT Se + λ (eT e − 1)
e
Se − λe = 0,
i.e.,
e Se = 1
T
To maximize eTSe, we need to select λmax
Algorithm
„
Learning the principal components from
{x1, x2, …, xn}
1 n
(1) m = ∑ xk ,
n k =1
A = [ x1 − m, ..., xn − m]
n
(2) S = ∑ ( xk − m)( xk − m)T = AAT
k =1
(3) eigenvalue decomposition S = U T ΣU
(4) sorting λi and u i
(5) P
T
T
1
= [u , u2T , ..., umT ]
PCA for Face Recognition
„
Training data D={x1, …, xM}
– Dimension (stacking the pixels together to make a vector
of dimension N)
– Preprocessing
9 cropping
9 normalization
„
„
These faces should lie in a “face” subspace
Questions:
– What is the dimension of this subspace?
– How to identify this subspace?
– How to use it for recognition?
Eigenface
The EigenFace approach: M. Turk and A. Pentland, 1992
An Issue
In general, N >> M
„ However, S, the covariance matrix, is NxN!
„ Difficulties:
„
– S is ill-conditioned. Rank(S)<<N
– The computation of the eigenvalue decomposition
of S is expensive when N is large
„
Solution?
Solution I:
Let’s do eigenvalue decomposition on ATA,
which is a MxM matrix
„ ATAv=λv
„ Æ AATAv= λAv
„ To see is clearly! (AAT) (Av)= λ(Av)
„ i.e., if v is an eigenvector of ATA, then Av
is the eigenvector of AAT corresponding to
the same eigenvalue!
„ Note: of course, you need to normalize Av
to make it a unit vector
„
Solution II:
You can simply use SVD (singular value
decomposition)
„ A = [x1-m, …, xM-m]
„ A = U∑VT
„
–
–
–
–
A: NxM
U: NxM UTU=I
∑: MxM diagonal
V: MxM VTV=VVT=I
Fisher Linear Discrimination
LDA
9 PCA+LDA for Face Recognition
9
When does PCA fail?
PCA
Linear Discriminant Analysis
„ Finding an optimal linear mapping W
„ Catches major difference between classes and
discount irrelevant factors
„ In the mapped space, data are clustered
Within/between class scatters
1
m1 =
n1
∑ x,
x∈D1
1
m2 =
n2
∑x
x∈D2
the linear transform : y = W T x
1
~
~ =W T m
m1 = ∑ W T x = W T m1 , m
2
2
n1 x∈D1
S1 =
T
(
x
−
m
)(
x
−
m
)
,
∑
1
1
x∈D1
S2 =
~
~ )2 = W T S W ,
S1 = ∑ ( y − m
1
1
y∈Y1
~
S2 =
T
(
x
−
m
)(
x
−
m
)
∑
2
2
x∈D2
~ )2 = W T S W
(
y
−
m
∑
2
2
y∈Y2
within class scatter : SW = S1 + S 2
between class scatter : S B = (m1 − m2 )(m1 − m2 )T
Fisher LDA
~ −m
~ |2 W T (m − m )(m − m )T W W T S W
|m
1
2
1
2
B
J (W ) = ~1 ~2 =
=
W T ( S1 + S 2 )W
W T SW W
S1 + S 2
W * = arg max J (W )
W
max J (W ) ⇔
S B w = λSW w ←
this is a generalized eigenvalue problem
Solution I
„
If Sw is not singular
S S w = λw
−1
W B
„
You can simply do eigenvalue
decomposition on SW-1SB
Solution II
„
Noticing:
– SBW is on the direction of m1-m2 (WHY?)
– We are only concern about the direction of the
projection, rather than the scale
„
We have
−1
W
w = S (m1 − m2 )
Multiple Discriminant Analysis
1
mi =
ni
Si =
~ = 1
m
i
ni
∑ x,
x∈Di
∑ ( x − m )( x − m )
x∈Di
C
SW = ∑ S k
k =1
i
i
T
y =WTx
∑ y =W
y∈Yi
T
mi ,
1 C ~
~
m = ∑ nk mk ,
n k =1
~
SW = W T SW W ,
C
~
~ −m
~ )(m
~ −m
~ )T = W T S W
S B = ∑ ni (m
i
i
B
k =1
~
T
|
S
|
|
W
S BW |
*
B
W = arg max ~ =
W
| SW | | W T SW W |
S B wi = λSW wi
Comparing PCA and LDA
LDA
PCA
MDA for Face Recognition
Lighting
• PCA does not work well! (why?)
• solution: PCA+MDA
Independent Component Analysis
The cross-talk problem
9 ICA
9
Cocktail party
x1(t)
S1(t)
A
t
t
x2(t)
S2(t)
t
t
x3(t)
t
Can you recover
s1(t) and s2(t) from
x1(t), x2(t) and
x3(t)?
Formulation
x j = a j1s1 + ... + a jn sn
n
X = ∑ ai si
or
∀j
X = AS
i =1
Both A and S are unknowns!
Can you recover both A and S from X?
The Idea
Y = W T X = W T AS = Z T S
„
„
„
„
„
„
y is a linear combination of {si}
Recall the central limit theory!
A sum of even two independent r.v. is more
Gaussian than the original r.v.
So, ZTS should be more Gaussian than any of {si}
In other words, ZTS become least Gaussian when
in fact equal to {si}
Amazed!
Face Recognition: Challenges
View