ppt - UTK-EECS

ECE 471/571 – Lecture 4
Parametric Estimation
01/27/17
Pattern Classification
Statistical Approach
Supervised
Basic concepts:
Baysian decision rule
(MPP, LR, Discri.)
Non-Statistical Approach
Unsupervised
Basic concepts:
Distance
Agglomerative method
Parameter estimate (ML, BL)
k-means
Non-Parametric learning (kNN)
Winner-takes-all
LDF (Perceptron)
Kohonen maps
NN (BP)
Mean-shift
Decision-tree
Syntactic approach
Support Vector Machine
Deep Learning (DL)
Dimensionality
Reduction
FLD, PCA
Performance Evaluation
ROC curve (TP, TN, FN, FP)
cross validation
Stochastic Methods
local opt (GD)
global opt (SA, GA)
Classifier Fusion
majority voting
NB, BKS
Bayes Decision Rule
p(x | w )P(w )
P(w | x )=
j
j
p(x )
j
R(a i | x ) = å l (a i | w j )P(w j | x )
c
j =1
gi (x )= ln p(x | wi )+ ln P(wi )
Maximum
Posterior
Probability
Likelihood
Ratio
Discriminant
Function
For a given x, if P(w1 | x ) > P(w 2 | x ),
then x belongs to class 1, otherwise, 2.
If R(a1 | x )< R(a 2 | x ), then decide w1 , that is
p (x | w1 ) l12 - l22 P(w 2 )
>
p (x | w 2 ) l21 - l11 P(w1 )
The classifier will assign a feature vector x to class w i if
g i (x ) > g j (x )
3
Normal/Gaussian Density
The 68-95-99.7
rule
4
Multivariate Normal Density
p (x ) =
1
(2p )d / 2 S 1/ 2
é 1
ù
T
exp ê- (x - m ) S -1 (x - m )ú
ë 2
û
x : d - component column vector
m : d - component mean vector
S : d - by - d covariance matrix
S : determinant
S -1 : inverse
5
Estimating Normal Densities
Calculate m, S
é
ê mi1 =
ê
mi = ê
ê
êmid =
ë
és11
ê
Si = ê
êës d1
n
ù
1 i
x k1 ú
å
n i k=1 ú
ú
ni
ú
1
x kd ú
å
n i k=1 û
s1d ù
ni
1
T
ú
ú = n -1 å ( x k - mi )( x k - mi )
s dd úû i k=1
6
Covariance
, {x p }, , {xq }, , {xd },
For d sets of variates denoted {x1},
the covariance s pq = cov(x p , xq )of x p and xq is defined by
[
]
= E [x x ]- E [x m ]- E [m x ]+ E [m m ]
= E [x x ]- m E [x ]- m E [x ]+ m m
= E [x x ]- m m - m m + m m
= E [x x ]- m m
When p = q, s = cov(x , x )= E [x x ]- m m
= E [x ]- (E [x ])
cov(x p , xq )= E (x p - m p )(xq - m q )
p q
p
q
p q
q
p q
q
p
p q
q
p
pp
p
p q
p
p
p
= s p2
p
p
2
p
q
q
p
p
p
p
q
q
q
p
p
2
p
7
*Why? – Maximum Likelihood
Estimation
D = {x1 , x2 ,
, xk ,
, xn }is a data set of n training samples
Compare “likelihood”
( )
p(x | wi )
( )
p D |q
n
ém ù
q =ê ú
ëS û
( )
p D | q ¾¾ ¾¾¾ ¾¾¾ ¾ ¾
¾® Õ p xk | q
assume samples are drawn independen tly
k =1
()
( )
qˆ = arg max l (q )
n
(
l q = ln p D | q = å ln p xk | q
q
)
k =1
8
éq1 ù é m ù
q =ê ú=ê ú
ëq 2 û ë S û
*Derivation
p(x ) =
1
(2p )
d /2
S
1/ 2
é 1
ù
T -1
exp ê- (x - m ) S (x - m )ú
ë 2
û
é 1
ù
T 1
p xk | q =
exp ê- ( xk - q1 )
( xk - q1 )ú
d/2
1/2
q2
ë 2
û
( 2p ) q 2
(
)
1
æ d
ö
1
1
T 1
l q = åç- ln ( 2p ) - ln q 2 - ( xk - q1 )
( xk - q1 ) ÷
2
2
q2
ø
k=1 è 2
()
n
¶l
=0
¶q1
¶l
=0
¶q 2
9
* Derivative of a Quadratic
Form
A matrix A is " positive definite" if xT Ax > 0
"x Î R d , x ¹ 0
xT Ax is also called a " quadratic form".
The derivative of a quadratic form is particularly useful :
d T
(
x Ax )= (A + AT )x
dx
10
**Why? – Baysian Estimation
• Maximum likelihood
estimation
– The parameters are
fixed
– Find value for q that
best agrees with or
supports the actually
observed training
samples – likelihood of
q w.r.t. the set of
samples
( )
• Baysian estimation
– Treat parameters as
random variable
themselves
p D |q
11
** The pdf of the parameter (m) is
Gaussian
p(D | m )p(m ) 1 n
p(m | D )=
= Õ p(xk | m )p(m )
C
C k =1
2
é
1
1 æ xk - m ö ù
p(xk | m ) =
exp ê- ç
÷ ú
2p s
êë 2 è s ø úû
2
é
1
1 æ m - m0 ö ù
÷÷ ú
p(m ) =
exp ê- çç
2p s 0
êë 2 è s 0 ø úû
12
** Derivation
2
2
é
é
ù
1
1
1 æ xk - m ö
1
1 æ m - m0 ö ù
÷÷ ú
p (m | D ) = Õ
exp ê- ç
exp ê- çç
÷ ú
C k =1 2p s
êë 2 è s 0 ø úû
êë 2 è s ø úû 2p s 0
é 1 æ n x - m 2 æ m - m ö 2 öù
æ
ö ç
0
÷÷ ÷ú
= a exp ê- ç å ç k
÷ +ç
ê 2 çè k =1 è s ø è s 0 ø ÷øú
ë
û
n
n
é
æ n 2
öù
2
x
2
m
x
+
n
m
å
k
ê 1çå k
2
2 ÷ú
m
2
m
m
+
m
k =1
0
0 ÷ú
= a exp ê- ç k =1
+
÷ú
s2
s 02
ê 2ç
ç
÷ú
ê
è
øû
ë
é 1 éæ n
1 ö 2 æ 1
ç
= b exp ê- êç 2 + 2 ÷÷ m - 2çç 2
s0 ø
êë 2 ëè s
ès
m0 ö ùù
xk + 2 ÷÷ m ú ú
å
s 0 ø û úû
k =1
n
2
é
1
1 æ m - mn ö ù
÷÷ ú
p(m | D ) =
exp ê- çç
2p s n
êë 2 è s n ø úû
13
** mn and sn
æ ns 02 öæ 1 n
ö Our best guess for m
ö æ s2
÷ç å xk ÷ + çç 2
÷m
m n = çç 2
2 ÷
2 ÷ 0 after observing n
è ns 0 + s øè n k =1 ø è ns 0 + s ø samples
2 2
s
2
0s
sn =
ns 02 + s 2
Measures our uncertainty
about this guess
Behavior of Bayesian learning



The larger the n, the smaller the sn – each additional
observation decreases our uncertainty about the true value
of m
As n approaches infinity, p(m|D) becomes more and more
sharply peaked, approaching a Dirac delta function.
mn is a linear combination between the sample mean and m0
14