Introduction to Machine Learning

INTRODUCTION TO Machine Learning
Parametric Methods
姓名:李政軒
Parametric Estimation
 X = { xt }t where xt ~ p (x)
 Parametric estimation:
Assume a form for p (x |q ) and estimate q , its sufficient
statistics, using X
 N ( μ, σ2) where q = { μ, σ2}
Maximum Likelihood Estimation
 Likelihood of q given the sample X
l (θ|X) = p (X |θ) = ∏t p (xt|θ)
 Log likelihood
L(θ|X) = log l (θ|X) = ∑t log p (xt|θ)
 Maximum likelihood estimator
θ* = argmaxθ L(θ|X)
Examples: Bernoulli/Multinomial
 Bernoulli: Two states, failure/success, x in {0,1}
P (x) = pox (1 – po ) (1 – x)
L (po|X) = log ∏t poxt (1 – po ) (1 – xt)
MLE: po = ∑t xt / N
 Multinomial: K>2 states, xi in {0,1}
P (x1,x2,...,xK) = ∏i pixi
L(p1,p2,...,pK|X) = log ∏t ∏i pixit
MLE: pi = ∑t xit / N
Gaussian (Normal) Distribution
 p(x) = N ( μ, σ2)
 x   2 
1
px  
exp

2
2 
2 

 MLE for μ and σ2:
m
s2 
t
x

t
N
 x
t
m
t
N

2
Bias and Variance
Unknown parameter q
Estimator di = d (Xi) on sample Xi
Bias: bq(d) = E [d] – q
Variance: E [(d–E [d])2]
Mean square error:
r (d,q) = E [(d–q)2]
= (E [d] – q)2 + E [(d–E [d])2]
= Bias2 + Variance
Bayes’ Estimator
 Treat θ as a random var with prior p (θ)
 Bayes’ rule: p (θ|X) = p(X|θ) p(θ) / p(X)
 Full: p(x|X) = ∫ p(x|θ) p(θ|X) dθ
 Maximum a Posteriori (MAP): θMAP = argmaxθ p(θ|X)
 Maximum Likelihood (ML): θML = argmaxθ p(X|θ)
 Bayes’: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ
Parametric Classification
gi x   px |C i P C i 
or
gi x   log px |C i   log P C i 
2


1
x  i  
px |C i  
exp 

2
2

2  i
i



1
x   i 2
gi x    log 2  log  i 
 log P C i 
2
2
2 i
Parametric Classification
 Given the sample X  {x t ,r t }tN1
x 
t

1
if
x
C i
t
ri  
t
0
if
x
C j , j  i

 ML estimates are
PˆC i  
 rit
t
N
 Discriminant becomes
mi 
 x t rit
t
r
i
t
t
si2 
 x
t
t
 mi  rit
2
r
t
i
t

1
x  mi 2
gi x    log 2  log si 
 log PˆC i 
2
2
2si
Parametric Classification
 (a)and(b) for two classes when the input is one-dimensional.
Variances are equal and the posteriors intersect at one point,
which is the threshold if decision.
Parametric Classification
 (a)and(b) for two classes when the input is one-dimensional. Variances
are unequal and the posteriors intersect at two points. In (c), the
expected risks are shown for the two classes and for reject with   0.2
Regression
r  f x   
estimator : gx | 
 ~ N 0, 2 
pr | x  ~ N gx | ,  2 
L  |X   log  px t , r t 
N
t 1
 log  pr t | x t   log  px t 
N
N
t 1
t 1
Regression: From LogL to Error
N
L  | X   log 
t 1
 N log 2  


2

2
 r
N
1
1
E  | X    r t  g x t | 
2 t 1
N

2
t
t

1
r  g x |  
exp 

2
2
2 


2
t 1
t

 g x t | 
2
Linear Regression
gx t |w1 ,w0   w1 x t  w0
t
t
r

Nw

w
x

0
1
t
t
r x
t
t
 N
A  
t
x

t
t
 w0  x  w1  x
t
t

t 2
t
t
t

x


r
t  w0 

 t

w

y

2
t t


t 

r
w
t x    1  t x 
w  A 1y
Polynomial Regression
gx |w ,,w ,w ,w   w x 
t k
t
k
2
1 x1

2
1
x
D


N
1
x

1
k
0
x 
x 
1 2
2 2
x 
N 2
2
 w1x  w0
1



r
 x 

 2
2 k
r 
 x  

r



 N
N 2
 x  
r 
1 k
w  D D DT r
T
   w x 
t 2
1
t
Other Error Measures
 Square Error:


1
E  |X    r t  gx t | 
2 t 1
N
 r
N
 Relative Square Error:
E  |X  
t
2

 g x t | 
2
t 1
 r
N

2
t
r
t 1
 Absolute Error: E (θ |X) = ∑t |r t – g(xt| θ)|
 ε-sensitive Error:
E (θ |X) = ∑ t 1(|r t – g(xt| θ)|>ε) (|r t – g(xt|θ)| – ε)
Bias and Variance

 

E r  gx 2 | x  E r  Er | x 2 | x  Er | x   gx 2
noise


squared error

E X E r | x   gx  | x  E r | x   E X gx   E X gx   E X gx 
2
2
bias
2
variance

Estimating Bias and Variance
 M samples Xi={xti , rti}, i=1,...,M
are used to fit gi (x), i =1,...,M


1
t
t 2




g
x

f
x

N t
1
t
t 2




Varianceg  
g
x

g
x

i
NM t i
1
g  x    gi  x 
M t
Bias2 g  


Bias/Variance Dilemma
 Example: gi(x)=2 has no variance and high bias
gi(x)= ∑t rti/N has lower bias with variance
 As we increase complexity,
bias decreases (a better fit to data) and
variance increases (fit varies more with data)
Bias/Variance Dilemma
(a) Function, f(x) = 2sin(1.5x), and one noisy (N(0,1)) dataset sampled
from the function. Five samples are taken, each containing twenty in-stances.
(b), (c), (d) are five polynomial fits, namely, gi(.), of order 1, 3 and 5. for each
case, dotted line is the average of the five fits namely, g (.) .
Polynomial Regression
In the same setting as that of previous, using one hundred models
instead of five, bias, variance, and error for polynomials of order 1 to 5.
Best fit “min error”
Model Selection
 Cross-validation: Measure generalization
accuracy by testing on data unused during
training
 Regularization: Penalize complex models
E’=error on data + λ model complexity
Akaike’s information criterion (AIC), Bayesian
information criterion (BIC)
 Minimum description length (MDL): Kolmogorov
complexity, shortest description of data
 Structural risk minimization (SRM)
Model Selection
Best fit, “elbow”
Bayesian Model Selection
 Prior on models, p(model)
pdata|model  pmodel 
pmodel |data 
pdata
 Regularization, when prior favors simpler models
 Bayes, MAP of the posterior, p(model|data)
 Average over a number of models with high posterior
Regression example
Coefficients increase in
magnitude as order
increases:
1: [-0.0769, 0.0016]
2: [0.1682, -0.6657, 0.0080]
3: [0.4238, -2.5778, 3.4675,
-0.0002
4: [-0.1093, 1.4356,
-5.5007, 6.0454, -0.0019]


2
1 N t
regulariza tion : E w | X    r  gx t | w    i wi2
2 t 1