INTRODUCTION TO Machine Learning
Parametric Methods
姓名:李政軒
Parametric Estimation
X = { xt }t where xt ~ p (x)
Parametric estimation:
Assume a form for p (x |q ) and estimate q , its sufficient
statistics, using X
N ( μ, σ2) where q = { μ, σ2}
Maximum Likelihood Estimation
Likelihood of q given the sample X
l (θ|X) = p (X |θ) = ∏t p (xt|θ)
Log likelihood
L(θ|X) = log l (θ|X) = ∑t log p (xt|θ)
Maximum likelihood estimator
θ* = argmaxθ L(θ|X)
Examples: Bernoulli/Multinomial
Bernoulli: Two states, failure/success, x in {0,1}
P (x) = pox (1 – po ) (1 – x)
L (po|X) = log ∏t poxt (1 – po ) (1 – xt)
MLE: po = ∑t xt / N
Multinomial: K>2 states, xi in {0,1}
P (x1,x2,...,xK) = ∏i pixi
L(p1,p2,...,pK|X) = log ∏t ∏i pixit
MLE: pi = ∑t xit / N
Gaussian (Normal) Distribution
p(x) = N ( μ, σ2)
x 2
1
px
exp
2
2
2
MLE for μ and σ2:
m
s2
t
x
t
N
x
t
m
t
N
2
Bias and Variance
Unknown parameter q
Estimator di = d (Xi) on sample Xi
Bias: bq(d) = E [d] – q
Variance: E [(d–E [d])2]
Mean square error:
r (d,q) = E [(d–q)2]
= (E [d] – q)2 + E [(d–E [d])2]
= Bias2 + Variance
Bayes’ Estimator
Treat θ as a random var with prior p (θ)
Bayes’ rule: p (θ|X) = p(X|θ) p(θ) / p(X)
Full: p(x|X) = ∫ p(x|θ) p(θ|X) dθ
Maximum a Posteriori (MAP): θMAP = argmaxθ p(θ|X)
Maximum Likelihood (ML): θML = argmaxθ p(X|θ)
Bayes’: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ
Parametric Classification
gi x px |C i P C i
or
gi x log px |C i log P C i
2
1
x i
px |C i
exp
2
2
2 i
i
1
x i 2
gi x log 2 log i
log P C i
2
2
2 i
Parametric Classification
Given the sample X {x t ,r t }tN1
x
t
1
if
x
C i
t
ri
t
0
if
x
C j , j i
ML estimates are
PˆC i
rit
t
N
Discriminant becomes
mi
x t rit
t
r
i
t
t
si2
x
t
t
mi rit
2
r
t
i
t
1
x mi 2
gi x log 2 log si
log PˆC i
2
2
2si
Parametric Classification
(a)and(b) for two classes when the input is one-dimensional.
Variances are equal and the posteriors intersect at one point,
which is the threshold if decision.
Parametric Classification
(a)and(b) for two classes when the input is one-dimensional. Variances
are unequal and the posteriors intersect at two points. In (c), the
expected risks are shown for the two classes and for reject with 0.2
Regression
r f x
estimator : gx |
~ N 0, 2
pr | x ~ N gx | , 2
L |X log px t , r t
N
t 1
log pr t | x t log px t
N
N
t 1
t 1
Regression: From LogL to Error
N
L | X log
t 1
N log 2
2
2
r
N
1
1
E | X r t g x t |
2 t 1
N
2
t
t
1
r g x |
exp
2
2
2
2
t 1
t
g x t |
2
Linear Regression
gx t |w1 ,w0 w1 x t w0
t
t
r
Nw
w
x
0
1
t
t
r x
t
t
N
A
t
x
t
t
w0 x w1 x
t
t
t 2
t
t
t
x
r
t w0
t
w
y
2
t t
t
r
w
t x 1 t x
w A 1y
Polynomial Regression
gx |w ,,w ,w ,w w x
t k
t
k
2
1 x1
2
1
x
D
N
1
x
1
k
0
x
x
1 2
2 2
x
N 2
2
w1x w0
1
r
x
2
2 k
r
x
r
N
N 2
x
r
1 k
w D D DT r
T
w x
t 2
1
t
Other Error Measures
Square Error:
1
E |X r t gx t |
2 t 1
N
r
N
Relative Square Error:
E |X
t
2
g x t |
2
t 1
r
N
2
t
r
t 1
Absolute Error: E (θ |X) = ∑t |r t – g(xt| θ)|
ε-sensitive Error:
E (θ |X) = ∑ t 1(|r t – g(xt| θ)|>ε) (|r t – g(xt|θ)| – ε)
Bias and Variance
E r gx 2 | x E r Er | x 2 | x Er | x gx 2
noise
squared error
E X E r | x gx | x E r | x E X gx E X gx E X gx
2
2
bias
2
variance
Estimating Bias and Variance
M samples Xi={xti , rti}, i=1,...,M
are used to fit gi (x), i =1,...,M
1
t
t 2
g
x
f
x
N t
1
t
t 2
Varianceg
g
x
g
x
i
NM t i
1
g x gi x
M t
Bias2 g
Bias/Variance Dilemma
Example: gi(x)=2 has no variance and high bias
gi(x)= ∑t rti/N has lower bias with variance
As we increase complexity,
bias decreases (a better fit to data) and
variance increases (fit varies more with data)
Bias/Variance Dilemma
(a) Function, f(x) = 2sin(1.5x), and one noisy (N(0,1)) dataset sampled
from the function. Five samples are taken, each containing twenty in-stances.
(b), (c), (d) are five polynomial fits, namely, gi(.), of order 1, 3 and 5. for each
case, dotted line is the average of the five fits namely, g (.) .
Polynomial Regression
In the same setting as that of previous, using one hundred models
instead of five, bias, variance, and error for polynomials of order 1 to 5.
Best fit “min error”
Model Selection
Cross-validation: Measure generalization
accuracy by testing on data unused during
training
Regularization: Penalize complex models
E’=error on data + λ model complexity
Akaike’s information criterion (AIC), Bayesian
information criterion (BIC)
Minimum description length (MDL): Kolmogorov
complexity, shortest description of data
Structural risk minimization (SRM)
Model Selection
Best fit, “elbow”
Bayesian Model Selection
Prior on models, p(model)
pdata|model pmodel
pmodel |data
pdata
Regularization, when prior favors simpler models
Bayes, MAP of the posterior, p(model|data)
Average over a number of models with high posterior
Regression example
Coefficients increase in
magnitude as order
increases:
1: [-0.0769, 0.0016]
2: [0.1682, -0.6657, 0.0080]
3: [0.4238, -2.5778, 3.4675,
-0.0002
4: [-0.1093, 1.4356,
-5.5007, 6.0454, -0.0019]
2
1 N t
regulariza tion : E w | X r gx t | w i wi2
2 t 1
© Copyright 2026 Paperzz