Introduction to Machine Learning Lecture 3

Introduction to Machine Learning
Lecture 3
Mehryar Mohri
Courant Institute and Google Research
[email protected]
Bayesian Learning
Bayes’ Formula/Rule
Terminology:
likelihood
prior
Pr[X | Y ] Pr[Y ]
Pr[Y | X] =
.
Pr[X]
posterior
probability
Mehryar Mohri - Introduction to Machine Learning
evidence
page 3
Loss Function
Definition: function L : Y × Y → R+ indicating the
penalty for an incorrect prediction.
L(�
y , y): loss for prediction of y� instead of y .
•
Examples:
zero-one loss: standard loss function in
classification; L(y, y � ) = 1y�=y� for y, y � ∈ Y .
non-symmetric losses: e.g., for spam
� spam) ≤ L(spam,
� ham) .
classification; L(ham,
squared loss: standard loss function in
regression; L(y, y � ) = (y � − y)2 .
•
•
•
Mehryar Mohri - Introduction to Machine Learning
page 4
Classification Problem
Input space X : e.g., set of documents.
feature vector Φ(x) ∈ RN associated to x ∈ X .
notation: feature vector x ∈ RN .
example: vector of word counts in document.
•
•
•
Output or target space Y : set of classes; e.g., sport,
business, art.
Problem: given x , predict the correct class y ∈ Y
associated to x .
Mehryar Mohri - Introduction to Machine Learning
page 5
Bayesian Prediction
Definition: the expected conditional loss of
predicting y� ∈ Y is
L[�
y |x] =
�
L(�
y , y) Pr[y|x].
y∈Y
Bayesian decision: predict class minimizing
expected conditional loss, that is
•
y�∗ = argmin L[�
y |x] = argmin
y
b
y
b
�
L(�
y , y) Pr[y|x].
y∈Y
∗
y
�
y|x].
zero-one loss: = argmax Pr[�
y
b
Maximum a Posteriori (MAP) principle.
Mehryar Mohri - Introduction to Machine Learning
page 6
Binary Classification - Illustration
1
0
Mehryar Mohri - Introduction to Machine Learning
Pr[y1 | x]
Pr[y2 | x]
x
page 7
Maximum a Posteriori (MAP)
Definition: the MAP principle consists of predicting
according to the rule
y� = argmax Pr[y|x].
y∈Y
Equivalently, by the Bayes formula:
Pr[x|y] Pr[y]
y� = argmax
= argmax Pr[x|y] Pr[y].
Pr[x]
y∈Y
y∈Y
How do we determine Pr[x|y] and Pr[y] ?
Density estimation problem.
Mehryar Mohri - Introduction to Machine Learning
page 8
Application - Maximum a Posteriori
Formulation: hypothesis set H.
Pr[O|h]Pr[h]
= argmax Pr[O|h]Pr[h].
ĥ = argmax Pr[h|O] = argmax
Pr[O]
h∈H
h∈H
h∈H
Example: determine if a patient has a rare
disease H = {d, nd} , given laboratory test O = {pos, neg} .
With Pr[d] = .005, Pr[pos|d] = .98, Pr[neg|nd] = .95 , if the test
is positive, what should be the diagnosis?
Pr[pos|d] Pr[d] = .98 × .005 = .0049.
Pr[pos|nd] Pr[nd] = (1 − .95) × .(1 − .005) = .04975 > .0049.
Mehryar Mohri - Introduction to Machine Learning
page 9
Density Estimation
Data: sample drawn i.i.d. from set X according to
some distribution D,
x1 , . . . , xm ∈ X.
Problem: find distribution p out of a set P that
best estimates D .
Note: we will study density estimation
specifically in a future lecture.
•
Mehryar Mohri - Introduction to Machine Learning
page 10
Maximum Likelihood
Likelihood: probability of observing sample under
distribution p ∈ P , which, given the independence
assumption is
m
!
Pr[x1 , . . . , xm ] =
p(xi ).
i=1
Principle: select distribution maximizing sample
m
probability
!
p! = argmax
p(xi ),
p∈P
or p! = argmax
p∈P
Mehryar Mohri - Introduction to Machine Learning
i=1
m
!
log p(xi ).
i=1
page 11
Example: Bernoulli Trials
Problem: find most likely Bernoulli distribution,
given sequence of coin flips
H, T, T, H, T, H, T, H, H, H, T, T, . . . , H.
Bernoulli distribution: p(H) = θ, p(T ) = 1 − θ.
Likelihood: l(p) = log θN (H) (1 − θ)N (T )
= N (H) log θ + N (T ) log(1 − θ).
Solution: l is differentiable and concave;
N (H) N (T )
N (H)
dl(p)
=
−
=0⇔θ=
.
dθ
θ
1−θ
N (H) + N (T )
Mehryar Mohri - Introduction to Machine Learning
page 12
Example: Gaussian Distribution
Problem: find most likely Gaussian distribution,
given sequence of real-valued observations
3.18, 2.35, .95, 1.175, . . .
!
(x − µ)
Normal distribution: p(x) = √ 2 exp −
2
2σ
2πσ m
� (xi − µ)2
1
Likelihood: l(p) = − m log(2πσ 2 ) −
.
2
2
2σ
1
2
"
.
i=1
Solution: l is differentiable and concave;
m
!
∂p(x)
1
=0⇔µ=
xi
∂µ
m i=1
Mehryar Mohri - Introduction to Machine Learning
m
!
1
∂p(x)
2
2
2
=
0
⇔
σ
=
x
−
µ
.
i
2
∂σ
m i=1
page 13
ML Properties
Problems:
the underlying distribution may not be among
those searched.
•
• overfitting: number of examples too small wrt
number of parameters.
•
Pr[y] = 0 if class y does not appear in sample!
smoothing techniques.
Mehryar Mohri - Introduction to Machine Learning
page 14
Additive Smoothing
Definition: the additive or Laplace smoothing for
estimating Pr[y] , y ∈ Y , from a sample of size m is
defined by
|y| + α
�
.
Pr[y] =
m + α|Y|
• α = 0: ML estimator (MLE).
• MLE after adding α to the count of each class.
• Bayesian justification based on Dirichlet prior.
• poor performance for some applications, such as
n-gram language modeling.
Mehryar Mohri - Introduction to Machine Learning
page 15
Estimation Problem
Conditional probability: Pr[x | y] = Pr[x1 , . . . , xN | y].
for large N, number of features, difficult to
estimate.
even if features are Boolean, that is xi ∈ {0, 1} ,
there are 2N possible feature vectors!
may need very large sample.
•
•
Mehryar Mohri - Introduction to Machine Learning
page 16
Naive Bayes
Conditional independence assumption: for
any y ∈ Y ,
Pr[x1 , . . . , xN | y] = Pr[x1 | y] . . . Pr[xN | y].
• given the class, the features are assumed to be
independent.
• strong assumption, typically does not hold.
Mehryar Mohri - Introduction to Machine Learning
page 17
Example - Document Classification
Features: presence/absence of word xi .
Estimation of Pr[xi | y]: frequency of word xi among
documents labeled with y , or smooth estimate.
Estimation of Pr[y] : frequency of class y in sample.
Classification:
y� = argmax Pr[y]
y∈Y
Mehryar Mohri - Introduction to Machine Learning
N
�
i=1
Pr[xi | y].
page 18
Naive Bayes - Binary Classification
Classes: Y = {−1, +1}.
Pr[+1|x]
log
Decision based on sign of
Pr[−1|x] ; in terms of
log-odd ratios:
Pr[+1 | x]
Pr[+1] Pr[x | +1]
log
= log
Pr[−1 | x]
Pr[−1] Pr[x | −1]
�N
Pr[+1] i=1 Pr[xi | +1]
= log
�N
Pr[−1] i=1 Pr[xi | −1]
N
�
Pr[+1]
Pr[xi | +1]
= log
log
+
.
Pr[−1] i=1
Pr[xi | −1]
�
��
�
contribution of feature/expert i to decision
Mehryar Mohri - Introduction to Machine Learning
page 19
Naive Bayes = Linear Classifier
Theorem: assume that xi ∈ {0, 1} for all i ∈ [1, N ] .
Then, the Naive Bayes classifier is defined by
x �→ sgn(w · x + b),
Pr[xi =1|+1]
Pr[xi =0|+1]
−
log
where wi = log Pr[x
Pr[xi =0|−1]
i =1|−1]
and b = log
Pr[+1]
Pr[−1]
+
�N
i=1
log
Pr[xi =0|+1]
Pr[xi =0|−1] .
Proof: observe that for any i ∈ [1, N ] ,
log
Pr[xi | +1]
=
Pr[xi | −1]
�
log
Pr[xi = 1 | +1]
Pr[xi = 0 | +1]
− log
Pr[xi = 1 | −1]
Pr[xi = 0 | −1]
Mehryar Mohri - Introduction to Machine Learning
�
xi +log
page 20
Pr[xi = 0 | +1]
.
Pr[xi = 0 | −1]
Summary
Bayesian prediction:
requires solving density estimation problems.
often difficult to estimate Pr[x | y] for x ∈ RN.
but, simple and easy to apply; widely used.
•
•
•
Naive Bayes:
strong assumption.
straightforward estimation problem.
specific linear classifier.
sometimes surprisingly good performance.
•
•
•
•
Mehryar Mohri - Introduction to Machine Learning
page 21