Introduction to Machine Learning Lecture 3 Mehryar Mohri Courant Institute and Google Research [email protected] Bayesian Learning Bayes’ Formula/Rule Terminology: likelihood prior Pr[X | Y ] Pr[Y ] Pr[Y | X] = . Pr[X] posterior probability Mehryar Mohri - Introduction to Machine Learning evidence page 3 Loss Function Definition: function L : Y × Y → R+ indicating the penalty for an incorrect prediction. L(� y , y): loss for prediction of y� instead of y . • Examples: zero-one loss: standard loss function in classification; L(y, y � ) = 1y�=y� for y, y � ∈ Y . non-symmetric losses: e.g., for spam � spam) ≤ L(spam, � ham) . classification; L(ham, squared loss: standard loss function in regression; L(y, y � ) = (y � − y)2 . • • • Mehryar Mohri - Introduction to Machine Learning page 4 Classification Problem Input space X : e.g., set of documents. feature vector Φ(x) ∈ RN associated to x ∈ X . notation: feature vector x ∈ RN . example: vector of word counts in document. • • • Output or target space Y : set of classes; e.g., sport, business, art. Problem: given x , predict the correct class y ∈ Y associated to x . Mehryar Mohri - Introduction to Machine Learning page 5 Bayesian Prediction Definition: the expected conditional loss of predicting y� ∈ Y is L[� y |x] = � L(� y , y) Pr[y|x]. y∈Y Bayesian decision: predict class minimizing expected conditional loss, that is • y�∗ = argmin L[� y |x] = argmin y b y b � L(� y , y) Pr[y|x]. y∈Y ∗ y � y|x]. zero-one loss: = argmax Pr[� y b Maximum a Posteriori (MAP) principle. Mehryar Mohri - Introduction to Machine Learning page 6 Binary Classification - Illustration 1 0 Mehryar Mohri - Introduction to Machine Learning Pr[y1 | x] Pr[y2 | x] x page 7 Maximum a Posteriori (MAP) Definition: the MAP principle consists of predicting according to the rule y� = argmax Pr[y|x]. y∈Y Equivalently, by the Bayes formula: Pr[x|y] Pr[y] y� = argmax = argmax Pr[x|y] Pr[y]. Pr[x] y∈Y y∈Y How do we determine Pr[x|y] and Pr[y] ? Density estimation problem. Mehryar Mohri - Introduction to Machine Learning page 8 Application - Maximum a Posteriori Formulation: hypothesis set H. Pr[O|h]Pr[h] = argmax Pr[O|h]Pr[h]. ĥ = argmax Pr[h|O] = argmax Pr[O] h∈H h∈H h∈H Example: determine if a patient has a rare disease H = {d, nd} , given laboratory test O = {pos, neg} . With Pr[d] = .005, Pr[pos|d] = .98, Pr[neg|nd] = .95 , if the test is positive, what should be the diagnosis? Pr[pos|d] Pr[d] = .98 × .005 = .0049. Pr[pos|nd] Pr[nd] = (1 − .95) × .(1 − .005) = .04975 > .0049. Mehryar Mohri - Introduction to Machine Learning page 9 Density Estimation Data: sample drawn i.i.d. from set X according to some distribution D, x1 , . . . , xm ∈ X. Problem: find distribution p out of a set P that best estimates D . Note: we will study density estimation specifically in a future lecture. • Mehryar Mohri - Introduction to Machine Learning page 10 Maximum Likelihood Likelihood: probability of observing sample under distribution p ∈ P , which, given the independence assumption is m ! Pr[x1 , . . . , xm ] = p(xi ). i=1 Principle: select distribution maximizing sample m probability ! p! = argmax p(xi ), p∈P or p! = argmax p∈P Mehryar Mohri - Introduction to Machine Learning i=1 m ! log p(xi ). i=1 page 11 Example: Bernoulli Trials Problem: find most likely Bernoulli distribution, given sequence of coin flips H, T, T, H, T, H, T, H, H, H, T, T, . . . , H. Bernoulli distribution: p(H) = θ, p(T ) = 1 − θ. Likelihood: l(p) = log θN (H) (1 − θ)N (T ) = N (H) log θ + N (T ) log(1 − θ). Solution: l is differentiable and concave; N (H) N (T ) N (H) dl(p) = − =0⇔θ= . dθ θ 1−θ N (H) + N (T ) Mehryar Mohri - Introduction to Machine Learning page 12 Example: Gaussian Distribution Problem: find most likely Gaussian distribution, given sequence of real-valued observations 3.18, 2.35, .95, 1.175, . . . ! (x − µ) Normal distribution: p(x) = √ 2 exp − 2 2σ 2πσ m � (xi − µ)2 1 Likelihood: l(p) = − m log(2πσ 2 ) − . 2 2 2σ 1 2 " . i=1 Solution: l is differentiable and concave; m ! ∂p(x) 1 =0⇔µ= xi ∂µ m i=1 Mehryar Mohri - Introduction to Machine Learning m ! 1 ∂p(x) 2 2 2 = 0 ⇔ σ = x − µ . i 2 ∂σ m i=1 page 13 ML Properties Problems: the underlying distribution may not be among those searched. • • overfitting: number of examples too small wrt number of parameters. • Pr[y] = 0 if class y does not appear in sample! smoothing techniques. Mehryar Mohri - Introduction to Machine Learning page 14 Additive Smoothing Definition: the additive or Laplace smoothing for estimating Pr[y] , y ∈ Y , from a sample of size m is defined by |y| + α � . Pr[y] = m + α|Y| • α = 0: ML estimator (MLE). • MLE after adding α to the count of each class. • Bayesian justification based on Dirichlet prior. • poor performance for some applications, such as n-gram language modeling. Mehryar Mohri - Introduction to Machine Learning page 15 Estimation Problem Conditional probability: Pr[x | y] = Pr[x1 , . . . , xN | y]. for large N, number of features, difficult to estimate. even if features are Boolean, that is xi ∈ {0, 1} , there are 2N possible feature vectors! may need very large sample. • • Mehryar Mohri - Introduction to Machine Learning page 16 Naive Bayes Conditional independence assumption: for any y ∈ Y , Pr[x1 , . . . , xN | y] = Pr[x1 | y] . . . Pr[xN | y]. • given the class, the features are assumed to be independent. • strong assumption, typically does not hold. Mehryar Mohri - Introduction to Machine Learning page 17 Example - Document Classification Features: presence/absence of word xi . Estimation of Pr[xi | y]: frequency of word xi among documents labeled with y , or smooth estimate. Estimation of Pr[y] : frequency of class y in sample. Classification: y� = argmax Pr[y] y∈Y Mehryar Mohri - Introduction to Machine Learning N � i=1 Pr[xi | y]. page 18 Naive Bayes - Binary Classification Classes: Y = {−1, +1}. Pr[+1|x] log Decision based on sign of Pr[−1|x] ; in terms of log-odd ratios: Pr[+1 | x] Pr[+1] Pr[x | +1] log = log Pr[−1 | x] Pr[−1] Pr[x | −1] �N Pr[+1] i=1 Pr[xi | +1] = log �N Pr[−1] i=1 Pr[xi | −1] N � Pr[+1] Pr[xi | +1] = log log + . Pr[−1] i=1 Pr[xi | −1] � �� � contribution of feature/expert i to decision Mehryar Mohri - Introduction to Machine Learning page 19 Naive Bayes = Linear Classifier Theorem: assume that xi ∈ {0, 1} for all i ∈ [1, N ] . Then, the Naive Bayes classifier is defined by x �→ sgn(w · x + b), Pr[xi =1|+1] Pr[xi =0|+1] − log where wi = log Pr[x Pr[xi =0|−1] i =1|−1] and b = log Pr[+1] Pr[−1] + �N i=1 log Pr[xi =0|+1] Pr[xi =0|−1] . Proof: observe that for any i ∈ [1, N ] , log Pr[xi | +1] = Pr[xi | −1] � log Pr[xi = 1 | +1] Pr[xi = 0 | +1] − log Pr[xi = 1 | −1] Pr[xi = 0 | −1] Mehryar Mohri - Introduction to Machine Learning � xi +log page 20 Pr[xi = 0 | +1] . Pr[xi = 0 | −1] Summary Bayesian prediction: requires solving density estimation problems. often difficult to estimate Pr[x | y] for x ∈ RN. but, simple and easy to apply; widely used. • • • Naive Bayes: strong assumption. straightforward estimation problem. specific linear classifier. sometimes surprisingly good performance. • • • • Mehryar Mohri - Introduction to Machine Learning page 21
© Copyright 2026 Paperzz