Bayesian learning Introduction Bayesian learning The models that we have used assume that we classify using hard decisions (only one correct answer) For some problems it is more interesting to have soft decisions This can be represented using probability distributions over the decisions A soft decision can always be thresholded Javier Béjar c b ea (LSI - FIB) Machine Learning Term 2012/2013 1 / 14 Bayesian learning Introduction Bayesian inference The cornerstone of bayesian learning is the bayes theorem This theorem links hypothesis and observations: P(h|E) = P(h) · P(E|h) P(E) This means that if we have a sample of data from our problem we can evaluate our hypothesis (the classification of our data) computing a set of probabilities Javier Béjar c b ea (LSI - FIB) Machine Learning Term 2012/2013 2 / 14 Bayesian learning Introduction Bayesian inference - Example Lets assume that we want to recommend a novel to a friend, this decision will be based on our opinion about the novel We know that the probability of our friend liking any novel is 60 % (p(h)) We know that our friend has similar taste than ours (P(E|h)) That assumption is our model of the task, with estimated parameters: The probability of we having a positive opinion when he has a positive opinion is 90 % The probability of we having a negative opinion when he has a negative opinion is 5 % We liked the novel, should we recommend the novel to our friend? (what would be his prediction?) (P(h|E)) Javier Béjar c b ea (LSI - FIB) Machine Learning Term 2012/2013 3 / 14 Bayesian learning Introduction Bayesian inference - Example If we enumerate all the probabilities: P(Friend’s) = h0,60; 0,40i (pos/neg) P(Ours | Friend’s=positive) = h0,9; 0,1i (pos/neg) P(Ours | Friend’s=negative)= h0,05; 0,95i (pos/neg) Javier Béjar c b ea (LSI - FIB) Machine Learning Term 2012/2013 4 / 14 Bayesian learning Introduction Bayesian inference - Example The bayes theorem tells us that: P(Friend 0 s|Ours) = P(Friend 0 s) · P(Ours|Friend 0 s) P(Ours) Given that our opinion is positive (the data) and given that the result has to sum 1 to be a probability: P(Friend 0 s|Ours = pos) = hP(F = pos) · P(O = pos|F = pos), P(F = neg ) · P(O = pos|F = neg )i = h0,6 × 0,9; 0,4 × 0,05i = h0,94; 0,06i (normalized) So it is mostly probable that our friend is going to like the novel Javier Béjar c b ea (LSI - FIB) Machine Learning Term 2012/2013 5 / 14 Bayesian learning Introduction Bayesian learning We can apply bayesian inference to perform classification in different ways Consider only the class with the larger a posteriori probability (maximum a posteriori hypothesis, hMAP ) hMAP = argmax P(h) · P(E|h) h∈H Consider also that all the hypothesis have the same probability (maximum likelihood hypothesis, hML ) hML = argmax P(E|h) h∈H Javier Béjar c b ea (LSI - FIB) Machine Learning Term 2012/2013 6 / 14 Bayesian learning Introduction Bayesian learning The learning goal is to estimate the probability density function (PDF) of the data To estimate a PDF we have to make some assumptions about: The model of the distribution that describe the attributes (continuous, discrete) The model of the distribution that describe the hypothesis The dependence among the variables (all independents, some independent, ...) Javier Béjar c b ea (LSI - FIB) Machine Learning Term 2012/2013 7 / 14 Bayesian learning Naive Bayes Bayesian learning - Naive Bayes The simple approach is to assume that all the attributes are independent (not usually true) The PDF for the attributes of the data can be expressed as: P(E|h) = Y P(Ei |h) ∀i∈attr The maximum a posteriori hipothesis can be transformed in: hMAP = argmax P(h) × h∈H Y P(Ei |h) ∀i∈attr We can estimate separately each P(Ei |h) from the data Javier Béjar c b ea (LSI - FIB) Machine Learning Term 2012/2013 8 / 14 Bayesian learning Naive Bayes Naive Bayes - algorithm Algorithm: Naive Bayes Input : E examples, A attributes, H hypothesis/classes Output: P(H),P(EA |H) foreach h ∈ H do P(h) ← Estimate a priori class probability (E,h) foreach a ∈ A do P(Ea |h) ← Estimate atribute PDF for the class (E, h, a) end end For predicting new examples we only have to compute the MAP hypothesis Surprisingly this approach many times works better that more complex methods Javier Béjar c b ea (LSI - FIB) Machine Learning Term 2012/2013 9 / 14 Bayesian learning Naive Bayes Bayesian learning - Probability estimation (NB) For discrete attributes we can estimate P(Ei |h) as the frequency of the value of the attribute in the dataset for each class (multinomial distribution) For discrete data with small samples, estimation can be problematic (values with zero probability) LaPlace correction (nc frequency of the value in the class, n number of examples, p a priori probability (for instance uniformly distributed), m weighting constant) P(Ei |h) = Javier Béjar c b ea (LSI - FIB) nc + m · p n+m Machine Learning Term 2012/2013 10 / 14 Bayesian learning Naive Bayes Bayesian learning - Probability estimation (NB) For continuous attributes we can estimate P(Ei |h) assuming that they follow a continuous distribution (for instance gaussian) and estimate its parameters from the dataset Continuous distributions can be also estimated using locally weighted regression (Kernel Density Estimation) Density in a specific point is a weighted sum of the density of the surroundings X P(Ei , xj |h) = K (d(xj , xk )) xk ∈Ei Where K (d(xj , xk )) is a decreasing function of the distance (eg. gaussian function) Javier Béjar c b ea (LSI - FIB) Machine Learning Term 2012/2013 11 / 14 Bayesian learning Naive Bayes Naive Bayes - Advantages and drawbacks Advantages: It is simple and computationally cheap Its result can be good even when its assumptions are not true Drawbacks: The dataset is not correctly characterized (we are assuming things that are not true) We have to decide what probability distribution represents the attributes Javier Béjar c b ea (LSI - FIB) Machine Learning Term 2012/2013 12 / 14 Bayesian learning Application Mushrooms Mushroom identification 22 Attributes (All discrete) Attributes: Habitat, odor, cap color, cap shape, ring number, ring type, gill size, gill color, . . . 8124 instances 2 classes (edible/poisonous) Validation: 10 fold cross validation Javier Béjar c b ea (LSI - FIB) Machine Learning Term 2012/2013 13 / 14 Bayesian learning Application Mushrooms: Models Naive Bayes (multinomial estimation): accuracy 99.5 % Poisonous recall 99.62 % (15 poisonous classified edible) Naive Bayes (multinomial estimation) + Cost sensitive: accuracy 99.1 % Poisonous recall 99.92 % (3 poisonous classified edible) Naive Bayes (multinomial estimation) + Cost sensitive: accuracy 83.33 % Poisonous recall 100 % (0 poisonous classified edible) Edible recall 67.82 % Javier Béjar c b ea (LSI - FIB) Machine Learning Term 2012/2013 14 / 14
© Copyright 2026 Paperzz