Bayesian learning

Bayesian learning
Introduction
Bayesian learning
The models that we have used assume that we classify using hard
decisions (only one correct answer)
For some problems it is more interesting to have soft decisions
This can be represented using probability distributions over the
decisions
A soft decision can always be thresholded
Javier Béjar c b ea (LSI - FIB)
Machine Learning
Term 2012/2013
1 / 14
Bayesian learning
Introduction
Bayesian inference
The cornerstone of bayesian learning is the bayes theorem
This theorem links hypothesis and observations:
P(h|E) =
P(h) · P(E|h)
P(E)
This means that if we have a sample of data from our problem we can
evaluate our hypothesis (the classification of our data) computing a
set of probabilities
Javier Béjar c b ea (LSI - FIB)
Machine Learning
Term 2012/2013
2 / 14
Bayesian learning
Introduction
Bayesian inference - Example
Lets assume that we want to recommend a novel to a friend, this
decision will be based on our opinion about the novel
We know that the probability of our friend liking any novel is 60 %
(p(h))
We know that our friend has similar taste than ours (P(E|h))
That assumption is our model of the task, with estimated parameters:
The probability of we having a positive opinion when he has a positive
opinion is 90 %
The probability of we having a negative opinion when he has a negative
opinion is 5 %
We liked the novel, should we recommend the novel to our friend?
(what would be his prediction?) (P(h|E))
Javier Béjar c b ea (LSI - FIB)
Machine Learning
Term 2012/2013
3 / 14
Bayesian learning
Introduction
Bayesian inference - Example
If we enumerate all the probabilities:
P(Friend’s) = h0,60; 0,40i (pos/neg)
P(Ours | Friend’s=positive) = h0,9; 0,1i (pos/neg)
P(Ours | Friend’s=negative)= h0,05; 0,95i (pos/neg)
Javier Béjar c b ea (LSI - FIB)
Machine Learning
Term 2012/2013
4 / 14
Bayesian learning
Introduction
Bayesian inference - Example
The bayes theorem tells us that:
P(Friend 0 s|Ours) =
P(Friend 0 s) · P(Ours|Friend 0 s)
P(Ours)
Given that our opinion is positive (the data) and given that the result has
to sum 1 to be a probability:
P(Friend 0 s|Ours = pos)
=
hP(F = pos) · P(O = pos|F = pos),
P(F = neg ) · P(O = pos|F = neg )i
=
h0,6 × 0,9; 0,4 × 0,05i
=
h0,94; 0,06i (normalized)
So it is mostly probable that our friend is going to like the novel
Javier Béjar c b ea (LSI - FIB)
Machine Learning
Term 2012/2013
5 / 14
Bayesian learning
Introduction
Bayesian learning
We can apply bayesian inference to perform classification in different
ways
Consider only the class with the larger a posteriori probability
(maximum a posteriori hypothesis, hMAP )
hMAP = argmax P(h) · P(E|h)
h∈H
Consider also that all the hypothesis have the same probability
(maximum likelihood hypothesis, hML )
hML = argmax P(E|h)
h∈H
Javier Béjar c b ea (LSI - FIB)
Machine Learning
Term 2012/2013
6 / 14
Bayesian learning
Introduction
Bayesian learning
The learning goal is to estimate the probability density function
(PDF) of the data
To estimate a PDF we have to make some assumptions about:
The model of the distribution that describe the attributes (continuous,
discrete)
The model of the distribution that describe the hypothesis
The dependence among the variables (all independents, some
independent, ...)
Javier Béjar c b ea (LSI - FIB)
Machine Learning
Term 2012/2013
7 / 14
Bayesian learning
Naive Bayes
Bayesian learning - Naive Bayes
The simple approach is to assume that all the attributes are
independent (not usually true)
The PDF for the attributes of the data can be expressed as:
P(E|h) =
Y
P(Ei |h)
∀i∈attr
The maximum a posteriori hipothesis can be transformed in:
hMAP = argmax P(h) ×
h∈H
Y
P(Ei |h)
∀i∈attr
We can estimate separately each P(Ei |h) from the data
Javier Béjar c b ea (LSI - FIB)
Machine Learning
Term 2012/2013
8 / 14
Bayesian learning
Naive Bayes
Naive Bayes - algorithm
Algorithm: Naive Bayes
Input : E examples, A attributes, H hypothesis/classes
Output: P(H),P(EA |H)
foreach h ∈ H do
P(h) ← Estimate a priori class probability (E,h)
foreach a ∈ A do
P(Ea |h) ← Estimate atribute PDF for the class (E, h, a)
end
end
For predicting new examples we only have to compute the MAP
hypothesis
Surprisingly this approach many times works better that more
complex methods
Javier Béjar c b ea (LSI - FIB)
Machine Learning
Term 2012/2013
9 / 14
Bayesian learning
Naive Bayes
Bayesian learning - Probability estimation (NB)
For discrete attributes we can estimate P(Ei |h) as the frequency of
the value of the attribute in the dataset for each class (multinomial
distribution)
For discrete data with small samples, estimation can be problematic
(values with zero probability)
LaPlace correction (nc frequency of the value in the class, n number
of examples, p a priori probability (for instance uniformly distributed),
m weighting constant)
P(Ei |h) =
Javier Béjar c b ea (LSI - FIB)
nc + m · p
n+m
Machine Learning
Term 2012/2013
10 / 14
Bayesian learning
Naive Bayes
Bayesian learning - Probability estimation (NB)
For continuous attributes we can estimate P(Ei |h) assuming that they
follow a continuous distribution (for instance gaussian) and estimate
its parameters from the dataset
Continuous distributions can be also estimated using locally weighted
regression (Kernel Density Estimation)
Density in a specific point is a weighted sum of the density of the
surroundings
X
P(Ei , xj |h) =
K (d(xj , xk ))
xk ∈Ei
Where K (d(xj , xk )) is a decreasing function of the distance (eg.
gaussian function)
Javier Béjar c b ea (LSI - FIB)
Machine Learning
Term 2012/2013
11 / 14
Bayesian learning
Naive Bayes
Naive Bayes - Advantages and drawbacks
Advantages:
It is simple and computationally cheap
Its result can be good even when its assumptions are not true
Drawbacks:
The dataset is not correctly characterized (we are assuming things that
are not true)
We have to decide what probability distribution represents the
attributes
Javier Béjar c b ea (LSI - FIB)
Machine Learning
Term 2012/2013
12 / 14
Bayesian learning
Application
Mushrooms
Mushroom identification
22 Attributes (All discrete)
Attributes: Habitat, odor, cap color,
cap shape, ring number, ring type, gill
size, gill color, . . .
8124 instances
2 classes (edible/poisonous)
Validation: 10 fold cross validation
Javier Béjar c b ea (LSI - FIB)
Machine Learning
Term 2012/2013
13 / 14
Bayesian learning
Application
Mushrooms: Models
Naive Bayes (multinomial estimation): accuracy 99.5 %
Poisonous recall 99.62 % (15 poisonous classified edible)
Naive Bayes (multinomial estimation) + Cost sensitive: accuracy
99.1 %
Poisonous recall 99.92 % (3 poisonous classified edible)
Naive Bayes (multinomial estimation) + Cost sensitive: accuracy
83.33 %
Poisonous recall 100 % (0 poisonous classified edible)
Edible recall 67.82 %
Javier Béjar c b ea (LSI - FIB)
Machine Learning
Term 2012/2013
14 / 14