Two Roles for Bayesian Methods Bayes Theorem Choosing

Two Roles for Bayesian Methods
Bayesian Learning
•
•
•
•
•
•
•
Bayes Theorem
MAP, ML hypotheses
MAP learners
Minimum description length principle
Bayes optimal classifier
Naïve Bayes learner
Bayesian belief networks
CS 8751 ML & KDD
Bayesian Methods
Provide practical learning algorithms:
• Naïve Bayes learning
• Bayesian belief network learning
• Combine prior knowledge (prior probabilities)
with observed data
Requires prior probabilities:
• Provides useful conceptual framework:
• Provides “gold standard” for evaluating other
learning algorithms
• Additional insight into Occam’s razor
1
CS 8751 ML & KDD
Bayes Theorem
P(h | D) =
•
•
•
•
Bayesian Methods
2
Choosing Hypotheses
P(h | D) =
P ( D | h ) P ( h)
P( D)
P ( D | h ) P ( h)
P( D)
Generally want the most probable hypothesis given the
training data
Maximum a posteriori hypothesis hMAP:
P(h) = prior probability of hypothesis h
P(D) = prior probability of training data D
P(h|D) = probability of h given D
P(D|h) = probability of D given h
hMAP = arg max P (h | D )
h∈H
P ( D | h) P ( h)
P( D)
= arg max P ( D | h) P (h)
= arg max
h∈H
h∈H
If we assume P(hi)=P(hj) then can further simplify, and
choose the Maximum likelihood (ML) hypothesis
hML = arg max P ( D | hi )
hi ∈H
CS 8751 ML & KDD
Bayesian Methods
3
CS 8751 ML & KDD
Bayesian Methods
Bayes Theorem
Some Formulas for Probabilities
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive.
The test returns a correct positive result in only 98% of the
cases in which the disease is actually present, and a correct
negative result in only 97% of the cases in which the
disease is not present. Furthermore, 0.8% of the entire
population have this cancer.
P(cancer) =
P(¬cancer) =
P(+|cancer) =
P(-|cancer) =
P(+|¬cancer) =
P(-|¬cancer) =
• Product rule: probability P(A ∧ B) of a
conjunction of two events A and B:
P(A ∧ B) = P(A|B)P(B) = P(B|A)P(A)
• Sum rule: probability of disjunction of two events
A and B:
P(A ∨ B) = P(A) + P(B) - P(A ∧ B)
• Theorem of total probability: if events A1,…,An
n
are mutually exclusive with ∑i =1 P ( Ai ) = 1 , then
4
n
P( B ) = ∑ P( B | Ai ) P ( Ai )
P(cancer|+) =
P(¬cancer|+) =
CS 8751 ML & KDD
i =1
Bayesian Methods
5
CS 8751 ML & KDD
Bayesian Methods
6
1
Relation to Concept Learning
Brute Force MAP Hypothesis Learner
1. For each hypothesis h in H, calculate the posterior
probability
P(h | D) =
Consider our usual concept learning task
• instance space X, hypothesis space H, training
examples D
• consider the FindS learning algorithm (outputs
most specific hypothesis from the version space
VSH,D)
P ( D | h ) P ( h)
P( D)
2. Output the hypothesis hMAP with the highest
posterior probability
hMAP = arg max P (h | D )
What would Bayes rule produce as the MAP
hypothesis?
Does FindS output a MAP hypothesis?
h∈H
CS 8751 ML & KDD
Bayesian Methods
7
Relation to Concept Learning
CS 8751 ML & KDD
y
m
hML = arg min ∑ (d i − h( xi )) 2
otherwise
h∈H
9
i =1
d − h ( xi ) 2
1
)
−1( i
e 2 σ
2πσ 2
Maximize natural log of this instead ...
m
= arg max ∏
i =1
1  d − h( xi ) 
−  i

2
σ

1
2πσ
2
1  d − h( xi ) 
= arg max −  i

h∈H
2
σ

2
2
10
– Note LC2 (D|h) = 0 if examples classified perfectly by
h. Need only describe exceptions
= arg max − (d i − h( xi ) )
2
h∈H
= arg min (d i − h( xi ) )
• Hence hMDL trades off tree size for training errors
2
h∈H
CS 8751 ML & KDD
Bayesian Methods
Occam’s razor: prefer the shortest hypothesis
MDL: prefer the hypothesis h that minimizes
hMDL = arg min LC1 ( h) + LC 2 ( D | h)
h∈H
where LC(x) is the description length of x under
encoding C
Example:
• H = decision trees, D = training data labels
• LC1(h) is # bits to describe tree h
• LC2(D|h) is #bits to describe D given h
m
= arg max ∏ p (d i | h)
h∈H
CS 8751 ML & KDD
i =1
Minimum Description Length Principle
h∈H
hML = arg max ln
hML
x
Learning a Real Valued Function
h∈H
f
Consider any real-valued target function f
Training examples (xi,di), where di is noisy training value
• di = f(xi) + ei
• ei is random variable (noise) drawn independently for each
xi according to some Gaussian distribution with mean = 0
Then the maximum likelihood hypothesis hML is the one that
minimizes the sum of squared errors:
hML = arg max p( D | h)
h∈H
8
e
if h is consistent with D
Bayesian Methods
Bayesian Methods
Learning a Real Valued Function
Assume fixed set of instances (x1,…,xm)
Assume D is the set of classifications
D = (c(x1),…,c(xm))
Choose P(D|h):
• P(D|h) = 1 if h consistent with D
• P(D|h) = 0 otherwise
Choose P(h) to be uniform distribution
• P(h) = 1/|H| for all h in H
Then
 1
P(h | D) =  VS H,D
 0
CS 8751 ML & KDD
Bayesian Methods
11
CS 8751 ML & KDD
Bayesian Methods
12
2
Bayes Optimal Classifier
Minimum Description Length Principle
hMAP = arg max P( D | h) P(h)
Bayes optimal classification
h∈H
= arg max log 2 P( D | h) + log 2 P( h)
arg max ∑ P (v j | hi ) P(hi | D)
h∈H
v j ∈V
= arg min − log 2 P( D | h) − log 2 P( h) (1)
Interesting fact from information theory:
The optimal (shortest expected length) code for
an event with probability p is log2p bits.
So interpret (1):
-log2P(h) is the length of h under optimal code
-log2P(D|h) is length of D given h in optimal code
→ prefer the hypothesis that minimizes
length(h)+length(misclassifications)
CS 8751 ML & KDD
Bayesian Methods
therefore
∑ P ( + | h ) P ( h | D ) = .4
and
hi ∈H
arg max ∑ P(v j | hi ) P (hi | D) = v j ∈V
13
CS 8751 ML & KDD
hi ∈H
Bayesian Methods
14
Naïve Bayes Classifier
Bayesian Methods
Along with decision trees, neural networks, nearest
neighor, one of the most practical learning
methods.
When to use
• Moderate or large training set available
• Attributes that describe instances are conditionally
independent given classification
Successful applications:
• Diagnosis
• Classifying text documents
15
Naïve Bayes Classifier
CS 8751 ML & KDD
Bayesian Methods
16
Naïve Bayes Algorithm
Assume target function f: X→V, where each instance
x described by attributed (a1,a2,…,an).
Most probable value of f(x) is:
vMAP = arg max P (v j | a1 , a2 ,..., an )
v j ∈V
v j ∈V
i
∑ P(− | hi ) P(hi | D) = .6
Gibbs Classifier
= arg max
i
hi ∈H
Bayes optimal classifier provides best result, but can be
expensive if many hypotheses.
Gibbs algorithm:
1. Choose one hypothesis at random, according to P(h|D)
2. Use this to classify new instance
Surprising fact: assume target concepts are drawn at random
from H according to priors on H. Then:
E[errorGibbs] ≤ 2E[errorBayesOptimal]
Suppose correct, uniform prior distribution over H, then
• Pick any hypothesis from VS, with uniform probability
• Its expected error no worse than twice Bayes optimal
CS 8751 ML & KDD
hi ∈H
Example:
P(h1|D)=.4, P(-|h1)=0, P(+|h1)=1
P(h2|D)=.3, P(-|h2)=1, P(+|h2)=0
P(h3|D)=.3, P(-|h3)=1, P(+|h3)=0
h∈H
Naive_Bayes_Learn(examples)
For each target value v j
Pˆ (v j ) ← estimate P(v j )
For each attribute value ai of each attribute a
Pˆ (a |v ) ← estimate P(a |v )
P(a1 , a2 ,..., an | v j ) P(v j )
P (a1 , a2 ,..., an )
i
= arg max P (a1 , a2 ,..., an | v j ) P (v j )
j
i
j
v j ∈V
Naïve Bayes assumption:
P(a1 , a2 ,..., an | v j ) = ∏ P( ai | v j )
i
which gives
P (v j )∏ P (ai | v j )
Naïve Bayes classifier: v NB = arg max
v ∈V
j
CS 8751 ML & KDD
Bayesian Methods
i
17
Classify_New_Instance( x)
v NB = arg max Pˆ (v j ) ∏ Pˆ (ai|v j )
v j ∈V
CS 8751 ML & KDD
ai ∈ x
Bayesian Methods
18
3
Naïve Bayes Example
Naïve Bayes Subtleties
Consider CoolCar again and new instance
(Color=Blue,Type=SUV,Doors=2,Tires=WhiteW)
Want to compute
1. Conditional independence assumption is often
violated
P (a1 , a2 ,..., an | v j ) = ∏ P( ai | v j )
i
• … but it works surprisingly well anyway. Note
that you do not need estimated posteriors to be
correct; need only that
v NB = arg max P (v j )∏ P (ai | v j )
v j ∈V
i
P(+)*P(Blue|+)*P(SUV|+)*P(2|+)*P(WhiteW|+)=
5/14 * 1/5 * 2/5 * 4/5 * 3/5 = 0.0137
P(-)*P(Blue|-)*P(SUV|-)*P(2|-)*P(WhiteW|-)=
9/14 * 3/9 * 4/9 * 3/9 * 3/9 = 0.0106
CS 8751 ML & KDD
Bayesian Methods
arg max Pˆ (v j )∏ Pˆ (ai | v j ) = arg max P (v j ) P (a1 ,..., an | v j )
v j ∈V
v j ∈V
i
• see Domingos & Pazzani (1996) for analysis
• Naïve Bayes posteriors often unrealistically close
to 1 or 0
19
Naïve Bayes Subtleties
CS 8751 ML & KDD
Bayesian Methods
20
Bayesian Belief Networks
2. What if none of the training instances with target
value vj have attribute value ai? Then
i
Typical solution
is Bayesian estimate for Pˆ (ai | v j )
n + mp
ˆ
P (ai | v j ) ← c
n+m
• n is number of training examples for which v=vj
• nc is number of examples for which v=vj and a=ai
• p is prior estimate for Pˆ ( ai | v j )
• m is weight given to prior (i.e., number of
“virtual” examples)
Interesting because
• Naïve Bayes assumption of conditional
independence is too restrictive
• But it is intractable without some such
assumptions…
• Bayesian belief networks describe conditional
independence among subsets of variables
• allows combing prior knowledge about
(in)dependence among variables with observed
training data
• (also called Bayes Nets)
CS 8751 ML & KDD
CS 8751 ML & KDD
Pˆ (ai | v j ) = 0, and ...
Pˆ (v j )∏ Pˆ ( ai | v j ) = 0
Bayesian Methods
21
Conditional Independence
P(X|Y,Z) = P(X|Z)
Storm
BusTourGroup
C
¬C
Lightning
S,B S,¬B ¬S,B ¬S,¬B
0.4 0.1 0.8 0.2
0.6 0.9 0.2 0.8
Campfire
Campfire
Thunder
Example: Thunder is conditionally independent of
Rain given Lightning
ForestFire
Network represents a set of conditional independence assumptions
• Each node is asserted to be conditionally independent of its
nondescendants, given its immediate predecessors
• Directed acyclic graph
P(Thunder|Rain,Lightning)=P(Thunder|Lightning)
Naïve Bayes uses conditional ind. to justify
P(X,Y|Z)=P(X|Y,Z)P(Y|Z)
=P(X|Z)P(Y|Z)
Bayesian Methods
22
Bayesian Belief Network
Definition: X is conditionally independent of Y
given Z if the probability distribution governing X
is independent of the value of Y given the value of
Z; that is, if
(∀xi , y j , z k ) P( X = xi | Y = y j , Z = z k ) = P ( X = xi | Z = z k )
more compactly we write
CS 8751 ML & KDD
Bayesian Methods
23
CS 8751 ML & KDD
Bayesian Methods
24
4
Bayesian Belief Network
Inference in Bayesian Networks
• Represents joint probability distribution over all
variables
• e.g., P(Storm,BusTourGroup,…,ForestFire)
• in general,
n
P ( y1 ,..., yn ) = ∏ P ( yi | Parents (Yi ))
i =1
where Parents(Yi) denotes immediate
predecessors of Yi in graph
• so, joint distribution is fully defined by graph, plus
the P(yi|Parents(Yi))
CS 8751 ML & KDD
Bayesian Methods
25
How can one infer the (probabilities of) values of
one or more network variables, given observed
values of others?
• Bayes net contains all information needed
• If only one variable with unknown value, easy to
infer it
• In general case, problem is NP hard
In practice, can succeed in many cases
• Exact inference methods work well for some
network structures
• Monte Carlo methods “simulate” the network
randomly to calculate approximate solutions
CS 8751 ML & KDD
Learning of Bayesian Networks
Bayesian Methods
Learning Bayes Net
Several variants of this learning task
• Network structure might be known or unknown
• Training examples might provide values of all
network variables, or just some
If structure known and observe all variables
• Then it is easy as training a Naïve Bayes classifier
Suppose structure known, variables partially
observable
e.g., observe ForestFire, Storm, BusTourGroup,
Thunder, but not Lightning, Campfire, …
• Similar to training neural network with hidden
units
• In fact, can learn network conditional probability
tables using gradient ascent!
• Converge to network h that (locally) maximizes
P(D|h)
CS 8751 ML & KDD
CS 8751 ML & KDD
Bayesian Methods
27
Gradient Ascent for Bayes Nets
d ∈D
j
28
• Combine prior knowledge with observed data
• Impact of prior knowledge (when correct!) is to
lower the sample complexity
• Active research area
– Extend from Boolean to real-valued variables
– Parameterized distributions instead of tables
– Extend to first-order instead of propositional
systems
– More effective inference methods
Ph ( yij , uik | d )
wijk
2. Then renormalize the wijk to assure
∑
Bayesian Methods
Summary of Bayes Belief Networks
Let wijk denote one entry in the conditional
probability table for variable Yi in the network
wijk =P(Yi=yij|Parents(Yi)=the list uik of values)
e.g., if Yi = Campfire, then uik might be (Storm=T,
BusTourGroup=F)
Perform gradient ascent by repeatedly
1. Update all wijk using training data D
wijk ← wijk + η ∑
26
wijk = 1 , 0 ≤ wijk ≤ 1
CS 8751 ML & KDD
Bayesian Methods
29
CS 8751 ML & KDD
Bayesian Methods
30
5