ch6 - COW :: Ceng

Chapter 6
Bayesian Learning
lecture slides of Raymond J. Mooney, University of Texas at Austin
1
Outline
2
Two Roles of Bayesian Methods:
3
Two Roles of Bayesian Methods:
4
Axioms of Probability Theory
• All probabilities between 0 and 1
0  P( A)  1
• True proposition has probability 1, false has
probability 0.
P(true) = 1
P(false) = 0.
• The probability of disjunction is:
P( A  B)  P( A)  P( B)  P( A  B)
A
A B
B
5
Conditional Probability
• P(A | B) is the probability of A given B
• Assumes that B is all and only information
known.
• Defined by:
P( A  B)
P( A | B) 
P( B)
A
A B
B
6
Independence
• A and B are independent iff:
P( A | B)  P( A)
P( B | A)  P( B)
These two constraints are logically equivalent
• Therefore, if A and B are independent:
P( A  B)
P( A | B) 
 P( A)
P( B)
P( A  B)  P( A) P( B)
7
Joint Distribution
• The joint probability distribution for a set of random variables,
X1,…,Xn gives the probability of every combination of values (an ndimensional array with vn values if all variables are discrete with v
values, all vn values must sum to 1): P(X1,…,Xn)
negative
positive
circle
square
red
0.20
0.02
blue
0.02
0.01
circle
square
red
0.05
0.30
blue
0.20
0.20
• The probability of all possible conjunctions (assignments of values to
some subset of variables) can be calculated by summing the
appropriate subset of values from the joint distribution.
P(red  circle )  0.20  0.05  0.25
P(red )  0.20  0.02  0.05  0.3  0.57
• Therefore, all conditional probabilities can also be calculated.
P( positive | red  circle ) 
P( positive  red  circle ) 0.20

 0.80
P(red  circle )
0.25
8
Bayes Theorem
P( E | H ) P( H )
P( H | E ) 
P( E )
Simple proof from definition of conditional probability:
P( H  E )
P( H | E ) 
P( E )
(Def. cond. prob.)
P( H  E )
(Def. cond. prob.)
P( E | H ) 
P( H )
P( H  E )  P( E | H ) P( H )
QED: P( H | E ) 
P( E | H ) P( H )
P( E )
9
10
11
12
13
14
P(cancer) = .008
P(¬cancer) = .992
P(+ | cancer)=.98 P(- | cancer)=.02
P(+ | ¬cancer)=.03 P(- | ¬cancer)=.97
Two alternative hypotheses: cancer, ¬cancer
P(+ | cancer) p(cancer) = (.98)(.008) = .0078
P(+ | ¬cancer) p(¬cancer) = (.03)(.992) = .0298
hMAP = ¬cancer
P(cancer | +) = 0078 / (.0078 + .0298) = .21
P(¬cancer | +) = 1 - P(cancer | +) = .79
15
Provides a standard against which we may compare the
performance of other learning algorithms
16
17
18
•Every consistent hypothesis has a posterior probability 1 / |VSH,D|
•Every inconsistent hypothesis has a posterior probability 0
Thıus, every consistent hypothesis is a MAP learner.
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Naïve Bayes Generative Model
neg
pos pos
pos neg
pos neg
Category
med
sm lg
med
lg lg sm
sm med
red
blue
red grn red
red blue
red
circ
tri tricirc
circ circ
circ sqr
lg
sm
med med
sm lglg
sm
red
blue
grn grn
red blue
blue grn
circ
sqr
tri
circ
circ tri sqr
sqr tri
Size
Color
Shape
Size
Color
Shape
Positive
Negative
35
Naïve Bayes Inference Problem
lg red circ
??
??
neg
pos pos
pos neg
pos neg
Category
med
sm lg
med
lg lg sm
sm med
red
blue
red grn red
red blue
red
circ
tri tricirc
circ circ
circ sqr
lg
sm
med med
sm lglg
sm
red
blue
grn grn
red blue
blue grn
circ
sqr
tri
circ
circ tri sqr
sqr tri
Size
Color
Shape
Size
Color
Shape
Positive
Negative
36
Naïve Bayesian Categorization
• If we assume features of an instance are independent given
the category (conditionally independent).
n
P( X | Y )  P( X 1 , X 2 , X n | Y )   P( X i | Y )
i 1
• Therefore, we then only need to know P(Xi | Y) for each
possible pair of a feature-value and a category.
• If Y and all Xi and binary, this requires specifying only 2n
parameters:
– P(Xi=true | Y=true) and P(Xi=true | Y=false) for each Xi
– P(Xi=false | Y) = 1 – P(Xi=true | Y)
• Compared to specifying 2n parameters without any
independence assumptions.
37
Naïve Bayes Example
Probability
positive
negative
P(Y)
0.5
0.5
P(small | Y)
0.4
0.4
P(medium | Y)
0.1
0.2
P(large | Y)
0.5
0.4
P(red | Y)
0.9
0.3
P(blue | Y)
0.05
0.3
P(green | Y)
0.05
0.4
P(square | Y)
0.05
0.4
P(triangle | Y)
0.05
0.3
P(circle | Y)
0.9
0.3
Test Instance:
<medium ,red, circle>
38
Naïve Bayes Example
Probability
positive
negative
P(Y)
0.5
0.5
P(medium | Y)
0.1
0.2
P(red | Y)
0.9
0.3
P(circle | Y)
0.9
0.3
Test Instance:
<medium ,red, circle>
P(positive | X) = P(positive)*P(medium | positive)*P(red | positive)*P(circle | positive) / P(X)
0.5
*
0.1
*
0.9
*
0.9
= 0.0405 / P(X) = 0.0405 / 0.0495 = 0.8181
P(negative | X) = P(negative)*P(medium | negative)*P(red | negative)*P(circle | negative) / P(X)
0.5
*
0.2
*
0.3
* 0.3
= 0.009 / P(X) = 0.009 / 0.0495 = 0.1818
P(positive | X) + P(negative | X) = 0.0405 / P(X) + 0.009 / P(X) = 1
P(X) = (0.0405 + 0.009) = 0.0495
39
Estimating Probabilities
• Normally, probabilities are estimated based on observed
frequencies in the training data.
• If D contains nk examples in category yk, and nijk of these nk
examples have the jth value for feature Xi, xij, then:
P( X i  xij | Y  yk ) 
nijk
nk
• However, estimating such probabilities from small training
sets is error-prone.
• If due only to chance, a rare feature, Xi, is always false in
the training data, yk :P(Xi=true | Y=yk) = 0.
• If Xi=true then occurs in a test example, X, the result is that
yk: P(X | Y=yk) = 0 and yk: P(Y=yk | X) = 0
40
Probability Estimation Example
Ex
Size
Color
Shape
Category
1
small
red
circle
positive
2
large
red
circle
positive
3
small
red
triangle
negitive
4
large
blue
circle
negitive
Test Instance X:
<medium, red, circle>
Probability
positive
negative
P(Y)
0.5
0.5
P(small | Y)
0.5
0.5
P(medium | Y)
0.0
0.0
P(large | Y)
0.5
0.5
P(red | Y)
1.0
0.5
P(blue | Y)
0.0
0.5
P(green | Y)
0.0
0.0
P(square | Y)
0.0
0.0
P(triangle | Y)
0.0
0.5
P(circle | Y)
1.0
0.5
P(positive | X) = 0.5 * 0.0 * 1.0 * 1.0 = 0
P(negative | X) = 0.5 * 0.0 * 0.5 * 0.5 = 0
41
Smoothing
• To account for estimation from small samples,
probability estimates are adjusted or smoothed.
• Laplace smoothing using an m-estimate assumes that
each feature is given a prior probability, p, that is
assumed to have been previously observed in a
“virtual” sample of size m.
P( X i  xij | Y  yk ) 
nijk  mp
nk  m
• For binary features, p is simply assumed to be 0.5.
42
Laplace Smothing Example
• Assume training set contains 10 positive examples:
– 4: small
– 0: medium
– 6: large
• Estimate parameters as follows (if m=1, p=1/3)
–
–
–
–
P(small | positive) = (4 + 1/3) / (10 + 1) = 0.394
P(medium | positive) = (0 + 1/3) / (10 + 1) = 0.03
P(large | positive) = (6 + 1/3) / (10 + 1) = 0.576
P(small or medium or large | positive) =
1.0
43
Continuous Attributes
• If Xi is a continuous feature rather than a discrete one, need
another way to calculate P(Xi | Y).
• Assume that Xi has a Gaussian distribution whose mean
and variance depends on Y.
• During training, for each combination of a continuous
feature Xi and a class value for Y, yk, estimate a mean, μik ,
and standard deviation σik based on the values of feature Xi
in class yk in the training data.
• During testing, estimate P(Xi | Y=yk) for a given example,
using the Gaussian distribution defined by μik and σik .
P ( X i | Y  yk ) 
1
 ik
  ( X i  ik ) 2 

exp 
2

2

2
ik


44
Comments on Naïve Bayes
• Tends to work well despite strong assumption of
conditional independence.
• Experiments show it to be quite competitive with other
classification methods on standard UCI datasets.
• Although it does not produce accurate probability
estimates when its independence assumptions are violated,
it may still pick the correct maximum-probability class in
many cases.
– Able to learn conjunctive concepts in any case
• Does not perform any search of the hypothesis space.
Directly constructs a hypothesis from parameter estimates
that are easily calculated from the training data.
– Strong bias
• Not guarantee consistency with training data.
• Typically handles noise well since it does not even focus
on completely fitting the training data.
45
46
47
48
49
50
51
52
53
Bayesian Networks
• Directed Acyclic Graph (DAG)
– Nodes are random variables
– Edges indicate causal influences
Earthquake
Burglary
Alarm
JohnCalls
MaryCalls
54
Conditional Probability Tables
• Each node has a conditional probability table (CPT) that
gives the probability of each of its values given every possible
combination of values for its parents (conditioning case).
– Roots (sources) of the DAG that have no parents are given prior
probabilities.
P(B)
.001
P(E)
Earthquake
Burglary
Alarm
A
P(J)
T
.90
F
.05
JohnCalls
B
E
P(A)
T
T
.95
T
F
.94
F
T
.29
F
F
.001
MaryCalls
.002
A
P(M)
T
.70
F
.01
55
CPT Comments
• Probability of false not given since rows
must add to 1.
• Example requires 10 parameters rather than
25–1=31 for specifying the full joint
distribution.
• Number of parameters in the CPT for a
node is exponential in the number of parents
(fan-in).
56
Joint Distributions for Bayes Nets
• A Bayesian Network implicitly defines a joint
distribution.
n
P( x1 , x2 ,...xn )   P( xi | Parents ( X i ))
i 1
• Example
P( J  M  A  B  E )
 P( J | A) P( M | A) P( A | B  E ) P(B) P(E )
 0.9  0.7  0.001 0.999  0.998  0.00062
• Therefore an inefficient approach to inference is:
– 1) Compute the joint distribution using this equation.
– 2) Compute any desired conditional probability using
the joint distribution.
57
Naïve Bayes as a Bayes Net
• Naïve Bayes is a simple Bayes Net
Y
X1
X2
…
Xn
• Priors P(Y) and conditionals P(Xi|Y) for
Naïve Bayes provide CPTs for the network.
58
Bayes Net Inference
• Given known values for some evidence variables,
determine the posterior probability of some query
variables.
• Example: Given that John calls, what is the
probability that there is a Burglary?
???
Earthquake
Burglary
Alarm
JohnCalls
MaryCalls
John calls 90% of the time there
is an Alarm and the Alarm detects
94% of Burglaries so people
generally think it should be fairly high.
However, this ignores the prior
probability of John calling.
59
Bayes Net Inference
• Example: Given that John calls, what is the
probability that there is a Burglary?
P(B)
???
.001
Earthquake
Burglary
Alarm
JohnCalls
A
P(J)
T
.90
F
.05
MaryCalls
John also calls 5% of the time when there
is no Alarm. So over 1,000 days we
expect 1 Burglary and John will probably
call. However, he will also call with a
false report 50 times on average. So the
call is about 50 times more likely a false
report: P(Burglary | JohnCalls) ≈ 0.02
60
Bayes Net Inference
• Example: Given that John calls, what is the
probability that there is a Burglary?
P(B)
???
.001
Earthquake
Burglary
Alarm
JohnCalls
A
P(J)
T
.90
F
.05
MaryCalls
Actual probability of Burglary is 0.016
since the alarm is not perfect (an
Earthquake could have set it off or it
could have gone off on its own). On the
other side, even if there was not an
alarm and John called incorrectly, there
could have been an undetected Burglary
anyway, but this is unlikely.
61
Complexity of Bayes Net Inference
• In general, the problem of Bayes Net inference is
NP-hard (exponential in the size of the graph).
• For singly-connected networks or polytrees in
which there are no undirected loops, there are lineartime algorithms based on belief propagation.
– Each node sends local evidence messages to their children
and parents.
– Each node updates belief in each of its possible values
based on incoming messages from it neighbors and
propagates evidence on to its neighbors.
• There are approximations to inference for general
networks based on loopy belief propagation that
iteratively refines probabilities that converge to
accurate values in the limit.
62
63
64
65
66
67
68
69
70
71
72
73