slides - Jiang Bian

CPSC 7373: Artificial Intelligence
Lecture 5: Probabilistic Inference
Jiang Bian, Fall 2012
University of Arkansas at Little Rock
Overview and Example
B
E
A
J
EVIDENCE
HIDDEN
M
QUERY
The alarm (A) might go off because of either a
Burglary (B) and/or an Earthquake (E). And
when the alarm (A) goes off, either John (J)
and/or Mary (M) will call to report.
Possible questions:
• Given the evidence of either B or E, what’s
the probability of J or M will call?
Answer to this type of questions:
• Posterior distribution: P(Q1, Q2 … | E1=e1,
E2=e2)
• It's the probability distribution of one or
more query variables given the values of the
evidence variables.
Overview and Example
B
E
A
J
EVIDENCE
HIDDEN
M
QUERY
The alarm (A) might go off because of either a
Burglary (B) and/or an Earthquake (E). And
when the alarm (A) goes off, either John (J)
and/or Mary (M) will call to report.
Possible questions:
• Out of all the possible values for all the
query variables, which combination of
values has the highest probability?
Answer to these questions:
• argmaxq: P(Q1=q1, Q2=q2 … | E1=e1, …)
• Which Q values are maxable given the
evidence values?
Overview and Example
B
E
A
J
M
Imagine the situation where Mary has called to
report that the alarm is going off, and we want
to know whether or not there has been a
burglary. For each of the nodes, tell us if the
node is an evidence node, a hidden node or a
query node?
Overview and Example
B
E
A
J
Imagine the situation where Mary has called to
report that the alarm is going off, and we want
to know whether or not there has been a
burglary. For each of the nodes, tell us if the
node is an evidence node, a hidden node or a
query node?
Evidence: M
Query: B
Hidden: E, A, J
M
Inference through enumeration
B
E
P(+b|+j, +m) = ???
Imagine the situation where both John and
Mary have called to report that the alarm is
going off, and we want to know the probability
of a burglary.
A
J
M
Definition:
Conditional probability:
P(Q|E) = P(Q, E) / P(E)
Inference through enumeration
B
E
P(+b|+j, +m) = ???
= P(+b, +j, +m) / P(+j, +m)
P(+b, +j, +m)
=å
å
=å å
A
a
e
a
e
P(+b, + j, +m, e, a)
P(+b)P(e)P(a | +b, e)P(+ j | a)P(+m, a)
f (e, a) = P(+b)P(e)P(a | +b, e)P(+ j | a)P(+m, a)
= f (+e, +a) + f (+e,Øa) + f (Øe, +a) + f (Øe,Øa)
J
M
Definition:
Conditional probability:
P(Q|E) = P(Q, E) / P(E)
Inference through enumeration
B
P(B)
+
b
0.01
E
B
E
A
Given +e, +a ???
J
J
P(J|A
)
+a +j
0.9
+a ¬j
0.1
¬a +j
0.05
¬a ¬j
0.95
+e 0.002
¬e 0.998
¬ 0.999
b
P(+b)P(e)P(a | +b, e)P(+ j | a)P(+m, a)
A
P(E)
A
M
M
P(M|A
)
B
E
A
P(A|B,E)
+
b
+e
+
a
0.95
+
b
+e
¬
a
0.05
+
b
¬e
+
a
0.94
+
b
¬e
¬
a
0.06
+e
+
a
0.29
+e
¬
0.71
+a +m
0.7
+a ¬m
0.3
¬a +m
0.01
¬
b
¬a ¬m
0.99
¬
Inference through enumeration
P(+b)P(e)P(a | +b, e)P(+ j | a)P(+m, a)
P(+b, +j, +m)
P(+b) P(e)
P(a|+b,e) P(+j|a
)
P(+m|a
)
+e,
+a
0.001 0.002 0.95
0.9
0.7
0.000001197
+e,
¬a
0.001 0.002 0.05
0.05
0.01
5e-11
¬e,
+a
0.001 0.998 0.94
0.9
0.7
0.0005910156
¬e,
¬a
0.001 0.998 0.05
0.05
0.01
2.495e-8
0.0005922376
Inference through enumeration
P(+b)P(e)P(a | +b, e)P(+ j | a)P(+m, a)
P(+j, +m)
P(b)
P(e)
P(a|b,e)
P(+j|a
)
P(+m|a
)
+e, +a,
+b
0.001 0.002 0.95
0.9
0.7
0.000001197
+e, ¬a,
+b
0.001 0.002 0.05
0.05
0.01
5e-11
¬e, +a,
+b
0.001 0.998 0.94
0.9
0.7
0.0005910156
¬e, ¬a,
+b
0.001 0.998 0.05
0.05
0.01
2.495e-8
+e, +a,
¬b
0.999 0.002 0.29
0.9
0.7
0.0003650346
+e, ¬a,
¬b
0.999 0.002 0.05
0.05
0.01
4.995e-8
Inference through enumeration
B
E
P(+b|+j, +m) = ???
= P(+b, +j, +m) / P(+j, +m)
= 0.0005922376 / 0.44741431924
= 0.284
A
J
M
Definition:
Conditional probability:
P(Q|E) = P(Q, E) / P(E)
Enumeration
B
E
• We assumed binary
events/Boolean variables.
• Only 5 variables:
– 25 = 32 rows in the CPT
A
J
• Practically, what if we have a
large network?
M
Example: Car-diagnosis
Initial evidence: engine won't start
battery age
battery
dead
no charging
battery
flat
lights
fanbelt
broken
alternator
broken
oil light
Testable variables (thin ovals),
diagnosis variables (thick ovals)
Hidden variables (shaded) ensure
sparse structure, reduce parameters
no oil
no gas
gas gauge
engine won’t
start
fuel line
blocked
starter
broken
Example: Car insurance
Predict claim costs (medical, liability, property)
given data on application form (other unshaded
nodes)
SocioEcon
Age
GoodStudent
If Boolean:
227 rows in
the CPT
NOT Boolean in
reality.
ExtraCar
Mileage
RiskAversion
VehicleYear
SeniorTrain
MakeModel
DrivingSkill
DrivingHist
Antilock
DrivQuality
Airbag
Ruggedness
CarValue
HomeBase
Accident
Theft
OwnDamage
Cushioning
MedicalCost
OtherCost
LiabilityCost
OwnCost
PropertyCost
AntiTheft
Speed Up Enumeration
P(+b, +j, +m)
å å P(+b,+ j,+m, e, a)
= å å P(+b)P(e)P(a | +b, e)P(+ j | a)P(+m, a)
a
e
a
e
Pulling out terms:
= P(+b)å
a
å
e
P(e)P(a | +b, e)P(+ j | a)P(+m, a)
= P(+b)å P(e)å P(a | +b, e)P(+ j | a)P(+m, a)
e
a
Speed up enumeration
• Maximize Independence
– The structure of the Bayes network determines how efficient to
calculate the probability values.
X1
X2
Xn
O(n)
X1
Xn
X2
O(2n)
Bayesian networks: definition
• A simple, graphical notation for conditional
independence assertions and hence for compact
specification of full joint distributions
• Syntax:
– a set of nodes, one per variable
– a directed, acyclic graph (link = “directly influences")
– a conditional distribution for each node given its parents:
P(Xi|Parents(Xi))
• In the simplest case, conditional distribution
represented as a conditional probability table (CPT)
giving the distribution over Xi for each combination of
parent values
Constructing Bayesian Networks
B
E
The alarm (A) might go off because of either a
Burglary (B) and/or an Earthquake (E). And
when the alarm (A) goes off, either John (J)
and/or Mary (M) will call to report.
Suppose we choose the ordering M, J, A, B, E
A
J
J
M
M
• Dependent or Independent?
– P(J|M) = P(J)?
J
M
A
P(A|J,M) = P(A|J)?
P(A|J,M) = P(A)?
J
M
A
P(B|A, J, M) = P(B|A)?
P(B|A, J, M) = P(B)?
B
J
M
A
B
P(E|B, A, J, M) = P(E|A)?
P(E|B, A, J, M) = P(E|A, B)?
E
J
M
A
B
E
• Deciding conditional independence is
hard in non-causal directions
• (Causal models and conditional
independence seem hardwired for
humans!)
• Assessing conditional probabilities is
hard in non-causal directions
• Network is less compact: 1 + 2 + 4 + 2 +
4=13 numbers needed
Variable Elimination
• Variable elimination: carry out summations
right-to-left, storing intermediate results
(factors) to avoid re-computation
P(B | j, m)
 a

 
 P (b)
 P (b)
 P (b)
e
a
e
e
P (b, j , m, e, a )
P (b) P (e) P ( a | b, e) P ( j | a ) P ( m, a )
P ( e )  a P ( a | b, e ) f j ( a ) f m ( a )
e
P ( e )  a f a ( a , b, e ) f j ( a ) f m ( a )
e
P (e) f a jm (b, e)
 P (b) f e a jm (b)
 f b (b) f e a jm (b)
(sum out A)
(sum out E)
Variable Elimination
å
f ´... ´ fk
x 1
= f1 ´... ´ fi å fi+1 ´... ´ fk
x
• Variable elimination:
= f1 ´... ´ fi ´ f x
– Summing out a variable from a product of factors:
• move any constant factors outside the summation
• add up submatrices in pointwise product of remaining
factors
– still N-P complete, but faster than enumeration
Pointwise product of factors f1 and f2
f1 (a, b)´ f2 (b, c) = f (a, b, c)
Variable Elimination
R
T
P(R)
L
P(T|R)
P(L|T)
+r
0.1
+r
+t 0.8
+t +l 0.3
¬r
0.9
+r
¬t 0.2
+t ¬l 0.7
¬r
+t 0.1
¬t +l
0.1
¬r
¬t 0.9
¬t ¬l
0.9
P(+l) = å
r
å
t
1) Joining factors
P(R, T)
P(r)P(t | r)P(+l | t)
+r
+t 0.08
+r
¬t 0.02
¬r
+t 0.09
¬r
¬t 0.81
Variable Elimination
R
T
L
P(L|T)
RT
L
+t +l 0.3
+t ¬l
P(R, T)
+r
+t 0.08
+r
¬t 0.02
¬r
+t 0.09
¬r
¬t 0.81
0.7
¬t +l 0.1
¬t ¬l
0.9
Marginalize on the variable R, to gives us a
table of just the variable T. P(R,T) - > P(T)
+t
??
¬t
??
Variable Elimination
R
T
L
P(L|T)
RT
L
+t +l 0.3
+t ¬l
P(R, T)
+r
+t 0.08
+r
¬t 0.02
¬r
+t 0.09
¬r
¬t 0.81
0.7
¬t +l 0.1
¬t ¬l
0.9
2) Marginalize on the variable R, to gives us a
table of just the variable T. P(R,T) - > P(T)
+t
0.17
¬t
0.83
Variable Elimination
R
T
L
P(L|T)
RT
L
+t +l 0.3
+t ¬l
0.7
¬t +l 0.1
T
P(T)
L
¬t ¬l
3) Joint probability of P(T, L)
+t
0.17
+t +l ??
¬t
0.83
+t ¬l
??
¬t +l
??
¬t ¬l
??
0.9
Variable Elimination
R
T
L
P(L|T)
RT
L
+t +l 0.3
+t ¬l
0.7
¬t +l 0.1
T
P(T)
L
¬t ¬l
3) Joint probability of P(T, L)
+t
0.17
+t +l 0.051
¬t
0.83
+t ¬l
0.119
¬t +l
0.083
¬t ¬l
0.747
0.9
Variable Elimination
R
T
L
RT
L
4) P(L)
T
T, L
L
P(T, L)
+t +l 0.051
+t ¬l
0.119
¬t +l
0.083
¬t ¬l
0.747
+l
??
¬l
??
Variable Elimination
R
T
L
RT
L
4) P(L)
T
T, L
L
+l
0.134
¬l
0.886
P(T, L)
+t +l 0.051
+t ¬l
0.119
¬t +l
0.083
¬t ¬l
0.747
Choice of ordering is important!
Approximate Inference: Sampling
• Joint probability of heads and tails of a 1 cent,
and a 5 cent coin.
• Advantages:
– Computationally easier.
– Works even without CPTs.
1 cent
5 cent
H
H
H
T
T
H
T
T
Sampling Example
Sprinkler: P(S|C)
+c
Cloudy: P(C)
+s 0.1
¬s 0.9
¬c
C
+c 0.5
¬c 0.5
Samples: +c, ¬s, +r
+s 0.5
¬s 0.5
Sprinkler: P(W|S,R)
+c
+s
¬s
¬c
+s
¬s
+w
0.99
¬w
0.01
+w
0.90
¬w
0.10
+w
0.90
¬w
0.10
+w
0.01
¬w
0.99
S
R
Rain: P(R|C)
+c
+r 0.8
¬r 0.2
W
¬c
+r 0.2
¬r 0.8
• Sampling is consistent if we want to compute the full
joint probability of the network or individual variables.
• What about conditional probability? P(w|¬c)
• Rejection sampling: need to reject samples that do not
match the probabilities that we are interested in.
Rejection sampling
• Too many rejected samples make it inefficient.
– Likelihood weight sampling: inconsistent
B
A
Likelihood weighting
Sprinkler: P(S|C)
+c
+s 0.1
¬s 0.9
¬c
+c 0.5
Cloudy: P(C)
¬c 0.5
C
P(R|+s, +w)
+s 0.5
¬s 0.5
Sprinkler: P(W|S,R)
+c
+s
¬s
¬c
+s
¬s
+w
0.99
¬w
0.01
+w
0.90
¬w
0.10
+w
0.90
¬w
0.10
+w
0.01
¬w
0.99
S
R
Rain: P(R|C)
+c
+r 0.8
¬r 0.2
¬c
W
+r 0.2
¬r 0.8
Weight samples:
+c, 0.1 +s, +r, 0.99 +w
weight: .01 x .99, +c, +s, +r, +w
P(C|+s, +r) ??
Gibbs Sampling
• Markov Chain Monte Carlo (MCMC)
– Sample one variable at a time conditioning on
others.
+c
+s
-r
-w
+c
-s
-r
-w
+c
-s
+r
-w
Monty Hall Problem
• Suppose you're on a game show, and you're given the choice of
three doors: Behind one door is a car; behind the others, goats. You
pick a door, say No. 2 [but the door is not opened], and the host,
who knows what's behind the doors, opens another door, say No. 1,
which has a goat. He then says to you, "Do you want to pick door
No. 3?" Is it to your advantage to switch your choice?
P(C=3|S=2) = ??
P(C=3|H=1,S=2) = ??
Monty Hall Problem
• Suppose you're on a game show, and you're given the choice of
three doors: Behind one door is a car; behind the others, goats. You
pick a door, say No. 2 [but the door is not opened], and the host,
who knows what's behind the doors, opens another door, say No. 1,
which has a goat. He then says to you, "Do you want to pick door
No. 3?" Is it to your advantage to switch your choice?
P(C=3|S=2) = 1/3
P(C=3|H=1,S=2) = 2/3
Why???
Monty Hall Problem
• P(C=3|H=1,S=2)
– = P(H=1|C=3,S=1)P(C=3|S=1)/SUM(P(H=1|C=i,
S=2)P(C=i|S=2) = 2/3
• P(C=1|S=2) = P(C=2|S=2)=P(C=3|S=2) = 1/3