P(c|b,a) P(b ) P(a) Independence - El

Reasoning Under Uncertainty
Artificial Intelligence
Chapter 9
1
Part 2
Reasoning
2
Notation


Random variable (RV): a variable (uppercase)
that takes on values (lowercase) from a domain
of mutually exclusive and exhaustive values
A=a: a proposition, world state, event, effect, etc.
–
–
–
–

3
abbreviate: P(A=true) to P(a)
abbreviate: P(A=false) to P(a)
abbreviate: P(A=value) to P(value)
abbreviate: P(Avalue) to P(value)
Atomic event: a complete specification of the state
of the world about which the agent is uncertain
Notation




4
P(a): a prior probability of RV A=a
which is the degree of belief proposition a
in absence of any other relevant information
P(a|e): conditional probability of RV A=a given E=e
which is the degree of belief in proposition a
when all that is known is evidence e
P(A): probability distribution, i.e. set of P(ai) for all i
Joint probabilities are for conjunctions of propositions
Reasoning under Uncertainty
5

Rather than reasoning about the truth or falsity
of a proposition, instead reason about the belief
that a proposition is true.

Use knowledge base of known probabilities to
determine probabilities for query propositions.
Reasoning under Uncertainty
using Full Joint Distributions
Assume a simplified Clue game having
two characters, two weapons and two rooms:
Who
plum
plum
plum
plum
green
green
green
green
6
each row is an atomic event
What Where Probability - one of these must be true
rope hall
1/8
- list must be mutually exclusive
rope kitchen
1/8
- list must be exhaustive
pipe hall
1/8
prior probability for each is 1/8
- each equally likely
pipe kitchen
1/8
- e.g. P(plum,rope,hall) = 1/8
rope hall
1/8
rope kitchen
1/8
 P(atomic_eventi) = 1
since each RV's domain is
pipe hall
1/8
exhaustive & mutually exclusive
pipe kitchen
1/8
Determining Marginal Probabilities
using Full Joint Distributions

The probability of any proposition is equal to
the sum of the probabilities of the atomic events
in which it holds, which is called the set e(a).
P(a) =  P(ei)
where ei is an element of e(a)
– its the disjunction of atomic events in set e(a)
– recall this property of atomic events:
any proposition is logically equivalent to the disjunction
of all atomic events that entail the truth of that proposition
7
Determining Marginal Probabilities
using Full Joint Distributions
Assume a simplified Clue game having
two characters, two weapons and two rooms:
Who What Where Probability P(a) =  P(ei)
where ei is an element of e(a)
plum rope hall
1/8
P(plum) = ?1/8+1/8+1/8+1/8 = 1/2
plum rope kitchen
1/8
plum
plum
green
green
green
green
8
pipe
pipe
rope
rope
pipe
pipe
hall
kitchen
hall
kitchen
hall
kitchen
1/8
1/8
1/8
1/8
1/8
1/8
when obtained in this manner it is
called a marginal probability
can be just a prior probability
(shown) or more complex (next)
this process is called
marginalization or summing out
Reasoning under Uncertainty
using Full Joint Distributions
Assume a simplified Clue game having
two characters, two weapons and two rooms:
Who What Where Probability
plum
plum
plum
plum
green
green
green
green
9
rope
rope
pipe
pipe
rope
rope
pipe
pipe
hall
kitchen
hall
kitchen
hall
kitchen
hall
kitchen
1/8
1/8
1/8
1/8
1/8
1/8
1/8
1/8
P(green,pipe) = 1/8+1/8 = 1/4
P(rope, hall) =1/8+1/8 = 1/4
P(rope  hall) =
1/ +1/ + 1/ +1/ +1/ +1/ = 3/
8
8
8
8
8
8
4
Independence

10
Using the game clue
for an example is uninteresting! Why?
– Because the random variables
Who, What, Where are independent.
– Does picking the murder from the deck of cards
affect which weapon is chosen? Location?
No! Each is randomly selected.
Independence

Unconditional (absolute) Independence:
RVs have no affect on each other's probabilities
1. P(X|Y) = P(X)
2. P(Y|X) = P(Y)
3. P(X,Y) = P(X) P(Y)

Example (full clue):
P(green | hall)
= P(green, hall) / P(hall)
= 6/324
/ 1/9
= P(green)
= 1/6
P(hall | green)
= P(hall)
= 1/9
P(green, hall)
= P(green) P(hall)
= 1/54
We need a more interesting example!
11
Independence

Conditional Independence:
RVs (X, Y) are dependent on another RV (Z)
but are independent of each other
1. P(X|Y,Z) = P(X|Z)
2. P(Y|X,Z) = P(Y|Z)
3. P(X,Y|Z) = P(X|Z) P(Y|Z)

Idea:
sneezing (x) and itchy eyes (y)
are both directly caused by hayfever (z)
but neither sneezing nor itchy eyes
has a direct effect on each other
12
Reasoning under Uncertainty
using Full Joint Distributions
Assume three boolean RVs: Hayfever HF, Sneeze SN, ItchyEyes IE
and fictional probabilities:
13
HF
SN
IE
false
false
false
false
true
true
true
true
false
false
true
true
false
false
true
true
false
true
false
true
false
true
false
true
Probability P(a) =  P(ei)
where ei is an element of e(a)
0.5
P(sn) = 0.1+ 0.1+ 0.04+ 0.1=0.34
0.09
0.1
0.1
0.01
0.06
0.04
0.1
P(hf) = 0.01+ 0.06+ 0.04+ 0.1=0.21
P(sn,ie) = 0.1+ 0.1=0.20
P(hf,sn) = 0.04+ 0.1=0.14
Reasoning under Uncertainty
using Full Joint Distributions
Assume three boolean RVs: Hayfever HF, Sneeze SN, ItchyEyes IE
and fictional probabilities:
14
HF
SN
IE
false
false
false
false
true
true
true
true
false
false
true
true
false
false
true
true
false
true
false
true
false
true
false
true
Probability P(a|e) = P(a, e) / P(e)
0.5
0.09
0.1
0.1
0.01
0.06
0.04
0.1
P(hf | sn) = P(hf,sn) / P(sn)
= 0.14 / 0.34 = 0.41
P(hf | ie) = P(hf,ie) / P(ie)
= 0.16 / 0.35 = 0.46
Reasoning under Uncertainty
using Full Joint Distributions
Assume three boolean RVs: Hayfever HF, Sneeze SN, ItchyEyes IE
and fictional probabilities:
15
HF
SN
IE
false
false
false
false
true
true
true
true
false
false
true
true
false
false
true
true
false
true
false
true
false
true
false
true
Probability P(a|e) = P(a, e) / P(e)
Instead of computing P(e),
0.5
could use normalization
0.09
0.1
P(hf | sn) = 0.14 / P(sn)
0.1
0.01
0.06
0.04
0.1
also compute:
P(hf | sn) = 0.20 / P(sn)
since P(hf | sn) + P(hf | sn) = 1
substituting and solving gives
P(sn) = 0.34 !
Combining Multiple Evidence

As evidence describing the state of the world
is accumulated, we'd like to be able to easily
update the degree of belief in a conclusion.

Using the Full Joint Prob. Dist. Table:
P(v1,...,vk|vk+1,...,vn) = P(V1=v1,...,Vn=vn) / P(Vk+1=vk+1,...,Vn=vn)
1. sum of all entries in the table, where V1=v1, ..., Vn=vn
2. divided by the sum of all entries in the table
corresponding to the evidence, where Vk+1=vk+1, ..., Vn=vn
16
Combining Multiple Evidence
using Full Joint Distributions
Assume three boolean RVs and fictional probabilities
Hayfever HF, Sneeze SN, ItchyEyes IE:
17
HF
SN
IE
false
false
false
false
true
true
true
true
false
false
true
true
false
false
true
true
false
true
false
true
false
true
false
true
Probability
0.5
0.09
0.1
0.1
0.01
0.06
0.04
0.1
P(a|b, c) = P(a,b,c) /  P(b,c)
as described in prior slide
P(hf | sn, ie) = P(hf,sn,ie) /  P(sn,ie)
= 0.10 / (0.1+0.1)
= 0.5
Combining Multiple Evidence (cont.)



FJDT techniques are intractable in general
because the table size grows exponentially.
Independence assertions can help reduce
the size of the domain and the complexity
of the inference problem.
Independence assertions are usually based
on the knowledge of the domain enabling
FJD table to be factored in to separate JD tables.
–
–
18
it's a good thing that problem domains are independent
but typically subsets of dependent RVs are quite large
Probability Rules
for Multi-valued Variables

Summing Out: P(Y) =  P(Y, z)
sum over all values z of RV Z

Conditioning:
P(Y) =  P(Y|z) P(z)
sum over all values z of RV Z


Product Rule:
Chain Rule:
–
–

19
P(X, Y) = P(X|Y) P(Y) = P(Y|X) P(X)
P(X, Y, Z) = P(X|Y, Z) P(Y|Z) P(Z)
this is a generalization of product rule with Y= Y,Z
order of RVs doesn't matter, i.e. gives same result
Conditionalized Chain Rule:
(let Y=A|B)
P(X, A|B) = P(X|A, B) P(A|B)
(order doesn't matter)
= P(A|X, B) P(X|B)
Bayes' Rule

Bayes' Rule:
P(b|a) = (P(a|b) P(b)) / P(a)
–
–

derived from P(a  b) = P(b|a) P(a) = P(a|b) P(b)
just divide both sides of equation by P(a)
basis of AI systems using probabilistic reasoning
For Example:
a=happy, b=sun
P(sun|happy) = ?
P(happy|sun) = 0.95
P(sun)
= 0.5
P(happy)
= 0.75
(0.95 * 0.5)/0.75 = 0.63
20
a= sneeze, b= fall
P(fall|sneeze) = ?
P(sneeze|fall) = 0.85
P(fall)
= 0.25
P(sneeze)
= 0.3
(0.85 * 0.25)/0.3 = 0.71
Bayes' Rule
P(b|a) = (P(a|b) P(b)) / P(a)
What's the benefit of being able to calculate
P(b|a) from the three probabilities on the right?

Usefulness of Bayes' Rule:
–
–
–
21
many problems have good estimates of probabilities on right
P(b|a) needed to identify cause, classification, diagnosis, etc
typical use is to calculate diagnostic knowledge
from causal knowledge
Bayes' Rule

Causal knowledge: from causes to effects
–
–

Diagnostic knowledge: from effects to causes
–
–
–
22
e.g. P(sneeze|cold)
probability of effect sneeze given cause common cold
this probability the doctors obtains from experience
treating patients and understanding the disease process
e.g. P(cold|sneeze)
probability of cause common cold given effect sneeze
knowing this probability helps a doctor make a
disease diagnosis based on a patient's symptoms
diagnostic knowledge is more fragile that causal knowledge
since it can change significantly over time given variations
in rate of occurrence of its causes (due to epidemics, etc.)
Bayes' Rule

Using Bayes' Rule with causal knowledge:
–
–
–
23
want to determine diagnostic knowledge (diagnostic reasoning)
that is difficult to obtain from a general population
e.g. symptom is s=stiffNeck, disease is m=meningitis
P(s|m) = 1/2
the casual knowledge
P(m) =1/50000, P(s) = 1/20 prior probabilities
P(m|s) = ?
desired diagnostic knowledge
(1/2 * 1/50000)/ (1/20) = 1/5000
doctor can now use P(m|s) to guide diagnosis
Combining Multiple Evidence
using Bayes' Rule

How do you update conditional probability
of Y given two pieces of evidence A and B?
General Bayes' Rule for multi-valued RVs:
P(Y|X) = (P(X|Y) * P(Y)) / P(X)
let X=A,B:
P(Y|A,B)
= (P(A,B|Y) P(Y) ) / P(A,B)
= (P(Y) (P(B|A,Y) P(A|Y)) / (P(B|A) P(A))
= P(Y)*(P(A|Y)/P(A))*(P(B|A,Y)/P(B|A))
conditionalized chain rule used, product rule used

Problems:
–
24
–
P(B|A,Y) generally hard to compute or obtain
doesn't scale well for n evidence RVs, table size grows O(2n)
Combining Multiple Evidence
using Bayes' Rule
Problems can be circumvented:
 If A and B are conditionally independent given Y
then P(A,B|Y) = P(A|Y)P(B|Y) and for P(A,B) use product rule
P(Y|A,B) = (P(Y) P(A,B|Y) ) / P(A,B) Bayes' Rule Multi-E
P(Y|A,B) = P(Y) * (P(A|Y)/P(A)) * (P(B|Y)/P(B|A))
 No joint probabilities, representation grows O(n)

If A is unconditionally independent of B
then P(A,B|Y) = P(A|Y)P(B|Y) and P(A,B) = P(A)P(B)
– P(Y|A,B) = (P(Y) P(A,B|Y) ) / P(A,B) Bayes' Rule Multi-E
P(Y|A,B) = P(Y) * (P(A|Y)/P(A)) * (P(B|Y)/P(B))
 This equation used to define a naïve Bayes classifier.
25
Combining Multiple Evidence
using Bayes' Rule

Example:
–
–
–
–
–
–
–
26
What is the likelihood that a patient has sclerosis colangitis?
doctor's initial belief:
P(sc) = 1/1,000,000
examine reveals jaundice:
P(j)
= 1/10,000
P(j|sc) = 1/5
doctor's belief given test result: P(sc|j) = P(sc)P(j|sc)/P(j)
= 2/1000
tests reveal fibrosis of bile ducts: P(f|sc) = 4/5
P(f)
= 1/100
doctor naïvely assumes jaundice and fibrosis are independent
doctor's belief now rises:
P(sc|j,f) = 16/100
P(sc|j,f) = P(sc)*(P(j |sc)/P(j)) *(P(f |sc)/P(f ))
P(Y|A,B) = P(Y) *(P(A|Y)/P(A))*(P(B|Y) /P(B))
Naïve Bayes Classifier

27
Naïve Bayes Classifier
used where single class is based on a number of features
or where single cause influences a number of effects
based on P(Y|A,B) = P(Y) * (P(A|Y)/P(A)) * (P(B|Y)/P(B))
– given RV C
 domain is possible classifications say {c1,c2,c3}
 classifies input example of features F1, …, Fn
– compute:
 P(c1|F1, …, Fn), P(c2|F1, …, Fn), P(c3|F1, …, Fn)
 naïvely assume features are independent
– choose value for C that gives maximum probability
– works surprising well in practice even
when independence assumption aren't true
Bayesian Networks

AKA: Bayes Nets, Belief Nets, Causal Nets, etc.

Encodes the full joint probability distribution (FJPD)
for the set of RVs defining a problem domain
Uses a space-efficient data structure by exploiting:

–
–

28
fact that dependencies between RVs are generally local
which results in lots of conditionally independent RVs
Captures both qualitative and quantitative
relationships between RVs
Bayesian Networks


29
Can be used to compute any value in FJPD
Can be used to reason:
– predictive/causal reasoning:
forward (top-down) from causes to effects
– diagnostic reasoning:
backward (bottom-up) from effects to causes
Bayesian Network Representation


Is an augmented DAG (i.e. directed, acyclic graph)
Represented by V,E where
–
–

Each vertex contains:
–
–

the RV's name
either a prior probability distribution or
a conditional probability distribution table (CDT)
that quantifies the effects of the parents on this RV
Each directed arc:
–
30
V is a set of vertices
E is a set of directed edges joining vertices, no loops
–
is from cause (parent) to its immediate effects (children)
represents direct causal relationship between RVs
Bayesian Network Representation

Example: in class
–
–
–
31
each row in conditional probability tables must sum to 1
columns don't need to sum to 1
values obtained from experts

Number of probabilities required is typically
far fewer than the number required for a FJDT

Quantitative information is usually given
by an expert or determined empirically from data
Conditional Independence


Assume effects are conditionally independent
of each other given their common cause
The net is constructed so that given its parents,
a node is conditionally independent of its nondescendant RVs in the net:
P(X1=x1, ..., Xn=xn) = P(xi | parents(Xi)) * ... * P(xn | parents(Xn))

32
Note the full joint probability distribution isn't
needed, only need conditionals relative to their
parent RVs
Algorithm for Constructing
Bayesian Networks
1.
2.
3.
4.
33
Choose a set of relevant random variables
Choose an ordering for them
Assume they're X1 .. Xm where X1 is first, X2 is second, etc.
For i = 1 to m
a. add a new node for Xi to the network
b. set Parents(Xi) to be a minimal subset of {X1 .. Xi-1}
such that we have conditional independence of Xi
and all other members of {X1 ..Xi-1} given Parents(Xi)
c. add directed arc from each node in Parents(Xi) to Xi
d. non-root nodes: define a conditional probability table
P(Xi =x | combinations of Parents(Xi))
root nodes: define prior probability distribution at Xi: P(Xi)
Algorithm for Constructing
Bayesian Networks


For a given set of random variables (RVs)
there is not, in general, a unique Bayesian Net
but all of them represent the same information
For the best net, topologically sort RVs in step 2
–
–

Best Bayesian Network for a problem has:
–
–

34
each RV comes before all of its children
first nodes are roots, then nodes they directly influence
fewest number of probabilities and arcs
easy to determine probabilities for the CDT
Algorithm won't construct a net that violates
the rules of probability
Computing Joint Probabilities
using a Bayesian Network
1. Use product rule
2. Simplify using independence
For Example:
Compute P(a,b,c,d) = P(d,c,b,a)
35
A
B
C
D
order RVS in joint probability bottom up D,C,B,A
= P(d|c,b,a) P(c,b,a)
Product Rule P(d,c,b,a)
= P(d|c) P(c,b,a)
Conditional Independ. of D given C
= P(d|c) P(c|b,a) P(b,a)
Product Rule P(c,b,a)
= P(d|c) P(c|b,a) P(b|a) P(a) Product Rule P(b,a)
= P(d|c) P(c|b,a) P(b ) P(a)
Independence of B and A
given no evidence
Computing Joint Probabilities
using a Bayesian Network

Any entry in the full joint dist. table A
B
C
(i.e. atomic event) can be computed!
P(v1,...,vn) = PP(vi|Parents(Vi)) over i from 1 to n
D
E
e.g. given boolean RVs what is P(a,..,h,k,..,p)?
= P(a)P(b)P(c)P(d|a,b)P(e|b,c)P(f)P(g|d,e)P(h)
* P(k|f,g)P(l|gh)P(m|k)P(n|k)P(o|k,l)P(p|l)

36
Note this is fast, i.e. linear
M
in the number of nodes in the net!
F
G
H
K
L
N
O
P
Computing Joint Probabilities
using a Bayesian Network
How is any joint probability computed?
sum the relevant joint probabilities:
A
B
e.g. Compute: P(a,b)
C
= P(a,b,c,d) + P(a,b,c,d) + P(a,b,c,d) + P(a,b,c,d)
D
e.g. Compute: P(c)
= P(a,b,c,d) + P(a,b,c,d) + Pa,b,c,d) + Pa,b,c,d) +
P(a,b,c,d) + P(a,b,c,d) + P(a,b,c,d) + P(a,b,c,d)


37
A BN can answer any query (i.e. probability) about
the domain by summing the relevant joint probs.
Enumeration can require many computations!
Computing Conditional Probabilities
using a Bayesian Network
38

Basic task of probabilistic system
is to compute conditional probabilities.

Any conditional probability can be computed:
P(v1,...,vk|vk+1,...,vn) = P(V1=v1,...,Vn=vn) / P(Vk+1=vk+1,...,Vn=vn)

Key problem is that the technique of enumerating
joint probabilities can make the computations
intractable (exponential in the number of RVs).
Computing Conditional Probabilities
using a Bayesian Network



39
These computations generally rely
on the simplifications resulting from
the independence of the RVs.
Every variable that isn't an ancestor
of a query variable or an evidence variable
is irrelevant to the query.
What ancestors are irrelevant?
Independence in a Bayesian Network
Given a Bayesian Network
how is independence established?
A
1.
A node is conditionally independent (CI)
of its non-descendants, given its parents.
e.g. Given D and E, G is CI of ?
D
F
A, B, C, F, H
M
40
B
C
E
G
H
K
L
N
O
P
Independence in a Bayesian Network
Given a Bayesian Network
how is independence established?
A
1.
A node is conditionally independent (CI)
of its non-descendants, given its parents.
e.g. Given D and E, G is CI of ?
D
F
A, B, C, F, H
e.g. Given F and G, K is CI of ?
A, B, C, D, E , H, L, P
M
41
B
C
E
G
H
K
L
N
O
P
Independence in a Bayesian Network
Given a Bayesian Network
how is independence established?
2.
A node is conditionally independent
of all other nodes in the network given
its parents, children, and children's
parents, which is called a Markov blanket
e.g. What is the Markov blanket for G?
D, E, F, H, K , L
Given this blanket G is CI of ?
A, B, C, M, N , O, P
What about absolute independence?
42
A
B
D
F
M
C
E
G
H
K
L
N
O
P
Computing Conditional Probabilities
using a Bayesian Network


The general algorithm for computing
conditional probabilities is complicated.
It is easy if the query involves nodes
that are directly connected to each other.
examples assumed to use boolean RVs

Simple causal inference: P(E|C)
–
–

Simple diagnostic inference: P(Q|E)
–
–
43
conditional prob. dist. of effect E given cause C as evidence
reasoning in same direction as arc, e.g. disease to symptom
conditional prob. dist. of query Q given effect E as evidence
reasoning in direction opposite of arc, e.g. symptom to disease
Computing Conditional Probabilities:
Causal (Top-Down) Inference
Compute P(e|c)
conditional probability of effect E=e given cause C=c as evidence
assume arc exists to E from C and C2
C
C
2
1. Rewrite conditional probability of e in terms
of e and all of its parents (that aren't evidence)
given evidence c
2. Re-express each joint probability back
to the probability of e given all of its parents
3. Simplify using independence and Look Up
required values in the Bayesian Network
44
E
Computing Conditional Probabilities:
Causal (Top-Down) Inference
Compute P(e|c)
1. = P(e,c) / P(c)
product rule
= (P(e,c,c2)+ P(e,c,c2)) / P(c)
marginalizing
= P(e,c,c2) / P(c) + P(e,c,c2) / P(c)
algebra
= P(e,c2|c) + P(e,c2|c)
product rule, e.g. X=e,c2
2. = P(e|c2,c) P(c2|c) + P(e|c2,c) P(c2|c) cond. chain rule
3. Simplify given C and C2 are independent
P(c2|c) = P(c2)
P(c2|c) = P(c2)
= P(e|c2,c) P(c2) + P(e|c2,c) P(c2)
algebra
now look up values to finish computation
45
Computing Conditional Probabilities:
Diagnostic (Bottom-Up) Inference
Compute P(c|e)
conditional probability of cause C=c given effect E=e as evidence
assume arc exists from C to E
idea: convert to casual inference using Bayes' rule
1.
2.
3.
4.
Use Bayes' rule P(c|e) = P(e|c) P(c) / P(e)
Compute P(e|c) using causal inference method
Look up value of P(c) in Bayesian Net
Use normalization to avoid computing P(e)
–
–
46
requires computing P(c|e)
using steps as in 1 – 3 above
Summary: the Good News


Bayesian Nets are the bread and butter of
AI-uncertainty community (like resolution to AI-logic)
Bayesian Nets are a compact representation
–
–
–

Bayesian Nets are fast at computing joint probs:
–
47
don't require exponential storage to hold
all of the info in the full joint probability distribution (FJPD) table
are a decomposed representation of the FJPD table
conditional probability distribution tables in non-root nodes are
only exponential in the max number of parents of any node
P(V1, ..., Vk) i.e. prior probability of V1, ..., Vk
computing the probability of an atomic event can be done
in linear time with the number of nodes in the net
Summary: the Bad News

Conditional probabilities can also be computed:
P(Q|E1, ..., Ek)
posterior probability of query Q given multiple evidence E1, ..., Ek
– requires enumerating all of the matching entries,
which takes exponential time in the number of variables
– in special cases it can be done faster, <= polynomial time
e.g. polytree: linear time for nets structured like trees

In general, inference in Bayesian Networks (BN)
is NP-hard. 
but BNs are well studied so there exists many efficient exact
solution methods as well as a variety of approximation techniques
48