A-02-Bayes-Nets

Bell & Coins Example
Coin2
Coin1
Bell
Consider a bell that rings with high probability only when the
outcome of two coins are equal. This situation can be well
represented by a directed acyclic graph with the probability
distribution:
P(c1,c2’ b) = P(c1) P(c2) P(b | c1, c2) (8 assignments)
= P(c1) P(c2|c1) P(b | c1, c2)
Only Assumption: Coins are marginally independent
I ( coin2, { } ,coin1)
1
Bayesian Networks
(Directed Acyclic Graphical Models)
Definition: A Bayesian Network is a pair (P,D) where P is
a Probability distribution over variables X1 ,…, Xn and D is
a Directed Acyclic Graph over vertices X1 ,…, Xn with the
relationship:
P(x1 ,…, xn ) =
𝑛
𝑖=1 P(xi
| pai)
for all values xi of Xi and
where pai is the set of parents of vertex xi in D.
(When Xi is a source vertex then pai is the empty set of parents
and P(xi | pai) reduces to P(xi).)
2
Bayesian Networks (BNs)
Examples to model diseases & symptoms & risk factors
One variable for all diseases (values are diseases)
(Draw it)
One variable per disease (values are True/False)
(Draw it)
Naïve Bayesian Networks versus Bipartite BNs
(details)
Adding Risk Factors
(Draw it)
3
Naïve Bayes for Diseases &
Symptoms
D
s1
s2
...
sk
Values of D are {D1,…Dm} and represent non-overlapping diseases.
Values of each symptom S represent severity of symptom {0,1,2}.
P(d, s1 ,…, sk ) = p(d)
𝑘
𝑖=1 P(si
| d)
4
Risks, Diseases, Symptoms
R1
D1
s1
s2
R2
D2
D3
s3
...
...
sk-1
Dn
sk
Values of each Disease Di are represent existence/non-existence
{True, False}.
Values of each symptom S represent severity of symptom {0,1,2}.
5
Natural Direction of Information
Consider our example of a hypothesis H=h indicating disease and
a set of symptoms such as fever, blood pressure, pain, etc, marked by
S1=s1, S2=s2, S3=s3 or in short E=e.
We assume P(e | h) are given (or inferred from data)
and compute P(h | e).
Natural order to specify a BN is usually according to causation or
time line. Nothing in the definition dictates this choice, but
normally more independence assumptions are explicated via this
choice. Example: Atomic clock + watch clocks naïve Bayes model.
6
The “Visit-to-Asia” Example
Visit to
Asia
Tuberculosis
Smoking
Lung Cancer
Abnormality
in Chest
X-Ray
Bronchitis
Dyspnea
What are the relevant variables and their dependencies ?
7
An abstract example
P(v, s, t , l , b, a, x, d ) 
P(v) P( s ) P(t | v) P(l | s ) P(b | s) P(a | t , l ) P( x | a) P(d | a, b)
the order V,S,T,L,B,A,X,D,
we have a boundary basis:
 I( S, { }, V )
 I( T, V, S)
 I( l, S, {T, V})
…
 I( X,A, {V,S,T,L,B,D})
 In
Does I ( {X, D} ,A,V) also hold in P ?
S
V
L
T
B
A
X
D
8
Independence in Bayesian networks
n
p ( x1 ,  , x n )   p ( x i | pa i )
(*)
i 1
n
p ( x1 ,  , x n )   p ( x i | x1 ,  x i 1 )
i 1
I ( Xi , pai , {X1,…,Xi-1} \ pai )
This set  of n independence assumptions is called Basis(D) .
This equation uses specific topological order d of vertices,
namely, that for every Xi, the parents of Xi appear before Xi in
this order.
However, all choices of a topological order are equivalent.
Check for example topological orders of V,S,T,L of visit-to-Asia example.
9
Directed and Undirected Chain
Example
X1
X2
X3
X4
P(x1,x2,x3,x4) = P(x1) P(x2|x1) P(x3|x2) P(x4|x3)
Assumptions: I ( X3 , X2 , X1), I ( X4 , X3,{X1, X2})
X1
X2
X3
X4
P(x1,x2,x3,x4) = P(x1, x2) P(x3|x2) P(x4|x3)
Markov network: The joint distribution is a Multiplication of functions
on three maximal cliques with normalizing factor 1.
10
Reminder: Markov Networks
(Undirected Graphical Models)
Definition: A Markov Network is a pair (P,G) where P is a
probability distribution over variables X1 ,…, Xn and G is a
undirected graph over vertices X1 ,…, Xn with the
relationship:
P(x1 ,…, xn ) = K
𝑚
𝑗=1
gj(Cj )
for all values xi of Xi and
where C1 … Cm are the maximal cliques of G and gj are
functions that assign a non-negative number to every value
combination of the variables in Cj.
Markov Networks and Bayesian networks represent the
same set of independence assumptions only on Chordal
graphs (such as chains and trees).
11
From Separation in UGs
To d-Separation in DAGs
12
Paths
 Intuition:
dependency must “flow” along paths in
the graph
 A path is a sequence of neighboring variables
Examples:
X  A  D  B
A  L  S  B
S
V
L
T
B
A
X
D
13
Path blockage
 Every
path is classified given the evidence:
 active -- creates a dependency between the
end nodes
 blocked – does not create a dependency
between the end nodes
Evidence means the assignment of a value to a
subset of nodes.
14
Path Blockage
Blocked
Three cases:
 Common cause

Active
S
S
L
B
L
B

15
Path Blockage
Three cases:
 Common cause
 Intermediate cause

Blocked
S
Active
S
L
L
A
A
16
Path Blockage
Three cases:
 Common cause
 Intermediate cause
 Common Effect
Blocked
Active
T
T
A
L
A
X
L
X
T
L
A
X
17
Definition of Path Blockage
Definition: A path is active, given evidence Z, if
 Whenever we have the configuration
T
L
A
then either A or one of its descendents is in Z
 No other nodes in the path are in Z.
Definition: A path is blocked, given evidence Z, if it is not
active.
Definition: X is d-separated from Y, given Z, if all paths from
a node in X and a node in Y are blocked, given Z. Denoted
by ID(X, Z, Y) .
18
Example
 ID(T,S|) = yes
S
V
L
T
B
A
X
D
19
Example
 ID (T,S |) = yes
 ID(T,S|D) = no
S
V
L
T
B
A
X
D
20
Example
 ID (T,S |) = yes
 ID(T,S|D) = no
 ID(T,S|{D,L,B}) = yes
S
V
L
T
B
A
X
D
21
Example
the order V,S,T,L,B,A,X,D, we
get from the boundary basis:
 ID( S, { }, V )
 ID( T, V, S)
 ID( l, S, {T, V})
T
…
 ID( X,A, {V,S,T,L,B,D})
 In
X
S
V
L
B
A
D
22
Main results on d-separation
The definition of ID(X, Z, Y) is such that:
Soundness [Theorem 9]: ID(X, Z, Y) = yes
implies IP(X, Z, Y) follows from Basis(D).
Completeness [Theorem 10]: ID(X, Z, Y) = no
implies IP(X, Z, Y) does not follow from Basis(D).
23
Revisiting Example II
S
V
L
T
B
A
X
D
So does IP( {X, D} ,A, V) hold ?
Enough to check d-separation !
24
Local distributions- Asymmetric
independence
Tuberculosis
Lung Cancer
(Yes/No)
(Yes/No)
p(A|T,L)
Abnormality
in Chest
(Yes/no)
Table:
p(A=y|L=n, T=n) = 0.02
p(A=y|L=n, T=y) = 0.60
p(A=y|L=y, T=n) = 0.99
p(A=y|L=y, T=y) = 0.99
25
Claim 1: Each variable Xi in a Bayesian Network is
conditionally independent of all its non-descendants’ given
its parents.
Proof : By definition of d-separation, each variable X is
d-separated of all its non-descendants, given its parents.
Namely, ID(Xi, pai, non-descendsi).
By soundness it follows
that also IP(Xi, pai, non-descendsi).
26
Independence in Bayesian networks
n
p ( x1 ,  , x n )   p ( x i | x1 ,  x i 1 )
i 1
n
=
p ( x1 ,  , x n )   p ( x i | pa i )
(*)
i 1
Implies for every topological order d ?
𝑛
𝑝(𝑥𝑑(1 , … , 𝑥𝑑(𝑛
=
𝑝(𝑥𝑑(𝑖 |𝐩𝐚𝑑(𝑖
𝑖=1
𝑛
𝑝(𝑥𝑑(1 , … , 𝑥𝑑(𝑛
=
=
𝑝(𝑥𝑑(𝑖 |𝑥d(1 , … 𝑥𝑑(𝑖−1
𝑖=1
Claim 2: Each topological order in a BN entails the
same set of independence assumptions.
Proof : By definition of d-separation: ID(Xi, pai, non-descendsi).
Consequently, For each topological order d on {1,…,n}, it
follows from soundness (Theorem 9) that
IP(Xd(i), pad(i), non-descendsd(i)) holds as well.
(This part just repeats Proof of claim 1)
By the decomposition property of conditional independence
IP(Xd(i), pad(i), S ) holds for every S
that is a subset of non-descendsd(i) .
Namely, Xi is independent given its parents also from
S ={all variables before Xi in an arbitrary topological order d}.
28
Extension of the Markov Chain Property
For Markov chains we assume Basis(D):
IP(Xi , Xi-1 , {X1 … Xi-2})
Due to soundness of d-separation (Theorem 9) we also get:
IP(Xi , {Xi-1 ,Xi+1} , {X1 … Xi-2 , Xi+2… Xn })
Definition: A Markov blanket of a variable X wrt probability
distribution P is a set of variables B(X) such that
IP(X, B(X), all-other-variables)
Chain Example: B(Xi) = {Xi-1 ,Xi+1}
Markov Blankets in BNs
Proof: Consequence of soundness of d-separation (Theorem 9).
30