Day 7 - Department of Computer Science

CS 489 - Machine Learning
- Bayesian network
Instructor: Renzhi Cao
Computer Science Department
Pacific Lutheran University
Spring 2017
Special appreciation to Ian Goodfellow, Joshua Bengio, Aaron Courville, Michael Nielsen, Andrew Ng, Katie Malone, Sebastian Thrun,
Ethem Alpaydin, Christopher Bishop, Geoffrey Hinton, Tom Mitchell.
Bayesian networks
• Naive Bayes assumption of conditional independence is
too restrictive
• It is intractable without some of these assumptions
• Bayesian network describes conditional independence
among subsets of variables
It allows combining prior knowledge about independences
among variables with observed training data
Bayesian networks
Definition: X is conditionally independent of Y given Z if the
probability distribution governing X is independent of the
value of Y given the value of Z; that is, if:
(∀xi,yj,zk) P(X = xi | Y = yj, Z = zk) = P(X = xi | Z = zk)
Example: Two coins, regular coin and fake two-tailed coin (P(h) =
0). Choose a coin and toss two times, define:
A = First coin toss result in Head
B = Second coin toss result in Head
C = Regular coin has been selected
A and B are dependent (since A happens tell us it is a regular coin,
so probability of B will be changed)
Given C (regular coin has been selected), A and B are independent.
P(A| B, C) = P(A|C)
Bayesian networks
• A simple, graphical notation for conditional independence
assertions, and it specifies a joint distribution in a
structured form.
• Represent dependence/independence via a directed
graph
– a set of nodes, for random variables
– A directed, acyclic graph (link ≈ "directly influences")
a conditional distribution for each node given its parents:
p(X1, X2,....XN) =
The full joint distribution
Π p(Xi | Pa(Xi ) )
The graph-structured approximation
Pa(X) = immediate parents of X in the graph
Simple practice
• What is the P(A,B,C)?
B
A
C
p(A,B,C) = p(C|A,B) p(A) p(B)
“Explaining away” effect:
Given C, observing A makes B less likely
e.g., earthquake/burglary/alarm example
A and B are (marginally) independent
but become dependent once C is known
Simple practice
• What is the P(A,B,C)?
p(A,B,C) = p(A) P(B) P(C)
A
B
C
Simple practice
A
• What is the P(A,B,C)?
B
p(A,B,C) = p(B|A) P(C|A) P(A)
B and C are conditionally independent
Given A
e.g., A is a disease, and we model
B and C as conditionally independent
symptoms given A
C
Simple practice
• What is the P(A,B,C)?
p(A,B,C) = p(C|B) p(B|A) p(A)
Markov dependence
A
B
C
Bayesian networks
Properties of Bayesian network:
• Requires that graph is acyclic (no directed cycles)
• Two components
– The graph structure (conditional independence
assumptions)
– The numerical probabilities (for each variable given its
parents)
Bayesian networks
What is the relationship of Bayesian net and Naive Bayesian?
• Naive Bayesian is a special case of Bayesian net.
Bayesian networks
Why we favor BN?
• Representation cost:
– In previous example, we have five variables: F, A, S, H, N. So
we need 25 -1 probability statements. But with BN, we only
need 2 + 2 + 8 + 4 + 4 = 20.
• Efficient learning computation
• Incorporation of domain knowledge
Bayesian network learning
There are several cases:
• Network structure is known or unknown
• Variable values might be fully observed / partly observed
The parameters are actually the conditional probability
table we calculated in Naive Bayesian algorithm.
What to do?