Lecture 1: Introduction to Bayesian Networks

Bayesian Network
CVPR Winter seminar
Jaemin Kim
Outline
» Concepts in Probability
• Probability
• Random variables
• Basic properties (Bayes rule)
•
•
•
•
•
•
Bayesian Networks
Inference
Decision making
Learning networks from data
Reasoning over time
Applications
2
Probabilities

Probability distribution P(X|x)
• X is a random variable
• Discrete
• Continuous
• x is background state of information
3
Discrete Random Variables

Finite set of possible outcomes
X  x1 , x2 , x3 ,..., xn 
P( xi )  0
0.4
0.35
0.3
n
 P( x )  1
i 1
X binary:
i
P( x)  P( x ) 1
0.25
0.2
0.15
0.1
0.05
0
X1
X2
X3
X4
4
Continuous Random Variables

Probability distribution (density function) over
continuous values
X  0,10
P( x)  0
10
 P( x)dx  1
P(x)
0
7
P(5  x  7) 
 P( x)dx
5
5
7
x
5
More Probabilities

Joint
P( x, y)  P( X  x  Y  y)
• Probability that both X=x and Y=y
 Conditional
P( x | y )  P( X  x | Y  y )
• Probability that X=x given we know that Y=y
6
Rule of Probabilities

Product Rule
P( X , Y )  P( X | Y ) P(Y )  P(Y | X ) P( X )

Marginalization
n
P(Y )   P(Y , xi )
i 1
X binary:
P(Y )  P(Y , x)  P(Y , x )
7
Bayes Rule
P( H , E )  P( H | E ) P( E )  P( E | H ) P( H )
P( E | H ) P( H )
P( H | E ) 
P( E )
8
Graph Model
목적:

특정 variable에 관한 정보 (probability distribution)를 상관관계가 있는
다른 variables 관한 정보부터 추출

Definition:
• A collection of variables (nodes) with a set of dependencies
(edges) between the variables, and
a set of probability distribution functions for each variable
• A Bayesian network is a special type of graph model which is a
directed acyclic graph (DAG)
9
Bayesian Networks
A Graph
− nodes represent the random variables
− directed edges (arrows) between pairs of nodes
− it must be a Directed Acyclic Graph (DAG)
− the graph represents relationships between variables
Conditional probability specifications
− the conditional probability distribution (CPD) of each variable
given its parents
− discrete variable: table (CPT)
10
Bayesian Networks (Belief Networks)
A Graph
− directed edges (arrows) between pairs of nodes
− causality: A “causes” B
− AI an statistics communities
Markov Random fields (MRF)
A Graph
− undirected edges (arrows) between pairs of nodes
− a simple definition of independence:
If all paths between the nodes in A and B are separated by a node c
A and B are conditionally independent given a third set C
− physics and vision communities
11
Bayesian Networks
12
Bayesian networks

Basics
• Structured representation
• Conditional independence
• Naïve Bayes model
• Independence facts
13
Bayesian networks
S no, light , heavy Smoking
P(S):
P(S=no)
0.80
P(S=light)
0.15
P(S=heavy) 0.05
P(C|S):
Cancer
C none, benign, malignant
Smoking=
P(C=none)
P(C=benign)
P(C=malig)
no
0.96
0.03
0.01
light
0.88
0.08
0.04
heavy
0.60
0.25
0.15
14
Product Rule

P(C,S) = P(C|S) P(S)
S C
no
light
heavy
none
0.768
0.132
0.035
benign
0.024
0.012
0.010
malignant
0.008
0.006
0.005
P(C=none ^ S=no) = P(C=none | S=no)P(S=no) = 0.96*0.8 = 0.768
15
Product Rule

P(C,S) = P(C|S) P(S)
S C
no
light
heavy
none
0.768
0.132
0.035
benign
0.024
0.012
0.010
malignant
0.008
0.006
0.005
P(C=none ^ S=no) = P(C=none | S=no)P(S=no) = 0.96*0.8 = 0.768
16
Marginalization
S C none benign malig
total
0.768
0.024 0.008
.80
no
0.132
0.012 0.006
.15
light
0.035
0.010 0.005
.05
heavy
total 0.935 0.046 0.019
P(Smoke)
P(Cancer)
P(S=no) = P(S=no ^ C=no) + P(S=no ^ C=be)
+ P(S=no & C=mal)
P(C=mal) = P(C=mal ^ S=no) + P(C=mal ^ S=light) + P(C=mal | S=heavy)
17
Bayes Rule Revisited
P(C | S ) P( S ) P(C , S )
P( S | C ) 

P(C )
P(C )
P(S|C):
S C none
benign
0.768/.935 0.024/.046
no
0.132/.935 0.012/.046
light
0.030/.935 0.015/.046
heavy
malig
Cancer=
P( S=no)
P( S=light)
P( S=heavy)
malignant
0.421
0.316
0.263
none
0.821
0.141
0.037
benign
0.522
0.261
0.217
0.008/.019
0.006/.019
0.005/.019
18
A Bayesian Network
Age
Gender
Exposure
to Toxics
Smoking
Cancer
Serum
Calcium
Lung
Tumor
19
Problems with Large Instances
•
The joint probability distribution, P(A,G,E,S,C,L,SC)
For five binary variables there are 27 = 128 values in the joint
distribution (for 100 variables there are over 1030 values)
How are these values to be obtained?
•
Inference
To obtain posterior distributions once some evidence is available
requires summation over an exponential number of terms eg 22 in the
calculation of
P( s1 , f1 , x1 )   P(b, s1 , f1 , x1 , l )
b ,l
which increases to 297 if there are 100 variables.
Independence
Age
Gender
Age and Gender are
independent.
P(A,G) = P(G)P(A)
P(A|G) = P(A) A ^ G
P(G|A) = P(G) G ^ A
P(A,G) = P(G|A) P(A) = P(G)P(A)
P(A,G) = P(A|G) P(G) = P(A)P(G)
21
Conditional Independence
Age
Gender
Cancer is independent
of Age and Gender
given Smoking.
Smoking
P(C|A,G,S) = P(C|S)
Cancer
C ^ A,G | S
(Smoking=heavy)조건은 Age와 Gender의 확률분포를 제한
(Smoking=heavy)조건은 cancer의 확률분포를 제한
(Smoking=heavy)조건하에서 cancer는 age와 gender에 독립
22
More Conditional Independence:
Naïve Bayes
Serum Calcium and Lung
Tumor are dependent
Cancer
Serum
Calcium
Serum Calcium is
independent of Lung Tumor,
given Cancer
Lung
Tumor
P(L|SC,C) = P(L|C)
혈청
23
More Conditional Independence:
Explaining Away
Exposure
to Toxics
Smoking
Exposure to Toxics and
Smoking are independent
E^S
Cancer
Exposure to Toxics is
dependent on Smoking, given
Cancer
P(E = heavy | C = malignant) >
P(E = heavy | C = malignant, S=heavy)
24
More Conditional Independence:
Explaining Away
Exposure
to Toxics
Smoking
Exposure
to Toxics
Smoking
Cancer
Cancer
Exposure to Toxics is
dependent on Smoking,
given Cancer
Moralize the graph.
25
Put it all together
P( A, G, E, S , C , L, SC ) 
Age
Gender
P( A)  P(G) 
Exposure
to Toxics
Smoking
P( E | A)  P( S | A, G) 
P(C | E , S ) 
Cancer
Serum
Calcium
Lung
Tumor
P( SC | C )  P( L | C )
26
General Product (Chain) Rule
for Bayesian Networks
n
P( X 1 , X 2 ,  , X n )   P( X i | Pai )
i 1
Pai=parents(Xi)
27
Conditional Independence
A variable (node) is conditionally independent of its nondescendants given its parents.
Age
Gender
Exposure
to Toxics
Smoking
Cancer
Serum
Calcium
Lung
Tumor
Non-Descendants
Parents
Cancer is independent of
Age and Gender given
Exposure to Toxics and
Smoking.
Descendants
28
Another non-descendant
Age
Gender
Exposure
to Toxics
Smoking
Cancer is independent
of Diet given Exposure
to Toxics and Smoking.
Cancer
Diet
Serum
Calcium
Lung
Tumor
29
Representing the Joint Distribution
In general, for a network with nodes X1, X2, …, Xn then
n
P( x1 , x2 ,..., xn )   P( xi | pa( xi ))
i 1
An enormous saving can be made regarding the number of values
required for the joint distribution.
To determine the joint distribution directly for n binary variables 2n – 1
values are required.
For a BN with n binary variables and each node has at most k parents
then less than 2kn values are required.
An Example
P(s1 , b2 , l1 , f1 , x1 )  ?
P(s1)=0.2
Smoking history
P(b1|s1)=0.25
P(b1|s2)=0.05
P(l1|s1)=0.003
P(l1|s2)=0.00005
Bronchitis
Lung Cancer
Fatigue
P(f1|b1,l1)=0.75
P(f1|b1,l2)=0.10
P(f1|b2,l1)=0.5
P(f1|b2,l2)=0.05
X-ray
P(x1|l1)=0.6
P(x1|l2)=0.02
Solution
Note that our joint distribution with 5 variables can be represented as:
P( s, b, l , f , x)  P( s) P(b | s) P(l | b, s) P( f | b, s, l ) P( x | b, s, l , f )
P( x | b, s, l , f )  P( x | l )
Consequently the joint probability distribution can now be expressed as
P( s, b, l , f , x)  P( s) P(b | s) P(l | s) P( f | b, l ) P( x | l )
For example, the probability that someone has a smoking history, lung
cancer but not bronchitis, suffers from fatigue and tests positive in an
X-ray test is
P(s1 , b2 , l1 , f1 , x1 )  0.2  0.75  0.003 0.5  0.6  0.000135
Independence and Graph Separation
• Given a set of observations, is one set of variables
dependent on another set?
• Observing effects can induce dependencies.
• d-separation (Pearl 1988) allows us to check
conditional independence graphically.
Bayesian networks
• Additional structure
•
•
•
•
•
Nodes as functions
Causal independence
Context specific dependencies
Continuous variables
Hierarchy and model construction
Nodes as funtions
• A BN node is conditional distribution function
• its parent values are the inputs
• its output is a distribution over its values
lo : 0.7
a
ab
ab
ab
ab
A
lo
b
XX med
hi
B
0.1
0.4
0.7
0.5
0.3
0.2
0.1
0.3
0.6
0.4
0.2
0.2
med : 0.1
hi : 0.2
Nodes as funtions
lo : 0.7
med : 0.1
a
A
b
B
Any type of function
from Val(A,B)
to distributions
XX
over Val(X)
hi : 0.2
Continuous variables
Outdoor
Temperature
A/C Setting
hi
97o
Indoor
Temperature
Function from Val(A,B)
Indoor
to density
functions
Temperature
over Val(X)
P(x)
x
Gaussian (normal) distributions
P( x) 
  ( x   )2 
1

exp 
2
2 


N(, )
different mean
different variance
Gaussian networks
X ~ N (  ,  X2 )
X
Y
Y ~ N (ax  b,  Y2 )
Each variable is a linear
function of its parents,
with Gaussian noise
Joint probability density functions:
X
Y
X
Y
Composing functions
• Recall: a BN node is a function
• We can compose functions to get more
complex functions.
• The result: A hierarchically structured BN.
• Since functions can be called more than once,
we can reuse a BN model fragment in
multiple contexts.
Owner
Owner
Age
Income
Maintenance
Original-value
Age
Mileage
Brakes:
Car:
Brakes
Power
Engine: Engine
Engine
Power
Tires:
RF-Tire
Tires
LF-Tire
Pressure
Fuel-efficiency
Traction
Braking-power
Bayesian Networks
• Knowledge acquisition
• Variables
• Structure
• Numbers
What is a variable?
• Collectively exhaustive, mutually exclusive
values
x1  x2  x3  x4
Error Occured
 ( xi  x j ) i  j

No Error
Values versus Probabilities
Risk of Smoking
Smoking
Clarity Test:
Knowable in Principle
•
•
•
•
•
Weather {Sunny, Cloudy, Rain, Snow}
Gasoline: Cents per gallon
Temperature {  100F , < 100F}
User needs help on Excel Charting {Yes, No}
User’s personality {dominant, submissive}
Structuring
Age
Gender
Exposure
to Toxic
Smoking
Cancer
Lung
Tumor
Network structure corresponding
to “causality” is usually good.
Genetic
Damage
Extending the conversation.
Course Contents
•
•
»
•
•
•
•
Concepts in Probability
Bayesian Networks
Inference
Decision making
Learning networks from data
Reasoning over time
Applications
Inference
•
•
•
•
•
Patterns of reasoning
Basic inference
Exact inference
Exploiting structure
Approximate inference
Predictive Inference
Age
Gender
Exposure
to Toxics
How likely are elderly males
to get malignant cancer?
Smoking
Cancer
Serum
Calcium
P(C=malignant | Age>60, Gender= male)
Lung
Tumor
Combined
Age
Gender
Exposure
to Toxics
Smoking
Cancer
Serum
Calcium
How likely is an elderly
male patient with high
Serum Calcium to have
malignant cancer?
P(C=malignant | Age>60,
Gender= male, Serum Calcium = high)
Lung
Tumor
Explaining away
Age
Gender
Exposure
to Toxics
Smoking
Cancer
Serum
Calcium
Lung
Tumor
• If we see a lung tumor, the
probability of heavy
smoking and of exposure to
toxics both go up.

If we then observe heavy
smoking, the probability of
exposure to toxics goes
back down.
Inference in Belief Networks
• Find P(Q=q|E= e)
• Q the query variable
• E set of evidence variables
P(q, e)
P(q | e) =
P(e)
X1,…, Xn are network variables except Q, E
P(q, e) = S
P(q, e, x1,…, xn)
x1,…, xn
Basic Inference
A
B
C
P(b) = S P(a, b) = S P(b | a) P(a)
a
a
P(c) = S P(c | b) P(b)
b
P(c) = S P(a, b, c) = S P(c | b) P(b | a) P(a)
b,a
b,a
= S P(c | b) S P(b | a) P(a)
b
a
P(b)
Inference in trees
Y2
Y1
X
P(x) = S P(x | y1, y2) P(y1, y2)
y1, y2
because of independence of Y1, Y2:
= S P(x | y1, y2) P(y1) P(y2)
y1, y2
Polytrees
• A network is singly connected (a polytree) if
it contains no undirected loops.
D
C
Theorem: Inference in a singly connected
network can be done in linear time*.
Main idea: in variable elimination, need only maintain
distributions over single nodes.
* in network size including table sizes.
The problem with loops
P(c) 0.5
c
c
Cloudy
Rain
Sprinkler
P(r) 0.99 0.01
Grass-wet
c
P(s) 0.01 0.99
deterministic or
The grass is dry only if no rain and no sprinklers.
P(g) = P(r, s) ~ 0
c
The problem with loops contd.
0
P(g) =
0
P(g | r, s) P(r, s) + P(g | r, s) P(r, s)
+ P(g | r, s) P(r, s) + P(g | r, s) P(r, s)
0
= P(r, s) ~ 0
= P(r) P(s) ~ 0.5 ·0.5 = 0.25
problem
1
Variable elimination
A
B
C
P(c) = S P(c | b) S P(b | a) P(a)
b
P(A)
a
P(b)
P(B | A)
x
P(B, A)
S
A
P(B)
P(C | B)
x
P(C, B)
S
B
P(C)
Inference as variable elimination
• A factor over X is a function from val(X) to
numbers in [0,1]:
• A CPT is a factor
• A joint distribution is also a factor
• BN inference:
• factors are multiplied to give new ones
• variables in factors summed out
• A variable can be summed out as soon as all
factors mentioning it have been multiplied.
Variable Elimination with loops
P(A)
Age
P(G)
P(S | A,G)
Gender
P(E | A)
x
Exposure
to Toxics
S
P(A,G,S)
Smoking
S
A
G
P(A,E,S)
x
P(C | E,S)
P(E,S)
x
Cancer
P(E,S,C)
Serum
Calcium
P(A,S)
Lung
Tumor
P(L | C)
x
S
P(C)
P(C,L)
S
E,S
C
Complexity is exponential in the size of the factors
P(L)
Inference in BNs and Junction Tree
The main point of BNs is to enable probabilistic inference to be performed.
Inference is the task of computing the probability of each value of a node
in BNs when other variables’ values are know.
The general idea is doing inference by representing the joint probability
distribution on an undirected graph called the Junction tree
The junction tree has the following characteristics:
•
it is an undirected tree, its nodes are clusters of variables
•
given two clusters, C1 and C2, every node on the path between them
contains their intersection C1  C2
•
a Separator, S, is associated with each edge and contains the
variables in the intersection between neighbouring nodes
ABC
BC
BCD
CD
CDE
Inference in BNs




Moralize the Bayesian network
Triangulate the moralized graph
Let the cliques of the triangulated graph be the nodes of
a tree, and construct the junction tree
Belief propagation throughout the junction tree to do
inference
Constructing the Junction Tree (1)
Step 1. Form the moral graph from the DAG
Consider BN in our example
S
S
B
B
L
F
DAG
X
L
F
X
Moral Graph – marry parents
and remove arrows
Constructing the Junction Tree (2)
Step 2. Triangulate the moral graph
An undirected graph is triangulated if every cycle of length greater than 3
possesses a chord
S
B
L
F
X
Constructing the Junction Tree (3)
Step 3. Identify the Cliques
A clique is a subset of nodes which is complete (i.e. there is an edge
between every pair of nodes) and maximal.
S
B
Cliques

L
F
X
{B,S,L}
{B,L,F}
{L,X}
Constructing the Junction Tree (4)
Step 4. Build Junction Tree
The cliques should be ordered (C1,C2,…,Ck) so they possess the running
intersection property: for all 1 < j ≤ k, there is an i < j such that
Cj  (C1… Cj-1)  Ci.
To build the junction tree choose one such I for each j and add an edge
between Cj and Ci.
Junction Tree
Cliques
{B,S,L}
{B,L,F}
{L,X}

BSL
BL
BLF
L
LX
Potentials Initialization
To initialize the potential functions:
1. set all potentials to unity
2. for each variable, Xi, select one node in the junction tree (i.e. one clique)
containing both that variable and its parents, pa(Xi), in the original DAG
3. multiply the potential by P(xi|pa(xi))
BSL  P (b | s ) P (b | s ) P ( s )
BSL
BL
SLF  P ( f | b, l )
BLF
L
LX
 LX  P ( x | l )
Potential Representation
The joint probability distribution can now be represented in terms of
potential functions, ϕ, defined on each clique and each separator of the
junction tree. The joint distribution is given by

P( x) 

cC
sS
c ( xc )
 s ( xs )
The idea is to transform one representation of the joint distribution to
another in which for each clique, C, the potential function gives the
marginal distribution for the variables in C, i.e.
c ( xc )  P( xc )
This will also apply for the separators, S.
Triangulation

Given a numbered graph, proceed from
node n, decrease to 1
• Determine the lower-numbered nodes which
•
are adjacent to the current node, including
those which may have been made adjacent to
this node earlier in this algorithm
Connects these nodes to each other.
Triangulation

Numbering the nodes
• Arbitrarily number the nodes
• Maximum cardinality search
• Give any node a value of 1
• For each subsequent number, pick an new
unnumbered node that neighbors the most already
numbered nodes
Triangulation
BN
Moralized graph
Triangulation
5
7
3
8
4
6
2
1
5
7
Arbitrary numbering
3
8
4
6
2
1
Triangulation
6
4
7
8
2
5
3
1
6
4
7
8
2
5
3
1
Maximum cardinality search
Course Contents
Concepts in Probability
 Bayesian Networks
 Inference
» Decision making
 Learning networks from data
 Reasoning over time
 Applications

Decision making



Decision - an irrevocable allocation of domain
resources
Decision should be made so as to maximize
expected utility.
View decision making in terms of
• Beliefs/Uncertainties
• Alternatives/Decisions
• Objectives/Utilities
Course Contents
Concepts in Probability
 Bayesian Networks
 Inference
 Decision making
» Learning networks from data
 Reasoning over time
 Applications

Learning networks from data
• The learning task
• Parameter learning
• Fully observable
• Partially observable
• Structure learning
• Hidden variables
The learning task
B E A C N
b e a c n
Burglary
Earthquake
Alarm
b e a c n
Input: training data
Call
Newscast
Output: BN modeling data
• Input: fully or partially observable data cases?
• Output: parameters or also structure?
Parameter learning: one variable

Unfamiliar coin:
 Let

q = bias of coin (long-run fraction of heads)
If q known (given), then
 P(X

= heads | q ) = q
Different coin tosses independent given q
 P(X1, …, Xn | q ) = q h (1-q)t
h heads, t tails
Maximum likelihood

Input: a set of previous coin tosses
• X1, …, Xn = {H, T, H, H, H, T, T, H, . . ., H}
h heads, t tails

Goal: estimate q

The likelihood P(X1, …, Xn | q ) = q h (1-q )t

The maximum likelihood solution is:
q* = h
h+t
79
Conditioning on data
h heads, t tails
P(q )
D
1 head
1 tail
P(q | D)
 P(q ) P(D | q )
= P(q ) q h (1-q )t
Conditioning on data
Good parameter distribution:
Beta ( h ,  t )  q  h 1 (1  q ) t 1
Beta ( h ,  t )  q  h 1 (1  q ) t 1
* Dirichlet
distribution generalizes Beta to non-binary variables.
General parameter learning

A multi-variable BN is composed of several
independent parameters (“coins”).
Three parameters:
A

B
qA, qB|a, qB|a
Can use same techniques as one-variable case to
learn each one separately
Max likelihood estimate of qB|a would be:
q*B|a =
#data cases with b, a
#data cases with a
Partially observable data
B E A C N
b ? a c ?
b ? a? n
Burglary
Earthquake
Alarm
Call
• Fill in missing data with “expected” value
• expected = distribution over possible values
• use “best guess” BN to estimate distribution
Newscast
Intuition

In fully observable case:
q*n|e =
I(e | dj) =

#data cases with n, e
#data cases with e
=
Sj I(n,e | dj)
S j I(e | dj)
1 if E=e in data case dj
0 otherwise
In partially observable case I is unknown.
Best estimate for I is:
Problem: q* unknown.
Iˆ(n, e | d j )  Pq * (n, e | d j )
Expectation Maximization (EM)
Repeat :

Expectation (E) step
• Use current parameters q to estimate filled in data.
Iˆ(n, e | d j )  Pq (n, e | d j )

Maximization (M) step

Use filled in data to do max likelihood estimation
~
q n|e 

~
Set: q : q
until convergence.
ˆ
I
 j (n, e | d j )

ˆ(e | d )
I
j
j
Structure learning
Goal:
find “good” BN structure (relative to data)
Solution:
do heuristic search over space of network
structures.
Search space
Space = network structures
Operators = add/reverse/delete edges
Heuristic search
Use scoring function to do heuristic search (any algorithm).
Greedy hill-climbing with randomness works pretty well.
score
Scoring


Fill in parameters using previous techniques &
score completed networks.
One possibility for score:
likelihood function: Score(B) = P(data | B)
D
Example: X, Y independent coin tosses
typical data = (27 h-h, 22 h-t, 25 t-h, 26 t-t)
Maximum likelihood network structure:
X
Y
Max. likelihood network typically fully connected
This is not surprising: maximum likelihood always overfits…
Better scoring functions

MDL formulation: balance fit to data and model
complexity (# of parameters)
Score(B) = P(data | B) - model complexity

Full Bayesian formulation
 prior
on network structures & parameters
 more parameters  higher dimensional space
 get balance effect as a byproduct*
* with Dirichlet parameter prior, MDL is an approximation
to full Bayesian score.
Hidden variables


There may be interesting variables that we
never get to observe:
• topic of a document in information retrieval;
• user’s current task in online help system.
Our learning algorithm should
• hypothesize the existence of such variables;
• learn an appropriate state space for them.
E1
E3
E2
randomly
scattered data
E1
E3
E2
actual data
Bayesian clustering (Autoclass)
Class
naïve Bayes model:
E1




E2
…...
En
(hypothetical) class variable never observed
if we know that there are k classes, just run EM
learned classes = clusters
Bayesian analysis allows us to choose k, trade off fit to
data with model complexity
E1
E3
E2
Resulting cluster
distributions
Detecting hidden variables

Unexpected correlations
hidden variables.
Hypothesized model
Data model
Cholesterolemia
Cholesterolemia
Test1
Test2
Test3
Test1
Cholesterolemia
Test2
Hypothyroid
“Correct” model
Test1
Test2
Test3
Test3
Course Contents
Concepts in Probability
 Bayesian Networks
 Inference
 Decision making
 Learning networks from data
» Reasoning over time
 Applications

Reasoning over time



Dynamic Bayesian networks
Hidden Markov models
Decision-theoretic planning
• Markov decision problems
• Structured representation of actions
• The qualification problem & the frame problem
• Causality (and the frame problem revisited)
Dynamic environments
State(t)

State(t+1)
State(t+2)
Markov property:
 past
independent of future given current state;
 a conditional independence assumption;
 implied by fact that there are no arcs t t+2.
Dynamic Bayesian networks

State described via random variables.
Drunk(t)
Drunk(t+1)
Drunk(t+2)
Velocity(t)
Velocity(t+1)
Velocity(t+2)
...
Position(t)
Position(t+1)
Position(t+2)
Weather(t)
Weather(t+1)
Weather(t+2)
Hidden Markov model

An HMM is a simple model for a partially
observable stochastic domain.
State(t)
State(t+1)
Obs(t)
Obs(t+1)
State transition
model
Observation
model
Hidden Markov model
Partially observable stochastic environment:




states = location
observations = sensor input
Speech recognition:
• states = phonemes
•

0.15
Mobile robots:
observations = acoustic signal
Biological sequencing:
• states = protein structure
• observations = amino acids
0.05
0.8
Acting under uncertainty
Markov Decision Problem (MDP)
action model
agent
observes
state
Action(t)
State(t)
Action(t+1)
State(t+1)
Reward(t)


State(t+2)
Reward(t+1)
Overall utility = sum of momentary rewards.
Allows rich preference model, e.g.:
rewards corresponding
to “get to goal asap”
=
+100
-1
goal states
other states
Partially observable MDPs
agent observes
Obs, not state
Obs depends
on state
Action(t)
Obs(t)
State(t)

Obs(t+1)
State(t+1)
Reward(t)

Action(t+1)
State(t+2)
Reward(t+1)
The optimal action at time t depends on the
entire history of previous observations.
Instead, a distribution over State(t) suffices.
Structured representation
Position(t)
Position(t+1)
Preconditions
Move:
Turn:
Effects
Direction(t)
Direction(t+1)
Holding(t)
Holding(t+1)
Position(t)
Position(t+1)
Direction(t)
Direction(t+1)
Holding(t)
Holding(t+1)
Probabilistic action model
• allows for exceptions & qualifications;
• persistence arcs: a solution to the frame problem.
Applications
Medical expert systems
• Pathfinder
• Parenting MSN
 Fault diagnosis
• Ricoh FIXIT
• Decision-theoretic troubleshooting
 Vista
 Collaborative filtering
