PPT

Based on J. A. Bilmes,“A Gentle Tutorial of the EM Algorithm and its Application
to Parameter Estimation for Gaussian Mixture and Hidden Markov Models“, TR-97021, U.C. Berkeley, April 1998; G. J. McLachlan, T. Krishnan, „The EM Algorithm and
Extensions“, John Wiley & Sons, Inc., 1997; D. Koller, course CS-228 handouts,
Stanford University, 2001., N. Friedman & D. Koller‘s NIPS‘99.
Advanced
I
WS 06/07
MINVOLSET
KINKEDTUBE
PULMEMBOLUS INTUBATION
PAP
SHUNT
VENTMACH
VENTLUNG
DISCONNECT
VENITUBE
PRESS
MINOVL
ANAPHYLAXIS
SAO2
TPR
HYPOVOLEMIA
LVEDVOLUME
CVP
LVFAILURE
FIO2
VENTALV
PVSAT
ARTCO2
INSUFFANESTH
CATECHOL
STROEVOLUME HISTORY ERRBLOWOUTPUT
PCWP
Graphical Models
- Learning -
EXPCO2
CO
HR
HREKG
ERRCAUTER
Structure Learning
HRSAT
HRBP
BP
Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel
Albert-Ludwigs University Freiburg, Germany
Learning With Bayesian Networks
Fixed structure
A
B
Fixed variables
A
?
B
fully
Partially
observed
Easiest problem
counting
Selection of arcs
New domain with no
domain expert
Data mining
Numerical, nonlinear
optimization,
Multiple calls to BNs,
Difficult for large
networks
Encompasses to
difficult subproblem,
„Only“ Structural EM
is known
Parameter
Estimation
Stucture learing
?
Hidden variables
A
?
B
?
H
Scientific discouvery
Bayesian Networks - Learning
Advanced
I
WS 06/07
Advanced
I
WS 06/07
Why Struggle for Accurate Structure?
Earthquake
Alarm Set
Burglary
Sound
Earthquake
Alarm Set
Burglary
Sound
• Cannot be compensated for by
fitting parameters
• Wrong assumptions about
domain structure
Adding an arc
Earthquake
Alarm Set
Burglary
Sound
• Increases the number of
parameters to be estimated
• Wrong assumptions about domain
structure
Bayesian Networks - Learning
Missing an arc
E, B, A
<Y,N,N>
<Y,N,Y>
<N,N,Y>
<N,Y,Y>
.
.
<N,Y,Y>
B
E
A
E, B, A
<Y,?,N>
<Y,N,?>
<N,N,Y>
<N,Y,Y>
.
.
<?,Y,Y>
Unknown Structure, (In)complete Data
• Network structure is not specified
– Learnerr needs to select arcs &
estimate parameters
B
E
• Data does not contain missing values
E
B
P(A | E,B)
e
b
?
?
e
b
?
?
e
b
?
?
e
b
?
?
Learning
algorithm
• Network structure is not specified
• Data contains missing values
– Need to consider assignments to
missing values
A
E
B
P(A | E,B)
e
b
.9
.1
e
b
.7
.3
e
b
.8
.2
e
b
.99
.01
Bayesian Networks - Learning
Advanced
I
WS 06/07
Score-based Learning
Advanced
I
WS 06/07
Define scoring function that evaluates how well a
structure matches the data
score
B
E
E
E
A
A
B
B
A
Search for a structure that maximizes the score
Bayesian Networks - Learning
E, B, A
<Y,N,N>
<Y,Y,Y>
<N,N,Y>
<N,Y,Y>
.
.
<N,Y,Y>
Structure Search as Optimization
Advanced
I
WS 06/07
Input:
Output:
– A network that maximizes the score
Bayesian Networks - Learning
– Training data
– Scoring function
– Set of possible structures
Advanced
I
WS 06/07
Heuristic Search
• Define a search space:
• Traverse space looking for high-scoring
structures
• Search techniques:
Theorem: Finding
– Greedy hill-climbing
– Best first search
– Simulated Annealing
– ...
maximal scoring
structure with at most
k parents per node is
NP-hard for k > 1
Bayesian Networks - Learning
– search states are possible structures
– operators make small changes to structure
Advanced
I
WS 06/07
Typically: Local Search
• Start with a given network
– empty network, best tree , a random network
• At each iteration
– Evaluate all possible changes
– Apply change based on score
• Stop when no modification
improves score
S
E
D
Bayesian Networks - Learning
C
Typically: Local Search
• Start with a given network
– empty network, best tree , a random network
S
C
E
• At each iteration
– Evaluate all possible changes
– Apply change based on score
D
• Stop when no modification
improves score
S
C
E
D
Bayesian Networks - Learning
Advanced
I
WS 06/07
Typically: Local Search
• Start with a given network
– empty network, best tree , a random network
S
C
E
• At each iteration
– Evaluate all possible changes
– Apply change based on score
D
• Stop when no modification
improves score
S
C
E
S
D
C
E
D
Bayesian Networks - Learning
Advanced
I
WS 06/07
Typically: Local Search
• Start with a given network
– empty network, best tree , a random network
S
C
E
• At each iteration
– Evaluate all possible changes
– Apply change based on score
D
• Stop when no modification
improves score
S
C
E
S
C
E
D
S
D
C
E
D
Bayesian Networks - Learning
Advanced
I
WS 06/07
Typically: Local Search
Advanced
I
WS 06/07
If data is complete:
To update score after local change,
only re-score (counting) families that
changed
S
C
E
D
S
E
S
C
E
D
S
D
If data is incomplete:
To update score after local change,
reran parameter estimation algorithm
C
E
D
Bayesian Networks - Learning
C
Advanced
I
WS 06/07
Local Search in Practice
• Local search can get stuck in:
• Standard heuristics can escape both
– Random restarts
– TABU search
– Simulated annealing
Bayesian Networks - Learning
– Local Maxima:
• All one-edge changes reduce the score
– Plateaux:
• Some one-edge changes leave the score
unchanged
Advanced
I
WS 06/07
Local Search in Practice
• Using LL as score, adding arcs always helps
– Max score attained by fully connected network
– Overfitting: A bad idea…
• Minimum Description Length:
<9.7 0.6 8 14 18>
<0.2 1.3 5 ?? ??>
<1.3 2.8 ?? 0 1 >
<?? 5.6 0 10 ??>
……………….
log N
MDL ( BN | D )   log P( D | , G ) 
||
2
DL(Data|model)
DL(Model)
• Other: BIC (Bayesian Information Criterion), Bayesian
score (BDe)
Bayesian Networks - Learning
– Learning  data compression
Local Search in Practice
Advanced
I
WS 06/07
• Perform EM for each candidate graph
G1
G3
G2
Parameter space
Parametric
optimization
(EM)
G4
Gn
Bayesian Networks - Learning
Local Maximum
Advanced
I
WS 06/07
Local Search in Practice
• Perform EM for each candidate graph
G1
G3
G2
Parameter space
Parametric
optimization
(EM)
Computationally
expensive:
Gn
G4
 Parameter optimization via EM — non-trivial
 Need to perform EM for all candidate structures
 Spend time even on poor candidates
 In practice, considers only a few candidates

Bayesian Networks - Learning
Local Maximum
Advanced
I
WS 06/07
Structural EM
[Friedman et al. 98]
Recall, in complete data we had
Idea:
• Instead of optimizing the real score…
• Find decomposable alternative score
• Such that maximizing new score
 improvement in real score
Bayesian Networks - Learning
–Decomposition  efficient search
Advanced
I
WS 06/07
Structural EM
[Friedman et al. 98]
Outline:
• Perform search in (Structure, Parameters) space
• At each iteration, use current model for finding
either:
– Better scoring parameters: “parametric” EM step
or
– Better scoring structure: “structural” EM step
Bayesian Networks - Learning
Idea:
• Use current model to help evaluate new
structures
Structural EM
Advanced
I
WS 06/07
[Friedman et al. 98]
Reiterate
X1
X2
X3
H
Y1
Y2

Y3
Training
Data
Expected Counts
N(X1)
N(X2)
N(X3)
N(H, X1, X1, X3)
N(Y1, H)
N(Y2, H)
N(Y3, H)
N(X2,X1)
N(H, X1, X3)
N(Y1, X2)
N(Y2, Y1, H)
Score
&
Parameterize
X1
X2
X3
H
Y1
X1
Y2
X2
Y3
X3
H
Y1
Y2
Y3
Bayesian Networks - Learning
Computation
Structure Learning: incomplete data
Data
E
B
A
Current model
Expectation
Inference:
P(S|X=0,D=1,C=0,B=1)
Maximization
Parameters
EM-algorithm:
iterate until convergence
S X D C
<? 0 1 0
<1 1 ? 0
<0 0 0 ?
<? ? 0 ?
………
B
1>
1>
?>
1>
Expected
counts
S
1
1
0
1
X D C
0 1 0
1 1 0
0 0 0
0 0 0
………..
B
1
1
0
1
Bayesian Networks - Learning
Advanced
I
WS 06/07
Structure Learning: incomplete data
E
E
B
A
B
Data
A
Current model
S X D C
<? 0 1 0
<1 1 ? 0
<0 0 0 ?
<? ? 0 ?
………
Expectation
Inference:
P(S|X=0,D=1,C=0,B=1)
Expected
counts
Maximization
Parameters
S
1
1
0
1
Maximization
Structure
SEM-algorithm:
iterate until convergence
B
E
X D C
0 1 0
1 1 0
0 0 0
0 0 0
………..
E
B
1
1
0
1
E
A
A
B
1>
1>
?>
1>
B
B
A
Bayesian Networks - Learning
Advanced
I
WS 06/07
Advanced
I
WS 06/07
Structure Learning: Summary
• Expert knowledge + learning from data
• Structure learning involves parameter
estimation (e.g. EM)
• Optimization w/ score functions
• Local traversing of space of possible
structures:
– add, reverse, delete (single) arcs
• Speed-up: Structural EM
– Score candidates w.r.t. current best model
Bayesian Networks - Learning
– likelihood + complexity penality = MDL