Today
• Parameter Estimation:
• Maximum Likelihood (ML)
• Maximum A Posteriori (MAP)
• Bayesian
• Continuous case
• Learning Parameters for a Bayesian Network
• Naive Bayes
• Maximum Likelihood estimates
• Priors
• Learning Structure of Bayesian Networks
Statistical Learning
CSE 473
Spring 2004
1
2
Coin Flip
Coin Flip
C1
C2
C3
C1
C2
C3
P(H|C1) = 0.1
P(H|C2) = 0.5
P(H|C3) = 0.9
P(H|C1) = 0.1
P(H|C2) = 0.5
P(H|C3) = 0.9
Which coin will I use?
Which coin will I use?
P(C1) = 1/3
P(C2) = 1/3
P(C3) = 1/3
Prior: Probability of a hypothesis
before we make any observations
P(C1) = 1/3
P(C2) = 1/3
P(C3) = 1/3
Uniform Prior: All hypothesis are equally likely
before we make any observations
3
4
Experiment 1: Heads
Experiment 1: Heads
Which coin did I use?
Which coin did I use?
P(C1|H) = ?
P (C1 |H) =
P(C2|H) = ?
P (H|C1 )P (C1 )
P (H)
P (H) =
C1
C2
P(H|C1) = 0.1
P(C1) = 1/3
P(H|C2) = 0.5
P(C2) = 1/3
P(C3|H) = ?
3
!
i=1
P(C1|H) = 0.066 P(C2|H) = 0.333
P(C3|H) = 0.6
Posterior: Probability of a hypothesis given data
P (H|Ci )P (Ci )
C3
C1
C2
C3
P(H|C3) = 0.9
P(C3) = 1/3
P(H|C1) = 0.1
P(C1) = 1/3
P(H|C2) = 0.5
P(C2) = 1/3
P(H|C3) = 0.9
P(C3) = 1/3
5
6
Experiment 2: Tails
Experiment 2: Tails
Which coin did I use?
Which coin did I use?
P(C1|HT) = ?
P(C2|HT) = ?
P(C3|HT) = ?
P(C1|HT) = 0.21 P(C2|HT) = 0.58 P(C3|HT) = 0.21
P (C1 |HT ) = αP (HT |C1 )P (C1 ) = αP (H|C1 )P (T |C1 )P (C1 )
P (C1 |HT ) = αP (HT |C1 )P (C1 ) = αP (H|C1 )P (T |C1 )P (C1 )
C1
C2
C3
C1
C2
C3
P(H|C1) = 0.1
P(C1) = 1/3
P(H|C2) = 0.5
P(C2) = 1/3
P(H|C3) = 0.9
P(C3) = 1/3
P(H|C1) = 0.1
P(C1) = 1/3
P(H|C2) = 0.5
P(C2) = 1/3
P(H|C3) = 0.9
P(C3) = 1/3
7
8
Experiment 2: Tails
Your Estimate?
What is the probability of heads after two experiments?
Which coin did I use?
Most likely coin:
P(C1|HT) = 0.21 P(C2|HT) = 0.58 P(C3|HT) = 0.21
Best estimate for P(H)
C2
P(H|C2) = 0.5
C1
C2
C3
C1
C2
C3
P(H|C1) = 0.1
P(C1) = 1/3
P(H|C2) = 0.5
P(C2) = 1/3
P(H|C3) = 0.9
P(C3) = 1/3
P(H|C1) = 0.1
P(C1) = 1/3
P(H|C2) = 0.5
P(C2) = 1/3
P(H|C3) = 0.9
P(C3) = 1/3
9
Your Estimate?
Using Prior Knowledge
Maximum Likelihood Estimate: The best hypothesis
that fits observed data assuming uniform prior
Most likely coin:
Best estimate for P(H)
C2
10
• Should we always use Uniform Prior?
• Background knowledge:
• Heads => you go first in Abalone against TA
• TAs are nice people
• => TA is more likely to use a coin biased in
P(H|C2) = 0.5
your favor
C2
P(H|C2) = 0.5
P(C2) = 1/3
11
C1
C2
C3
P(H|C1) = 0.1
P(H|C2) = 0.5
P(H|C3) = 0.9
12
Using Prior Knowledge
Experiment 1: Heads
Which coin did I use?
We can encode it in the prior:
P(C1) = 0.05
P(C2) = 0.25
P(C3) = 0.70
C1
C2
C3
P(H|C1) = 0.1
P(H|C2) = 0.5
P(H|C3) = 0.9
P(C1|H) = ?
P(C2|H) = ?
P(C3|H) = ?
P (C1 |H) = αP (H|C1 )P (C1 )
C1
C2
C3
P(H|C1) = 0.1
P(C1) = 0.05
P(H|C2) = 0.5
P(C2) = 0.25
P(H|C3) = 0.9
P(C3) = 0.70
13
14
Experiment 1: Heads
Experiment 2: Tails
Which coin did I use?
Which coin did I use?
P(C1|H) = 0.006 P(C2|H) = 0.165
P(C3|H) = 0.829
ML posterior after Exp 1:
P(C1|H) = 0.066 P(C2|H) = 0.333
C2
C1
P(C3|H) = 0.600
C3
P(H|C1) = 0.1
P(C1) = 0.05
P(H|C2) = 0.5
P(C2) = 0.25
P(C1|HT) = ?
P(C2|HT) = ?
P(C3|HT) = ?
P (C1 |HT ) = αP (HT |C1 )P (C1 ) = αP (H|C1 )P (T |C1 )P (C1 )
P(H|C3) = 0.9
P(C3) = 0.70
C1
C2
C3
P(H|C1) = 0.1
P(C1) = 0.05
P(H|C2) = 0.5
P(C2) = 0.25
P(H|C3) = 0.9
P(C3) = 0.70
15
16
Experiment 2: Tails
Experiment 2: Tails
Which coin did I use?
Which coin did I use?
P(C1|HT) = 0.035 P(C2|HT) = 0.481 P(C3|HT) = 0.485
P(C1|HT) = 0.035 P(C2|HT) = 0.481 P(C3|HT) = 0.485
P (C1 |HT ) = αP (HT |C1 )P (C1 ) = αP (H|C1 )P (T |C1 )P (C1 )
C1
C2
C3
C1
C2
C3
P(H|C1) = 0.1
P(C1) = 0.05
P(H|C2) = 0.5
P(C2) = 0.25
P(H|C3) = 0.9
P(C3) = 0.70
P(H|C1) = 0.1
P(C1) = 0.05
P(H|C2) = 0.5
P(C2) = 0.25
P(H|C3) = 0.9
P(C3) = 0.70
17
18
Your Estimate?
Your Estimate?
Maximum A Posteriori (MAP) Estimate: The best hypothesis
that fits observed data assuming a non-uniform prior
What is the probability of heads after two experiments?
Most likely coin:
Best estimate for P(H)
C3
Most likely coin:
P(H|C3) = 0.9
Best estimate for P(H)
C3
P(H|C3) = 0.9
C1
C2
C3
C3
P(H|C1) = 0.1
P(C1) = 0.05
P(H|C2) = 0.5
P(C2) = 0.25
P(H|C3) = 0.9
P(C3) = 0.70
P(H|C3) = 0.9
P(C3) = 0.70
19
20
Did We Do The Right
Thing?
Did We Do The Right
Thing?
P(C1|HT) = 0.035 P(C2|HT) = 0.481 P(C3|HT) = 0.485
P(C1|HT) = 0.035 P(C2|HT) = 0.481 P(C3|HT) = 0.485
C2 and C3 are almost
equally likely
C1
C2
C3
C1
C2
C3
P(H|C1) = 0.1
P(H|C2) = 0.5
P(H|C3) = 0.9
P(H|C1) = 0.1
P(H|C2) = 0.5
P(H|C3) = 0.9
21
A Better Estimate
22
Bayesian Estimate
Bayesian Estimate: Minimizes prediction error,
given data and (generally) assuming a non-uniform prior
Recall: P (H) =
3
!
P (H|Ci )P (Ci ) = 0.680
P (H) =
i=1
P (H|Ci )P (Ci ) = 0.680
i=1
P(C1|HT) = 0.035 P(C2|HT) = 0.481 P(C3|HT) = 0.485
C1
P(H|C1) = 0.1
3
!
C2
P(H|C2) = 0.5
P(C1|HT) = 0.035 P(C2|HT) = 0.481 P(C3|HT) = 0.485
C3
P(H|C3) = 0.9
C1
P(H|C1) = 0.1
23
C2
P(H|C2) = 0.5
C3
P(H|C3) = 0.9
24
Comparison
Comparison
• ML (Maximum Likelihood):
• ML (Maximum Likelihood):
• MAP (Maximum A Posteriori):
• MAP (Maximum A Posteriori):
• Bayesian:
• Bayesian:
P(H) = 0.5
after 10 experiments: P(H) = 0.9
P(H) = 0.5
after 10 experiments (HTH8): P(H) = 0.9
P(H) = 0.9
after 10 experiments: P(H) = 0.9
P(H) = 0.9
after 10 experiments (HTH8): P(H) = 0.9
P(H) = 0.68
after 10 experiments: P(H) = 0.9
P(H) = 0.68
after 10 experiments (HTH8): P(H) = 0.9
25
26
Comparison
Comparison
• ML (Maximum Likelihood):
• Easy to compute
• MAP (Maximum A Posteriori):
• Still easy to compute
• Incorporates prior knowledge
• Bayesian:
• Minimizes error => great when data is
scarce
• Potentially much harder to compute
• ML (Maximum Likelihood):
• Easy to compute
• MAP (Maximum A Posteriori):
• Still easy to compute
• Incorporates prior knowledge
• Bayesian:
• Minimizes error => great when data is
scarce
• Potentially much harder to compute
27
28
Comparison
Summary For Now
•
•
•
•
• ML (Maximum Likelihood):
• Easy to compute
• MAP (Maximum A Posteriori):
• Still easy to compute
• Incorporates prior knowledge
• Bayesian:
• Minimizes error => great when data is
scarce
• Potentially much harder to compute
29
Prior: Probability of a hypothesis before we see any data
Uniform Prior: A prior that makes all hypothesis equaly likely
Posterior: Probability of a hypothesis after we saw some data
Likelihood: Probability of data given hypothesis
30
Summary For Now
•
•
•
•
Continuous Case
Prior: Probability of a hypothesis before we see any data
Uniform Prior: A prior that makes all hypothesis equaly likely
Posterior: Probability of a hypothesis after we saw some data
Likelihood: Probability of data given hypothesis
Prior
• In the previous example, we chose from a
discrete set of three coins
Hypothesis
Maximum Likelihood
Estimate
Uniform
The most likely
Maximum A Posteriori
Estimate
Any
The most likely
Bayesian Estimate
Any
Weighted
combination
• In general, we have to pick from a
continuous distribution of biased coins
31
Continuous Case
32
Continuous Case
3
2
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
33
Continuous Case
Prior
3
1
34
Continuous Case
Exp 2: Tails
Exp 1: Heads
2
0.9
Posterior after 2 experiments:
2
2
uniform
ML Estimate
MAP Estimate
Bayesian Estimate
2
1
1
1
0
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
w/ uniform prior
1
0
0
3
3
2
2
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
3
with background
knowledge
2
3
0.1
with background
knowledge
2
1
1
1
1
0
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0
35
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
36
After 10 Experiments...
After 100 Experiments...
Posterior:
Posterior:
5
11
ML Estimate
MAP Estimate
Bayesian Estimate
w/ uniform prior
4
ML Estimate
MAP Estimate
Bayesian Estimate
3
2
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
w/ uniform prior
10
1
-1
9
8
7
6
5
4
3
2
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-1
5
with background
knowledge
4
with background
knowledge
12
11
10
9
3
8
7
2
6
5
4
1
3
2
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1
0
-1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-1
37
Parameter Estimation
and Bayesian Networks
E
T
F
F
F
F
...
B
F
F
T
F
T
R
T
F
F
F
F
A
T
F
T
T
F
J
F
F
T
T
F
38
Parameter Estimation
and Bayesian Networks
M
T
T
T
T
F
E
T
F
F
F
F
...
B
F
F
T
F
T
R
T
F
F
F
F
Prior
We have:
- Bayes Net structure and observations
- We need: Bayes Net parameters
25
20
18
20
16
P(B) = ?
+ data =
15
10
5
14
12
10
8
6
4
2
0
0
0.2
0.4
0.6
0.8
1
0
0
-5
0.2
0.4
0.6
-2
0.8
1
A
T
F
T
T
F
J
F
F
T
T
F
M
T
T
T
T
F
Now compute
either MAP or
Bayesian estimate
39
Parameter Estimation
and Bayesian Networks
E
T
F
F
F
F
...
B
F
F
T
F
T
R
T
F
F
F
F
A
T
F
T
T
F
J
F
F
T
T
F
40
Parameter Estimation
and Bayesian Networks
M
T
T
T
T
F
E
T
F
F
F
F
...
B
F
F
T
F
T
R
T
F
F
F
F
A
T
F
T
T
F
J
F
F
T
T
F
M
T
T
T
T
F
P(A|E,B) = ?
P(A|E,¬B) = ?
P(A|¬E,B) = ?
P(A|¬E,¬B) = ?
41
42
Parameter Estimation
and Bayesian Networks
E
T
F
F
F
F
...
P(A|E,B) = ? Prior
P(A|E,¬B) = ?
+ data =
P(A|¬E,B) = ?
P(A|¬E,¬B) = ?
2
1
0
0
0.2
0.4
0.6
0.8
1
B
F
F
T
F
T
R
T
F
F
F
F
2
1
0
0
0.2
0.4
0.6
0.8
1
A
T
F
T
T
F
J
F
F
T
T
F
Parameter Estimation
and Bayesian Networks
M
T
T
T
T
F
E
T
F
F
F
F
...
Now compute
either MAP or
Bayesian estimate
P(A|E,B) = ?
P(A|E,¬B) = ?
P(A|¬E,B) = ?
P(A|¬E,¬B) = ?
B
F
F
T
F
T
R
T
F
F
F
F
A
T
F
T
T
F
J
F
F
T
T
F
You know the drill...
43
44
Naive Bayes
Naive Bayes
Spam
FREE
apple
Spam
Bayes
sender =
Krzysztof
...
Apple
Bayes
FREE
sender =
Krzysztof
...
• A Bayes Net where all nodes are children
• A Bayes Net where all nodes are children
• Why?
• Expressive and accurate?
• Easy to learn?
• Why?
• Expressive and accurate? No - why?
• Easy to learn?
of a single root node
of a single root node
45
46
Naive Bayes
Naive Bayes
Spam
Apple
Bayes
M
T
T
T
T
F
Spam
FREE
sender =
Krzysztof
Apple
...
Bayes
FREE
sender =
Krzysztof
...
• A Bayes Net where all nodes are children
• A Bayes Net where all nodes are children
• Why?
• Expressive and accurate? No
• Easy to learn? Yes
• Why?
• Expressive and accurate? No
• Easy to learn? Yes
• Useful? Sometimes
of a single root node
of a single root node
47
48
Inference In Naive Bayes
Apple
P(B|S) = 0.001
P(B|¬S) = 0.01
Spam
FREE
sender =
Krzysztof
P(F|S) = 0.8
P(F|¬S) = 0.05
P(A|S) = ?
P(A|¬S) = ?
Bayes
P(A|S) = 0.01
P(A|¬S) = 0.1
Inference In Naive Bayes
P(S) = 0.6
Spam
Apple
...
Bayes
P(A|S) = 0.01
P(A|¬S) = 0.1
• Goal, given evidence (words in an email)
P (S|E) =
decide if an email is spam
E = {A, ¬B, F, ¬K, . . .}
P(B|S) = 0.001
P(B|¬S) = 0.01
P(S) = 0.6
FREE
sender =
Krzysztof
P(F|S) = 0.8
P(F|¬S) = 0.05
P(A|S) = ?
P(A|¬S) = ?
...
P (E|S)P (S)
P (E)
Independence
to the rescue!
=
P (A, ¬B, F, ¬K, . . . |S)P (S)
P (A, ¬B, F, ¬K, . . .)
=
P (A|S)P (¬B|S)P (F |S)P (¬K|S)P (. . . |S)P (S)
P (A)P (¬B)P (F )P (¬K)P (. . .)
49
Inference In Naive Bayes
Apple
P (S|E) =
P (¬S|E) =
Spam
FREE
sender =
Krzysztof
P(F|S) = 0.8
P(F|¬S) = 0.05
P(A|S) = ?
P(A|¬S) = ?
Bayes
P(B|S) = 0.001
P(B|¬S) = 0.01
Inference In Naive Bayes
P(S) = 0.6
Spam
P(A|S) = 0.01
P(A|¬S) = 0.1
50
Apple
...
Bayes
P(A|S) = 0.01
P(A|¬S) = 0.1
P (A|S)P (¬B|S)P (F |S)P (¬K|S)P (. . . |S)P (S)
P (A)P (¬B)P (F )P (¬K)P (. . .)
P(B|S) = 0.001
P(B|¬S) = 0.01
P(S) = 0.6
FREE
sender =
Krzysztof
P(F|S) = 0.8
P(F|¬S) = 0.05
P(A|S) = ?
P(A|¬S) = ?
...
P (S|E) ∝ P (A|S)P (¬B|S)P (F |S)P (¬K|S)P (. . . |S)P (S)
P (A|¬S)P (¬B|¬S)P (F |¬S)P (¬K|¬S)P (. . . |¬S)P (¬S)
P (A)P (¬B)P (F )P (¬K)P (. . .)
P (¬S|E) ∝ P (A|¬S)P (¬B|¬S)P (F |¬S)P (¬K|¬S)P (. . . |¬S)P (¬S)
Spam if P(S|E) > P(¬S|E)
But...
51
Parameter Estimation Revisited
Parameter Estimation Revisited
P(S) = !
Spam
Apple
Bayes
Spam
sender =
Krzysztof
FREE
Apple
...
• Can we calculate Maximum Likelihood
Data:
2
+
1
=
1
0
0
0.2
0.4
0.6
0
0
0.2
0.4
0.6
!
0.8
1
!
0.8
1
FREE
sender =
Krzysztof
...
P(data|hypothesis)
Max Likelihood
estimate
2
Bayes
P(S) = !
• What function are we maximizing?
estimate of ! easily?
Prior
52
Looking for the
maximum of a function:
- find the derivative
- set it to zero
• hypothesis = h
• P(data|h ) = P(
!
!
=
=
53
(one for each value of !)
|h!)P( |h!)P( |h!)P( |h!)
!
(1-!) (1-!)
!
#
#
!
(1-!)
54
Parameter Estimation Revisited
Spam
Apple
Bayes
Parameter Estimation Revisited
P(S) = !
FREE
Spam
sender =
Krzysztof
Apple
...
Bayes
P(S) = !
FREE
sender =
Krzysztof
• What function are we maximizing?
• What function are we maximizing?
• hypothesis = h
• P(data|h ) = P(
• hypothesis = h
• P(data|h ) = P(
P(data|hypothesis)
!
!
=
=
...
P(data|hypothesis)
(one for each value of !)
!
|h!)P( |h!)P( |h!)P( |h!)
!
(1-!) (1-!)
!
#
#
!
(1-!)
!
=
=
(one for each value of !)
|h!)P( |h!)P( |h!)P( |h!)
!
(1-!) (1-!)
!
#
#
!
(1-!)
55
Parameter Estimation Revisited
Spam
Apple
Bayes
56
Parameter Estimation Revisited
P(S) = !
FREE
Spam
sender =
Krzysztof
Apple
...
Bayes
P(S) = !
FREE
sender =
Krzysztof
• What function are we maximizing?
• What function are we maximizing?
• hypothesis = h
• P(data|h ) = P(
• hypothesis = h
• P(data|h ) = P(
P(data|hypothesis)
!
!
=
=
...
P(data|hypothesis)
(one for each value of !)
!
|h!)P( |h!)P( |h!)P( |h!)
!
(1-!) (1-!)
!
#
#
!
(1-!)
!
=
=
(one for each value of !)
|h!)P( |h!)P( |h!)P( |h!)
!
(1-!) (1-!)
!
#
#
!
(1-!)
57
Parameter Estimation Revisited
Spam
Apple
Bayes
58
Parameter Estimation Revisited
P(S) = !
FREE
Spam
sender =
Krzysztof
Apple
...
• To find ! that maximizes !#
FREE
sender =
Krzysztof
...
• To find ! that maximizes !#
(1-!)#
we take a derivative of the function
(1-!)#
we take a derivative of the function
and set it to 0. And we get:
• P(S) = ! = # / (# + #
• You knew it already, right?
Bayes
P(S) = !
and set it to 0. And we get:
#
• P(S) = ! = # + #
• You knew it already, right?
)
59
60
Problems With Small
Samples
Smoothing
• What happens if in your training data
• Smoothing is used when samples are small
• Add-one smoothing is the simplest
apples are not mentioned in any spam
message?
• P(A|S) = 0
• Why is it bad?
smoothing method: just add 1 to every
count!
P (S|E) ∝ P (A|S)P
(¬B|S)P (F |S)P (¬K|S)P (. . . |S)P (S) = 0
0
61
62
Priors!
Priors!
#
#
• Recall that P(S) = # + #
• If we have a slight hunch that P(S) ! p
P(S) =
#
• Recall that P(S) = # + #
• If we have a slight hunch that P(S) ! p
+p
#
+# +1
P(S) =
• If we have a big hunch that P(S) ! p
#
+p
#
+# +1
• If we have a big hunch that P(S) ! p
+ mp
#
#
+# +m
where m can be any number > 0
+ mp
#
#
+# +m
where m can be any number > 0
63
64
Priors!
•
•
Priors!
#
#
+#
If we have a slight hunch that P(S) ! p
P(S) =
Recall that P(S) =
P(S) =
#
#
+ mp
#
+# +m
• Note that if m = 10 in the above, it is like
saying “I have seen 10 samples that make
me believe that P(S) = p”
+p
#
+# +1
• Hence, m is referred to as the
• If we have a big hunch that P(S) ! p
equivalent sample size
+ mp
#
#
+# +m
where m can be any number > 0
65
66
Inference in Naive
Bayes Revisited
Priors!
P(S) =
#
+ mp
#
+# +m
• Recall that
•
• No prior knowledge => p=0.5
• If you build a personalized spam filter, you
Where should p come from?
P (S|E) ∝ P (A|S)P (¬B|S)P (F |S)P (¬K|S)P (. . . |S)P (S)
multiplying lots of small numbers
• WeIs are
there any potential for trouble here?
together => danger of underflow!
can use p = P(S) from some body else’s
filter!
• Solution? Use logs!
67
Inference in Naive
Bayes Revisited
68
Inference in Naive
Bayes Revisited
• Recall that
log(P (S|E)) ∝ log(P (A|S)P (¬B|S)P (F |S)P (¬K|S)P (. . . |S)P (S))
P (S|E) ∝ P (A|S)P (¬B|S)P (F |S)P (¬K|S)P (. . . |S)P (S)
∝ log(P (A|S))+log(P (¬B|S))+log(P (F |S))+log(P (¬K|S))+log(P (. . . |S))+log(P (S)))
• We are multiplying lots of small numbers
• Now we add “regular” numbers -- little
together => danger of underflow!
• Solution? Use logs!
danger of over- or underflow errors!
69
Learning The Structure
of Bayesian Networks
70
Learning The Structure
of Bayesian Networks
• General idea: look at all possible network
• Local search!
• Start with some network structure
• Try to make a change (add, delete or
structures and pick one that fits observed
data best
• Impossibly slow: exponential number of
networks, and for each we have to learn
parameters, too!
reverse node)
• See if the new network is any better
• What do we do if searching the space
exhaustively is too expensive?
71
72
Learning The Structure
of Bayesian Networks
Learning The Structure
of Bayesian Networks
prior network+equivalent sample size
• What network structure should we start
x1
x3
with?
improved network(s)
x5
x6
• Random with uniform prior?
• Networks that reflects our (or experts’)
x1
x7
x8
x3
x9
x1
true
false
false
true
73
Learning The Structure
of Bayesian Networks
• We have just described how to get an ML
or MAP estimate of the structure of a
Bayes Net
What would the Bayes estimate look like?
Find all possible networks
Calculate their posteriors
When doing inference: result weighed
combination of all networks!
•
•
•
75
x2
false
false
false
true
.
.
.
x3
true
true
false
false
x2
x4
x5
x6
data
knowledge of the field?
•
x2
x4
x7
x8
x9
...
.
.
.
74
© Copyright 2026 Paperzz