Evaluating Bayesian Networks by Sampling with Simplified
Assumptions
Saaid Baraty∗ , Dan A. Simovici∗
∗
University of Massachusetts Boston
Computer Science Department,
Boston, Massachusetts 02125
e-mail{sbaraty,dsim}@cs.umb.edu,
Abstract. The most common fitness evaluation for Bayesian networks in the
presence of data is the Cooper-Herskovitz criterion. This technique involves
massive amounts of data and, therefore, expansive computations. We propose a
cheaper alternative evaluation method using simplified assumptions which produces evaluations that are strongly correlated with the Cooper-Herskovitz criterion.
1
Introduction
We investigate the problem of constructing a Bayesian network for a composite phenomenon U = {u1 , u2 , . . . , un } where ui for 1 ≤ i ≤ n are discrete random variables
representing the state assignment of the attributes of U. To accomplish this, we start from
a data multiset D = {t1 , t2 , . . . , tm } where an n-ary tuple ti is an instance of the event U. We
refer to this multiset as evidence data set (data set for short).
A number of assumptions are necessary for deriving a measure for evaluating the fitness of
a Bayesian network structure for a training data set. Stronger hypotheses make the evaluation
more manageable. On the other hand, the model obtained under weaker assumptions is be
more capable to be conforming with the underlying true distribution of the problem at hand.
Let G = (U, E) be a directed acyclic graph having U as its set of vertices and E as its set
of edges, which captures the direct probabilistic dependencies among these variables. Let Θ be
the collection of parameters which quantifies the joint probability distribution of U as specified
by G. We denote the set of possible assignments of a random variable ui by Dom(ui ) =
{u1i , . . . , uri i }. The notion of domain can be extended to sets of variables V using Cartesian
product. If the set of parent nodes of ui is ParG (ui ), then Dom(ParG (ui ) = {Ui1 , . . . , Uiqi }.
The set of non-descendants of ui , ndG (ui ) is the set of all nodes in U excluding ui and all its
descendants. When it is clear from the context we drop the subscript G. The pair B = (G, Θ)
satisfies the local Markov condition if PB (ui |nd(ui )) = PB (ui |Par(ui )) for 1 ≤ i ≤ n, where
PB is the probability distribution on U specified by B. The model B is a Bayesian network
if
satisfies the local Markov condition. By the chain rule we have: PB (u1 , u2 , . . . , un ) =
Qit
n
j
k
i=1 PB (ui |Par(ui )). Therefore if we let θijk = P (ui = ui |Par(ui ) = Ui ) and θij· =
(θij1 , . . . , θijri ) for 1 ≤ i ≤ n, 1 ≤ k ≤ ri and 1 ≤ j ≤ qi , then the joint probability
distribution on U is specified by Θ = {θij· |1 ≤ i ≤ n and 1 ≤ j ≤ qi }.
Evaluating Bayesian Networks by Sampling
2
A Posterior-based Score with A Reduced Assumptions Set
Cooper and Herskovitz introduced the probability P (G|D) as a measure of assessing the
fitness of G as a probabilistic model of D. Since P (D) is constant across different networks,
we can work with P (G, D). Let ΩG be the space of all probability distributions Θ for the
structure G. Then,
Z
P (G, D) =
P (D|Θ, G)f (Θ|G)P (G)dΘ.
(1)
ΩG (Θ)
Pri −1
Recall that Θ is a collection of distributions θij· = (θij1 , . . . , θij(ri −1) , 1 − k=1
θijk ) for
Pri −1
all i and j. The vectors θij· for any (i, j) ∈ [1..n] × [1..qi ] must satisfy k=1 θijk ≤ 1
and θijk ≥ 0 for all k. Also, Θ itself, the collection of these random vector variables, can be
treated as a random variable. P (D|Θ, G) is the conditional probability function of data given
(G, Θ), f (Θ|G) is the conditional density function of Θ given structure G, and P (G) is the
prior probability function of structure G. To evaluate this integral a number of assumptions
were introduced by Cooper and Herskovits (1993). The data independence assumes tuples of
D are independent given the network structure. The local and global independence ( LGI )
assumption requires that θij· is conditionally independent of θi0 j 0 · for all (i, j) 6= (i0 , j 0 ) given
the structure. Based on the LGI assumption, Ω(Θ), the space of possible collections Θ can be
written as
ΩG (Θ) =
qi
n Y
nY
ri −1
(θij1 , . . . , θij(ri −1) ) ∈ Rri −1 |
i=1 j=1
X
k=1
Qn
o
θijk ≤ 1 and θij1 , . . . , θij(ri −1) ≥ 0
Qqi
and we have f (Θ|G) = i=1 j=1 g(θij· |G) due to the LGI assumption. Cooper and
Herskovits (1993) replace f by the above product in Equality (1). Also, they assume the
distribution g(θij· |G) for each i and j is uniform. We refer to this assumption as second order
uniform probability (SOUP ). Heckerman et al. (1995) introduce the BDe metric which is
a posterior-based measure similar to CH metric. They use the LGI assumption and three
other assumptions: the second order Dirichlet probability (SODP ) (suggested but not used
in Cooper and Herskovits (1993)), the parameter modularity and the multinomial sample (
MS ) assumption. SODP is generalization of SOUP assumption which states that P (θij· |G)
follows a Dirichlet distribution for all i and j. The multinomial sample assumption asserts
that if we define the ordered set Dl = {t1 , . . . , tl−1 } then,
vi−1
), D l , (G, Θ) = θijk ,
P tl [ui ] = uki | tl [u1 , . . . , ui−1 ] = (uv11 , . . . , ui−1
where t[V] denotes the restriction of V ⊆ U on tuple t ∈ D and we have the state assignvi−1
ment ParG (ui ) = Uij consistent with tl [u1 , . . . , ui−1 ] = (uv11 , . . . , ui−1
) and θijk ∈ Θ.
Later, the SODP assumption was replaced with two other assumptions, likelihood equivalence and structure possibility, which imply the SODP assumption. Note that every probability function g(θij· |G) follows a Dirichlet
Pn distribution which requires ri parameters. Thus,
for each BNS G we need to specify i=1 qi ri parameters and this makes this approach
impractical. To overcome this difficulty Heckerman et al. (1995) encoded the prior knowledge into a single Bayesian network referred as (a prior network) Bpr = (Gpr , Θpr ). Then,
they set the Dirichlet parameter corresponding to probability distribution component θijk to
αijk = N 0 · PBpr (ui = uki , ParGpr (ui ) = Uij ), where N 0 is a user given parameter which they
S. Baraty et D. A. Simovici
refer as equivalent sample size. The choice of a values of N 0 and the collection Θpr without
observing data is arbitrary. We use sampling which enable us to let data shape the distribution
of the posterior probability on vectors θij· .
In the evaluation of the prior P (G) Cooper and Herskovits (1993) assumed an uniform
prior distribution. This and other assumptions are based on parameters that need to be arbitarily
specified. Sampling enables us to use data as a substitute for strong assumptions or domain
knowledge in determining the parameters of the second order probability distribution and the
prior probability P (G).
Let S1 and S2 be two disjoint samples from D. We evaluate P (G|S1 , S2 ) as a measure of
fitness of BN structure G. Since P (S1 , S2 ) does not depend on the specific BNS we can drop
it and instead compute P (G, S1 , S2 ). Note that by chain rule P (G, S1 , S2 ) = P (S1 |G, S2 ) ·
P (G|S2 ) · P (S2 ). If we sample consistently across different structures, then P (S2 ) is constant
and can be dropped. Therefore, we adopt P (S1 |G, S2 ) · P (G|S2 ) as a relative measure of
fitness of structures for a data set D. If we repeat the process of sampling, we can extend our
measure to
1
k
Y
P (S2q−1 |G, S2q ) · P (G|S2q )
q=1
!k
,
where S1 , S2 , . . . , S2k are samples from D where S2q−1 ∩ S2q = ∅ for each q. We refer to this
measure as k-sample validation of structure G for data set D and denote it by SAMPk (G, D).
Let S = {t1 , . . . , ta } and S0 be two disjoint samples of D. The first term of SAMPk (G, D)
can be written as
Z
P (S|G, S0 ) =
P (S|Θ, G, S0 )f (Θ|G, S0 )dΘ.
(2)
ΩG (Θ)
Let d = (u1 , . . . , un ) be a topological order of nodes of G which represents expert prior
knowledge of the domain. Denote by nS (t) the number
of occurrences
P riof tuple t in S and let
γijk (S) = {t ∈ S | t[{ui }] = uki ∧ t[Par(ui )] = Uij } and γij· (S) =
k=1 γijk (S) . Since the
attributes of D are discrete, we have
P (S|B) =
a
Y
P (tl |Sl , Θ, G) =
a Y
n
Y
P (ui = tl [ui ]|Ui = tl [Ui ], Sl , Θ, G) =
l=1 i=1
l=1
qi ri
a Y
n Y
Y
Y
λ
lijr
θijr
,
l=1 i=1 j=1 r=1
where the first equality is by the chain rule and Sl = (t1 , . . . , tl−1 ), the second equality is by
assuming MS assumption and Ui = (u1 , . . . , ui−1 ) and λlijr = 1 if tl [ui ] = uri ∈ Dom(ui )
Pa
and tl [ParG (ui )] = Uij ∈ Dom(ParG (ui )) and λlijr = 0 otherwise. Since
l=1 λlijr =
γijr (S) , we have
P (S|Θ, G) =
qi ri
n Y
Y
Y
γ
ijr
θijr
(S)
.
(3)
i=1 j=1 r=1
Then,
Qn Qqi Q ri γijr (S∪S 0 )
qi ri
n Y
Y
Y γ (S)
P (S ∪ S0 |Θ, G)
i=1
j=1
r=1 θijr
ijr
=
P (S|Θ, G, S ) =
=
θijr
.
0
Q
Q
Q
γijr (S )
ri
n
qi
P (S0 |Θ, G)
θ
i=1 j=1 r=1
0
i=1
j=1
r=1
(4)
ijr
For the second term of right hand side of Equality (2) we have
f (Θ|S0 , G) = R
ΩG
P (S0 |Θ, G)f (Θ|G)
P (S0 |Θ, G)f (Θ|G)dΘ
(Θ)
(5)
Evaluating Bayesian Networks by Sampling
We assume the SOUP hypothesis and set each g(θij· |G) = (ri −1)!. The posterior probability
of Θ is conditioned on G in presence of sample S0 , as shown in Equality (2). This approach is
is different from the one used in Cooper and Herskovits (1993) where SOUP hypothesis has
been applied directly without intervention of sample data. Then, we have
Z
0
P (S |Θ, G)f (Θ|G)dΘ =
ΩG (Θ)
qi n Y
Y
(ri − 1)! ·
i=1 j=1
Q ri
0
r=1 γijr (S )!
,
(γij· (S0 ) + ri − 1)!
from Equality (3) and SOUP , LGI , and from a result from Jeffreys and Jeffreys (1988) (see
pages 468-470 of this reference). Thus, from the previous equalities and from (3) and (5) we
have,
f (Θ|S0 , G) =
qi
n Y
Y
i=1 j=1
Γ(γij· (S0 ) + ri )
ri
Y
r=1
γ
(S 0 )
ijr
θijr
,
Γ(γijr (S0 ) + 1)
(6)
where Γ is Euler’s function. Combining Equalities (2), (4) and (6) we obtain
P (S|G, S0 ) =
qi
n Y
Y
i=1 j=1
ri
Y
Γ(γijr (S ∪ S0 ) + 1)
Γ(γij· (S0 ) + ri )
·
,
0
Γ(γij· (S ∪ S ) + ri ) r=1 Γ(γijr (S0 ) + 1)
To approximate the quantity P (G|S) we use a slight variation of a measure called the
distribution distortion introduced in Baraty and Simovici (2009). Here we want to evaluate
the conditional independency captured by local Markov condition according to data, that is,
we want to assess to what degree the conditions fS (ui |nd(ui )) = fS (ui |Par(ui )) holds for
1 ≤ i ≤ n, where fS is the frequency function relative to sample S ⊆ D. To achieve this, we
measure the divergence of the set of probability distributions fS (ui |nd(ui ) = U ) from the set
of probability distributions fS (ui |Par(ui ) = U [Par(ui )]) for all i and U ∈ Dom(nd(ui )).
Definition 2.1 The local Markov divergence of the fork structure at node ui of G according to
sample S, denoted by LMDG
S (ui ), is the number
X
U
fS (nd(ui ) = U ) · KL fS (ui |nd(ui ) = U ), fS (ui |Par(ui ) = U [Par(ui )]) ,
where the sum extends over all U ∈ Dom(nd(ui )). Here KL[p, q] is the Kullbach-Leibler
divergence between the probability distributions p = (p1 , . . . , pn ) and q = (q1 , . . . , qn ).
Let HS (π u ) be the Shannon entropy of the set S partitioned according to the values of u,
and let HS (π u |π W ) be the conditional Shannon entropy of the set S partitioned according to
the values of u, conditioned by the partition of S according to the assignment of the set of
attributes W (see Baraty and Simovici (2009)).
ui Par(ui )
) − HS (π ui |π nd(ui ) ).
Theorem 2.2 For 1 ≤ i ≤ n we have LMDG
S (ui ) = HS (π |π
Theorem 2.3 LMDG
S (ui ) = 0 if and only if fS (ui |nd(ui )) = fS (ui |Par(ui )).
ui
Theorem 2.2 implies that 0 ≤ LMDG
S (ui ) ≤ HS (π ). By Theorem 2.3 the smaller the
G
value of LMDS (ui ) is, the closer is the fork structure at node ui to satisfy the local Markov
condition according to S. Therefore, the Markov condition is closer to be satisfied according
ui
to sample S. On another hand, the closer LMDG
S (ui ) is to HS (π ) the more divergent the two
S. Baraty et D. A. Simovici
probability distributions fS (ui |nd(ui ) = U ) and fS (ui |Par(ui ) = U [Par(ui )]) are for every
ui Par(ui )
ui
) = HS (π ui ) and
U ∈ Dom(nd(ui )). When LMDG
S (ui ) = HS (π ), we have HS (π |π
ui nd(ui )
) = 0. This means that the set Par(ui ) has no prediction capability at all at the
HS (π |π
node ui and the set nd(ui ) has a perfect predication capability on ui . Let BNS(U) be the set
of all possible Bayesian structures on set of attributes U. Define P (G|S) as
P (G|S) = P
Pn
H S (π ui ) − LMDG
S (ui )
.
Pn
G0
ui
G0 ∈BNS(U)
i=1 H S (π ) − LMDS (ui )
i=1
Using the previous evaluations, SAMPk (G, D) can be written as
H S 2q (π us ) − LMDG
S 2q (us )
P (S2q−1 |G, S2q ) · P (G|S2q )
=
P
Pn 0
us ) − LMDG (u )
H
(π
q=1
q=1
s
0
S
S
2q
G ∈BNS(U)
s=1
2q
!1
qi
ri
k
Y
Y Γ(γijr (S2q−1 ∪ S2q ) + 1)
Γ(γij· (S2q ) + ri )
·
.
Γ(γij· (S2q−1 ∪ S2q ) + ri ) r=1
Γ(γijr (S2q ) + 1)
j=1
! k1
k
Y
k
Y
Pn
s=1
If we consistently sample the data across different structures, we can drop the constant entities
P
S
with respect to BNS G and assuming PG2q = ns=1 (H S2q (π us ) − LMDG
S 2q (us )) we set,
SAMPk (G, D) =
k
Y
q=1
3
S
PG2q
qi
n Y
Y
i=1 j=1
ri
Y
Γ(γijr (S2q−1 ∪ S2q ) + 1)
Γ(γij· (S2q ) + ri )
Γ(γij· (S2q−1 ∪ S2q ) + ri ) r=1
Γ(γijr (S2q ) + 1)
! k1
.
Experimental Results and Conclusions
We conducted experiments on three well-known structures GAM , GCAR and GN C for
domains Alarm, Car Diagnosis2 and Neapolitan Cancer with 37, 18 and 5 nodes respectively. For the first two structures we randomly generated the corresponding probability tables,
ΘAM and ΘCAR . Then, based on probability distributions introduced by (GAM , ΘAM ) and
(GCAR , ΘCAR ) we generated data sets of sizes 80000 and 100000 respectively. For the GN C
we used its corresponding data set in the literature with 7565 with no missing values.
For each data set we randomly generated a number of structures of different complexities.
The number of the edges for these structures ranged from 1 − 10, 12 − 108 and 12 − 330 for
NC, CAR and AM data sets respectively.
Figures 1(a), 1(b) and 1(c) show very strong correlations between the CH score and the
SAMP score for various values for k. The derived measure is cheaper to compute, since it
works with samples much smaller than the entire data.
We introduced a measure based on posterior probability for measuring the fitness of a
Bayesian network structure based on data. The conclusion of this work is that our samplingbased scoring is a viable and much cheaper alternative to the CH score. The fact that we use
sampling to reduce the set of assumptions and we get a very strong correlation between two
measure confirms that the SOUP and uniform distribution on P (G) are safe assumptions and
do not distort the search.
Evaluating Bayesian Networks by Sampling
(a) Alarm data
(b) Car Diagnosis2 data
(c) Neapolitan Cancer data
(d) time comparison diagram
F IG . 1 – Correlations between log(SAMPk (G, D)) and log(CH) and time in ms needed for computing
log(SAMP1 )(G, CAR) and CH scores
References
Baraty, S. and D. A. Simovici (2009). Edge evaluation in Bayesian network structures. In
Proceedings of the 8th Australian Data Mining Conference, Melbourne, pp. 193–201.
Cooper, G. F. and E. Herskovits (1993). A Bayesian method for the induction of probabilistic
networks from data. Technical Report KSL-91-02, Stanford University, Knowledge System
Laboratory.
Heckerman, D., D. Geiger, and D. M. Chickering (1995). Learning Bayesian networks: The
combination of knowledge and statistical data. In Machine Learning, pp. 197–243.
Jeffreys, H. and B. S. Jeffreys (1988). Dirichlet Integrals. Cambridge, UK: Cambridge University Press.
Résumé
L’évaluation qualitative la plus connue des réseaux Bayesiens en présence de données est le
critère Cooper-Herskovitz. Cette technique implique des quantités massives de données donc,
par conséquent, des nombreux calculs. Nous proposons une méthode d’évaluation plus efficace
utilisant des suppositions simplifiées et qui produit des évaluations fortement corrélées avec le
critère Cooper-Herskovitz.
© Copyright 2026 Paperzz