Bayesian Modeling with Strong vs. Weak Assumptions
in the Domain of Skills Assessment
Michel C. Desmarais, Peyman Meshkinfam, Michel Gagnon
Computer Engineering
École Polytechnique de Montréal
Montreal, Canada H3T 2B1
Abstract
Approaches such as Bayesian networks (BN) are
considered highly powerful modeling and inferencing techniques because they make few assumptions and they can represent complex relationships among variables with efficiency and
parsimony. They can also be learned from training data. Yet, they generally lend themselves to
a variety of sound and efficient inference computations. However, in spite of these qualities,
BN may not be always be the most advantageous technique in comparison to simpler techniques that make stronger assumptions. We investigate this issue in the domain of skills modeling and assessment, where BN have received
a whealt attention. Vomlel’s (2004) BN model
of basic arithmetic skills is compared, on the basis of predictive accuracy, to a simple Bayes posterior probability update approach under strong
independence assumptions, named POKS (Desmarais, Maluf, & Liu, 1995). The results of simulation experiments show that the BN yields better accuracy for predicting concept mastery, but
POKS is better at predicting question mastery.
We conjecture possible explanations for these
findings by analyzing the specifics of the domain
model, in particular its closure under union and
intersection, and discuss their implications.
1
Introduction
Bayesian modeling with joint conditional probabilities is
conceptually and computationally the most straightforward
mean of computing posterior probabilities. It makes few
assumptions and has a good reliability given sufficient data.
However, because the number of joint conditional probabilities grows exponentially with the number of variables,
this approach quickly becomes impractical due to the large
amount of data required to calibrate the model. Conse-
quently, reducing the number of conditional probabilities
to estimate the conditional probabilities required to a useful minimum represents a fundamental goal in practice.
Bayesian networks (BN) address this issue by modeling
only the relevant conditional probabilities out of the full
joint conditional probabilities. They can represent complex relationships among variables with efficiency and parsimony. Because they explicitly model dependencies and
independencies, they make few assumptions and generally
lend themselves to a variety of sound and efficient inference computations. Bayesian networks limit their assumptions to that of conditional independence of a child given
its parents to all non descendents of this node (see Neapolitan, 2004, for a good introduction). Moreover, they can be
learned from training data.
However, in spite of these qualities, BN may not be always be the most advantageous technique in comparison
to simpler techniques that make stronger assumptions. For
the same reason that BN offer parsimonious models for
Bayesian modeling, thereby significantly increasing their
usefullness in a wider range of application contexts, so
do Bayes models with stronger independence assumptions.
They offer more parsimonious representations for Bayesian
modeling than do BN. However, they impose further assumptions on the domain model that can lead to invalid inferences.
We investigate the tradeoff between model parsimony and
predictive accuracy by comparing a BN model with a simple model based on the application of Bayes rule under strong independence assumption. The domain modeled is the mastery of individual skills. Bayesian models
have been used by a number of researchers in the domain
of skills assessment and user modeling, such as Conati,
Gertner, and VanLehn (2002), Mislevy, Almond, Yan, and
Steinberg (1999), Millán, Trella, Pérez-de-la-Cruz, and
Conejo (2000), Vomlel (2004), to name but a few of them.
We use a BN developed by Vomlel (2004), composed of
20 questions items and 19 concept skills, as the comparison
point for the BN approach (see figure 1). Using the same
data as Vomlel, we applied the Bayesian method named
Bayesian network from Vomlel (2004). The model contains 20 question item nodes represented by leaf nodes.
Figure 1: Other nodes represent concepts or misconceptions, and hidden task nodes (labeled Tnn and with dotted line
countours). See text for details.
POKS (Desmarais et al., 1995) and compare their respective performance.
2
Vomlel’s Bayesian network
Figure 1 illustrates the Bayesian network structure that we
use in this study. It contains a set of concepts and question
items that take binary values (mastered or non mastered).
Question nodes are leaf nodes in the structure and they are
labeled Xnn. Concept nodes are the non leaf nodes with
oval and rectangular shapes. There are many types of concept nodes. Nodes with labels starting with “M” are in fact
misconceptions whereas the oval concept nodes represent
skills in the domain of fraction arithmetics. For example,
the AD concept node (near the top right of figure 1’s hierarchy) represents the skill of adding two fractions with a
common denominator (eg. 71 + 72 = 37 ) and concept node
CD (left of node AD) is the skill of
finding the common
denominator (eg. 21 , 23 = 63 , 46 ). Nodes with dotted
contour are hidden nodes. They are never directly assessed
whereas all other nodes are. Two concepts are hidden nodes
(HV1 and CP). Tasks are also hidden nodes as they allow
more than one question item for a single task. For example,
task T8 is linked to two question items, X9 and X9.
Figure 1’s network structure is adapted from Vomlel
(2004). It was initially defined through a constraint-based
learning algorithm - the Hugin PC algorithm (HUGIN expert, 2002). This algorithm is based on a series of conditional independence tests. It was later inspected by domain
experts for adjustments.
Vomlel (2004) tested a number of variants to figure 1’s
structure. The structure reported here was one of the best
performer and it is described in more detail in Vomlel
(2004).
3
The POKS Bayes modeling framework
Bayesian Networks, especially those with a hierarchical
structure, or that can be transformed into one, need not
make strong assumptions. The inference algorithms for
such structures can be exact given the BN structure’s independence assumptions, and these assumptions are themselves subject to statistical test by the learning algorithm.
Taking an almost opposite stance, the POKS approach
makes strong independence assumptions, namely that evidence variables are independent:
P (E1 , . . . , En |H) =
n
Y
i
P (Ei |H)
(1)
Given the assumption of independence of equation (1), the
probability update of H can be witten in following posterior odds form:
O(H|E1 , E2 , . . . , En ) =
n
Y
O(H|Ei )
(2)
i
where O(H|Ei ) represents the odds of H given evidence
of Ei , and assumes the usual odds semantics O(H|Ei ) =
P (H|Ei )
1−P (H|Ei ) . This allows us to use Bayes’ Theorem in its
version based on odds and likelihood algebra:
O(H|E) = LSEH ∗ O(H)
O(H|E) = LNHE ∗ O(H)
(3)
(4)
and where LS and LN are respectively the likelihood of
sufficiency and the likelihood of necessity.
LSHE
= P (H|E)/P (H|E)
(5)
LNHE
= P (H|E)/P (H|E)
(6)
All conditional probabilities, odds, and likelihood estimates are derived from Vomlel’s data set.
4
POKS Network induction
The evidence nodes in equation (2) are determined by the
topology of the POKS network. Determining which nodes
are linked together is based on the POKS network induction algorithm (Desmarais et al., 1995). This algorithm is
applied to Vomlel’s data (2004), that served for the construction of figure 1’s BN.
The POKS induction algorithm relies on a pairwise analysis of item to item relationships. The analysis attempts
to identify the order in which we master knowledge items
and it is inspired from the knowledge spaces theory of Falmagne, Koppen, Villano, Doignon, and Johannesen (1990).
This theory states that skill acquisition order can be modeled by an AND/OR graph. For our purpose, we impose
a stronger assumption that the skill acquisition order can
be modeled by an directed acyclic graph, or DAG (see the
example in figure 4 in the Discussion below). This assumption allows us to limit our network induction algorithm to pairwise analysis. We will return to this question
as it has implications on the performance comparisons of
the BN vs. models with strong independence assumptions
approaches.
The tests to establish a relation A → B consists in three
conditions for which a statistical test is applied:
P (B|A) ≥ pc
P (A|B) ≥ pc
P (B|A) 6= P (B)
(7)
(8)
(9)
Conditions (7) and (8) are verified by a Binomial test with
parameters:
pc the minimal conditional probability of equations (7)
and (8),
αb the alpha error tolerance level.
For this study, both parameters are set at 0.5. Condition (9)
is the independence test verified through a χ 2 statistic with
an alpha error αc < 0.5. The high values of alpha errors
maximize the number of relations we obtain.
There is no knowledge engineering effort involved in building the relations among question items in POKS. Relations
are obtained with an algorithm based on statistical tests
over the above three conditions.
Although a network of relations is obtained with the POKS
induction algorithm, we do not propagate evidence within
the network from an observed node beyond its directly connected nodes1 . In other words, if we have A → B and
B → C, no probability update is performed over C upon
the observation of A, unless a link A → C is also derived from the data. Experimental results not reported here
show that the performance is very close whether we propagate evidence using the POKS scheme in Desmarais et al.
(1995), or do not propagate and solely rely on direct links
for probability updates.
5
Logistic regression with the POKS model
As mentioned, the POKS network builds relations among
observable question items. There is no hidden nodes in the
network. However, to infer mastery of concepts from the
assessment of observable question item nodes, we need to
add links from question items to concepts.
Logistic regression models are used to link concept nodes
to question items. For each concept node in figure 1, a
logistic regression model is built with the observable question items it is directly or indirectly linked to. For example,
concept node CL (left side of figure 1) is linked to question
items X1, X2, and X13 through a logistic regression model.
The model’s parameters are estimated from Vomlel’s data
again, where concept nodes mastery was independently assesed by experts (see below).
6
Methodology and data
The experiments are conducted over a dataset composed of
20 question items and 19 concept nodes. This data was
graciously provided to us by Jiřı́ Vomlel. The 20 questions
were administered to 149 high school students. Concept
nodes are not directly observed, but experts analyzed each
individual’s test answers to determine the mastery of each
concept (except for the two hidden concepts in figure 1).
This data was used by Vomlel to build figure 1’s Bayesian
network model and conduct simulations to assess predictive accuracy. We use the same data for the POKS simulations. Akin to Vomlel (2004), and in order to avoid over
1
Limiting propagation to direct neighbours does not correspond to the algorithm described in Desmarais et al. (1995) where
propagation is performed in accordance to the algorithm described in Neapolitan (1998). The choice not to propagate further
for this study is made to use the simplest model possible.
7
Entropy reduction
For both the BN and POKS approaches, the order of questions is adpative and determined by entropy minimization.
i
If all items’ probability is close to 0 or 1, the value of
HT will be small and there will be little uncertainty about
the examinee’s ability score. We minimize uncertainty by
choosing the item with the lowest expected value of test
entropy. This value is given by:
Ei (HT0 ) = P (Xi )HT0 (Xi = 1) + Q(Xi )HT0 (Xi = 0)
where HT0 (Xi = 1) is the entropy after the examinee answers correctly to item i and HT0 (Xi = 0) is the entropy
after a wrong answer.
8
Simulation process
Two types of performance assessment are conducted. The
question predictive accuracy assessment provides an estimate of the proportion of questions that are correctly estimated as succeeded or failed. We compare the models
prediction to the actual examinee’s answers as the model
is given 0 to all 20 question items. Of course, such performance ends at 20 items with a 100% correct “estimate”.
The concept predictive accuracy proceeds with the same
process of feeding the model with observed questions, but
we measure the accuracy of concept prediction. Recall that
concepts are not directly observed but they were independently assessed by experts.
9
Question predictive accuracy
Figure 2 illustrates the performance of both the POKS and
BN approaches for predicting the question item successes.
10
15
20
Figure 2: Question predictive accuracy.
where Q(X) = 1 − P (X). The entropy of the whole test
is the sum of all individual item’s entropy:
H(Xi )
5
Item
H(Xi ) = −[P (Xi )log(P (Xi )) + Q(Xi )log(Q(Xi ))]
k
X
BN
POKS
Fixed
0
The entropy of a single item Xi is defined by the usual
formula:
HT =
0.70 0.75 0.80 0.85 0.90 0.95 1.00
The number of relations obtained for POKS with alpha error αb = 0.5 and minimal conditional probabilities p c =
0.5 (see section 3) are 206 on average with Vomlel’s data
of 20 items and 149 data cases.
Prediction score
calibration, model constructions and calibrations are done
on the full data set, minus the data case for which a simulation is performed. For example, this implies that 149
different POKS models are built for the simulations, one
for each simulation data case.
Both lines are averages of the 149 data cases.
A third curve (dotted line) provides the performance of a
non adaptative, fixed question sequence where all examinees get the same question sequence, regardless of their
previous answers. The fixed sequence orders items based
on the question items’ initial entropies: it starts with items
whose average success rate is closest to 0.5 and finishes
with items whose success rate is closest to 0 or 1.
The simulation results show that the POKS technique is
able to predict answers to questions a few percentage point
more accurately than the BN. The gain is not very strong,
but it is systematic and more significant after the fifth question, especially relative to the number of non observed
items. In fact, the BN’s performance does not perform systematically better than a fixed question sequence.
10
Concept predictive accuracy
Figure 3 reports the performance of each technique for predicting concept mastery. Recall that mastery of concepts
was assessed independently by experts. Concepts nodes in
the network are not observed during the simulation, thus
prediction does not reach reach 100% accuracy as it does
in the question predictive simulation.
For this experiment, the performance of the BN model is
clearly stronger than the POKS one. The BN approach
quickly reaches from 74% correct to about 90% correct in
only 5 items observed, and it stabilizes at close to 92% after a couple more items. The POKS performance is about
5% weaker after the first item and this difference remains
almost stable throughout the remaining 18 items observed.
2
0.70 0.75 0.80 0.85 0.90 0.95 1.00
Prediction score
(a)
12
4
+
31
5
=
8
24
(b)
(c)
21
4
>
3
18
2 5
31
+ =
3 8
24
(d)
BN
POKS
0
5
10
15
2×
20
Item
1
=1
2
Figure 4:
A simple knowledge space composed
of 4 items ({a, b, c, d}) and with a partial order that constrains possible knowledge states to
{∅, {d}, {d, b}, {d, c}, {d, b, c}, {d, b, c, a}}.
Figure 3: Concept predictive accuracy.
observations and partial explanations in the following paragraphs hope that they can help bringing some light.
However, the fact that POKS is weaker after all 20 items are
administered indicates that the logistic regression model
used does not perform as well as the BN. It suggest that
the observed performance gap is mostly attributable to the
logistic regression component rather than the POKS inferences.
First, we note that other experiments over two other knowledge domains (knowledge of UNIX shell commands and
mastery of written French) have also shown that the POKS
approach performs at least as well as standard techniques in
Computer Adaptive Testing, namely Item Response Theory
(IRT) (Desmarais & Pu, 2005).
An obvious followup experiment to conduct would be to
verify if combining inferences from POKS and BN, instead
of using the logistic regression model, would improve over
the BN’s current performance at the concept level. Unfortunately the two systems are not integrated and we could
not conduct this experiment in time for this publication.
However, we should note that clearly, from the question
accuracy results, if concepts were assessed in the classical test construction technique, by which a teacher breaks
down a subject matter into a set of more specific topics and
assesses the mastery of that topic by one or more test items,
possibly with a weighted mean, then concept assessment
accuracy would be improved by POKS.
Let us also rule out the explanation that the BN we used
in this study is ill-structured and underperformant, since
it did perform well at the concept prediction level, better
than did POKS with logistic regression in fact. This observation also makes less likely the explanation that because
POKS uses binary relations only, it needs smaller sample
size than the more complex n-ary relations found in the
BN. Good performance at the concept level suggest that
149 data cases is probably enough. It appears that, when
it comes to concept predictive accuracy, Vomlel’s BN can
use relations among concepts to effectively predict concept
mastery, in spite of its relatively poorer ability predict question mastery from concept mastery.
11
Discussion
Why does a simple Bayesian posterior update scheme, that
rests on strong independence assumptions and does not rely
on any hidden nodes or knowledge engineered structure,
perform better for question accuracy than a BN? Unfortunately, we do not have a clear answer to the above question. However, we note that such finding may not be exceptional, as other studies have also concluded that simpler
approaches relying upon stronger assumptions often outperform more sophisticated approaches under certain conditions (see, for eg. Domingos & Pazzani, 1997). Which
are these possible conditions here? We conjecture a few
Another explanation stems from the fact that Vomlel’s BN
does not build links directly amongst question items themselves. This practice is typical of all BN used in the knowledge assessment and user modeling research litterature. It
also makes good sense since question items and assessment tests have a short life span and frequent updates.
The knowledge engineering effort required to build a BN
among test items would prove inefficient, unless the process can be fully automated as in POKS. Nevertheless, by
not directly linking questions items among themselves, it is
conceivable that the predictions miss valuable information
that POKS exploited. It is also possible that by relying on
question items to infer concept mastery to, in turn, predict
question mastery, the evidence propagated looses weights
and gathers noise. This could explain why direct links between question items themselves is more effective, in spite
of the strong assumptions that used for building these links.
Another, potentially more interesting hint at why the POKS
did relatively well with question items lies in the structural
properties of this domain. These properties are best understood by looking back at the theory of Knowledge Spaces
we referred to in section 3. This theory is well known
in mathemtical psychology and it states that knowledge
items are mastered in a constrained order. For example,
we learn to solve figure 4’s problems in an order that complies with the arrows. It follows from this structure that
if one succeeds item (c), it is likely she will also succeed
item (d). Conversely, if she fails item (c), she will likely fail
item (a). However, item (c) does not significantly inform us
about item (b). This structure defines the following possible knowledge states: {∅, {d}, {d, c}, {d, b}, {a, b, c, d}}.
Other knowledge states are deemed impossible (or unlikely
in a probabilistic framework).
Formally, Falmagne and his colleagues argue that if the
knowledge space of individual knowledge states is closed
under union and intersection, then the set of all possible
knowledge states can be represented by a directed acyclic
graph (DAG)2 , such as the one in figure 4.
This closure implies that, given a relation A → B, the absolute frequency of people who master a knowledge item A
will necessarily be smaller than the frequency of B. This
conclusion does not hold for the case of general BN. For
example, assume figure 4’s structure is the following (a BN
taken from Neapolitan, 2004):
(a) smoking history
(b) bronchitis
(c) lung cancer
(d) fatigue
It is clear that smoking history (a) could be a much more
frequent state than lung cancer (c) and bronchitis (b). It
is also obvious that, whereas the occurrence lung cancer
could decrease the probability of bronchitis by discounting that later cause as a plausible explanation for fatigue,
discounting does not play a role in the case of knowledge
structures (eg. observing figure 4’s (c) does not decrease
the probability of (b); on the contrary, it could increase it).
In short, many interactions found in general BN do not occur in knowledge structures. We conjecture that this reduction in the space of possibilities that characterizes the
domain we modeled in this experiment, namely the closure
under union and intersection in knowledge spaces, warrants
2
In fact, Falmagne and colleagues show that the set of all
knowledge states is closed under union only, not under intersection, and that an AND/OR graph is the proper structure. For our
purpose, we make the assumption/approximation that it is closed
under union and intersection and that a DAG is a proper representation of the ordering.
the use of strong independent assumptions in Bayesian
modeling. It allows the modeling of the domain by a pairwise analysis of variable relations, thereby reducing considerably the computational complexity and the required
size of the learning data set.
This last explanation is interesting because it links network
structural properties (closure under union and intersection)
to the level of assumption violation we can expect. However, we must emphasize that such explanation is speculative and not directly supported by empirical evidence from
the current experiment. Further investigation is required to
support such claim.
12
Acknowledgements
We are grateful to Jiřı́ Vomlel for giving us valuable feedback on an early draft of the paper and for providing the
data used in this experiment. This work has been supported
by the National Research Council of Canada.
13
References
Conati, C., Gertner, A., & VanLehn, K. (2002). Using bayesian networks to manage uncertainty in student
modeling. User Modeling and User-Adapted Interaction, 12(4), 371–417.
Desmarais, M. C., Maluf, A., & Liu, J. (1995). Userexpertise modeling with empirically derived probabilistic implication networks. User Modeling and UserAdapted Interaction, 5(3-4), 283–315.
Desmarais, M. C., & Pu, X. (2005). Computer adaptive
testing: Comparison of a probabilistic network approach
with item response theory. Proceedings of the 10th International Conference on User Modeling (UM’2005)
(p. (to appear)). Edinburg.
Domingos, P., & Pazzani, M. (1997). On the optimality of
the simple bayesian classifier under zero-one loss. Machine Learning, 29, 103–130.
Falmagne, J.-C., Koppen, M., Villano, M., Doignon, J.-P.,
& Johannesen, L. (1990). Introduction to knowledge
spaces: How to build test and search them. Psychological Review, 97, 201–224.
HUGIN expert (2002).
HUGIN explorer. ver.
6.0.,
computer software (Technical report).
http://www.hugin.com.
Millán, E., Trella, M., Pérez-de-la-Cruz, J.-L., & Conejo,
R. (2000). Using bayesian networks in computerized
adaptive tests. In M. Ortega, & J. Bravo (Eds.), Computers and education in the 21st century (pp. 217–228).
Kluwer.
Mislevy, R. J., Almond, R. G., Yan, D., & Steinberg, L. S.
(1999). Bayes nets in educational assessment: Where
the numbers come from. In K. B. Laskey, & H. Prade
(Eds.), Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence (UAI-99) (pp. 437–446).
S.F., Cal.: Morgan Kaufmann Publishers.
Neapolitan, R. E. (1998). Probabilistic reasoning in expert
systems: Theory and algorithms. John Wiley & Sons,
Inc., New York.
Neapolitan, R. E. (2004). Learning bayesian networks.
Prentice Hall, New Jersey.
Vomlel, J. (2004). Bayesian networks in educational testing. International Journal of Uncertainty, Fuzziness and
Knowledge Based Systems, 12(Supplementary Issue 1),
83–100.
© Copyright 2026 Paperzz