Learning with TETRAD IV to build a Bayesian network for the study

Learning with TETRAD IV to build a Bayesian network for the
study of depression in elderly people
Pilar Fuster-Parra
Departament Matemàtiques i Informàtica
Universitat de les Illes Balears, Spain
[email protected]
Carmen Moret-Tatay
Cátedra Energesis de Tecnologı́a Interdisciplinar
Universidad Católica de Valencia, Spain
[email protected]
Esperanza Navarro-Pardo
Departament Psicologia Evolutiva i de la Educació
Universitat de València, Spain
[email protected]
Pedro Fernández-de-Córdoba Castellá
Institut Universitari de Matemàtica Pura i Aplicada
Universitat Politècnica de València, Spain
[email protected]
Abstract
The paper describes the process of learning using TETRAD IV, from a complete database
composed of twenty two variables (characteristics) and sixty six observations (cases),
which has been obtained from different validated psychological tests applied to a reduced
set of people: the elderly. Learning both the structure and parameters has been performed
with TETRAD IV. Then, a Bayesian network has been implemented in Elvira.
1
Introduction
Discovering causal relationships from observational data is an important problem in empirical science (Druzdzel and Glymour, 1995),
(Claasen and Heskes, 2010). The purpose of
the present article is to describe the process
of learning with TETRAD IV, to develop a
Bayesian network (BN) in the domain of depression in elderly people from a database with
complete observations. Then, the BN can be
used to evaluate and investigate the depression
(Butz et al., 2009). BNs can be considered at
the intersection of Artificial Intelligence, statistics and probability (Pearl, 2000) and constitute a representation formalism strongly related
to knowledge discovery and data mining (Heckerman, 1997). BNs are a kind of probabilistic graphical models (PGMs) (Larrañaga and
Moral, 2011), which combine graph theory (to
help the representation and resolution of complex problems) and probability theory (as a way
of representing uncertainty). A PGM is defined
by a graph where nodes represent random variables and arcs represent dependencies among
such variables. The graphical structure captures the compositional structure of the causal
relations and general aspects of all probability
distributions that factorize according to that
structure (Glymour, 2003). A PGM is called
a BN when the graph connecting its variables is
a directed acyclic graph (DAG). A BN may be
causal or not depending whether the arcs in the
DAG represent direct causal influences or not.
If the arcs represent causal influences we will
talk of causal BNs (Spirtes et al., 1993). The
assumption that arcs represent causality in elicitation process is very useful (Buntine, 1996).
In Section 2 we analyze the problems associated with depression in elderly. In Section 3
we describe the process of learning the structure from data using TETRAD IV. In Section
4 we present the process of learning the parameters. In Section 5 we present the use of an
approximative updater. In Section 6 we show
that depression variable classifies the 80.30% of
cases under study. In Section 7 we present a BN
to study depression in elderly, which was implemented in Elvira. Finally, some conclusions and
further remarks are given in Section 8.
2
3
Learning the structure from data
The problem of discovering the causal structure
increases with the number of variables (Sucar
and Martı́nez-Arroyo, 1998), (Cheng et al.,
2002), e.g. for each pair of variables, X and
Y there are four possible kinds of connections:
X → Y, X ← Y, X ↔ Y, X
Y. The
number of distinct possible causal arrangements
of n variables is 4(number of pairs of variables)
(Glymour, 2003). With twenty two variables
there would be 4(21+20+···+2+1) = 1.190853 ×
10139 possible connections between variables.
Depression in elderly
Depressive symptomatology is especially relevant to psychology and mental health: this is
the fourth cause of disability in Spain and it is
expected to be the second cause in the next five
years (Spanish Ministry of Health, 2007). In the
elderly, 4 from 10 doctors appointment are associated to this pathology. Nowadays, it is not
yet possible to establish the cause of depression.
Therefore, we considered interesting to explore
which variables may influence depressive symptoms and, which are influenced by them.
Although the global manual of mental illness,
created by the APA (American Psychiatric Association, 2002) does not set margins for the
elderly, pathoplasty differs, usually appearing
in an atypical and unspecific way, accompanied
by other pathologies, often characteristic at this
age (Jiménez et al., 2006). Thus, we followed
DSM-IV-TR criteria, valid for all age groups.
We have also included other variables that, according to the literature, may be related to depressive symptoms in the elderly (Lang et al.,
2011) as age, gender, marital status, presence
of cognitive impairment, level of physical dependence, coping, psychological well-being, etc..
But, at the same time, we tried to examine relationships between them.
Figure 1: Pattern obtained when GES algorithm performs the search on data.
Basically, there are two approaches to structure learning (Jensen and Nielsen,
2007):
i)search-and-score structure learning, and
ii)constraint-based structure learning. Searchand-score search algorithms assigns a number
(score) to each BN structure, and we look for
the structure model with the highest score.
Constraint-based search algorithms establish
a set of conditional independence analysis on
the data (Margaritis, 2003). Based on this
analysis an undirected graph is generated. Using additional independence test, the network
is converted into a BN. Being PC the most
known example of constraint-based learning algorithms. For discrete or categorical variables,
PC uses either a chi square of independence or
conditional independence (Spirtes et al., 1993).
In order to obtain the DAG we use
TETRAD IV (Scheines et al.,
1998).
Table 2: GES algorithm applied on data (model
1 ), applied on data with prior knowledge (model
2 ) and with prior knowledge obtained from PC,
PCD, CPC, JPC and JCPC (model 3, model 4,
model 5, model 6 and model 7 respectively).
Figure 2: Pattern obtained when GES algorithm performs the search on data with prior
knowledge.
Table 1: Search algorithms results on data.
Search
df
χ2
pvalue
PC
PCPattern
PCD
CPC
JPC
JCPC
GES
224
214
224
214
226
226
200
402.3027
297.9925
402.3027
283.286
396.2216
396.2216
195.3489
0
0.0001
0
0.0011
0
0
0.5796
The software is available as freeware in
the TETRAD IV suite of algorithms at
www.phil.cmu.edu/projects/tetrad.
Several
search algorithms, which are integrated in
TETRAD IV, were tested on the data (PC,
PCPattern, PCD, CPC, JPC, JCP and GES),
and we found GES (greedy equivalence search)
(Chickering, 2002) as the one best behaved with
respect to our data (see Table 1). We only got
a completely connected graph with GES algorithm, so we choose this one.
The search process gives a pattern, where
most of the arc directions were defined, that
is, we obtained a partially directed graph G.
We only had connections without any direction
among variables: maritalstatus – sex, brcsme –
caerep – visionproblems. It is also possible to
obtain the DAG in this pattern, being marked
with yellow arrows the connections which were
Model
Search
df
χ2
pvalue
1
2
3
4
5
6
7
GES
GES
GES
GES
GES
GES
GES
200
201
199
199
200
199
199
195.3489
201.0426
192.754
192.754
198.1167
194.5962
194.5962
0.5796
0.4859
0.6114
0.6114
0.5244
0.5749
0.5749
Table 3: Comparisons among GES models.
Model comparison χ2
df pvalue
model
model
model
model
model
model
model
model
model
model
model
2
2
1
2
1
2
1
5
2
5
5
versus
versus
versus
versus
versus
versus
versus
versus
versus
versus
versus
1
3
3
4
4
6
6
6
5
4
3
5.6937
8.2886
2.5949
8.2886
2.5949
6.4464
0.7527
3.5205
2.5949
5.3627
5.3627
1
2
1
2
1
2
1
1
1
1
1
0.9829740
0.9841455
0.8927918
0.9841455
0.8927918
0.9601726
0.6143773
0.9393858
0.8927918
0.9794281
0.9794281
no defined in direction in Figure 1, this graph G’
extends graph G, because i) G and G’ have the
same adjacencies and ii) if A → B is in G then
A → B is in G’ . G’ is a consistent DAG extension of graph G because i) G’ extends G, ii) G’
is a directed acyclic graph and iii) pattern(G)
= pattern(G’) (Meek, 2003). The model obtained by this way (only from the original data)
we called model 1. Then, we perform another
search using GES algorithm, where we take the
data and prior knowledge (Heckerman et al.,
1995), which was included as tears to order
the variables: maritalstatus, sex, brcsme, caerep
and visionproblems. The order introduced was
the same given in the first search, that is the
yellow directions marked in Figure 1. Another
DAG was obtained, and we called it model 2,
see Figure 2. The only difference between both
models is that in model 2 the arc connecting
caeafn with worklevel (which was in model 1 )
has been removed, as a consequence the degree
of freedom has increased one unit (Glymour et
al., 1986). We also take into account a possible model 3, which was obtained from data
and prior knowledge (connections obtained by
PC algorithm defined by tears), and GES algorithm was applied. We proceed like that with
PCD, CPC, JPC and JCPC, and we called the
models: model 4, model 5, model 6 and model
7 respectively. A resume of the results can be
seen in Table 2. In order to compare the models explained in Table 2 we follow as in (Bollen,
1996) and we compute the differences of the χ2
statistics and p-values. We obtain Table 3.
Model comparisons make clear model 2 as the
model of choice. Model 2 was compared to a
non constrained model (model 1 ) and to several
constrained models (model 3, 4 and 5 ). Model
7 has not been considered in the comparison as
we obtained the same results than for model 6.
Among the small number of considered models,
model 2 has strong statistical support. However
many other models could be constructed.
4
Learning the parameters
In a BN the conditional probability distributions are called the parameters. A prior probability, in a particular model, it is the probability of some event prior to updating the probability of that event, within the framework of
that model, using new information. A posterior
probability is the probability of an event after
its prior probability has been updated, within
the framework of some model, based on new information. Parameters have been obtained using the estimator box of TETRAD IV. A Dirichlet estimation has been performed, which estimes a Bayes instantiated model using a Dirichlet distribution for each category. The probability of each value of a variable (conditional on the
values of the variable’s parents) is estimated by
adding to a prior pseudocount (which is 1, by
default) of cases (see Table 4), for each configu-
Table 4: Dirichlet parameters (pseudocounts)
for variable a = age conditional on combinations of its parent values (with total pseudocount in row shown), where s and b denote variables selfratedhealth and barthelindex respectively, and d. = dependent, i. = independent
s
b a =< 76.5 a => 76.5 t. c.
bad
bad
good
good
d.
i.
d.
i.
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
2.0000
2.0000
2.0000
2.0000
Table 5: Table of expected values of prior probabilities for variable a = age conditional on
combinations of its parent values (with total
pseudocount in row shown), where s and b denote variables selfratedhealth and barthelindex
respectively.
s
b a =< 76.5 a => 76.5
t. c.
bad
bad
good
good
d.
i.
d.
i.
0.5000
0.5000
0.5000
0.5000
0.5000
0.5000
0.5000
0.5000
2.0000
2.0000
2.0000
2.0000
ration, the number of cases in which the variable
takes that value and then dividing by the total
number of cases in the prior and in the data
with that configuration of parents variables (see
Table 5). Using a Dirichlet distribution with
parameters α1 , α2 , . . . , αk then the prior probabilities are (Naepolitan, 2004):
p(x1 ) =
α1
m
p(x2 ) =
α2
αk
. . . p(xk ) =
(1)
m
m
where m = α1 + . . . + αk . After seeing x1 occurs
s1 times, x2 occurs s2 times, . . ., and xk occurs
sk times in n = s1 + . . . + sn trials, the posterior
probabilities are as follows:
α1 + s1
m+n
α2 + s2
P (x2 | s1 . . . sk ) =
m+n
..
.
αk + sk
P (xk | s1 . . . sk ) =
m+n
P (x1 | s1 . . . sk ) =
(2)
Table 6: Table of expected values of posterior
probabilities for variable a = age conditional on
combinations of its parent values, where s and b
denote variables selfratedhealth and barthelindex
respectively.
s
b a =< 76.5 a => 76.5
t.c.
bad
bad
good
good
d.
i.
d.
i.
0.2000
0.6667
0.2500
0.3774
0.8000
0.3333
0.7500
0.6226
5.0000
12.0000
4.0000
53.0000
The numbers α1 , α2 , . . . , αk are usually obtained by our experience to having seen the first
outcome occurs α1 times, the second outcome
occurs α2 times, . . . , and the last outcome occurs αk times. These αi are the pseudocounts
of TETRAD IV. Several possibilities were considered: i) α1 = α2 = . . . = αk = 1 in this
case we do not have any knowledge at all concerning the relative frequency; ii) α1 = α2 =
. . . = αk < 1 when it is believed that the relative frequency of the i-th value is around αi /N ;
iii) α1 = α2 = . . . = αk > 1 when someone
wants to impose his/her relative frequencies on
the system. However we got better results of
classification using a pseudocount = 1 (case i)
than in the other cases. Another possibility was
considered by expressing prior indifference with
prior equivalent sample size (Naepolitan, 2004);
there are variables in our problem which have
two of three values, so for variables with 2 values
we assign a pseudocount of 1.5 to each value and
a pseudocount of 1 to variables of three values,
probabilities were slightly different however we
got again that e.g. depression variable classified
worst.
From the pseudocounts, TETRAD IV determines the conditional probability of a category
(see Table 6). This estimation is done by taking the pseudocount of a category and dividing
it by the total count for its row.
5
Updater
An approximate updater has been implemented.
The updater of TETRAD IV allows us to up-
Figure 3:
Updater with some evidence,
worklevel = medium, brcsme = no.
date the values of other parameters given a set
of predetermined values of parameters in that
model. There is also the possibility of using an
exact updater, but the first one is very fast. In
Figure 3 and 4 some evidence has been given to
the update: sex = woman, age => 76.5, maritalstatus = widow, worklevel = medium, deafprob = yes, caefsp = presence, caeafn = presence, caerlg = presence; the difference between
Figure 3 and 4 is the variable brcsme = no in
Figure 3 and brcsme = yes in Figure 4, with the
updater we obtained a clear variation on depression variable. The probability of depression =
yes increase to 0.75 with variable brcsme stated
to no (see Figure 3); and the probability of depression = yes decrease to 0.125 when brcsme
variable is stated to yes. The blue lines, and
the values listed across from them, indicate the
probability that the variable takes on the given
value in the input instantiated model. The red
lines indicate the probability that the variable
depression takes on the given value, given the
Figure 4:
Updater with some evidence,
worklevel = medium, brcsme = yes.
that an instance is negative, F P = 2 (false
positive) is the number of incorrect predictions
that an instance is positive, F N = 11 (false
negative) is the number of incorrect predictions that an instance is negative, and T P = 6
(true positive) is the number of correct predictions that an instance is positive. Thus,
47
T N R = 47+2
(true negative rate or proportion
of negatives cases that were classified correctly),
6
T P R = 11+6
(true positive rate or proportion
of positive cases that were correctly identified),
2
F P R = 47+2
(false positive rate or proportion
of negatives cases that were incorrectly classified
11
as positive), F N R = 11+6
(false negative rate
or proportion of positives cases that were incorrectly classified as negative), and precision P is
the proportion of the predicted positive cases
6
that were correct P = 2+6
= 0.75. The accuracy of depression variable is 80.30% (ACC =
47+6
T P +T N
T P +F N +F P +T N = 66 = 0.80303030). From
these figures we can see how the BN provides a
computationally efficient prediction system for
the study of depression.
evidence that we added to the updater.
6
Validation
TETRAD IV takes as input a categorical data
set and a Bayes instantiated model and for a
given variable, estimates that variable’s values
in that case. Figure 5 shows that there is a
80.30% correctly classified with respect to depression variable. Figure 6 shows the Receiver
Operating Characteristics (ROC) also known as
relative operating curves, a graphical plot of the
sensitivity, or true positive rate versus false positive rate. The ROC curve falls within the quadrant 1 × 1 and the area under it is used as a
predictive indicator of goodness. The area under curve (AUC) is defined as the probability
of correctly classifying a pair of cases (positive
and negative) randomly selected.
In Figure 5 we can see the contingency table or confusion matrix, where T N = 47 (true
negative) is the number of correct predictions
Figure 5: Percentage correctly classified for depression = no.
7
Bayesian network for depression
TETRAD IV has been developed to be able to
build a Bayesian network (Spirtes et al., 1993),
because it can be used as a expert system from
a database. However TETRAD IV is not a pro-
as a product of several conditional distributions
and denotes the dependency/independency
structure by a DAG:
P (x1 , . . . x22 ) =
22
Y
P (xi | P a(xi ))
(3)
i=1
Figure 6: ROC plot for depression = no.
Based on the magnitude of influence and in
the sign of influence (Lacave et al., 2011), the
networks implemented in Elvira offer the possibility of automatic coloring of links as it can
be seen in Figure 8. Most of the connections
are red or positive links, undefined connections
are in purple and negative in blue. The coloring
and the width of links help to detect wrong influences. There are only two exceptions in the
network for depression: wais3pe → cognitiveimpairment and caeeea → depression for obvious
reasons.
Figure 7: Bayesian network with 22 nodes implemented in Elvira.
gram of bayesian networks, it is not available a
graphical interface with the graphical and quantitative structure at the same time. TETRAD
IV can not work with missing data, as usually
do Bayesian networks.
From the parameters and structure obtained
by TETRAD IV a Bayesian network has been
implemented using Elvira (a tool for building
and evaluating graphical probabilistic models,
the software can be obtained for free at http://
www. ia.uned.es/ elvira)(see Figure 7).
The independencies from the graph are translated to the probabilistic model. The joint probability distribution of the BN of Figure 7 requires the specification of 22 conditional probability tables, one for each variable conditioned
to its parents’set.
The joint probability distribution factorized
Figure 8: Bayesian network with automatic coloring of links feature of Elvira.
8
Concluding remarks
The structure and parameters have been obtained from complete data using the software
TETRAD IV. ROC curves and confusion matrix have been taken into account. From the
obtained results a Bayesian network has been
implemented using the software Elvira.
Further work is oriented to the study and
evaluation of depression in elderly using the implemented BN, which can be used to calculate
new probabilities when new information is introduced (evidence), and therefore some interesting conclusions could be achieved.
Acknowledgments
This work was financially supported by the
UV-INV-AE11-39878 project and the MICINN
TIN2009-12359 project ArtBioCom.
References
American Psychiatric Association (APA), (2002).
Manual diagnóstico y estadı́stico de los trastornos
mentales. Texto
revisado.
(DSM-IV-TR).
Barcelona: Masson.
Bollen, K.A. (1989). Structural equations with latent variables. Wiley series in probability and
mathematical statistics.
Buntine, W. (1996). A guide to the literature on
learning probabilistic networks from data. IEEE
Transactions on Knowledge and Data Engineering, vol. 8, (2):195–210.
Butz, C. J., Hua, S., Chen, J. and Yao, H. (2009).
A simple graphical approach for understanding
probabilistic inference in Bayesian networks. Information Sciences, 179, 699-716.
Claasen, T. and Heskes, T. (2010). Learning causal
network structure from multiple (in)dependence
models. Proceedings of the 5th European Workshop on Probabilistic Graphical Models, 81-89,
Helsinki, Finland.
Cheng, J., Greiner, R., Kelly, J., Bell,D. and Liu, W.
(2002). Learning Bayesian networks from data: an
information-theory based approach. Artificial Intelligence, 137, 43–90.
Chickering, D. (2002). Optimal structure identification with greedy search. Journal of Machine
Learning Research, 3(3), 507–554.
Druzdzel, M.J. and Glymour, C. (1995). What do
college ranking data tell us about student retention: causal discovery in action. Intelligent Information Systems IV. Proceedings of the workshop
held in Poland.
Glymour, C. (2003). The mind’s arrows: Bayes nets
and graphical causal models in psychology. MIT
Press, New York.
Glymour, C., Scheines, R., Spirtes, P. and Kelly,
K. (1986). Discovering causal structure. Technical
report CMU-PHIL-1.
Hekerman, D. (1997). Bayesian networks for data
mining. Data mining knowledge discovery, 1, 79–
119.
Heckerman, D., Geiger, D. and Chickering, D.M.
(1995). Learning Bayesian networks: the combination of knowledge and statistical data. Machine
Learning, 20, 197–243.
Jensen, F.V. and Nielsen, T.D. (2007). Bayesian networks and decision graphs. Information Science &
Statistics. Springer.
Jiménez, A., M., Gálvez Sánchez, N. y Esteban Sáiz,
R. (2006). Depresión y Ansiedad. En Sociedad
Española de Geriatrı́a y Gerontologı́a (SEGG),
Tratado de Geriatrı́a para Residentes. pp. 235239.
Lacave, C., Luque, M. and Dez, F.J. (2011). Explanation of Bayesian networks and inflence diagrams
in Elvira. Journal of Latex Class Files,vol. 1, 11.
1511-1528.
Lang, G., Resch, K., K. Hofer, K., Braddick, F. y
Gabilondo, A. (2010). La salud mental y el bienestar de las personas mayores. Hacerlo posible.
Madrid: IMSERSO.
Larrañaga, P. and Moral, S. (2011). Probabilistic
graphical models in artificial intelligence. Applied
Soft Computing, 1511-1528.
Margaritis, D. (2003). Learning Bayesian network
model structure from data. PhD Thesis of CMUCS-03-153.
Meek, Ch. (2003). Causal inference and causal explanation with background knowledge. 403-410.
Spanish Ministry of Health (Ministerio de Sanidad).
(2007). Estrategia en salud mental del sistema
nacional de salud, 2006. España: Ministerio de
Sanidad y Consumo.
Naepolitan, R. E. (2004). Learning Bayesian networks. Prentice Hall.
Pearl, J. (2000). Causality. Models, reasoning and
inference. Cambridge university press.
Sucar, L.E. and Martı́nez-Arroyo, M. (1998). Interactive structural learning of Bayesian networks.
Expert Systems with Applications, 15, 325–332.
Scheines, R., Spirtes, P., Glymour, C., Meek, C. and
Richardson, T. (1998). The TETRAD project:
constraint based aids to causal model specification. Multivariate Behavioural Research, 33:1, 65–
117.
Spirtes, P., Glymour, C. and Scheines, R. (1993).
Causation, prediction and search. SpringerVerlag.