Learning with TETRAD IV to build a Bayesian network for the study of depression in elderly people Pilar Fuster-Parra Departament Matemàtiques i Informàtica Universitat de les Illes Balears, Spain [email protected] Carmen Moret-Tatay Cátedra Energesis de Tecnologı́a Interdisciplinar Universidad Católica de Valencia, Spain [email protected] Esperanza Navarro-Pardo Departament Psicologia Evolutiva i de la Educació Universitat de València, Spain [email protected] Pedro Fernández-de-Córdoba Castellá Institut Universitari de Matemàtica Pura i Aplicada Universitat Politècnica de València, Spain [email protected] Abstract The paper describes the process of learning using TETRAD IV, from a complete database composed of twenty two variables (characteristics) and sixty six observations (cases), which has been obtained from different validated psychological tests applied to a reduced set of people: the elderly. Learning both the structure and parameters has been performed with TETRAD IV. Then, a Bayesian network has been implemented in Elvira. 1 Introduction Discovering causal relationships from observational data is an important problem in empirical science (Druzdzel and Glymour, 1995), (Claasen and Heskes, 2010). The purpose of the present article is to describe the process of learning with TETRAD IV, to develop a Bayesian network (BN) in the domain of depression in elderly people from a database with complete observations. Then, the BN can be used to evaluate and investigate the depression (Butz et al., 2009). BNs can be considered at the intersection of Artificial Intelligence, statistics and probability (Pearl, 2000) and constitute a representation formalism strongly related to knowledge discovery and data mining (Heckerman, 1997). BNs are a kind of probabilistic graphical models (PGMs) (Larrañaga and Moral, 2011), which combine graph theory (to help the representation and resolution of complex problems) and probability theory (as a way of representing uncertainty). A PGM is defined by a graph where nodes represent random variables and arcs represent dependencies among such variables. The graphical structure captures the compositional structure of the causal relations and general aspects of all probability distributions that factorize according to that structure (Glymour, 2003). A PGM is called a BN when the graph connecting its variables is a directed acyclic graph (DAG). A BN may be causal or not depending whether the arcs in the DAG represent direct causal influences or not. If the arcs represent causal influences we will talk of causal BNs (Spirtes et al., 1993). The assumption that arcs represent causality in elicitation process is very useful (Buntine, 1996). In Section 2 we analyze the problems associated with depression in elderly. In Section 3 we describe the process of learning the structure from data using TETRAD IV. In Section 4 we present the process of learning the parameters. In Section 5 we present the use of an approximative updater. In Section 6 we show that depression variable classifies the 80.30% of cases under study. In Section 7 we present a BN to study depression in elderly, which was implemented in Elvira. Finally, some conclusions and further remarks are given in Section 8. 2 3 Learning the structure from data The problem of discovering the causal structure increases with the number of variables (Sucar and Martı́nez-Arroyo, 1998), (Cheng et al., 2002), e.g. for each pair of variables, X and Y there are four possible kinds of connections: X → Y, X ← Y, X ↔ Y, X Y. The number of distinct possible causal arrangements of n variables is 4(number of pairs of variables) (Glymour, 2003). With twenty two variables there would be 4(21+20+···+2+1) = 1.190853 × 10139 possible connections between variables. Depression in elderly Depressive symptomatology is especially relevant to psychology and mental health: this is the fourth cause of disability in Spain and it is expected to be the second cause in the next five years (Spanish Ministry of Health, 2007). In the elderly, 4 from 10 doctors appointment are associated to this pathology. Nowadays, it is not yet possible to establish the cause of depression. Therefore, we considered interesting to explore which variables may influence depressive symptoms and, which are influenced by them. Although the global manual of mental illness, created by the APA (American Psychiatric Association, 2002) does not set margins for the elderly, pathoplasty differs, usually appearing in an atypical and unspecific way, accompanied by other pathologies, often characteristic at this age (Jiménez et al., 2006). Thus, we followed DSM-IV-TR criteria, valid for all age groups. We have also included other variables that, according to the literature, may be related to depressive symptoms in the elderly (Lang et al., 2011) as age, gender, marital status, presence of cognitive impairment, level of physical dependence, coping, psychological well-being, etc.. But, at the same time, we tried to examine relationships between them. Figure 1: Pattern obtained when GES algorithm performs the search on data. Basically, there are two approaches to structure learning (Jensen and Nielsen, 2007): i)search-and-score structure learning, and ii)constraint-based structure learning. Searchand-score search algorithms assigns a number (score) to each BN structure, and we look for the structure model with the highest score. Constraint-based search algorithms establish a set of conditional independence analysis on the data (Margaritis, 2003). Based on this analysis an undirected graph is generated. Using additional independence test, the network is converted into a BN. Being PC the most known example of constraint-based learning algorithms. For discrete or categorical variables, PC uses either a chi square of independence or conditional independence (Spirtes et al., 1993). In order to obtain the DAG we use TETRAD IV (Scheines et al., 1998). Table 2: GES algorithm applied on data (model 1 ), applied on data with prior knowledge (model 2 ) and with prior knowledge obtained from PC, PCD, CPC, JPC and JCPC (model 3, model 4, model 5, model 6 and model 7 respectively). Figure 2: Pattern obtained when GES algorithm performs the search on data with prior knowledge. Table 1: Search algorithms results on data. Search df χ2 pvalue PC PCPattern PCD CPC JPC JCPC GES 224 214 224 214 226 226 200 402.3027 297.9925 402.3027 283.286 396.2216 396.2216 195.3489 0 0.0001 0 0.0011 0 0 0.5796 The software is available as freeware in the TETRAD IV suite of algorithms at www.phil.cmu.edu/projects/tetrad. Several search algorithms, which are integrated in TETRAD IV, were tested on the data (PC, PCPattern, PCD, CPC, JPC, JCP and GES), and we found GES (greedy equivalence search) (Chickering, 2002) as the one best behaved with respect to our data (see Table 1). We only got a completely connected graph with GES algorithm, so we choose this one. The search process gives a pattern, where most of the arc directions were defined, that is, we obtained a partially directed graph G. We only had connections without any direction among variables: maritalstatus – sex, brcsme – caerep – visionproblems. It is also possible to obtain the DAG in this pattern, being marked with yellow arrows the connections which were Model Search df χ2 pvalue 1 2 3 4 5 6 7 GES GES GES GES GES GES GES 200 201 199 199 200 199 199 195.3489 201.0426 192.754 192.754 198.1167 194.5962 194.5962 0.5796 0.4859 0.6114 0.6114 0.5244 0.5749 0.5749 Table 3: Comparisons among GES models. Model comparison χ2 df pvalue model model model model model model model model model model model 2 2 1 2 1 2 1 5 2 5 5 versus versus versus versus versus versus versus versus versus versus versus 1 3 3 4 4 6 6 6 5 4 3 5.6937 8.2886 2.5949 8.2886 2.5949 6.4464 0.7527 3.5205 2.5949 5.3627 5.3627 1 2 1 2 1 2 1 1 1 1 1 0.9829740 0.9841455 0.8927918 0.9841455 0.8927918 0.9601726 0.6143773 0.9393858 0.8927918 0.9794281 0.9794281 no defined in direction in Figure 1, this graph G’ extends graph G, because i) G and G’ have the same adjacencies and ii) if A → B is in G then A → B is in G’ . G’ is a consistent DAG extension of graph G because i) G’ extends G, ii) G’ is a directed acyclic graph and iii) pattern(G) = pattern(G’) (Meek, 2003). The model obtained by this way (only from the original data) we called model 1. Then, we perform another search using GES algorithm, where we take the data and prior knowledge (Heckerman et al., 1995), which was included as tears to order the variables: maritalstatus, sex, brcsme, caerep and visionproblems. The order introduced was the same given in the first search, that is the yellow directions marked in Figure 1. Another DAG was obtained, and we called it model 2, see Figure 2. The only difference between both models is that in model 2 the arc connecting caeafn with worklevel (which was in model 1 ) has been removed, as a consequence the degree of freedom has increased one unit (Glymour et al., 1986). We also take into account a possible model 3, which was obtained from data and prior knowledge (connections obtained by PC algorithm defined by tears), and GES algorithm was applied. We proceed like that with PCD, CPC, JPC and JCPC, and we called the models: model 4, model 5, model 6 and model 7 respectively. A resume of the results can be seen in Table 2. In order to compare the models explained in Table 2 we follow as in (Bollen, 1996) and we compute the differences of the χ2 statistics and p-values. We obtain Table 3. Model comparisons make clear model 2 as the model of choice. Model 2 was compared to a non constrained model (model 1 ) and to several constrained models (model 3, 4 and 5 ). Model 7 has not been considered in the comparison as we obtained the same results than for model 6. Among the small number of considered models, model 2 has strong statistical support. However many other models could be constructed. 4 Learning the parameters In a BN the conditional probability distributions are called the parameters. A prior probability, in a particular model, it is the probability of some event prior to updating the probability of that event, within the framework of that model, using new information. A posterior probability is the probability of an event after its prior probability has been updated, within the framework of some model, based on new information. Parameters have been obtained using the estimator box of TETRAD IV. A Dirichlet estimation has been performed, which estimes a Bayes instantiated model using a Dirichlet distribution for each category. The probability of each value of a variable (conditional on the values of the variable’s parents) is estimated by adding to a prior pseudocount (which is 1, by default) of cases (see Table 4), for each configu- Table 4: Dirichlet parameters (pseudocounts) for variable a = age conditional on combinations of its parent values (with total pseudocount in row shown), where s and b denote variables selfratedhealth and barthelindex respectively, and d. = dependent, i. = independent s b a =< 76.5 a => 76.5 t. c. bad bad good good d. i. d. i. 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 2.0000 2.0000 2.0000 2.0000 Table 5: Table of expected values of prior probabilities for variable a = age conditional on combinations of its parent values (with total pseudocount in row shown), where s and b denote variables selfratedhealth and barthelindex respectively. s b a =< 76.5 a => 76.5 t. c. bad bad good good d. i. d. i. 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 2.0000 2.0000 2.0000 2.0000 ration, the number of cases in which the variable takes that value and then dividing by the total number of cases in the prior and in the data with that configuration of parents variables (see Table 5). Using a Dirichlet distribution with parameters α1 , α2 , . . . , αk then the prior probabilities are (Naepolitan, 2004): p(x1 ) = α1 m p(x2 ) = α2 αk . . . p(xk ) = (1) m m where m = α1 + . . . + αk . After seeing x1 occurs s1 times, x2 occurs s2 times, . . ., and xk occurs sk times in n = s1 + . . . + sn trials, the posterior probabilities are as follows: α1 + s1 m+n α2 + s2 P (x2 | s1 . . . sk ) = m+n .. . αk + sk P (xk | s1 . . . sk ) = m+n P (x1 | s1 . . . sk ) = (2) Table 6: Table of expected values of posterior probabilities for variable a = age conditional on combinations of its parent values, where s and b denote variables selfratedhealth and barthelindex respectively. s b a =< 76.5 a => 76.5 t.c. bad bad good good d. i. d. i. 0.2000 0.6667 0.2500 0.3774 0.8000 0.3333 0.7500 0.6226 5.0000 12.0000 4.0000 53.0000 The numbers α1 , α2 , . . . , αk are usually obtained by our experience to having seen the first outcome occurs α1 times, the second outcome occurs α2 times, . . . , and the last outcome occurs αk times. These αi are the pseudocounts of TETRAD IV. Several possibilities were considered: i) α1 = α2 = . . . = αk = 1 in this case we do not have any knowledge at all concerning the relative frequency; ii) α1 = α2 = . . . = αk < 1 when it is believed that the relative frequency of the i-th value is around αi /N ; iii) α1 = α2 = . . . = αk > 1 when someone wants to impose his/her relative frequencies on the system. However we got better results of classification using a pseudocount = 1 (case i) than in the other cases. Another possibility was considered by expressing prior indifference with prior equivalent sample size (Naepolitan, 2004); there are variables in our problem which have two of three values, so for variables with 2 values we assign a pseudocount of 1.5 to each value and a pseudocount of 1 to variables of three values, probabilities were slightly different however we got again that e.g. depression variable classified worst. From the pseudocounts, TETRAD IV determines the conditional probability of a category (see Table 6). This estimation is done by taking the pseudocount of a category and dividing it by the total count for its row. 5 Updater An approximate updater has been implemented. The updater of TETRAD IV allows us to up- Figure 3: Updater with some evidence, worklevel = medium, brcsme = no. date the values of other parameters given a set of predetermined values of parameters in that model. There is also the possibility of using an exact updater, but the first one is very fast. In Figure 3 and 4 some evidence has been given to the update: sex = woman, age => 76.5, maritalstatus = widow, worklevel = medium, deafprob = yes, caefsp = presence, caeafn = presence, caerlg = presence; the difference between Figure 3 and 4 is the variable brcsme = no in Figure 3 and brcsme = yes in Figure 4, with the updater we obtained a clear variation on depression variable. The probability of depression = yes increase to 0.75 with variable brcsme stated to no (see Figure 3); and the probability of depression = yes decrease to 0.125 when brcsme variable is stated to yes. The blue lines, and the values listed across from them, indicate the probability that the variable takes on the given value in the input instantiated model. The red lines indicate the probability that the variable depression takes on the given value, given the Figure 4: Updater with some evidence, worklevel = medium, brcsme = yes. that an instance is negative, F P = 2 (false positive) is the number of incorrect predictions that an instance is positive, F N = 11 (false negative) is the number of incorrect predictions that an instance is negative, and T P = 6 (true positive) is the number of correct predictions that an instance is positive. Thus, 47 T N R = 47+2 (true negative rate or proportion of negatives cases that were classified correctly), 6 T P R = 11+6 (true positive rate or proportion of positive cases that were correctly identified), 2 F P R = 47+2 (false positive rate or proportion of negatives cases that were incorrectly classified 11 as positive), F N R = 11+6 (false negative rate or proportion of positives cases that were incorrectly classified as negative), and precision P is the proportion of the predicted positive cases 6 that were correct P = 2+6 = 0.75. The accuracy of depression variable is 80.30% (ACC = 47+6 T P +T N T P +F N +F P +T N = 66 = 0.80303030). From these figures we can see how the BN provides a computationally efficient prediction system for the study of depression. evidence that we added to the updater. 6 Validation TETRAD IV takes as input a categorical data set and a Bayes instantiated model and for a given variable, estimates that variable’s values in that case. Figure 5 shows that there is a 80.30% correctly classified with respect to depression variable. Figure 6 shows the Receiver Operating Characteristics (ROC) also known as relative operating curves, a graphical plot of the sensitivity, or true positive rate versus false positive rate. The ROC curve falls within the quadrant 1 × 1 and the area under it is used as a predictive indicator of goodness. The area under curve (AUC) is defined as the probability of correctly classifying a pair of cases (positive and negative) randomly selected. In Figure 5 we can see the contingency table or confusion matrix, where T N = 47 (true negative) is the number of correct predictions Figure 5: Percentage correctly classified for depression = no. 7 Bayesian network for depression TETRAD IV has been developed to be able to build a Bayesian network (Spirtes et al., 1993), because it can be used as a expert system from a database. However TETRAD IV is not a pro- as a product of several conditional distributions and denotes the dependency/independency structure by a DAG: P (x1 , . . . x22 ) = 22 Y P (xi | P a(xi )) (3) i=1 Figure 6: ROC plot for depression = no. Based on the magnitude of influence and in the sign of influence (Lacave et al., 2011), the networks implemented in Elvira offer the possibility of automatic coloring of links as it can be seen in Figure 8. Most of the connections are red or positive links, undefined connections are in purple and negative in blue. The coloring and the width of links help to detect wrong influences. There are only two exceptions in the network for depression: wais3pe → cognitiveimpairment and caeeea → depression for obvious reasons. Figure 7: Bayesian network with 22 nodes implemented in Elvira. gram of bayesian networks, it is not available a graphical interface with the graphical and quantitative structure at the same time. TETRAD IV can not work with missing data, as usually do Bayesian networks. From the parameters and structure obtained by TETRAD IV a Bayesian network has been implemented using Elvira (a tool for building and evaluating graphical probabilistic models, the software can be obtained for free at http:// www. ia.uned.es/ elvira)(see Figure 7). The independencies from the graph are translated to the probabilistic model. The joint probability distribution of the BN of Figure 7 requires the specification of 22 conditional probability tables, one for each variable conditioned to its parents’set. The joint probability distribution factorized Figure 8: Bayesian network with automatic coloring of links feature of Elvira. 8 Concluding remarks The structure and parameters have been obtained from complete data using the software TETRAD IV. ROC curves and confusion matrix have been taken into account. From the obtained results a Bayesian network has been implemented using the software Elvira. Further work is oriented to the study and evaluation of depression in elderly using the implemented BN, which can be used to calculate new probabilities when new information is introduced (evidence), and therefore some interesting conclusions could be achieved. Acknowledgments This work was financially supported by the UV-INV-AE11-39878 project and the MICINN TIN2009-12359 project ArtBioCom. References American Psychiatric Association (APA), (2002). Manual diagnóstico y estadı́stico de los trastornos mentales. Texto revisado. (DSM-IV-TR). Barcelona: Masson. Bollen, K.A. (1989). Structural equations with latent variables. Wiley series in probability and mathematical statistics. Buntine, W. (1996). A guide to the literature on learning probabilistic networks from data. IEEE Transactions on Knowledge and Data Engineering, vol. 8, (2):195–210. Butz, C. J., Hua, S., Chen, J. and Yao, H. (2009). A simple graphical approach for understanding probabilistic inference in Bayesian networks. Information Sciences, 179, 699-716. Claasen, T. and Heskes, T. (2010). Learning causal network structure from multiple (in)dependence models. Proceedings of the 5th European Workshop on Probabilistic Graphical Models, 81-89, Helsinki, Finland. Cheng, J., Greiner, R., Kelly, J., Bell,D. and Liu, W. (2002). Learning Bayesian networks from data: an information-theory based approach. Artificial Intelligence, 137, 43–90. Chickering, D. (2002). Optimal structure identification with greedy search. Journal of Machine Learning Research, 3(3), 507–554. Druzdzel, M.J. and Glymour, C. (1995). What do college ranking data tell us about student retention: causal discovery in action. Intelligent Information Systems IV. Proceedings of the workshop held in Poland. Glymour, C. (2003). The mind’s arrows: Bayes nets and graphical causal models in psychology. MIT Press, New York. Glymour, C., Scheines, R., Spirtes, P. and Kelly, K. (1986). Discovering causal structure. Technical report CMU-PHIL-1. Hekerman, D. (1997). Bayesian networks for data mining. Data mining knowledge discovery, 1, 79– 119. Heckerman, D., Geiger, D. and Chickering, D.M. (1995). Learning Bayesian networks: the combination of knowledge and statistical data. Machine Learning, 20, 197–243. Jensen, F.V. and Nielsen, T.D. (2007). Bayesian networks and decision graphs. Information Science & Statistics. Springer. Jiménez, A., M., Gálvez Sánchez, N. y Esteban Sáiz, R. (2006). Depresión y Ansiedad. En Sociedad Española de Geriatrı́a y Gerontologı́a (SEGG), Tratado de Geriatrı́a para Residentes. pp. 235239. Lacave, C., Luque, M. and Dez, F.J. (2011). Explanation of Bayesian networks and inflence diagrams in Elvira. Journal of Latex Class Files,vol. 1, 11. 1511-1528. Lang, G., Resch, K., K. Hofer, K., Braddick, F. y Gabilondo, A. (2010). La salud mental y el bienestar de las personas mayores. Hacerlo posible. Madrid: IMSERSO. Larrañaga, P. and Moral, S. (2011). Probabilistic graphical models in artificial intelligence. Applied Soft Computing, 1511-1528. Margaritis, D. (2003). Learning Bayesian network model structure from data. PhD Thesis of CMUCS-03-153. Meek, Ch. (2003). Causal inference and causal explanation with background knowledge. 403-410. Spanish Ministry of Health (Ministerio de Sanidad). (2007). Estrategia en salud mental del sistema nacional de salud, 2006. España: Ministerio de Sanidad y Consumo. Naepolitan, R. E. (2004). Learning Bayesian networks. Prentice Hall. Pearl, J. (2000). Causality. Models, reasoning and inference. Cambridge university press. Sucar, L.E. and Martı́nez-Arroyo, M. (1998). Interactive structural learning of Bayesian networks. Expert Systems with Applications, 15, 325–332. Scheines, R., Spirtes, P., Glymour, C., Meek, C. and Richardson, T. (1998). The TETRAD project: constraint based aids to causal model specification. Multivariate Behavioural Research, 33:1, 65– 117. Spirtes, P., Glymour, C. and Scheines, R. (1993). Causation, prediction and search. SpringerVerlag.
© Copyright 2026 Paperzz