Boston University OpenBU http://open.bu.edu Cognitive & Neural Systems CAS/CNS Technical Reports 1998-06 Outcome Prediction for Unipolar Depression Rubin, Mark A. Boston University Center for Adaptive Systems and Department of Cognitive and Neural Systems http://hdl.handle.net/2144/2354 Boston University Outcome Prediction for Unipolar Depression Mark A. Rubin, Michael A. Cohen, Joanne S. Luciano, and Jacqueline A. Samson June 1998 Technical Report CAS/CNS-98-020 Permission to copy without rce all or part of this material is granted provided that: I. The copies arc not made or distributed for direct commercial advantage; 2. the report title, author, docUJncnt number, and release date appear, and notice is given that copying is by permission of the BOSTON UNIVERSITY CENTER FOR ADAPTIVE SYSTEMS AND DEPARTMENT OF COGNITIVE AND NEURAL SYSTEMS. To copy otherwise, or to republish, requires a fcc and I or special permission. Copyright © 1998 Boston University Center for Adaptive Systems and Department of Cognitive and Neural Systems 677 Beacon Street Boston, MA 02215 Outcome Prediction for Unipolar Depression Mark A. Rubin,l,a Michael A. Cohen, 1·b Joanne S. Luciano Jacqueline A. Samson3 2 ) ·" Department of Cognitive and Neural Systems, Boston University 677 Beacon Street, Boston, MA 022L} USA' rubin@cns. bu. edu, a mike@cns . bu. edub Neural Systems Group, Psychiatry Department, Harvard Medical School Massachusetts General Hospital 149 13th Street., Charlestown, MA 02129 USA 2 jluciano@nmr. mgh. harvard. edu" Harvard Medical SchooL Depression Hesearch Facility, 1\tlcLean Hospital Belmont., MA 02178 USNr Abstract Although effective drug and non-drug trcatnwnt.s for unipolar depressive illness exist, different individuals respond differently to diffeHmt treatments. It is not uncommon for a given patient to lw switched several times from one treatment to another until an eff'c>ctivc: rcrnedy for that particular patient is found. 'fhis process is costly in terms of time, money and suffering. It is thno desirable to determine at the outset the likdy response of a, patient to the available treatments, so that the optimal one can be selected. Although prior attc>mpts at outcome prediction with linear regression models have failed, recent work on this problem has indicated that the nonlinear predictive techniques of backpropagation and quadratic regression call account for a significant proportion of the variance in the data. The present research applies the nonlinear predictive technique of kernel regression to this problcrn, and employs cross-validation to test the ability of the resulting model to extract, from extremely noisy dinical data, information with predictive value. The importance of comparison with a suitable null hypothesis is illustrated. 1 Introduction Depression, a serious and sometimes fatal illness, was estimated to affect six percent of the US population in 1990, at an economic cost of 44 billion dollars (Greenberg et al., 1993). 1 A more recent study finds the prevalence of depression to be at an even higher rate, ten percent of the US population in a given year (Kessler et al., 1994). Since different patients respond differently to cliffnrent treatments, an efficient approach to treatment is to select at the outset the optimal treatment for a given patient out of several potentia.! treatments (Luciano et al., 1994; Luciano, 1996), and to clo so using information readily available to the clinician at. intake, such as scores on the Hamilton Depression Rating Scale (Hamilton, 1960). Such optimal treatment selection is possible provided that., from this information, it is possible to predict with snfficient accuracy the respective outcomes of treatment with several modalities and then select the treatment with the best predicted outcome. Although prior atternpts at outcome prediction with linear regression models have failed (Joyce and Paykel, 1989), recent work on this problem utilizing the multilayer perceptron and quadratic regression techniques (Luciano, HJCJG; Luciano, Cohen and Samson, 1996, 1997) has indic:atecl that these nonlinear predictive models can account for a significant proportion of the~ variance in the data. The present. work employ-s the nonlinear technique of kcnwl regression to the prediction of treatment outcome for depression, and utilizes cross-validation to evaluate its effectiveness at this task. 2 The data and the variables The data employed in this study were obtained from clinical n:search studies on depression that were previously conducted at tlw Depression Research Facility, j\:fcLcan Hospital, in Belmont J'v!assachusetts by Dr. Jacquelim~ A. Samson, Dr. Joseph J. Schildkraut, Dr. John .J. Mooney, and Dr. Alan Schatzberg. Data frorn the treatment of 9D patients are usee!. Of these patients, 1)9 were treated with the drug desipramine, 37 with the drug fluoxetine, and 1:3 with cognitive-behavior therapy. Differences in rcspow;c' rates between the three groups an• not statistically signilicant (Luciano, 19\JG). The Hamilton Depression Rating Scale con.sists of 21 items, quantifying such symptoms as depn:ssed tnood, feelings of guilt) sleep dist.urbancc etc, Each itcru ta.kcs a.n integer v;:.tluc in either a 0-:l or a (J-<1 range; the "severity," the surn of all 21 Hamilton scores, ranges from O-G5. Based on clinical experience, the initial values of seven "symptom factors"Hamilton items 1, tl, and 7, a.nd the painvisc averages of itcrns 2 and :J, 5 and 6 ancl 8 and 1:3 as well as the initial severity are t.akc:n t.o clcsuibc the initial state of the patient. A three-component. binary-valued "treatment flag" is appended to this set of variablc:s to represent the treatment given the patient. The outcome of the treatment, the quantity to be predicted by the trainee! network, is taken to be the fractional change in severity. 1 3 Backpropagation and quadratic regression models Multiple linear regression, backpropagation, and quadratic; regression models were fitted to the data of the 99 patients, and the proportion of variance accounted for by the models, Pearson's T 2 was used to evaluate their performance. lVlultiple linear regression accounted for 14% of the variance; based on the F-statistic, this was not significant. The nonlinear 2 methods of backpropagation (multilayer perceptron) and quadratic regression accounted, respectively, for 44% and 41% of the variance. These results were significant at a level of p < 0.0005, determined by a maximum likelihood method of Rao (1973, pp. 420-422) as adapted by Cohen (1995). For details, see Luciano (1996). 4 Cross-validation and kernel regression (GRNN) Although the nonline.a.r models described above seem to predict outcome in a statistically significant manner, the small size of the data set, as well as the lack of separation between training and test data, still leaves open the possibility that, in spite of the best efforts to avoid it, overfitting has occurred. The small size of the data set also precludes the option of training the models on a subset of the data and testing on the resr,. vVe are thus led naturally to the method of leave-one--out cross-validation (Efron and Tibshirani, 1993; Masters, 199:3). That is, we train 99 difl'<~rcnt. predictive models, each one 011 98 out of the 99 patients, with a different patient left out each time. Each model is thc•r used to predict the outcome: oniy of the patient whose data was left. out in t.lw training of that model. In this way, information concerning the patient whose treatment outcome is to be predicted is never used in the training of the model doing tlw predictin[~, thus providing a completely "blind" test of t.he model's predictive power. ( f3y l<~aving out only a single patient at a time we construct models which should have nemly as much predictive power as a model trained on the en tin~ data set.) Clearly this sort of cross-validation is a time--consuming procedure. On the other hand, since the cross-validation serves to guard against ovcrfitting, we arc free to choose a predictive model which can be trained and tested relatively quickly, even if it. involves a relatively large nurnber of panuneters . The n1odcl I:V(~ usc is the '(geueral regression neura.l network') (GRNN) (Specht, 1990; Schi0ler and Hartmann, 1992). This model derives frorn the Parzen window estimator for probability density, moclilied t.o perfonn function esl:imation. The GR~N assocjates to eac.h vector in its trn.iniug set a Gaussian kernel. In the variant of CRNN employed here (::Vlasters, lDDG), tlw covariance rnatrices ol' the kernels are diagonal and the same for all training vectors. 'The GRNN prediction for the outcome 0 of a patient described by the n-componc:nl: input vector iJ (symptom factors, initial severity and treatment flag), based on training with N,-n patients, is 0 =, I:~~~~o K . J--l J J ·--~ ,~vTn- -~-- L,Fl 1\.J (1) where Kj = exp( -(1!)- 1lfi:(vj- iJ)/2), 2:: = dio.g(v 1 , v 2 , ... , v,J, (2) (3) and Oi and -Dj are, respectively, the outcorne and input vector associated with .J'" training patient. The values of the v;'s are determined so as to minimize an error criterion which itself is the average square leave-one-out predictive error over the training set. Despite the fact that finding the optimal values involves a conjugate-gradient minimization, GRNN still provides a great savings in time over backpropagation. 3 5 Performance evaluation vVe evaluate the predictive performance of the GRNN using two measures of the proportion of variance in the data accounted for by the model. The first of these measures is (4) Here a1 is the actual outcome of the /" test patient, Pi is the prediction of the model for the outcome of the /h test patient, and a is the average actual outcome. This measure is applicable to linear and nonlinear models; it is generally close to Pearson's -r 2 and is equal to it in the case of fitting with a linear model. The measure in equation (4) above is subject to the criticism that it "rewards" the model simply for learning the average outcome of those patients receiving a given treatment. vVe therefore also employ an alternative measure, (5) Here a[Jl denotes the average outcome of all those patients who received the sa.nre treatment as received by patient j. As will be seen, the two measures gave similar n•sults; those for R2 are given in !Ktrent.lreses below. Note that when cross-validation is employed, each prediction JJJ is actnall.v being made by a different. predictive model; i.e., the one trained on data from which patient j was excluded. 6 Application of GRNN to depression data We first train GH.NN on the entire 99-pat.ieut. data set, using tbe seven symptom factors, initial severity and treatment flag as the 11-component input vector. Both the input variables and the output variable (final sevmity) are standardized (convcnted to cc-scorcs). In addition, the outcome of treatment. to be predictc~d by the model is taken to be the aetna! final severity, rather than the fractional chaugc iu :;evcril.y used in the previous studies. (Results without thcose data-preprocessing modifications are inferior to those reported be·· low.) The proportion of variance which the model accounts for, without cross-validation, is 57% (55%). However, when leave-one-out cross-validation is employed, .lt2 (R 2 ) drops to 9% (5%); much smaller, to he sure, but still nonzero. To assess the significance of this small level of predictive power, we compare it with the proportion of variance obtained under the null hypothesis that the data in fact contains no information of any predictive valne. Specifically, we train and test GR.NN, with cross-validation, on randomly-constructed "data sets." The first random data set consists of values for each variable chosen uniformly over that variable's allowed range. The cross-validated proportion of variance is negative, suggesting that the small positive effect found in the actual data is meaningful. 4 We next construct a random data set in which the value of each variable has a Gaussian distribution, with a mean and variance determined by the mean and variance found in the real data set. When R 2 (R2 ) is computed, with cross-validation, for GR.NN trained and tested on this data set, the result is 8% (4%). Since the proportion of variance accounted for by the model trained on this random data is so dose to that accounted for with the actual data, we cannot consider the figure of9% (5%) proportion of variance on the actual data to be meaningful. 7 Discussion The above results indicate that, to the extent that the data set we are using contains information of predictive valne, GRNN is not extraeting it. Perhaps of even greater interest is the observation that cross-validation by itself' is not necessarily suflic:ient to unmask an overfit disguised as an apparently positive resnlt.. As in any test of an hypothesis, comparison with a null hypothesis nmst be perf'onneel. Furthermore, different null hypotheses may give different results, and it is critical to devise om~s which are similar cnoup;h in structure to the alternative hypothesis tu provide sufficiently stringent tests. The negative GRNN results make even more imperative the~ evaluation of the previous backpropagation and quadratic regression findings in a similar manner. These tests, on the the: same and extended data sets, are currently in pmgrcss. Acknowledgements This research was supported in part by a grant from the Office of Naval Research, ONH. N00014-95-l-0409 (JVI. A. R.). References Cohen, M. A. (1995) Unpublished. Greenberg, P. E., Stiglin, L. E., Finkelstein, S. N. and Berndt, E. R (199:3) The economic burden of depression in 1990. Joum.al of Clinical Psychiatry, 54, pp. 405-418. Hamilton, M. (1960) A rating scale for depression. Jonmal of Neurological and NeuTOS'ILr·gical P.sychia.try, 23, pp. 56-62. Joyce, P. M. and Paykcl, E. (1989) Predictors of drug response in depression. Archives of General Psych:iatry, 46, pp. 89-99. Kessler, R.. C., McGonagle, K. A., Zhao, S., Nelson, C. B., Hughes, i\11., Eshleman, S., Whittchen, I-I.-U. and Kendler, K S. (HJ94) Lifetime and 12-month prevalence of DSM-III-H. psychiatric disorders in the United States: Results from the national comorbidity study. Archive.s of General Psychiatry, 51, pp. 8-19. 5 Luciano, J. S. (1996) Neural network modeling of unipolar depression: Patterns of recovery and prediction outcome. PhD thesis, Department of Cognitive and Neural Systems, Boston University, Boston, MA. Luciano, J. S., Cohen, M.A. and Samson, J. A. (1996) Neural network modeling of unipolar depression, in Reggia, J., Ruppin, E. all(! Berndt, R.., Neural Modeling of Bmin and Cognitive Disorders, World Scientific Publishing Company, pp. 469-483. Luciano, J. S., Cohen, M.A. and Samson, J. A. (1997) Predictive medicine: initial symptorns may determine outcome in clinically treated depressions, in ICNN '97: Proceedings of the International Conference on Neu·ml Networks, Houston, June 1997, vol. I, pp. 71-80. Luciano, .J. S., Cohen, M.A., Samson, .J. A. and !-lagan, P. (1994) A prototype and nonlinear neural network approach to subtyping depressive illness by initial input data and treatment outcome, in Proceedings of the Irish Nwml Network Conference 199.!,, Dublin, Ireland. Masters, T.(1993) Pr-actical Neural Network 1'/cc:-ipes in C++, Academic Press, Inc. Masters, T. (1995) Advanced Algorithms frn· Neum.l Networks: A C++ Sourcebook, John Wiley ancl Sons, Inc. Rao, C. R. (1973) Linear Statistical Infer-ence o:n.d -its Appl·ications, 2nd eel. Wiley-lntersciencc. Schiyller, H., and Hartmann, U. (1992) !\!lapping neural network derived from the Parzen window estimator, Ncuml Networ-ks 5, pp. 9o:l-909. Specht, D. (1990) A general regression neural network, IEEE Transactions on Ncm·al Networks, 2, pp. 568-576. 6
© Copyright 2026 Paperzz