Outcome Prediction for Unipolar Depression

Boston University
OpenBU
http://open.bu.edu
Cognitive & Neural Systems
CAS/CNS Technical Reports
1998-06
Outcome Prediction for Unipolar
Depression
Rubin, Mark A.
Boston University Center for Adaptive Systems and Department of Cognitive and
Neural Systems
http://hdl.handle.net/2144/2354
Boston University
Outcome Prediction for Unipolar Depression
Mark A. Rubin, Michael A. Cohen, Joanne S. Luciano,
and Jacqueline A. Samson
June 1998
Technical Report CAS/CNS-98-020
Permission to copy without rce all or part of this material is granted provided that: I. The copies arc not
made or distributed for direct commercial advantage; 2. the report title, author, docUJncnt number, and
release date appear, and notice is given that copying is by permission of the BOSTON UNIVERSITY
CENTER FOR ADAPTIVE SYSTEMS AND DEPARTMENT OF COGNITIVE AND NEURAL
SYSTEMS. To copy otherwise, or to republish, requires a fcc and I or special permission.
Copyright © 1998
Boston University Center for Adaptive Systems and
Department of Cognitive and Neural Systems
677 Beacon Street
Boston, MA 02215
Outcome Prediction for Unipolar Depression
Mark A. Rubin,l,a Michael A. Cohen, 1·b Joanne S. Luciano
Jacqueline A. Samson3
2
)
·"
Department of Cognitive and Neural Systems, Boston University
677 Beacon Street, Boston, MA 022L} USA'
rubin@cns. bu. edu, a mike@cns . bu. edub
Neural Systems Group, Psychiatry Department, Harvard Medical School
Massachusetts General Hospital 149 13th Street., Charlestown, MA 02129 USA 2
jluciano@nmr. mgh. harvard. edu"
Harvard Medical SchooL Depression Hesearch Facility, 1\tlcLean Hospital
Belmont., MA 02178 USNr
Abstract
Although effective drug and non-drug trcatnwnt.s for unipolar depressive illness
exist, different individuals respond differently to diffeHmt treatments. It is not
uncommon for a given patient to lw switched several times from one treatment
to another until an eff'c>ctivc: rcrnedy for that particular patient is found. 'fhis
process is costly in terms of time, money and suffering. It is thno desirable
to determine at the outset the likdy response of a, patient to the available
treatments, so that the optimal one can be selected. Although prior attc>mpts
at outcome prediction with linear regression models have failed, recent work
on this problem has indicated that the nonlinear predictive techniques of backpropagation and quadratic regression call account for a significant proportion of
the variance in the data. The present research applies the nonlinear predictive
technique of kernel regression to this problcrn, and employs cross-validation to
test the ability of the resulting model to extract, from extremely noisy dinical
data, information with predictive value. The importance of comparison with a
suitable null hypothesis is illustrated.
1
Introduction
Depression, a serious and sometimes fatal illness, was estimated to affect six percent of the
US population in 1990, at an economic cost of 44 billion dollars (Greenberg et al., 1993).
1
A more recent study finds the prevalence of depression to be at an even higher rate, ten
percent of the US population in a given year (Kessler et al., 1994).
Since different patients respond differently to cliffnrent treatments, an efficient approach
to treatment is to select at the outset the optimal treatment for a given patient out of
several potentia.! treatments (Luciano et al., 1994; Luciano, 1996), and to clo so using
information readily available to the clinician at. intake, such as scores on the Hamilton
Depression Rating Scale (Hamilton, 1960). Such optimal treatment selection is possible
provided that., from this information, it is possible to predict with snfficient accuracy the
respective outcomes of treatment with several modalities and then select the treatment
with the best predicted outcome.
Although prior atternpts at outcome prediction with linear regression models have failed
(Joyce and Paykel, 1989), recent work on this problem utilizing the multilayer perceptron
and quadratic regression techniques (Luciano, HJCJG; Luciano, Cohen and Samson, 1996,
1997) has indic:atecl that these nonlinear predictive models can account for a significant
proportion of the~ variance in the data. The present. work employ-s the nonlinear technique
of kcnwl regression to the prediction of treatment outcome for depression, and utilizes
cross-validation to evaluate its effectiveness at this task.
2
The data and the variables
The data employed in this study were obtained from clinical n:search studies on depression
that were previously conducted at tlw Depression Research Facility, j\:fcLcan Hospital, in
Belmont J'v!assachusetts by Dr. Jacquelim~ A. Samson, Dr. Joseph J. Schildkraut, Dr. John
.J. Mooney, and Dr. Alan Schatzberg. Data frorn the treatment of 9D patients are usee!.
Of these patients, 1)9 were treated with the drug desipramine, 37 with the drug fluoxetine,
and 1:3 with cognitive-behavior therapy. Differences in rcspow;c' rates between the three
groups an• not statistically signilicant (Luciano, 19\JG).
The Hamilton Depression Rating Scale con.sists of 21 items, quantifying such symptoms
as depn:ssed tnood, feelings of guilt) sleep dist.urbancc etc, Each itcru ta.kcs a.n integer v;:.tluc
in either a 0-:l or a (J-<1 range; the "severity," the surn of all 21 Hamilton scores, ranges
from O-G5. Based on clinical experience, the initial values of seven "symptom factors"Hamilton items 1, tl, and 7, a.nd the painvisc averages of itcrns 2 and :J, 5 and 6 ancl 8
and 1:3 as well as the initial severity are t.akc:n t.o clcsuibc the initial state of the patient.
A three-component. binary-valued "treatment flag" is appended to this set of variablc:s to
represent the treatment given the patient. The outcome of the treatment, the quantity to
be predicted by the trainee! network, is taken to be the fractional change in severity.
1
3
Backpropagation and quadratic regression models
Multiple linear regression, backpropagation, and quadratic; regression models were fitted
to the data of the 99 patients, and the proportion of variance accounted for by the models,
Pearson's T 2 was used to evaluate their performance. lVlultiple linear regression accounted
for 14% of the variance; based on the F-statistic, this was not significant. The nonlinear
2
methods of backpropagation (multilayer perceptron) and quadratic regression accounted,
respectively, for 44% and 41% of the variance. These results were significant at a level of
p < 0.0005, determined by a maximum likelihood method of Rao (1973, pp. 420-422) as
adapted by Cohen (1995). For details, see Luciano (1996).
4
Cross-validation and kernel regression (GRNN)
Although the nonline.a.r models described above seem to predict outcome in a statistically
significant manner, the small size of the data set, as well as the lack of separation between
training and test data, still leaves open the possibility that, in spite of the best efforts
to avoid it, overfitting has occurred. The small size of the data set also precludes the
option of training the models on a subset of the data and testing on the resr,. vVe are thus
led naturally to the method of leave-one--out cross-validation (Efron and Tibshirani, 1993;
Masters, 199:3). That is, we train 99 difl'<~rcnt. predictive models, each one 011 98 out of the
99 patients, with a different patient left out each time. Each model is thc•r used to predict
the outcome: oniy of the patient whose data was left. out in t.lw training of that model. In
this way, information concerning the patient whose treatment outcome is to be predicted
is never used in the training of the model doing tlw predictin[~, thus providing a completely
"blind" test of t.he model's predictive power. ( f3y l<~aving out only a single patient at a
time we construct models which should have nemly as much predictive power as a model
trained on the en tin~ data set.)
Clearly this sort of cross-validation is a time--consuming procedure. On the other hand,
since the cross-validation serves to guard against ovcrfitting, we arc free to choose a predictive model which can be trained and tested relatively quickly, even if it. involves a relatively
large nurnber of panuneters . The n1odcl I:V(~ usc is the '(geueral regression neura.l network')
(GRNN) (Specht, 1990; Schi0ler and Hartmann, 1992). This model derives frorn the Parzen
window estimator for probability density, moclilied t.o perfonn function esl:imation.
The GR~N assocjates to eac.h vector in its trn.iniug set a Gaussian kernel. In the
variant of CRNN employed here (::Vlasters, lDDG), tlw covariance rnatrices ol' the kernels
are diagonal and the same for all training vectors. 'The GRNN prediction for the outcome
0 of a patient described by the n-componc:nl: input vector iJ (symptom factors, initial
severity and treatment flag), based on training with N,-n patients, is
0
=,
I:~~~~o K
.
J--l
J
J
·--~ ,~vTn- -~--
L,Fl
1\.J
(1)
where
Kj
= exp( -(1!)- 1lfi:(vj- iJ)/2),
2:: = dio.g(v 1 , v 2 , ... , v,J,
(2)
(3)
and Oi and -Dj are, respectively, the outcorne and input vector associated with .J'" training
patient.
The values of the v;'s are determined so as to minimize an error criterion which itself
is the average square leave-one-out predictive error over the training set. Despite the fact
that finding the optimal values involves a conjugate-gradient minimization, GRNN still
provides a great savings in time over backpropagation.
3
5
Performance evaluation
vVe evaluate the predictive performance of the GRNN using two measures of the proportion
of variance in the data accounted for by the model. The first of these measures is
(4)
Here a1 is the actual outcome of the /" test patient, Pi is the prediction of the model for
the outcome of the /h test patient, and a is the average actual outcome. This measure is
applicable to linear and nonlinear models; it is generally close to Pearson's -r 2 and is equal
to it in the case of fitting with a linear model.
The measure in equation (4) above is subject to the criticism that it "rewards" the model
simply for learning the average outcome of those patients receiving a given treatment. vVe
therefore also employ an alternative measure,
(5)
Here a[Jl denotes the average outcome of all those patients who received the sa.nre treatment
as received by patient j. As will be seen, the two measures gave similar n•sults; those for
R2 are given in !Ktrent.lreses below.
Note that when cross-validation is employed, each prediction JJJ is actnall.v being made
by a different. predictive model; i.e., the one trained on data from which patient j was
excluded.
6
Application of GRNN to depression data
We first train GH.NN on the entire 99-pat.ieut. data set, using tbe seven symptom factors,
initial severity and treatment flag as the 11-component input vector. Both the input variables and the output variable (final sevmity) are standardized (convcnted to cc-scorcs). In
addition, the outcome of treatment. to be predictc~d by the model is taken to be the aetna!
final severity, rather than the fractional chaugc iu :;evcril.y used in the previous studies.
(Results without thcose data-preprocessing modifications are inferior to those reported be··
low.)
The proportion of variance which the model accounts for, without cross-validation, is
57% (55%). However, when leave-one-out cross-validation is employed, .lt2 (R 2 ) drops to
9% (5%); much smaller, to he sure, but still nonzero. To assess the significance of this
small level of predictive power, we compare it with the proportion of variance obtained
under the null hypothesis that the data in fact contains no information of any predictive
valne.
Specifically, we train and test GR.NN, with cross-validation, on randomly-constructed
"data sets." The first random data set consists of values for each variable chosen uniformly
over that variable's allowed range. The cross-validated proportion of variance is negative,
suggesting that the small positive effect found in the actual data is meaningful.
4
We next construct a random data set in which the value of each variable has a Gaussian
distribution, with a mean and variance determined by the mean and variance found in the
real data set. When R 2 (R2 ) is computed, with cross-validation, for GR.NN trained and
tested on this data set, the result is 8% (4%). Since the proportion of variance accounted
for by the model trained on this random data is so dose to that accounted for with the
actual data, we cannot consider the figure of9% (5%) proportion of variance on the actual
data to be meaningful.
7
Discussion
The above results indicate that, to the extent that the data set we are using contains information of predictive valne, GRNN is not extraeting it. Perhaps of even greater interest
is the observation that cross-validation by itself' is not necessarily suflic:ient to unmask an
overfit disguised as an apparently positive resnlt.. As in any test of an hypothesis, comparison with a null hypothesis nmst be perf'onneel. Furthermore, different null hypotheses
may give different results, and it is critical to devise om~s which are similar cnoup;h in
structure to the alternative hypothesis tu provide sufficiently stringent tests. The negative
GRNN results make even more imperative the~ evaluation of the previous backpropagation
and quadratic regression findings in a similar manner. These tests, on the the: same and
extended data sets, are currently in pmgrcss.
Acknowledgements
This research was supported in part by a grant from the Office of Naval Research, ONH.
N00014-95-l-0409 (JVI. A. R.).
References
Cohen, M. A. (1995) Unpublished.
Greenberg, P. E., Stiglin, L. E., Finkelstein, S. N. and Berndt, E. R (199:3) The economic
burden of depression in 1990. Joum.al of Clinical Psychiatry, 54, pp. 405-418.
Hamilton, M. (1960) A rating scale for depression. Jonmal of Neurological and NeuTOS'ILr·gical P.sychia.try, 23, pp. 56-62.
Joyce, P. M. and Paykcl, E. (1989) Predictors of drug response in depression. Archives of
General Psych:iatry, 46, pp. 89-99.
Kessler, R.. C., McGonagle, K. A., Zhao, S., Nelson, C. B., Hughes, i\11., Eshleman, S., Whittchen, I-I.-U. and Kendler, K S. (HJ94) Lifetime and 12-month prevalence of DSM-III-H.
psychiatric disorders in the United States: Results from the national comorbidity
study. Archive.s of General Psychiatry, 51, pp. 8-19.
5
Luciano, J. S. (1996) Neural network modeling of unipolar depression: Patterns of recovery and prediction outcome. PhD thesis, Department of Cognitive and Neural Systems, Boston University, Boston, MA.
Luciano, J. S., Cohen, M.A. and Samson, J. A. (1996) Neural network modeling of unipolar depression, in Reggia, J., Ruppin, E. all(! Berndt, R.., Neural Modeling of Bmin
and Cognitive Disorders, World Scientific Publishing Company, pp. 469-483.
Luciano, J. S., Cohen, M.A. and Samson, J. A. (1997) Predictive medicine: initial symptorns may determine outcome in clinically treated depressions, in ICNN '97: Proceedings of the International Conference on Neu·ml Networks, Houston, June 1997,
vol. I, pp. 71-80.
Luciano, .J. S., Cohen, M.A., Samson, .J. A. and !-lagan, P. (1994) A prototype and nonlinear neural network approach to subtyping depressive illness by initial input data
and treatment outcome, in Proceedings of the Irish Nwml Network Conference 199.!,,
Dublin, Ireland.
Masters, T.(1993) Pr-actical Neural Network 1'/cc:-ipes in C++, Academic Press, Inc.
Masters, T. (1995) Advanced Algorithms frn· Neum.l Networks: A C++ Sourcebook, John
Wiley ancl Sons, Inc.
Rao, C. R. (1973) Linear Statistical Infer-ence o:n.d -its Appl·ications, 2nd eel. Wiley-lntersciencc.
Schiyller, H., and Hartmann, U. (1992) !\!lapping neural network derived from the Parzen
window estimator, Ncuml Networ-ks 5, pp. 9o:l-909.
Specht, D. (1990) A general regression neural network, IEEE Transactions on Ncm·al Networks, 2, pp. 568-576.
6