Denoting treatment outcome in child and adolescent psychiatry: a

Eur Child Adolesc Psychiatry
DOI 10.1007/s00787-014-0609-9
ORIGINAL CONTRIBUTION
Denoting treatment outcome in child and adolescent psychiatry:
a comparison of continuous and categorical outcomes
Edwin de Beurs • Marko Barendregt • Bente Rogmans
Sylvana Robbers • Marieke van Geffen •
Marleen van Aggelen-Gerrits • Huub Houben
•
Received: 29 April 2014 / Accepted: 23 August 2014
Ó Springer-Verlag Berlin Heidelberg 2014
Abstract Various approaches have been proposed to
denote treatment outcome, such as the effect size of the preto-posttest change, percentage improvement, statistically
reliable change, and clinical significant change. The aim of
the study is to compare these approaches and evaluate their
aptitude to differentiate among child and adolescent mental
healthcare providers regarding their treatment outcome.
Comparing outcomes according to continuous and categorical
outcome indicators using real-life data of seven mental
healthcare providers, three using the Child Behavior Checklist and four using the Strengths and Difficulties Questionnaire as primary outcome measure. Within each dataset
consistent differences were found between providers and the
various methods led to comparable rankings of providers.
Statistical considerations designate continuous outcomes as
the optimal choice. Change scores have more statistical power
and allow for a ranking of providers at first glance. Expressing
providers’ performance in proportions of recovered, changed,
unchanged, or deteriorated patients has supplementary value,
as it denotes outcome in a manner more easily interpreted and
appreciated by clinicians, managerial staff, and, last but not
least, by patients or their parents.
Keywords Treatment outcome research Effect size
(ES) Reliable change index (RCI) Percentage
improvement (PI) Benchmarking
Introduction
E. de Beurs (&) M. Barendregt
Stichting Benchmark GGZ, Rembrandtlaan 46,
3723 BK Bilthoven, The Netherlands
e-mail: [email protected]
B. Rogmans
Department of Clinical Psychology, Leiden University,
Wassenaarseweg 52, 2333 AK Leiden, The Netherlands
S. Robbers
Yulius Academy, Mathenesserlaan 202, 3014 HH Rotterdam,
The Netherlands
M. van Geffen
De Viersprong, De Beeklaan 2, 4661 EP Halsteren,
The Netherlands
M. van Aggelen-Gerrits
PIONN, Papenvoort 21, 9447 TT Papenvoort, The Netherlands
H. Houben
Mentaal Beter, Soestdijkerweg 17, 3734 MG Den Dolder,
The Netherlands
Recently, Dutch mental healthcare has embarked on an
endeavor to implement countrywide two treatment supportive
measures: Routine outcome monitoring (ROM) and benchmarking. ROM is the process of routinely monitoring and
evaluating the progress made by individual patients in mental
healthcare (ROM [8]; ROMCKAP http://www.romckap.org).
Each patient is assessed at the onset of treatment and
assessments are repeated periodically to track changes over
time. Thus, standardized information about the treatment
effect at regular intervals is obtained, and feedback regarding
progress is provided to the therapist and the patient and/or
parents. There is ample evidence that ROM improves treatment outcome, especially through the process of providing
feedback [5, 24], albeit to a modest extent [6, 22]. Internationally, implementation of ROM initiatives in child and
adolescent psychiatry is on the rise [13, 20].
Aggregated ROM data, collected in mental health centers across the country, allow for comparison of providers’
performance (benchmarking), aimed at learning what
123
Eur Child Adolesc Psychiatry
works best for whom in everyday clinical practice. In The
Netherlands, a foundation (Stichting Benchmark GGZ or
SBG http://www.sbggz.nl), was established in 2011 for
benchmarking mental healthcare providers. SBG is an
independent foundation, governed conjointly by the Dutch
association of mental healthcare providers (GGZ-Nederland; http://www.ggznederland.nl), the association for
healthcare insurers (Zorgverzekeraars Nederland; http://
www.zn.nl), and the Dutch Patient Advocate Organisation
(Landelijk Platform GGZ; http://www.lpggz.nl). The goal
of SBG is to aggregate ROM-data, compare the outcome of
groups of patients and provide results to insurers, managers, therapists, and, eventually, to consumers: the patients
or their parents. If certain conditions are met, aggregated
data allow for comparing treatment outcomes of healthcare
providers (external benchmarking). Also, aggregated data
may reveal differences in outcome between locations or
departments or treatment methods within providers or can
be used for comparing the caseloads of therapists (internal
benchmarking). Transparency regarding outcomes and a
valid feedback system for discussing outcomes, may
improve the quality of care [4, 13, 20].
Healthcare insurers seek reliable and valid performance
indicators and their attention has shifted from process to
outcome indicators [29]. A valid comparison requires
unbiased data and potential confounding of results should
be counteracted by case-mix adjusted outcomes [14]. In
The Netherlands, access of insurers to aggregated outcome
data is restricted to the level of comparing providers; they
use this information to monitor quality and may, in the near
future, use it in their contracting negotiations. In the long
run, also patients will benefit from transparency on the
effectiveness of mental healthcare providers, as quality of
care is a relevant factor in choosing a provider.
Use of treatment outcome as the main performance
indicator of quality of care requires a simple, straightforward, and valid method to denote outcome. Various
methods have been proposed to delineate treatment outcome and improvement over time [23]. In treatment outcome research, the effect size of the standardized pre-toposttest change (ES or Cohen’s d) [7] is often used to
express the magnitude of the treatment effect. It is, however, difficult to translate a given effect size in clinical
terms and appreciate its clinical implication. Therefore,
such scores alone may not be sufficient to communicate to
a broad audience what is accomplished by mental healthcare. To express the effectiveness of a course of psychotherapy for an individual patient, Jacobson and colleagues
have proposed two concepts: clinical significance and statistical reliable change [17]. Jacobson and colleagues
introduced their method to evaluate the effectiveness of
treatment beyond a statistical comparison at group level
(comparing pre–post group means) and to convey
123
information regarding the progress of individual patients.
They proposed to use the term ‘‘clinical significance’’ for
the change that is attained when a client transgresses a
threshold score and has a postscore falling within the range
of the ‘‘normal’’ population. However, whether a patient
has moved during therapy from the dysfunctional to the
healthy population is in itself not sufficient to denote
meaningful change. A patient with a pretest score very
close to the threshold score may experience a very minor
change, but still transgress the cutoff value, and would thus
be classified as meaningfully changed. Therefore, Jacobson
and colleagues [17, 18] introduced an additional requirement: the Reliable Change Index (RCI). It is defined as the
difference between the pre- and postscore, divided by the
standard error of measurement. If the change from pre- to
posttest score is larger than 1.969 RCI, the change is
statistically significant (p \ 0.05). The two proposed criteria, clinical significance and statistical significance, can
be taken together and utilized to categorize patients into
four categories: recovered, merely improved, unchanged,
and deteriorated [18].
Over the years, several refinements of the traditional
Jacobson–Truax (JT) method have been suggested [3, 36].
Alternative methods are recommended to correct for
regression to the mean, which the JT method does not.
McGlinchey, Atkins, and Jacobson [26] compared them:
They concluded that four of the five investigated methods
yielded similar results in patient’s outcome classification
with a mean agreement of 93.7 %. Because all indexes
performed similarly, they recommended the original JT
method as it is easy to compute and cutoff estimates are
available for a number of widely used instruments. Moreover, the JT method takes a moderate position between the
more extreme alternatives. If the JT method is used with
measures for which large normative samples are available,
then standard cutoff and RCI values can be applied uniformly by researchers rather than requiring sample- and
study-specific recalculations. This means that a standard
for clinically significant change can be set and applied
regardless of the research or clinical setting in which the
study is conducted [3].
A recent addition to the expanding literature on outcome
indicators is the method proposed by Hiller, Schindler, and
Lambert [15]: the percentage improvement (PI). PI is
defined as the percentage reduction in symptom score
within each patient. PI has a long tradition of use in
pharmacological research, where it is used to establish
responder status (e.g., by requiring a 50 % reduction in
score on the outcome measure or PI50). Hiller et al. [15]
proposed to calculate PI as the pre-to-posttest change
proportionally to the difference between the prescore and
the mean score of the functional population (pre minus
post/pre minus functional). The method offers two ways of
Eur Child Adolesc Psychiatry
usage: Just expressing the percentage of change within a
patient, or categorizing patients above or below a cutoff of
C50 % reduction of symptoms in ‘‘responders’’ and ‘‘nonresponders’’. For responder status, Hiller et al. [15] proposed an additional criterion of 25 % reduction on the
entire range of the scale, which corresponds to the RCI
requirement of the JT method. As this is a substantial
modification of the original percentage improvement score,
we will refer to this indicator as PImod. Hiller et al. [15]
compared PImod with RCI for 395 adult depressive outpatients who were treated with cognitive behavior therapy.
The RCI was significantly more conservative than PImod in
classifying patients as responders.
An important asset of PImod is that it corrects for baseline severity. According to the law of initial value [35] the
pretreatment severity is positively associated with the
degree of improvement: patients with high initial scores
tend to achieve a larger pre–post change than those with
low scores and, consequently, the RCI is more easily
transgressed. In contrast, with PImod, the higher the baseline score of an individual, the more improvement in terms
of score reduction is needed to meet the respondent criterion of C50 % reduction. For benchmarking, this is
advantageous. If PImod appears insensitive to pretreatment
level, no case-mix correction for pretest severity [14] is
needed, which would simplify comparison of treatment
providers. The other advantage Hiller et al. [15] point out
for PImod is that no reliability and standard deviations have
to be determined, as is the case for the JT method.
Three issues regarding conceptual and methodological
aspects of outcome measurement need to be addressed; two
pertain to measurement scale properties, and one regards
statistical power considerations in relation to continuous
vs. categorical data. Every textbook on statistics will point
out that subtracting pre- and posttest scores requires an
interval scale (the distance between 9 and 7 is equal to the
distance between 6 and 4). Raw scores on self-report scales
seldom meet this criterion, as scores on these instruments
tend to be peaked and skewed to the right. Consequently,
the magnitude of a two-point change on the high end of the
scale does not correspond to a magnitude of a two-point
change on its low end. Normalizing scores deals with
skewness and kurtosis in the raw data and, from a measurement scale perspective, the difference in T-scores (DT)
is superior to ES. Thus, for benchmarking raw scores are
transformed in normalized T-scores [21] with a pretest
mean of M = 50 and SD = 10.
Another issue pertaining to properties of measurement
scales is mentioned by Russel [30] and concerns the
shortcomings of PImod as an indicator of change. A percentage improvement score requires a ratio measurement
scale with a meaningful 0-score (such as distance in miles
or degrees Kelvin). Again, self-report questionnaires yield
scores that do not meet this requirement. Use of the cutoff
value between the functional and the dysfunctional population as a proxy for a 0-score, the solution chosen for
PImod, invokes new problems, as it makes improvement in
access of 100 % possible and likely. Thus, percentage
change based upon an interval or ordinal measurement
scale is methodologically flawed.
Transforming continuous data into categories diminishes their informative value and has negative consequences for the statistical power to detect differences.
Federov, Mannino, and Zhang [9] show that categorizing
treatment outcome in two categories of equal size, amounts
to a loss of power of 36.3 % (1–2/p). To compensate for
this loss of power, a 1.571 times larger sample size is
required. Some examples of power calculation can illustrate this point. Sufficient statistical power (e.g., 1 b = 0.80) to demonstrate a large difference (e.g., M = 50
vs. M = 42, SD = 10; d = 0.80) between the means of
two groups with a T test at p [ 0.05 (two-tailed) requires at
least N = 26 observations in each group; a medium size
difference (e.g., M = 50 vs. M = 45, SD = 10; d = 0.50)
requires N = 64 in each group [12]. In comparison, a Chi
square test requires N = 32 to demonstrate a large difference in proportion (e.g., 75/25 vs. 50/50; w = 0.50) and
N = 88 to demonstrate a medium size difference in proportion (e.g., 65/35 vs. 50/50; w = 0.30) [10]. The loss of
information, going from continuous data to a dichotomous
categorization, needs to be offset with more participants to
maintain the same statistical power to demonstrate differences in performance between providers. Loss of statistical
power could lead to the faulty conclusion that there are no
differences between providers (type II error), an undesirable side effect of categorization when benchmarking
mental healthcare. One way to increase statistical power
and counter type II error would be to raise the number of
possible outcome categories from a dichotomy to three or
more. It should be noted that the negative influence of
diminished statistical power is dependent on the size of the
number of observations to begin with. If one starts out with
a large dataset, which grants substantial statistical power,
some decrease of power can be sustained, and the probability of a type II error would not be raised to unacceptable
heights. However, with smaller datasets, e.g., when we
would compare the treatment effect of individual therapists, statistical power definitely becomes an issue and
categorized outcome indicators may no longer be an
alternative for average pre-to-post change. Thus, caution
should be applied when comparing small data sets, e.g., the
caseload of individual therapists, as true differences in
effectiveness may get obscured when using categorical
outcomes.
Finally, a relevant consideration for the choice between
a continuous vs. a categorical outcome is the issue whether
123
Eur Child Adolesc Psychiatry
psychopathology should be regarded as a discrete (sick vs.
healthy) or a continuous phenomenon. Markon, Chmielewski, and Miller [25] provide a comprehensive discussion of this topic and demonstrate superior reliability and
validity of continuous over discrete measures in a metaanalysis of studies assessing psychopathology. They report
a 15 % increase in reliability and a 37 % increase in
validity when continuous measures of psychopathology are
used instead of discrete measures. They argue that these
increases in reliability and validity may be based on the
more fitting conceptualization of (developmental) psychopathology as a dimensional rather than as a categorical
phenomenon [16, 34]. If psychopathology is better viewed
as dimensional, than treatment outcome can also be considered best on a continuous scale. When a continuous
outcome variable is discretized, information on variation
within outcome categories is lost and patients with highly
similar outcomes may fall on either side of the cutoff
ending up in different categories. However, in support of
categorization, Markon et al. [25] also mention the discrepancy between optimizing reliability and validity vs.
optimizing clinical usefulness, which might be dependent
on the goal of measurement (idiosyncratic or nomothetic).
When the goal is to denote clinical outcome in individual
cases to, for instance, decide on the further course of
treatment, optimized idiosyncratic assessment is key.
Finally, patients/parents may be better informed by discrete
information (e.g., % of recovered patients), which may be
taken as representing an approximation of their chance to
recovery.
The above brings forth the following research questions:
How do the various continuous and categorical approaches to
denote treatment outcome compare in practice, when applied
to real-life data? What is their concordance? Which method
reveals the largest differences in outcome between providers?
How do various indicators compare in ranking treatment
providers? Which approach is most useful or informative for
benchmarking purposes, i.e., which method is best to establish differential effectiveness of Dutch mental healthcare
providers treating children and adolescents?
Method
Participants
To ensure sufficient statistical power to detect differences
between treatment providers, we selected from the SBGdatabase only providers that had collected outcome data
from at least 200 patients in the period from January 2012
to December 2013. We randomly selected four providers
using the Child Behavior Checklist (CBCL) [1] and four
using the Strengths and Difficulties Questionnaire (SDQ)
123
[11]. From the first group, one center declined participation. All patients were treated for an Axis I or II diagnosis
according to the Diagnostic Statistical Manual of Mental
Disorders, DSM–IV [2]. All data were anonymized before
being submitted to SBG. According to Dutch law, no
informed consent is required for data collected during
therapy to support the clinical process. Nevertheless,
patients were informed about ROM procedures and anonymous use of their data for research purposes, as is part of
the general policy at Dutch mental healthcare providers.
Response rates for ROM differed among providers, as
implementation of ROM in child and adolescent mental
healthcare is an ongoing process, with response rates
growing annually by about 10 %. For the year 2013, the
response rates lie between 20 and 30 %.
The duration of psychiatric treatment for children and
adolescents may vary from months to several years. To
standardize the pre-to-posttest interval, SBG utilizes the
Dutch reimbursement system. For administrative purposes,
long treatments are segmented in units of one year called
Diagnosis Treatment Combinations (DTCs). To get a more
homogeneous dataset, in the present study, only outcomes
of initial DTCs were analyzed (outcome of the first year of
treatment as opposed to prolonged treatment or follow-up).
Thus, the maximum pre-to-posttest interval was 1-year.
Table 1 presents data on gender, age, and pre-to-posttest
interval of the included patients of the seven providers. In
the CBCL dataset, it appeared that provider three treated
more boys [v2(2) = 6.25, p = 0.04], the providers differed
in age of their clients [F (2) = 347.1, p \ 0.001,
g2 = 0.30, 2 [ 1 [ 3] and in pre-to-posttest interval
[F (2) = 238.2, p \ 0.001, g2 = 0.23, 3 [ 2 [ 1]. In the
SDQ dataset provider C and D treated more boys
[v2(3) = 45.2, p \ 0.001], patients in providers A and C
were younger than in providers B and D [F (3) = 16.5,
p \ 0.001, g2 = 0.04] and the providers differed in duration of the pre-to-posttest interval [F (3) = 154.0,
p \ 0.001, g2 = 0.28, C [ D [ A [ B].
Measurement instruments
For ROM in The Netherlands, the two most commonly
used self-report questionnaires are the CBCL [1] and the
SDQ [11]. The CBCL assesses the competence level of the
child by assessing behavioral and emotional problems and
abilities [1, 33]. The CBCL questionnaire contains 113
items, which are scored on a three-point scale (0 = absent,
1 = occurs sometimes, 2 = occurs often). We used the
total score of parents’ reports.
The SDQ is a 25-item questionnaire with good psychometric properties [11, 27, 32]. The SDQ measures five
constructs: emotional symptoms, conduct problems, hyperactivity-inattention symptoms, peer problems, and pro-
Eur Child Adolesc Psychiatry
Table 1 Descriptive data of
patients and duration of pre-toposttest interval in both datasets
CBCL dataset
N
Boys
n
Girls
Age (years)
Duration (days)
%
n
%
M
SD
M
SD
Provider 1
515
287
59.2
198
40.8
11.4
3.4
213.9
92.2
Provider 2
411
250
60.8
161
39.2
15.5
1.5
263.3
65.7
Provider 3
715
471
65.9
244
34.1
10.9
3.2
313.1
74.6
1,641
1,008
62.6
603
37.4
12.2
3.5
269.1
89.3
Total
SDQ dataset
Provider A
372
238
64.0
134
36.0
9.6
2.4
241.1
94.9
Provider B
367
192
52.3
175
47.7
10.9
3.6
220.7
105.7
Provider C
288
208
72.2
80
27.8
9.5
2.5
360.1
24.8
Provider D
180
118
79.2
31
20.8
10.5
3.3
285.4
103.6
1,207
756
64.3
420
35.7
10.1
3.0
270.3
104.3
Total
social behavior, rated on a three-point scale from 0 = not
true to 2 = certainly true. For the present study, we used
the sum score of the 20 items regarding difficulties (leaving
pro-social behavior items out) from the parent version, as
this selection of items is comparable to what is measured
with the total score of the CBCL. Both questionnaires
correlate highly with one another [12].
Outcome indicators
ES and DT
Cohen’s d effect size (ES) is calculated on raw scores and
defined as the pre-to-post difference divided by the standard deviation of pretest scores. To make outcomes on
various measurement instruments comparable and make
raw scores with a skewed frequency distribution suitable
for subtraction, scores are standardized into normalized
T-scores (mean = 50, SD = 10; [19]). Delta-T (DT) is the
pre-to-post difference of normalized T-scores. As T-scores
are standardized with SD = 10, the size of DT is approximately 10 times Cohen’s d. For the CBCL and the SDQ,
we based the T-score conversion on normative data of
clinical samples. (In case of the CBCL our approach
deviates from the usual T-score conversion in the user
manual, which is based on ratings of a normative sample of
‘‘healthy’’ children.)
PImod
The percentage improvement (PImod) method as applied by
Hiller et al. [15] defines a responder as a patient who
reports at least a 50 % reduction of symptoms from pretest
toward the cutoff halfway the mean score of the functional
and the dysfunctional population, as well as a 25 %
reduction on the entire range. For instance, with a baseline
score of T = 55 on the SDQ (Tcutoff = 42,5; Trange is
20–90), a posttest score B46 is required to meet the criterion for responder status (a nine points change implies
72 % reduction toward the cutoff value and 26 % reduction
on the entire scale).
JTRCI, JTCS, and JTRCI&CS
The Jacobson–Truax (JT) method yields three performance
indicators. Reliable change requires a pre–post change
beyond a cutoff value based on the reliability and pretest
variance of the instrument. The indicator for comparing the
proportions of reliably changed patients is called the
Reliable Change Index (JTRCI). Clinical Significance
(JTCS) requires a pre-to-posttest score transition by which a
cutoff value is crossed, denoting the passage from the
dysfunctional to the functional population. The outcome
indicator JTCS is based on the proportion of patients
meeting the required passage. Combining both criteria
yields a categorization (JTRCI&CS) with theoretically five
levels (recovered, improved, unchanged, deteriorated, and
relapsed). In the present study, four levels were used, as
patients rarely meet the criteria for the category ‘‘relapsed’’
(\2.9 % in our datasets started ‘‘healthy’’ and concluded
their first year of treatment with a score in the pathological
range). Therefore, we collapsed deteriorated and relapsed
patients into a single category.
Categorization is based on T-values. For JTRCI, we
chose a value of DS = 5.0, which corresponds to half a
standard deviation and is commonly considered the minimal clinically important difference (Jaeschke, 1989
1205/id; Norman, 2003 1206/id;Sloan, 2005 1207/id). Use
of the formulas provided by Jacobson and Truax [18], with
an instrument reliability of Cronbach’s a = 0.95, would
result in RCI90 = 5.22. For the JTCS, we used a value of
T = 42.5, which is based on the third option for calculation
of the cutoff value described by Jacobson and Truax [18],
i.e., halfway the mean of the functional and the
123
Eur Child Adolesc Psychiatry
Table 2 Overview of outcome
indicators, their calculation,
advantages, and disadvantages
r reliability, SD standard
deviation for a clinical group,
SEM dstandard error of
measurement, Sddysfunc standard
deviation for the dysfunctional
population, Mfunct mean score of
the functional population, etc.,
CS clinical significance, JTRCI
reliable change, JTCS clinical
significance, JTRCI&CS reliable
change and clinical significance
combined, Usually taken
together with ‘‘Deteriorated’’
Indicator
Definition/calculation
Advantages and disadvantages
ES
pretest mean–posttest mean/pretest
SD
Frequently used effect size indicator; unduly
influenced by variations in pretest SD
DT
Tpretest–Tposttest
Based on T-score conversion, thus especially
suitable for measures with a non-normal
distribution of scores; not easy to interpret as a
performance indicator
PImod
(Tpretest–Tposttest)/(Tpretest–42.5)[0.50
\
Responder status with pretest severity taken into
account; formally not appropriate for ordinal or
interval scales, as a ratio scale is required
JTRCI
(Tpretest–Tposttest)/(Tpretest–20) [0.25
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
RC ¼ 1:96 2 SEM2 and
pffiffiffiffiffiffiffiffiffiffiffi
SEM ¼ SD 1 r
JTCS
CS ¼
Sddysfunc Mfunc þSdfunc Mdysfunc
Sddysfunc þSdfunc
Inherently appealing to denote clinical endstate; a
minimal change can be sufficient to transgress
the cutoff value CS
JTRCI&CS
Recovered
Tpretest–Tposttest [5.0 \ Tpretest [42.5
and Tposttest B42.5
Improved
Tpretest–Tposttest [5.0
Unchanged
Tpretest–Tposttest B5.0
Deteriorated
Tpretest–Tposttest \-5.0
Relapsed
Tpretest–Tposttest [5.0 \ Tpretest \42.5
and Tposttest C42.5
dysfunctional population. (The average of the dysfunctional population is by definition 50; the average for the
functional population (T = 35.0) was derived from a
translation of raw mean scores from the manuals the CBCL
and the SDQ to T-scores. This resulted in T-scores of 37
and 33 for the CBCL and the SDQ, respectively. Using
these reference values, the mean of the functional population was estimated as T = 35.0).
Table 2 gives an overview of the indicators, summarizes
how each indicator is calculated and briefly describes the
advantages and disadvantages of each indicator.
Statistical analysis
All analyses were performed for the CBCL dataset and the
SDQ dataset separately. Raw scores were standardized into
normalized T-scores. Concordance among the different
outcome indicators was assessed with correlational analyses (Pearson’s PMCC, Kendall’s Tau, and Spearman’s Rho
correlation coefficients). Also, the prognostic value of
pretest severity for outcome (change score or category) was
assessed with correlational analysis. Pre- and posttest
T-scores of providers were compared with repeated measures ANOVA with time as the within-subjects factor and
provider as the between-subjects factor. To investigate
which providers differed significantly among each other,
we conducted pairwise comparisons with a Bonferroni
correction for multiple comparisons. Differences in proportions of patients according to the various categorical
123
Frequently used to denote individual and group
outcome; sensitive to pretest severity differences
Classification into four clinically meaningful
outcome categories; rank order depends on
which category is focused upon
approaches were assessed with Chi square tests. Providers
were placed in rank order from best to worst results and the
ranking according to the various outcome indicators was
compared.
Results
Comparison of outcome indicators
For both the CBCL data and the SDQ data, the association
between ES and DT amounted to r = 0.99, p \ 0.001.
Although transforming raw scores into normalized T-scores
is important from a theoretical perspective, in practice it
has little effect on the relative position of outcome scores
within each dataset.
The pattern of associations among indicators based on
T-scores is presented in Table 3. Findings are highly similar for both datasets. Pre-test severity is substantially
associated with DT (r = 0.30 and r = 0.41 for the CBCL
and SDQ, respectively) and JTRCI (r = 0.19 and r = 0.23).
Compared to DT, PImod is relatively insensitive to pretest
level. When looking at associations among performance
indicators, the strongest concordance is found between
PImod and the combined JTRCI&CS index (Spearman’s
Rho = 0.76 and 0.77 for the CBCL and SDQ, respectively), followed by the concordance of DT and JTRCI&CS
(Kendall’s Tau = 0.75 and 0.77). JTCS appears less concordant with DT and PImod.
Eur Child Adolesc Psychiatry
The CBCL outcomes
Table 4 shows CBCL-based results of 1,641 patients, differentiating providers according to the various outcome
indicators along with a ranking of providers.
A 2 (time) 9 3 (provider) repeated measures ANOVA
revealed a statistically significant effect of time,
F (11,638) = 469.0, p \ 0.001, g2 = 0.28, indicating a
pre–post difference, irrespective of provider. Overall the
pre-to-posttest difference amounts to d = 0.44, an effect of
Table 3 Association of outcome indicators with pretest severity and
among outcome indicators: DT, PImod, and JT methods (Kendall’s
Tau, unless otherwise indicated)
CBCL dataset (N = 1,641)
Pre
DT
PImod
DT
0.30a
PImod
0.18
0.62
JTRCI
0.19
0.71
0.69b
0.45
JTCS
-0.04
JTRCI&CS
0.12
SDQ dataset (N = 1,207)
Pre
DT
PImod
0.41a
0.75
0.26
0.62
0.23
0.71
0.69b
b
0.03 (ns)
0.47
0.72b
b
0.19
0.77
0.77b
0.68
0.76
The SDQ outcomes
All correlations p \ 0.01 except those indicated with ‘‘ns’’
Pre prescore, DT pre-T-score minus post-T-score, PImod percentage
improvement responder
a
Pearson’s PMCC
b
Spearman’s Rho
medium size. There is also a main effect of provider,
F (21,638) = 4.3, p \ 0.01, g2 = 0.005, indicating differences between providers, irrespective of time. Finally,
there was a significant interaction effect (time 9 provider)
indicating a difference in effectiveness between providers,
F (31,638) = 16.2, p \ 0.001, g2 = 0.02. Pairwise comparisons (Bonferroni corrected) indicated that providers 2
and 3 had different outcomes. Also with the indicators
PImod, DT, JTRCI and JTCS significant differences were
found in outcome between providers. Overall, the ranking
is consistent:1–2–3. JTRCI&CS yields a similar pattern of
outcome ranking in all categories. In plain English, these
results indicate that a year of treatment from provider 1 was
beneficial to 55.3 % of the patients, vs. 40.9 % of the
patients of provider 3. Moreover, the percentage of deteriorated patients of provider 3 was twice the percentage of
provider 1 (11.7 vs. 5.6 %). Figure 1 shows the proportion
of patients, according to the JTRCI&CS 4-level categorization in stacked-bar graph, based on normalized T-scores.
Evidently, provider 3 has less recovered and more
unchanged patients as compared to the other two providers.
Table 5 displays the results from the dataset containing
pre-to-posttest scores of 1,207 patients. Overall, the results
mimic the findings in the CBCL dataset, with the various
Table 4 Treatment outcomes of providers using the CBCL
Providers
1
M (SD)
2
M (SD)
3
M (SD)
Total
M (SD)
Ranking
Pretest
51.0 (9.7)
51.8 (10.2)
49.0 (9.5)
50.4 (9.8)
–
Posttest
44.7 (10.5)
45.8 (10.8)
45.2 (11.1)
45.2 (10.8)
–
Continuous outcomes
DT
M (SD)
6.3 (7.7)
M (SD)
6.0 (8.6)
M (SD)
3.8 (8.7)
M (SD)
4.8 (8.4)
1–2–3
d
d
d
d
0.65
0.59
0.40
0.44
ES
1–2–3
2
Categorical outcomes
n (%)
n (%)
n (%)
n (%)
v (2)
PImod
182 (35.3)
135 (32.8)
184 (25.7)
539 (32.8)
14.4**
JTRCI
285 (55.3)
212 (51.6)
293 (41.0)
790 (48.1)
27.4***
1–2–3
JTCS
147 (28.5)
103 (25.1)
173 (24.2)
423 (25.8)
3.1
1–2–3
37.0***
1–2–3
JTRCI&CS
Deteriorated
1–2–3
29 (5.6)
31 (7.5)
84 (11.7)
144 (8.8)
Unchanged
201 (39.0)
168 (40.9)
338 (47.3)
707 (43.1)
Improved
151 (29.3)
119 (29.0)
139 (19.4)
409 (24.9)
1–2–3
Recovered
134 (26.0)
93 (22.6)
154 (21.5)
381 (23.2)
1–2–3
1–2–3
Categories ‘deteriorated’ and ‘unchanged’ have a reversed rank order: the fewer patients, the better the outcome
ES Cohen’s d, JTRCI reliable change, JTCS clinical significance, JTRCI&CS reliable change and clinical significance combined
* p \ 0.05, ** p \ 0.01, *** p \ 0.001
123
Eur Child Adolesc Psychiatry
Fig. 1 Clinical results
according to JTRCI&CS for
normalized T-scores of the
CBCL
0%
20%
40%
60%
80%
100%
Institute1
Institute2
Institute3
Recovered
Improved
Unchanged
Deteriorated
Table 5 Treatment outcomes of providers using the SDQ
Providers
A
M (SD)
B
M (SD)
C
M (SD)
D
M (SD)
Total
M (SD)
Ranking
Pretest
51.8 (9.0)
48.2 (9.9)
47.4 (9.6)
51.2 (10.9)
49.5 (9.8)
–
Posttest
44.3 (10.4)
41.3 (10.7)
44.0 (9.2)
48.7 (10.7)
43.9 (10.1)
–
A–B–C–D
Continuous outcomes
DT
ES
7.5 (9.6)
6.9 (9.7)
3.4 (8.6)
2.5 (8.2)
5.6 (9.4)
d
d
d
d
d
0.83
0.69
0.35
0.23
0.57
Categorical outcomes
n (%)
n (%)
n (%)
n (%)
n (%)
A–B–C–D
2
v (3)
PImod
147 (39.5)
118 (32.0)
63 (21.9)
32 (16.5)
380 (31.5)
42,9***
A–B–C–D
JTRCI
221 (59.4)
197 (53.5)
105 (36.5)
63 (35.0)
586 (48.6)
51.5***
A–B–C–D
JTCS
124 (33.3)
111 (30.2)
66 (22.9)
23 (12.8)
324 (26.8)
30.5***
A–B–C–D
65.4***
A–B–C–D
JTRCI&CS
Deteriorated
26 (7.0)
28 (7.6)
36 (12.5)
30 (16.7)
120 (9.9)
Unchanged
125 (33.6)
142 (38.7)
147 (51.0)
87 (48.3)
501 (41.5)
A–B–D–C
Improved
105 (28.2)
95 (25.9)
57 (19.8)
44 (24.4)
301 (24.9)
A–B–D–C
Recovered
116 (31.2)
102 (27.8)
48 (16.7)
19 (10.6)
285 (23.6)
A–B–C–D
Categories ‘deteriorated’ and ‘unchanged’ have a reversed rank order: the fewer patients, the better the outcome
ES Cohen’s d, JTRCI reliable change, JTCS clinical significance, JTRCI&CS reliable change and clinical significance combined
* p \ 0.05, ** p \ 0.01, *** p \ 0.001
methods yielding similar results and all indicators showing
significant differences among the providers.
Analysis of variance for a 2 (time) 9 4 (provider) design
indicated a significant effect of time, F (11,203) = 356.5,
p \ 0.001, g2 = 0.23), provider F (11,203) = 21.2,
p \ 0.001, g2 = 0.05, and a significant interaction effect,
F (31,203) = 19.1, p \ 0.001, g2 = 0.05, meaning that the
change over time was different among the providers.
Pairwise comparisons revealed that A and B outperformed
C and D. The rank order is the same with the various
123
approaches, but varies somewhat among the four categories
of the JTRCI&CS approach (with providers C and D trading
places). For the total sample, almost half the patients
improve or recover after 1-year of service delivery; about
10 % deteriorate (similar numbers are found in the CBCL
sample). With provider A 59.4 % of the patients benefit
and 7.0 % deteriorate, with provider D only 35.0 % benefit
and 16.7 % deteriorate. Figure 2 presents the proportions
of patients in the four outcome categories graphically for
each provider.
Eur Child Adolesc Psychiatry
Fig. 2 Clinical results
according to JTRCI&CS for
normalized T-scores of the SDQ
0%
20%
40%
60%
80%
100%
Institute A
Institute B
Institute C
Institute D
Recovered
Discussion
The results revealed consistent differences in outcome
between providers and, by and large, the various methods
converged. Each index showed a similar assignment of
patients to outcome categories, as evidenced by the concordance among indicators, especially between DT, PImod,
and JTRCI&CS. Also, a similar ranking of providers is found
by the various methods.
Hiller et al. [15] proposed the PImod method as an
alternative to the RCI, as it takes differences in pretreatment severity into account. The present findings support
this contention partially, as PImod was less strongly associated with pretest level than DT was. In addition, Hiller
et al. claimed that the RCI index leads to a more conservative categorization of patients as compared to the PI
method. Comparing the performance of RCI and PImod in
this study does not support their claim, as the percentages
of patients reaching RCI (about 48 %) exceeded those for
PImod (about 32 %). Methodologically, percentage change
based upon an interval or ordinal measurement scale is
flawed and the solution chosen by Hiller et al. of using the
cutoff between the functional and the dysfunctional population as a proxy for a 0-point invokes new problems, as it
makes improvement in access of 100 % likely for many
patients (14 and 23 % for the CBCL and the SDQ,
respectively). Given these drawbacks, we recommend
against the use of PImod.
Advantageous of the categorical approach is that it
provides a more easily interpretable presentation of the
comparison of provider effectiveness than differences in
pre-to-posttest change do. Moreover, the categorical
approach denotes differences in clinically meaningful
terms (e.g., provider 1 outperforms provider 3 in the proportion of patients that benefitted from treatment
(improvement or recovery, 55.3 vs. 40.9 %). As the
Improved
Unchanged
Deteriorated
categorical outcomes (PImod and JT) and the continuous
outcomes (DT or Cohen’s d) largely converged, one could
conclude that the categorical approach is a good alternative
for continuous outcomes.
As was pointed out earlier in the introduction, transforming continuous data into categories diminishes their
informative value, may not be fitting for an inherently
continuous phenomenon [25], and has negative consequences for the statistical power to detect differences. Fedorov, Mannino, and Zhang [9] demonstrated that with a
‘‘trichotomous’’ categorization the loss of power would still
be substantial (19 %). Statistical power increases gradually
with the use of more categories. Conversely, the ease of
interpretation diminishes with the increase of outcome
categories. The JTRCI&CS approach applied to the SDQ data
nicely illustrates this point: it can be problematic to compare four providers on four outcome categories and ranking
is not straightforward. With JTRCI&CS, ranking providers
depends on the focus: Do we consider primarily treatment
successes, i.e., the proportion of recovered or improved
patients? Or do we focus on failures, i.e., unchanged or
deteriorated patients? Alternatively, it could be argued that
the most extreme classifications are key to ranking providers. This cannot be decided beforehand and makes the
interpretation of a multi-categorical outcome indicator
more subjective than comparing them on average pre-toposttest change. Hence, using more than two categories
may successfully preserve statistical power, but it also
diminishes the plainness of the results.
Strengths
The present study used real-life data, containing the treatment outcomes of young patients submitted to the database
of SBG. Six different outcome indicators were investigated
and compared to denote treatment outcome on two
123
Eur Child Adolesc Psychiatry
frequently used outcome measures, the CBCL and the
SDQ. Both questionnaires seemed equally suitable to
assess pre-to-post differences and compare outcome indicators. This observation corresponds to other research
comparing the CBCL and SDQ, which ascertained that
both questionnaires correlate highly with one another [12].
Limitations
The aim of the present study was to compare various
indicators on their ability to differentiate between mental
healthcare providers regarding their average treatment
outcome. As such, the present study compares methods, not
mental healthcare providers. Data per provider were far
from complete (leaving ample room for selection bias),
treatment duration was limited to 1-year, and no correction
was applied for potential confounders. Furthermore, variations in outcome measurement (e.g., questionnaire use or
parental respondent) or timing of assessments may have
differential effects on outcome and were not controlled for
in this study. Hence, the present report should not be
regarded as a definitive report on the comparison of mental
healthcare providers or on the effectiveness of Dutch
mental healthcare for children and adolescents, but merely
as an investigation of various methods of comparing
treatment outcomes.
Secondly, the present study lacks a criterion or ‘gold
standard’ that can be used to test the validity of each outcome indicator. Indicators are compared solely among each
other and a similar rank order is more easily obtained with
a limited set of 3 or 4 providers, as compared to countrywide benchmarking (currently 175 providers submit their
outcome data to SBG). In future studies, we may assess and
compare the predictive value of each indicator for an
external criterion, such as functioning at long-term followup, readmission, or continued care.
In sum, we conclude that outcome is best operationalized as a continuous variable: preferably the pre-to-post
difference on normalized T-scores (DT), as this outcome
indicator is not unduly influenced by pretest variance in the
sample. Continuous outcome is superior to categorical
outcome as change scores have more statistical power and
allow for a ranking of providers at first glance. This performance indicator can be supplemented with categorical
outcome information, preferably the JTRCI&CS method,
which presents outcome information in a manner which is
more informative to clinicians and patients or their parents.
Acknowledgments The authors are grateful to the following mental
health providers for allowing us to use their outcome data: De
Viersprong, Mentaal Beter, Mutsaersstichting, OCNR, Praktijk Buitenpost, Yorneo, and Yulius.
Conflict of interest
123
No conflicts declared.
References
1. Achenbach TM (1991) Manual for the child behavior checklist
4–18 and 1991 profiles. Department of Psychiatry, University of
Vermont, Burlington
2. American Psychiatric Association (1994) Diagnostic and statistical manual of mental disorders IV. Author, Washington, DC
3. Bauer S, Lambert MJ, Nielsen SL (2004) Clinical significance methods: a comparison of statistical techniques. J Pers Assess 82:60–70
4. Berwick DM, Nolan TW, Whittington J (2008) The triple aim:
care, health, and cost. Health Aff 27:759–769
5. Bickman L, Kelley SD, Breda C, de Andrade AR, Riemer M
(2011) Effects of routine feedback to clinicians on mental health
outcomes of youths: results of a randomized trial. Psychiatr Serv
62:1423–1429
6. Carlier IV, Meuldijk D, van Vliet IM, van Fenema E, van der
Wee NJ, Zitman FG (2012) Routine outcome monitoring and
feedback on physical or mental health status: evidence and theory. J Eval Clin Pract 18:104–110
7. Cohen J (1988) Statistical power analysis for the behavioral
sciences. Lawrence Erlbaum Associates, Hillsdale
8. de Beurs E, den Hollander-Gijsman ME, van Rood YR, van der
Wee NJ, Giltay EJ, van Noorden MS, van der Lem R, van Fenema E, Zitman FG (2011) Routine outcome monitoring in the
Netherlands: practical experiences with a web-based strategy for
the assessment of treatment outcome in clinical practice. Clin
Psychol Psychother 18:1–12
9. Fedorov V, Mannino F, Zhang R (2009) Consequences of
dichotomization. Pharm Stat 8:50–61
10. Fleiss JL (1973) Statistical methods for rates and proportions.
Wiley, New York
11. Goodman R (2001) Psychometric properties of the strengths and
difficulties questionnaire. J Am Acad Child Adolesc Psychiatr
40:1337–1345
12. Goodman R, Scott S (1999) Comparing the strengths and difficulties questionnaire and the child behavior checklist: is small
beautiful? J Abnorm Child Psychol 27:17–24
13. Hall CL, Moldavsky M, Baldwin L, Marriot M, Newell K, Taylor
J, Hollis C (2013) The use of routine outcome measures in two
child and adolescent mental health services: a completed audit
cycle. BMC Psychiatr 13:270
14. Hermann RC, Rollins CK, Chan JA (2007) Risk-adjusting outcomes of mental health and substance-related care: a review of
the literature. Harv Rev Psychiatr 15:52–69
15. Hiller W, Schindler AC, Lambert MJ (2012) Defining response and
remission in psychotherapy research: a comparison of the RCI and
the method of percent improvement. Psychother Res 22:1–11
16. Hudziak JJ, Achenbach TM, Althoff RR, Pine DS (2007) A
dimensional approach to developmental psychopathology. Int J
Methods Psychiatr Res 16:S16–S23
17. Jacobson NS, Follette WC, Revenstorf D (1984) Psychotherapy
outcome research: methods for reporting variability and evaluating clinical significance. Behav Ther 15:336–352
18. Jacobson NS, Truax P (1991) Clinical significance: a statistical
approach to defining meaningful change in psychotherapy
research. J Consul Clin Psychol 59:12–19
19. Jaeschke R, Singer J, Guyatt GH (1989) Measurement of health
status: ascertaining the minimal clinically important difference.
Control Clin Trials 10(4):407–415
20. Kelley SD, Bickman L (2009) Beyond routine outcome monitoring: measurement feedback systems (MFS) in child and adolescent clinical practice. Cur Opin Psychiatr 22:363–368
21. Klugh HE (2006) Normalized T Scores. In: Kotz S, Read CB,
Balakrishnan N, Vidakovic B (eds) Encyclopedia of statistical
sciences, 2nd edn. Wiley, New York
Eur Child Adolesc Psychiatry
22. Knaup C, Koesters M, Schoefer D, Becker T, Puschner B (2009)
Effect of feedback of treatment outcome in specialist mental
healthcare: meta-analysis. Br J Psychiatry 195:15–22
23. Kraemer HC, Morgan GA, Leech NL, Gliner JA, Vaske JJ,
Harmon RJ (2003) Measures of clinical significance. J Am Ac
Child Adolesc Psychiatr 42:1524–1529
24. Lambert M (2007) Presidential address: what we have learned
from a decade of research aimed at improving psychotherapy
outcome in routine care. Psych Res 17:1–14
25. Markon KE, Chmielewski M, Miller CJ (2011) The reliability
and validity of discrete and continuous measures of psychopathology: a quantitative review. Psych Bul 137:856–879
26. McGlinchey JB, Atkins DC, Jacobson NS (2002) Clinical significance methods: which one to use and how useful are they?
Behav Ther 33:529–550
27. Muris P, Meesters C, van den Berg F (2003) The strengths and
difficulties questionnaire (SDQ). Eur Child Adolesc Psychiatr
12:1–8
28. Norman GR, Sloan JA, Wyrwich KW (2003) Interpretation of
changes in health-related quality of life: the remarkable universality of half a standard deviation. Med Care 41:582–592
29. Porter ME, Teisberg EO (2006) Redefining healthcare: creating
value-based competition on results. Harvard Business Press,
Cambridge
30. Russel M (2000) Summarizing change in test scores: shortcomings of three common methods. Practical Assessment, Research
& Evaluation 7
31. Sloan JA, Cella D, Hays RD (2005) Clinical significance of
patient-reported questionnaire data: another step toward consensus. J Clin Epidem 58:1217–1219
32. van Widenfelt B, Goedhart A, Treffers P, Goodman R (2003)
Dutch version of the strengths and difficulties questionnaire
(SDQ). Eur Child Adolesc Psychiatr 12:281–289
33. Verhulst FC, Van der Ende J, Koot HM (1996) Handleiding voor
de CBCL/4-18 [Dutch manual for the CBCL/4-18]. Department
of Child and Adolescent Psychiatry, Erasmus University, Sophia
Children’s Hospital, Rotterdam
34. Walton K, Ormel J, Krueger R (2011) The dimensional nature of
externalizing behaviors in adolescence: evidence from a direct
comparison of categorical, dimensional, and hybrid models.
J Abnorm Child Psychol 39:553–561
35. Wilder J (1965) Pitfalls in the methodology of the law of initial
value. Am J Psychother 19:577–584
36. Wise EA (2004) Methods for analyzing psychotherapy outcomes:
a review of clinical significance, reliable change, and recommendations for future directions. J Pers Assess 82:50–59
123