Reading radiographs in chronological order, in

Rheumatology 1999;38:1213–1220
Reading radiographs in chronological order,
in pairs or as single films has important
implications for the discriminative power of
rheumatoid arthritis clinical trials
D. van der Heijde, A. Boonen, M. Boers2, P. Kostense1
and S. van der Linden
Department of Internal Medicine, Division of Rheumatology, University Hospital
Maastricht, Maastricht and 1Department of Epidemiology and Biostatistics,
Medical Faculty, Vrije Universiteit, Amsterdam, The Netherlands
Abstract
Objective. To determine the influence of reading series of films in chronological order, in
pairs with unknown time sequence, or as single films, on precision and sensitivity to change.
Methods. Two studies were performed with 10 and 12 patients fulfilling the American
College of Rheumatology criteria. In Study 1, two sets of films with a 1 yr interval were
scored in chronological order, in pairs, and as single films. In Study 2, four sets of films, with
a 1 yr interval each, were scored in chronological order, as single films and as single-pair
(right and left together). All films were scored with the Sharp/van der Heijde method by two
independent observers. Data were analysed with a repeated measures ANOVA using a full
mixed effects model. Two generalizability (G) coefficients were constructed for reliability and
for change.
Results. Study 1: the interobserver reliability was similar for the three methods (G
reliability
chronological 0.94, paired 0.88, single 0.93); progression was a mean increase (averaged over
patients, observers and methods) from 26 to 37 (P = 0.046). The sensitivity for change was
greater for the chronological than for the paired and single scoring (G
0.39, 0.22 and
change
0.24, respectively). Study 2: the interobserver reliability was 0.86 for chronological, 0.76 for
single-pair and 0.91 for single readings. Significantly more progression was measured with the
chronological compared with the single-paired and single methods (15.9 vs 8.5 and 8.3; P =
0.0001). A constant progression was suggested by chronological reading, in contrast to a
stabilization in the other two methods after 1 yr.
Conclusion. Reading films in chronological order is most sensitive to change in a time
period up to 3 yr follow-up; this was already present after 1 yr, but even more pronounced
with longer follow-up.
K : Radiographs, Reading order, Paired/single films, Rheumatoid arthritis,
Clinical trials.
Structural damage as seen on radiographs is an important feature in rheumatoid arthritis (RA). The ability to
sustain joint structure and functional capacity determine
whether a drug has disease-controlling capacity [1].
Radiographs are included in the core set of measures to
evaluate trials with a duration of 1 yr and longer [2].
Many scoring methods for the assessment of radioSubmitted 13 July 1998; revised version accepted 4 June 1999.
Correspondence to: D. van der Heijde, Department of Internal
Medicine, Division of Rheumatology, University Hospital Maastricht,
PO Box 5800, 6202 AZ Maastricht, The Netherlands.
2 Present address: Department of Clinical Epidemiology, VU
University Hospital, Amsterdam, The Netherlands.
graphic damage are available [3]. Probably the most
well known and widely used are the Larsen and Sharp
methods with their modifications [4–7]. In the past,
various aspects of the validity of these assessments have
been addressed, such as which view should be used,
which abnormalities should be scored, aspects of intraand interobserver reliability, and so on [3]. One of the
issues that has not yet received a lot of attention is in
which order films should be scored to measure progression. Many possibilities exist, e.g. (1) films grouped per
patient (all available radiographs of the hands and feet
of one particular patient) ordered chronologically
(‘chronological’); (2) films grouped per patient, but in
random time order (‘paired’); (3) films grouped per
1213
© 1999 British Society for Rheumatology
Two studies were performed. First, patients, methods
and results will be described for the first study, thereafter
for the second study. The details of both studies are
presented in Table 1. All patients fulfilled the 1987
American College of Rheumatology criteria for RA and
had a disease duration of <1 yr at the start. Films were
made at the start and after 1 yr of follow-up. Films were
selected for high and low scores at the first film, and for
+
+
+
+
+
Not done
Not done
Not done
Not done
+
+
+
Sharp/van der Heijde
12
4
Hands and feet
Two independent
+
Sharp/van der Heijde
10
2
Hands and feet
Two independent
+
Larsen
284
2
Hands and feet
Three consensus score
+
Study 1
Ferrara et al. [9]
Sharp
100
2
Hands
Two independent
+
Scoring method
Number of patients
Number of points in time per patient
Films
Observers
Chronological: all hands and feet of the same patient of 2 (or more)
points in time available together; left and right hand (or foot) of the
same point in time are presented together; the order in time is known
Paired: all hands and feet of the same patient of 2 (or more) points
in time available together; left and right hand (or foot) of the same
point in time are presented together; the order in time is unknown
Single-pair: both hands (or feet) are presented, time and patient in
random order
Single: one single hand (or foot); time and patient in random order
Patients and methods
Salaffi and Carotti [10]
region (e.g. both hands) from a particular patient at a
single point in time (‘single-pair’); (4) single films without any grouping or ordering, i.e. all films of all patients
mixed randomly (‘single’). Theoretically, there are
advantages and disadvantages for any of these possibilities. Scoring in chronological order probably provides
the most information to the reader. This may help to
reduce measurement error introduced solely by variation
in positioning or quality of the films. However, it could
also introduce bias as the observer may expect progression of damage over time. In contrast, the reading of a
single film at a time is unbiased, but probably more
prone to measurement error. The advantages and disadvantages of the other methods range between these two
extremes.
Recently, we carried out a non-standardized literature
review on the methods to score films in randomized
clinical trials. This survey revealed some interesting
facts. A number of therapeutic trials did not state in
what order films were scored. One trial used the method
of scoring single films; half of the remaining trials scored
in chronological order, the other half used paired scoring
but provided no information on timing. Studies comparing methods were also scarce. Fries et al. [8] showed
that if films were read in pairs compared with single
films, precision was greater. Recently, two Italian groups
assessed the influence of reading the films in chronological order, in pairs or as single films [9, 10]. Films of
hands and feet were read with the Larsen method in
one study and films of hands with the Sharp method in
the other study. Both groups conclude that paired
reading is preferable, although they draw their conclusions on completely different grounds.
At the time of the above-mentioned publications, we
had performed two studies to evaluate these issues. The
aim of our studies was to assess the possible influence
of knowledge about the chronology. In these two studies,
films were scored in chronological order, in pairs, as
single-pairs (hands or feet) and single (one hand or
foot). The first study had films of two points in time
per patient, the second study of four points in time. As
far as we are aware, this is the first study with more
than two points in time per patient and also the first
one scored with the Sharp/van der Heijde method including hands and feet ( Table 1). We chose this method
because it seems to be the most sensitive and we have a
lot of experience with it, although it has the disadvantage
of being more time consuming than the Larsen
method [3].
Study 2
D. van der Heijde et al.
T 1. Schedule of the methods used in the present and published studies
1214
Order of reading radiographs
having high and low progression between the two films.
The films were made in posteroanterior view, copied
three times and blinded, and scored by three different
methods. The order of patients, as well as the order of
application of the methods in a particular patient, were
also randomized (random number table). Selection,
ordering and blinding were performed by one of us
(MB) not involved in scoring.
All films in both studies were scored by two experienced readers (DvdH and AB) independently by the
Sharp/van der Heijde method [3, 7]. Erosions were
scored in 32 joints in the hands and in 12 joints in the
feet with a maximum of 5 per joint in the hands and 10
in the feet. Erosions were scored according to the surface
of the joint involved. Joint space narrowing was graded
from 0 to 4 in 30 joints in the hands and in 12 joints in
the feet. This results in a total damage score that can
range from 0 to 448. When scored in true chronological
order, scores cannot decrease (‘once an erosion, always
an erosion’). This is similar to the published Sharp/
van der Heijde modification and thereafter applied in
many studies [3, 7]. Erosions and joint space narrowing
were summed to obtain the total score. The results of
the total score are presented, unless stated otherwise.
Statistical analysis
The main analyses concentrated on the total damage
score. Secondary analyses were performed for erosions
and narrowing scores, respectively. Results are expressed
as means.
Study 1
Patients and methods
In Study 1, two sets of hand and foot films of 10
patients, with a 1 yr interval between the first and second
set, were scored. In this study, the three scoring methods
were as follows. (1) Single: a single hand or foot is
presented to the reader. The time, patients and other
radiographs of the same patient at that point in time
are in random order. (2) Paired: all hands and feet of
the same patient of both points in time are presented
together. The left and right hand and foot of the same
point in time are kept together. The order in time is
unknown to the reader. (3) Chronological: all hands
and feet of the same patient of both points in time are
presented together. The left and right hands and feet of
the same point in time are kept together. The order in
time is known to the reader.
Statistical analysis
A repeated measures ANOVA analysed the data; a full
mixed effects model was constructed with patient
(n = 10) and order of application of method (n = 6) as
random factors, and method (n = 3), observer (n = 2)
and time (n = 2) as fixed factors. In order to check the
assumptions of the ANOVA (i.e. normality of the distribution of the residuals and constancy of variance of
residuals), we plotted residuals vs fitted values. We found
no substantial deviations from the assumptions of
1215
ANOVA. As order of application of method did not
change the findings of the analysis, the results will be
presented without this factor, for clarity.
For each method, separate ANOVA tables yielded
mean square estimates that were used to calculate
expected mean squares and, from these, variance estimates. Two generalizability (G) coefficients were constructed to express the efficiency of the three scoring
methods; in each, the numerator identifies the signal
and the denominator the sum of signal and noise.
G coefficients can range from 0 to 1, and usually values
above 0.80 are considered as good.
Per method, interobserver reliability was expressed as
the G coefficient (VC is the variance component) [11]:
G
= (VC + VC
)/(VC + VC
interobs
pat
time,pat
pat
time,pat
+ VC
+ VC
)
observer,pat
observer,time,pat
Also per method, sensitivity to change was expressed as
the G coefficient:
G
= (VC
+ VC
)/(VC
change
time
observer,time
time
+ VC
+ VC
observer,time
time,pat
+ VC
)
observer,time,pat
For comparability with other reports, signal-to-noise
ratios were also calculated by dividing the square root
of the signal variance (VC
+ VC
), i.e. the
time
observer,time
signal .., by the square root of the noise variance
(VC
+ VC
), i.e. the noise ..
time,pat
observer,time,pat
However, the .. values are not directly comparable
with those of other reports because our model is more
complex, and the .. estimates are composed of more
terms.
Results
The two observers found the single method less time
consuming, but also stated that the paired and chronological readings offered the possibility to compare right
and left, or baseline and follow-up, and to assess the
quality of films. The interobserver reliability (expressed
as G coefficients) of the absolute scores was good and
not influenced by the method (chronological 0.94, paired
0.88 and single 0.93).
Table 2 shows the ANOVA table. The main effect of
time, i.e. progression, was a mean increase in the joint
score (averaged over patients, observers and methods)
from 26 to 37 (P = 0.046). The main effect of observer
(averaged over patients, time and methods) was highly
significant because one observer scored consistently
higher than the second observer (P = 0.003). The main
effect of method (averaged over patients, time and
observers) was not significant (P = 0.34). The two-factor
interaction effects show that the difference in absolute
scores between the two observers (i.e. averaged over
time) was not considerably influenced by the method
(method × observer): chronological 6.1, paired 8.2 and
single 6.8 (P = 0.706). Likewise, the effect of time (progression) averaged over methods did not differ significantly between the observers (observer × time): 10 vs 13
(P = 0.10). However, the effect of time averaged over
D. van der Heijde et al.
1216
T 2. ANOVA table of Study 1 with patient (n = 10) as random
factor, and method (n = 3), observer (n = 2) and time (n = 2) as
fixed factors
Source of variance
Patient
Time
Time × patient
Observer
Observer × patient
Method
Method × patient
Method × observer
Method × observer × patient
Observer × time
Observer × time × patient
Method × time
Method × time × patient
Method × time × observer
Method × time × observer
× patient
d.f.
9
1
9
1
9
2
18
2
18
1
9
2
18
2
18
Mean
square
4057.2
3933.1
732.6
1477.0
93.5
58.0
50.5
10.9
30.5
95.4
28.4
76.7
17.2
16.4
14.5
F
P
5.37
0.046
15.79
0.003
1.15
0.339
0.36
0.706
3.35
0.100
4.47
0.027
1.13
0.344
observers differed significantly between the methods
(method × time): chronological 14.6, paired 10.4, single
9.4 (P = 0.027). Finally, the three-factor interaction
(method × observer × time) was not significant (P =
0.34), indicating that the interaction between the
method and time was not significantly different between
the two observers.
Figure 1 shows the Bland and Altman plot of the
difference in progression between the two observers per
method [12]. On the y-axis, the difference in progression
as assessed by the two observers is presented; on the
x-axis, the mean of progression as assessed by the two
observers is presented. In the ideal situation, all points
would be situated on or close to y = 0. It can be clearly
seen that there are major differences between the observers in the paired method, but not in the chronological
and single readings. The differences between the observers are quite similar over the full range of measured
progression; there is no clear tendency to increase or
decrease if more progression is observed. From this plot,
it can also be seen that one observer is consistently
scoring higher than the other observer, resulting in more
points in the lower right quadrant.
The G coefficient for change was 0.39 for chronological, 0.23 for paired and 0.22 for single readings.
Expressed differently, the ratio of signal .. to noise
.. was 0.79 for chronological, 0.55 for paired and 0.53
for single readings. Given that the .. of progression is
around 16 in all three methods (data not shown), the
signal is almost 50% stronger in chronological vs
the other methods. This has major implications for the
power of a study. For example, a trial with 10 patients
per group and a .. of 16 (as in this study) has a
calculated power of 47% with chronological reading
(difference 14.6) and for the other methods 24% (paired )
and 21% (single) (differences 10.4 and 9.4, respectively).
A secondary analysis showed that both erosions and
narrowing contributed to the effect of time (factor time
for erosions P = 0.06 and for narrowing P = 0.04).
Erosions and narrowing had no significantly different
effect on the method; in other words, the difference
noted in progression between the three methods could
not be attributed to one of the two subscores of the
total score.
F. 1. Bland and Altman plot presenting the difference in progression between the two observers in relation to the mean
progression measured by the two observers for chronological, paired and single readings (Study 1). The chronological and single
readings are distributed equally around a difference of zero, one observer is reading paired sets systematically higher than the
other observer.
Order of reading radiographs
Study 2
Patients and methods
Given the results of the first study, we hypothesized that
the chronological method would be able to detect radiographic differences earlier, but that this effect would
disappear with longer follow-up. This hypothesis was
tested in the second study. Four sets of hand and foot
films of 12 patients (completely different from those in
Study 1), with a 1 yr interval between each of the four
sets, were scored. Because (1) Study 1 did not show
relevant differences between paired and single readings
in assessing progression, (2) the interobserver reliability
was lower for the paired readings and (3) because the
observers assumed that single readings were less time
consuming and (4) it might be helpful to have the right
and left hand (or foot) at the same time, it was decided
to investigate the influence of reading in chronological
order vs single vs single with pairs of hands or feet
(‘single-pair’; Table 1). The same readers as in Study 1
were involved. The time for each reading was recorded,
but due to inadequate instruction, one observer noted
the time of the reading in seconds and the other in full
minutes. Therefore, the differences between the observers
are not precise, but the differences between the three
methods for each observer reflect true differences. Thus,
the films were scored by three methods that varied
partially from the first study: (1) chronological; (2)
single-pair: both hands or both feet from the same point
in time are presented together, time and patient are in
random order; (3) single. In this second study, it was
also recorded whether a film, according to one or both
observers, had technical problems that could interfere
with appropriate scoring, such as bad positioning of the
joints on the radiograph.
Statistical analysis
The study question was whether the differences between
the three methods would disappear with longer followup. A repeated measures ANOVA similar to that in
Study 1 was performed. However, only the interobserver
G coefficient was calculated. The G coefficient for change
would be hard to interpret given the multiple time points
and thus many different and interrelated possibilities to
express change (e.g. change 0–1, change 1–2, change
0–2, change 0–3 yr). Again, plots of residuals vs fitted
values were normally distributed.
Results
Observer 1 needed on average 11.2 min (range 7.3–21.5)
for a set in chronological order, 11.3 min (4.2–22.0) for
single-pair and 10.7 min (4.8–18.0) for single readings.
However, observer 2 took almost twice as long to score
a set in chronological order (27.7; range 20.0–38.0)
compared with single-pair (15.4; range 10.5–21.0) and
single reading (18.5; range 9.5–30.0), respectively.
Technical problems were felt to play a role in 2% of the
films if scored in chronological order vs 13% in singlepair and 4% in single readings.
The interobserver G coefficients for absolute scores
1217
were 0.86 for chronological, 0.76 for single-pair and
0.91 for single readings.
The main effect of time (averaged over patients,
observers and methods) was significant: the mean initial
score was 19.2, the final (3 yr) score was 30.1
(P=0.002). Again, the main effect of observer (averaged
over patients and methods) was significant, with a mean
difference of 5.4 (P = 0.03). The main effect of method
was not significant (P = 0.53). As in Study 1, the
difference in absolute scores between the two observers
was not statistically different between the methods:
chronological 3.2, single-pair 5.6 and single 7.3
(P=0.15). Likewise, the effect of time (progression)
averaged over methods did not differ significantly
between the observers: 11.2 vs 10.5 (P = 0.83). However,
the effect of time averaged over observers differed
significantly between the three methods: chronological
15.9, single-paired 8.5, single 8.3 (P = 0.0001) ( Fig. 2).
A constant progression is suggested by chronological
reading, in contrast to a stabilization in the other two
methods after 1 yr. A similar pattern was observed for
both observers (three-factor interaction not significant;
P = 0.57).
A separate analysis of erosions and narrowing showed
comparable results (data not shown). The effect on the
difference in progression between the three methods was
as strong for erosions and narrowing as for total score.
Discussion
Our studies have shown that scoring films in chronological order results in higher progression rates and a
better signal-to-noise ratio than scoring films in pairs or
singly. This was evident after 1 yr, but even more
pronounced with longer follow-up. From Study 1, it
was obvious that the results between paired and single
scoring were very similar in respect to progression rates.
However, paired scoring performed worse in terms of
precision. This could be seen in both G coefficients of
change and reliability. In the second study, it became
clear that there is no relevant difference between reading
according to the single and the single-pair method.
Our data are based on small numbers of patients. In
general, our position is that non-significant results from
small studies should be considered with great caution,
with an eye open for the possibility that lack of power,
rather than absence of effects, may be the explanation.
However, if a relatively small study yields convincing
(significant) results, as our study does, it can safely be
concluded in hindsight that the study, relatively small
as it is, had sufficient power to demonstrate the effects
of interest. Moreover, although our study is indeed
small in number of patients included, we believe that
our design is highly efficient; this may explain why the
power (precision) of our study turns out to be greater
than might be expected at first sight.
Recently, two other papers dealing extensively with
this methodological issue were published ( Table 1)
[9, 10]. The first is a letter to the editor on 100 patients
with early RA and 18 month follow-up films [10]. Hand
1218
D. van der Heijde et al.
F. 2. Influence of order of scoring on progression over 3 yr (average of two observers, Study 2).
films were scored according to Sharp by two independent
readers. They found a higher progression rate in chronological readings, followed by the paired and single
readings. The interobserver reliability of the absolute
scores for the total score was greater for the paired
readings (0.90–0.93) than for the chronological
(0.82–0.85) and single (0.76–0.80) readings. However,
the differences in interobserver reliability for the progression scores were only marginal between the chronological and paired readings (0.67 and 0.64, respectively),
but greater than the single reading (0.54). They also
calculated a signal-to-noise ratio, but this ratio is not
directly comparable with ours because they used a simple
two-factor model without interaction terms, as opposed
to our full three-factor model. Their ratios, calculated
as intrapatient .. divided by inter-rater .., were 4.8,
3.8 and 2.3 for the chronological, paired and single
methods, respectively. The higher ratio of chronological
scoring compared with that of single scoring is in
agreement with our finding, although the results of their
paired reading were better than ours. They concluded
that paired reading is the preferred method because of
the higher sensitivity to change, better interobserver
reliability, and conservative reading if the order is not
known.
Comparing our results on interobserver reliability
with the above cited study shows that the interobserver
reliability on absolute scores was somewhat higher in
our study (chronological 0.94, paired 0.88 and single
0.93), whereas the interobserver reliability was only
slightly worse for the paired readings compared with
chronological and single readings.
The second study concerns 284 RA patients with early
disease, with films at baseline and after 12 months.
Hand and foot films were scored according to Larsen
by a panel of three observers that provided a consensus
score ( Table 1) [9]. The authors found that the .. of
the paired readings was smallest. However, the mean
change with this method was also smallest. A more
appropriate way is to compare the coefficients of variation (CVs), relating the .. to the change. For progression in eroded joint counts, they were 1.46, 1.64 and
1.93 for the chronological, paired and single readings,
respectively, and for progression in damage score 1.38,
1.72 and 2.47, respectively. Expressed in this way, the
chronological method is the one to prefer with the
greatest sensitivity to change (and therefore most powerful in a trial ). The authors also performed a bootstrap
analysis and give most weight to this final analysis and
prefer paired readings.
Both studies did not provide information on the order
that they used in their study to apply the three methods:
totally random, or first all films with one method,
thereafter with the second and lastly with the third
method. Only the second paper stated that the films of
the same patient were scored in different sessions. If the
orders are not scored randomly, this could possibly have
influenced the results. We copied the films in order to
be able to randomize the orders and showed that
ordering did not influence the results. The use of copies
may have influenced precision overall, but with all
methods read on copies, it is unlikely that the quality
of the films biased the differences between the methods
in any way.
The methodological distinctions between our two
studies and the two studies mentioned before are as
follows: (1) compared with the study by Salaffi and
Carotti [10], we also included feet (in contrast to hands
Order of reading radiographs
alone); (2) compared with the study by Ferrara et al.
[9], we used the Sharp/van der Heijde method (in
contrast to the Larsen method ). Sensitivity to change is
reported to be greater for the Sharp/van der Heijde
score compared with the Larsen score, and this also
applies for combined assessment of hands and feet
compared with hands only [3]. So it could be hypothesized that the combination of the most sensitive methods
(Sharp/van der Heijde and hands and feet) distinguishes
best between the various methods of scoring. On the
other hand, the numbers of patients in the other two
studies were much larger than in ours. However, the
repeated measures ANOVA and the application of generalizability theory uses all information available, reducing the number of patients needed to draw reliable
conclusions.
Our data on 3 yr follow-up give very interesting and
new additional information. This is especially relevant
because in clinical therapeutic trials increasingly more
than two films are used to evaluate treatment effects.
Differences in absolute scores between the two observers were marked, but there was no difference in
assessing progression. This has been shown before and
is not a major drawback if average scores of the observers are used [13].
The signal-to-noise ratio of chronological scoring was
greater than for paired and single scoring. This results
in a marked difference in power in a clinical trial. The
choice in which order to score the films could have a
similar or even greater impact than the choice of the
scoring method itself. The question we cannot answer
at this moment is whether this higher sensitivity to
change coincides with bias. In other words: more signal,
but less valid, i.e. a false signal? We hypothesized that
the chronological method would be able to detect differences earlier, but that this would disappear with longer
follow-up. This was tested in the second study. However,
the differences became even more apparent. Also, the
chronological order suggested progression in all three
periods of follow-up, whereas the single and single-pair
methods suggested progression only during the first
period of follow-up. Thus, we still do not have the
definite answer as to which method is measuring the
truth. To answer this, we need a gold standard, which
is in fact currently not available for radiographic
damage. Therefore, we need to use a surrogate gold
standard, another external criterion for damage. We
1219
think that sonography and MRI are also not suitable
because the scoring methods for these methods are not
validated yet. An expert panel might decide whether
there is a true difference between a set of films, and
subsequently the relationship between this judgement
and the data obtained with the three methods could be
a way forward. An alternative would be to read the
radiographs of a therapeutic trial with known efficacy
with the three methods.
What type of information do we have available
depending on the radiographs that are present at the
same time when scoring the films? Table 3 summarizes
the different sources of information that can influence
the scoring and in which type of scoring method these
are present. Sources of information which can influence
the scoring are: (1) the identical joint at a different time;
(2) contralateral joint at the same point in time; (3)
other joints in different regions. This information can
be helpful in scoring a joint, e.g. a change in positioning
can be seen in (1); an anatomical variation can be seen
in (2). However, these various sources of information
can also introduce bias, e.g. if the joints in the feet do
not show erosions, the expectation could be that no
erosions will be present in the hands. The balance
between positive extra information, which leads to
reduction in measurement error, and negative extra
information, which leads to the introduction of bias,
should be investigated for the various sources of
information.
The rule that the scores cannot decrease in chronological scoring means that error can only be in one
direction: overestimation. This was the rule as it was
published with the modification of the Sharp method
[3, 7]. This was chosen because in series of films it was
regularly seen that on film 1 there was an erosion, which
was not present on film 2, but was present again on film
3. So, it was obvious that even though the erosion could
not be seen on film 2, it must have been present. It
would be worthwhile to see what would happen if this
rule was not applied. One obvious effect would be the
introduction of more measurement error.
It might very well be that the chronological order is
biased and overestimates progression of damage, i.e. the
signal is false. On the other hand, the films were selected
to show progression of damage in at least some of the
sets. With the single and single-pair methods, on average
no progression at all could be detected in the second
T 3. Possible sources of information when reading radiographs on the presence of abnormalities (e.g. films of the hands)
Source of information
Identical joint at a different time
Chronology known to investigator: ‘chronological’
Time sequence unknown to investigator: ‘paired’
Contralateral joint (same point in time)
Available: ‘single-pair’
Not available: ‘single’
(the contralateral joints are usually available with ‘chronological’ or ‘paired’ reading)
Other joints in different regions (e.g. feet)
(usually combined with either ‘chronological’ reading or ‘paired’ reading)
Study 1
Study 2
+
+
+
−
−
+
+
+
+
+
1220
D. van der Heijde et al.
and third year. One might really doubt whether this
occurrence is realistic. The interpretation could also be
that non-chronological scoring introduces measurement
error by limiting the information the reader gets, and
that the signal is lost in the noise.
In conclusion, reading films in chronological order is
most sensitive to change, but it cannot yet be excluded
that it overestimates the progression of damage. In
randomized clinical trials where sensitivity to change is
pivotal, we would advise scoring chronologically. In this
situation, bias (if present) will be in the same direction
in both arms of the trial, assuming films are read in a
blinded fashion. Effective treatment will show a measurable reduction in joint damage compared with the less
(or not) effective treatment arm. In observational studies
with many potential sources of bias, the choice is more
difficult. Chronological reading could be chosen as the
most sensitive scoring method and to ensure comparability with the results of clinical trials. On the other hand,
the reduction of possible bias could be an argument to
choose paired, single-pair or single reading, although it
cannot be excluded that these methods are biased in the
opposite direction (i.e. show no change where in fact
one has occurred). More data are needed to make a
final choice for observational studies.
Acknowledgement
We would like to thank Mrs L. Heusschen for her
excellent assistance in the entire project.
References
1. Edmonds JP, Scott DL, Furst DE, Brooks P, Paulus HE.
Antirheumatic drugs: a proposed new classification.
[Editorial ] Arthritis Rheum 1993;36:336–9.
2. Boers M, Tugwell P, Felson DT et al. World Health
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Organization and International League of Associations
for Rheumatology core endpoints for symptom modifying
antirheumatic drugs in rheumatoid arthritis clinical trials.
J Rheumatol 1994;41(suppl.):86–9.
van der Heijde DM. Plain X-rays in rheumatoid arthritis:
overview of scoring methods, their reliability and applicability. Baillière’s Clin Rheumatol 1996;10:435–53.
Larsen A, Dale K, Eek M. Radiographic evaluation of
rheumatoid arthritis and related conditions by standard
reference films. Acta Radiol Diagn Stockh 1977;18:481–91.
Sharp JT, Young DY, Bluhm GB et al. How many joints
in the hands and wrists should be included in a score of
radiologic abnormalities used to assess rheumatoid arthritis? Arthritis Rheum 1985;28:1326–35.
Rau R, Herborn G. A modified version of Larsen’s scoring
method to assess radiologic changes in rheumatoid arthritis. J Rheumatol 1995;22:1976–82.
van der Heijde DM, van Riel PL, Nuver Zwart IH,
Gribnau FW, van de Putte LB. Effects of hydroxychloroquine and sulphasalazine on progression of joint damage
in rheumatoid arthritis. Lancet 1989;i:1036–8.
Fries JF, Bloch DA, Sharp JT et al. Assessment of
radiologic progression in rheumatoid arthritis. A randomized, controlled trial. Arthritis Rheum 1986;29:1–9.
Ferrara R, Priolo F, Cammisa M et al. Clinical trials in
rheumatoid arthritis: methodological suggestions for
assessing radiographs arising from the Grisar Study. Ann
Rheum Dis 1997;56:608–12.
Salaffi F, Carotti M. Interobserver variation in quantitative analysis of hand radiographs in rheumatoid arthritis:
comparison of 3 different reading procedures. J Rheumatol
1997;24:2055–6.
Streiner DL, Norman GR. Health measurements scales.
A practical guide to their development and use. Oxford:
Oxford University Press, 1995:104–80.
Bland JM, Altman DG. Statistical methods for assessing
agreement between two methods of clinical measurement.
Lancet 1986;i:307–10.
Sharp JT. Radiologic assessment as an outcome measure
in rheumatoid arthritis. Arthritis Rheum 1989;32:221–9.