Reliability of Scoring Respiratory Disturbance Indices and Sleep

INSTRUMENTATION AND METHODOLOGY
Reliability of Scoring Respiratory Disturbance Indices and
Sleep Staging
Coralyn W. Whitney,1 Daniel J. Gottlieb,2 Susan Redline,3 Robert G. Norman,4 Russell R. Dodge,5 Eyal Shahar,6
Susan Surovec,3 and F. Javier Nieto7
(1) Department of Biostatistics, University of Washington, Seattle, Wash; (2) Department of Medicine,
Boston University School of Medicine, Boston, Mass; (3) Department of Pediatrics, Case Western Reserve
University, Cleveland, Ohio; (4) Department of Medicine, New York University Medical Center, New York,
NY; (5) Respiratory Science Center, University of Arizona, Tucson, Ariz; (6) Division of Epidemiology,
University of Minnesota, Minneapolis, Minn; (7) Department of Epidemiology, School of Hygiene and Public
Health, Johns Hopkins University, Baltimore, Md
Study Objectives: Unattended, home-based polysomnography (PSG) is increasingly used in both research and clinical
settings as an alternative to traditional laboratory-based studies, although the reliability of the scoring of these studies has
not been described. The purpose of this study is to describe the reliability of the PSG scoring in the Sleep Heart Health
Study (SHHS), a multicenter study of the relation between sleep-disordered breathing measured by unattended, in-home
PSG using a portable sleep monitor, and cardiovascular outcomes.
Design: The reliability of SHHS scorers was evaluated based on 20 randomly selected studies per scorer, assessing both
interscorer and intrascorer reliability.
Results: Both inter- and intrascorer comparisons on epoch-by-epoch sleep staging showed excellent reliability (kappa
statistics >0.80), with stage 1 having the greatest discrepancies in scoring and stage 3/4 being the most reliably discriminated. The arousal index (number of arousals per hour of sleep) was moderately reliable, with an intraclass correlation
(ICC) of 0.54. The scorers were highly reliable on various respiratory disturbance indices (RDIs), which incorporate an
associated oxygen desaturation in the definition of respiratory events (2% to 5%) with or without the additional use of associated EEG arousal in the definition of respiratory events (ICC>0.90). When RDI was defined without considering oxygen
desaturation or arousals to define respiratory events, the RDI was moderately reliable (ICC=0.74). The additional use of
associated EEG arousals, but not oxygen desaturation, in defining respiratory events did little to increase the reliability of
the RDI measure (ICC=0.77).
Conclusions: The SHHS achieved a high degree of intrascorer and interscorer reliability for the scoring of sleep stage
and RDI in unattended in-home PSG studies.
Key words: Sleep apnea syndrome; polysomnography; scoring; reliability
POLYSOMNOGRAPHY (PSG) DATA have been collected for over 30 years to assess sleep continuity and—more
recently—to assess sleep-related respiratory disturbances.
Over the last 10 years, interest has grown in the use of PSG
data to characterize the level of exposure to sleep disruption and sleep-disordered breathing as potential risk factors
for cardiovascular diseases. The Sleep Heart Health Study
(SHHS)1 is a multicenter, longitudinal study designed to
relate PSG data to cardiovascular outcomes. Such a study
is feasible in part due to the recent development of
lightweight, portable PSG monitors which have allowed us
Accepted for publication August, 1998
Address correspondence and requests for reprints to Coralyn W. Whitney,
PhD, Department of Biostatistics, Box 358223, University of Washington,
Seattle, WA 98195, e-mail: [email protected]
SLEEP, Vol. 21, No. 7, 1998
749
Reliability of PSG scoring—Whitney et al
years of age or older were invited to participate in the
SHHS, excluding only those individuals who were currently treated with continuous positive airway pressure, oral
devices, or home oxygen therapy, and subjects who had
had a tracheostomy. Those who agreed to participate in
SHHS underwent an unattended PSG study at home.
Recruitment for the SHHS was initiated in November 1995
and completed in January 1998.
to perform 6 440 multichannel sleep recordings in unattended home settings for participants from 10 clinical sites
located across the United States. The ability to identify
meaningful associations between the PSG measures and
cardiovascular outcomes depends on the ability to score
relevant PSG parameters reliably. The reliability of measures obtained from unattended, in-home PSG has not been
previously described; however, the lack of both a standardized testing environment and a technician monitoring the
study and intervening to optimize signal quality suggests
that such studies might be more difficult to score reliably
than traditional laboratory-based studies. The reliability of
scoring of unattended, in-home PSG studies is also of clinical concern, as such studies are being introduced into the
practice of sleep medicine.
For SHHS, a central Reading Center (Cleveland, Ohio)
was established to process and score all PSGs, with the
goal of maximizing interscorer and intrascorer reliability.
This required the establishment and documentation of clear
scoring rules designed to operationalize Rechtschaffen and
Kales2 staging and the American Sleep Disorders
Association3 arousal criteria. New approaches for hypopnea identification, utilizing software innovations for categorizing apneas and hypopneas according to the degree of
desaturation and the presence or absence of arousals, were
established. Scorer training and certification procedures
were established to minimize differences among scorers.
Early in the study, an assessment of the inter- and intrascorer reliability of PSG data was performed to provide a general indication of the reliability of specific PSG indices
highlighted to be of importance for SHHS hypotheses.
Respiratory disturbance indices (as measures of sleep-disordered breathing), the arousal index, and the epoch-byepoch comparison of sleep staging were assessed. This
paper reports the results of this study, highlighting differences in the reliability of different indices, and suggesting
areas for further development.
Equipment
The portable monitor used for the home sleep studies
was the Compumedics PS-2 system (Abbotsford, Victoria,
Australia). Briefly, this system consists of a 12-channel
recording montage, consisting of C3/A1 and C4/A2 EEG,
right and left EOG, ECG, chin EMG, oxygen saturation
(from finger pulse oximetry), chest and abdominal excursion (inductance plethysmography), airflow (a Pro-Tech
oral-nasal thermistor), body position, and ambient light.
The data collected during the SHHS PSG is therefore typical of the PSG montage used in most clinical laboratories,
with the exception that leg movement data are not collected. Software to score the raw data was developed by
Compumedics in collaboration with SHHS investigators
(see reference 5 for complete details).
Data Collection
Data from a PSG study were collected on a PCMCIA
card, then downloaded to computers and transferred to
magnetic cartridges at the local field centers. The average
PSG study consisted of approximately 20 megabytes of
raw data. These cartridges were mailed to the central PSG
Reading Center (Case Western Reserve University,
Cleveland, Ohio).
Scorers
The SHHS scorers were trained and certified according
to rigorous methods.5 Scoring rules were documented in a
written Reading Center Manual of Operations,6 which contained examples of typical and atypical events and patterns,
and classification schema. Scoring issues, including
review of any problematic studies, were reviewed on a
weekly basis by the staff, including three scorers, a chief
polysomnologist, a study coordinator, a neurologist (diplomate of the American Board of Sleep Medicine), and a pulmonologist. A standardized “Scorer’s Notes Form” was
used also to identify patterns and scoring issues which were
not reported by the software package and could not be
identified from the study quality assurance forms (coding
signal and overall study qualit)y.4,5 These included notations on the occurrence of patterns of periodic breathing,
abnormal eye movements, abnormal wake EEG, alpha
METHODS
The objectives and design of the SHHS have been
described elsewhere.1,4 Briefly, for the purpose of studying
the cardiovascular consequences of sleep apnea, the SHHS
recruited participants from ongoing community-based
cohort studies of cardiovascular disease and hypertension.
These parent cohorts included Framingham Heart Study
(FHS), Atherosclerosis Risk in Communities Study
(ARIC), Cardiovascular Health Study (CHS), New York
Hypertension Cohorts, Strong Heart Study (SHS), and
Tucson’s Epidemiological Study (TES) of Airways
Obstructive Disease Cohort, and Health & Environment
Cohort (H&E). Participants from the parent cohorts were
recruited into SHHS without consideration of the presence
or absence of a history of sleep apnea. All participants 40
SLEEP, Vol. 21, No. 7, 1998
750
Reliability of PSG scoring—Whitney et al
integrated into the scorers’ normal workloads. Even
though such measures were taken to maintain blinding of
the reliability studies, equipment improvements during the
first 6 months resulted in some software montage displays
being slightly different from the majority of studies being
scored along with the reliability studies. Thus, although
study identification numbers and dates were modified in an
attempt to make these studies look like current studies, it is
possible that scorers identified some studies as being reliability studies. However, no changes in workflow or scoring behavior were observed. Order-of-assignment tables
were constructed by the SHHS Coordinating Center
(Seattle, Wash), providing the Reading Center with the
order in which the studies were to be assigned to the three
scorers. For the interscorer evaluation, all 30 studies were
scored once before cycling through the studies for the second and third scorings. The purpose of this was to ensure
that (a) no two scorers would be scoring the same study at
the same time; and (b) since the scorers did discuss scoring
with each other on a day-to-day basis, the studies would be
spread out to minimize recall and identification of “reliability” studies.
intrusion, alpha artifact, and periodic large breaths suggesting an arousal disorder. Scorers also coded each study
according to perceived problems with sleep staging and
arousal detection. Such problems were identified if the
scorer felt uncertain about >10% of the event (arousal)
determinations or staging decisions because of signal noise
(eg, sweat artifact, electrical interference) or biological disturbances (alpha intrusion).
Studies were scored manually on a high-resolution
monitor using 30-second epochs for staging and arousal
detection, and using 5-minute epochs for respiratory data.
Sleep data were scored without visualization of respiratory
channels, and respiratory data were scored without visualization of EEG and EMG channels.
The reliability study occurred during the first year of
more than 3 years of PSG scoring. Three of the six scorers
employed during the course of the study were on staff at the
time of the study. These three scorers were evaluated in
this reliability study, and are identified by scorer ID number (912, 914, and 915). Scorer 912 worked halftime and
the other two fulltime.
Selection of Studies to Assess Reliability
Scoring and Definitions
The SHHS Coordinating Center and Quality Control
Subcommittee designed and coordinated the reliability
study, which was implemented at the Reading Center from
October, 1996 through January, 1997. Twenty previously
scored studies, rated good or better in overall quality,6 were
chosen randomly for each of the three scorers, yielding 60
studies to be used in the assessment of reliability. Each
study was rescored by the original scorer, and half were
rescored by the other two scorers. The previously scored
results were used only in the assessment of the intrascorer
reliability,* comparing the results from 20 studies scored
twice with a median time interval between scorings of 6.5
months. Estimation of interscorer reliability was based on
30 PSG studies scored by all three SHHS scorers. These
studies represent the first 6 months of PSG studies performed for SHHS, when approximately 14% of the studies
(out of 6440) had been conducted. Selection was made
from those studies scored after a scorer had completed a
minimum of 80 studies.
Sleep stages were scored according to the criteria
developed by Rechtshaffen and Kales.2 The predominant
sleep stage was scored for each 30-second epoch and coded
as wake, stage 1, stage 2, stage 3/4 (delta sleep), or REM.
Arousals were identified according to the criteria of the
American Sleep Disorders Association,3 modified to
accommodate situations in which EMG artifact obscures
the EEG signal. The arousal index was calculated as the
number of arousals per hour of sleep.
Scoring of breathing disturbances was based on identifying changes in breathing amplitude, regardless of associated desaturation or arousal. Potential apneas were identified if airflow was nearly flat (£25% of baseline, where
baseline amplitude is identified during the nearest preceding period of regular breathing with stable oxygen saturation) lasting for at least 10 seconds. Potential hypopneas
were identified if airflow or thoracoabdominal excursion
was approximately £70% of baseline for at least 10 seconds. The software linked each apnea and hypopnea with
data from the oximetry channel and EEG channels. Thus,
each event was characterized according to its associated
desaturation (0 to 5%) and whether an arousal occurred
within 3 seconds of the termination of the event. Various
measures of respiratory disturbance indices (RDIs), defined
as the number of apneas and hypopneas per hour of sleep,
were calculated based on varying the requirement for an
associated desaturation or arousal.
Assignment of Studies
To minimize study recognition and the potential for
scoring recall bias, scorers were blinded as to whether a
study they were scoring was a “reliability” study (one previously scored and discussed), with all reliability studies
*One study was determined to be nonscorable during review because
of identification of excessive artifact on the oximetry channel.
Another study was not sleep-staged because of excessive EEG artifact.
SLEEP, Vol. 21, No. 7, 1998
751
Reliability of PSG scoring—Whitney et al
Table 1.— Interscorer reliability—descriptive statistics (n = 30)
Scorer ID number
Variable
912
914
915
Mean (SD)*
Mean (SD)
Mean (SD)
ICC
RDI w/wo desat or arousal
25.93 (13.12)
32.95 (13.57)
27.38 (13.88)
0.74
RDI with 2% desat
19.90 (12.14)
18.44 (10.63)
18.57 (11.12)
0.93
RDI with 3% desat
11.35 (9.32)
9.97 (8.43)
10.80 (8.90)
0.97
RDI with 4% desat
6.08 (7.00)
5.38 (6.39)
5.75 (6.70)
0.99
RDI with 5% desat
3.53 (4.95)
3.16 (4.57)
3.32 (4.70)
0.99
RDI with arousals only
5.56 (4.14)
7.22 (5.44)
6.57 (5.30)
0.77
RDI with 2% desat or arousal
21.24 (12.34)
21.60 (11.13)
20.43 (11.29)
0.91
RDI with 3% desat or arousal
14.11 (9.87)
14.58 (9.16)
14.06 (9.24)
0.95
RDI with 4% desat or arousal
9.75 (7.92)
10.78 (7.88)
10.05 (7.90)
0.94
RDI with 5% desat or arousal
7.69 (6.32)
9.08 (6.76)
8.37 (6.64)
0.90
Arousal index (AI)
13.50 (6.33)
20.63 (11.47)
17.31 (7.68)
0.54
* : SD = Standard deviation; ICC = Intraclass correlation; desat = desaturation.
cent positive agreement (PPA)10 within each scorer pair is
computed from these data, where PPA is defined as the percent agreement on a particular stage among those epochs in
which at least one in the scorer pair indicated that stage
(see footnote to Table 4 for example of computation of
PPA).
Statistical Analysis
Interscorer reliability of the respiratory disturbance and
arousal indices were estimated using the intraclass correlation coefficient (ICC),7 and the intrascorer reliability by
paired t test. Using analysis of variance power curves,8 the
number of studies necessary for adequate evaluation of the
intraclass correlation for a fixed number of scorers was
determined. These curves vary as a function of the number
of repeated measurements (number of scorers), the number
of studies (n), and values of the hypothesized intraclass
correlations. It was determined that a sample size (n) of 30
studies provides at least 80% power at the 5% level of significance to detect the alternative of “almost perfect” reliability between the scorers. The null hypothesis specifies
the ICC is less than 0.80, and the alternative hypothesis
specifies the observed ICC is at least 0.90.
The assessment of intrascorer reliability, based on a
sample size of n=20, was sufficient to detect a difference
between scorings (original score minus reliability study
rescore) of 0.63 standard deviations, for a=0.05 and 80%
power.
Scorer agreement on epoch-by-epoch sleep staging
was evaluated using the kappa statistic.9 For some kappa
analyses, stages 1 through 3/4 were collapsed into a
“nonREM” (NREM) group. For each stage, the number of
pairwise agreements and disagreements between scorers
are presented (eg, number of occurrences of stage 2 coded
as stage 3/4 between two scorers and vice versa). The perSLEEP, Vol. 21, No. 7, 1998
RESULTS
The reliability of RDI varied with the definition of RDI
(according to use of oxygen saturation and/or arousal as
event-defining criteria), but overall was extremely high, up
to an intraclass correlation of 0.99 (Table 1). The greatest
reliability was found when the most stringent criterion was
applied (5% desaturation linked with a breathing amplitude
change). When a less stringent, but clinically applied, criterion of a 2% desaturation was used, the intraclass correlation was still excellent at 0.93. Using a definition of RDI
which required either a specified level of oxygen desaturation or an arousal resulted in a slightly lower reliability
than when oxygen desaturation alone was required (eg, the
ICC for RDI with 4% desaturation is 0.99, while the ICC
for RDI with 4% desaturation or arousal is 0.94—see Table
1). When the definition of RDI required arousal, but did
not require oxygen desaturation, the reliability dropped to
0.77. The reliability of the arousal index was modest, with
an intraclass correlation of 0.54.
For assessment of intrascorer reliability, the paired t
752
Reliability of PSG scoring—Whitney et al
Table 2.—Intrascorer+ reliability
Intrascorer
Variable
Scorer 912
(n=20)
Scorer 914
(n=19)
Scorer 915
(n=20)
Mean (SD)
Mean (SD)
Mean (SD)
RDI w/wo desaturation or arousal
0.51 (6.50)
1.90 (8.11)
-3.60 (7.56)
RDI with 2% desaturation
-1.17 (4.11)
-0.17 (3.95)
-2.34**(2.74)
RDI with 3% desaturation
-0.80 (1.89)
-0.04 (2.54)
-0.85**(1.26)
RDI with 4% desaturation
-0.44* (0.85)
-0.14 (1.49)
-0.31* (0.52)
RDI with 5% desaturation
-0.24* (0.50)
0.06 (0.80)
-0.10 (0.34)
RDI with arousals only
-0.60 (3.48)
2.05 (4.73)
-0.89
RDI with 2% desaturation or arousal
-0.68 (4.58)
1.33 (4.52)
-2.27* (3.70)
RDI with 3% desaturation or arousal
-0.88 (3.47)
2.02* (3.84)
-1.13 (2.67)
RDI with 4% desaturation or arousal
-0.75 (3.11)
2.30* (3.94)
-0.79 (2.74)
RDI with 5% desaturation or arousal
-0.72 (3.23)
2.24* (4.27)
-0.76 (2.90)
Arousal index (AI)
-3.11* (6.08)
7.32**(10.59)
2.48
(3.14)
(8.55)
+ Mean (SD) of intrascorer differences. Difference equals the original score minus the reliability rescore.
* : p < 0.05 ; ** : p < 0.01.
tests showed 3 or 4 significant differences, out of the 11, at
the 0.05 level for each scorer (Table 2). Scorers 912 and
915 tended to score the RDI measures slightly higher during the reliability study (ie, mean differences are negative),
and 914 tended to score slightly lower (mean differences
positive). However, even where these statistically significant differences occurred, the magnitude of the differences
was small, ranging from 1%-19% of the original mean
scores. For the arousal index, there were greater mean differences and variability observed in the original vs rescore
comparisons. Scorer 912 scored more arousals in the reliability study, while the other two scorers rescored fewer
arousals. These differences were larger and represented
21%-25% of the original mean score, again indicating that
arousals are not as easily scored.
For epoch-by-epoch staging, the interscorer comparisons yielded kappas in the range of 0.81-0.83 (Table 3).
When stage was recoded into three stages (wake, NREM,
REM), the kappas increased to the range of 0.87 - 0.90.
Intrascorer kappas revealed the same high degree of reliability. The stage-specific results (Table 4) revealed that
stage 1 had the lowest percent positive agreement (PPA),
ranging from 16.7% for within-scorer evaluation of 914,
and up to only 42.2% for scorer 915. Wake had the highest percent positive agreements. The data show that stage
3/4 was very rarely misscored by another scorer or on
SLEEP, Vol. 21, No. 7, 1998
Table 3.—Kappa statistics evaluating reliability of epoch-by-epoch staging
Scorer
comparison
Number
of
epochs
Interscorer
912 vs 914
912 vs 915
914 vs 915
29,507
²
²
Intrascorer
912
914
915
*
+
20, 289
17, 828
19 236
,
All
stages*
Wake vs
NREM+ vs REM
Kappa
Kappa
0.81
0.81
0.83
0.89
0.87
0.90
Kappa
Kappa
0.81
0.79
0.87
0.90
0.90
0.92
All stages categories: wake, 1, 2, 3/4, REM.
NREM combines stages 1, 2, and 3/4.
rescore as wake, stage 1, or REM. However, stages 2 and
3/4 were more often interchanged, as were stage 2 and
REM.
DISCUSSION
Scoring of PSG data largely relies on procedures for
pattern identification. Reliability will be influenced by the
753
Reliability of PSG scoring—Whitney et al
Table 4.—Interscorer and intrascorer comparisons of epoch-by-epoch staging
Interscorer pair
Stage Comparison
Intrascorer
912 & 914
912 & 915
914 & 915
912
914
915
Wake
8183*
8361
8347
5116
5513
4632
PPA+
88.7%
87.7%
90.1%
88.5%
93.3%
89.7%
Stage 1
476
481
412
223
178
211
Stage 2
461
551
418
388
207
223
4
10
7
6
3
6
REM
102
136
84
45
8
91
Stage 1
379
333
532
190
178
485
20.4%
19.8%
28.1%
20.1%
16.7%
42.2%
760
691
717
457
599
336
0
1
1
0
0
2
240
174
229
74
114
115
Stage 2
11,208
10,804
11,126
7475
7065
7582
PPA+
78.5%
77.6%
80.1%
75.6%
76.4%
84.8%
Stage 3/4
1241
1084
1028
1141
929
593
REM
605
789
597
432
445
209
Stage 3/4
2487
2690
2453
2155
1153
1824
PPA+
66.5%
70.9%
70.1%
65.2%
55.3%
75.2%
REM
10
7
11
2
0
1
REM
3351
3395
3545
2585
1436
2926
PPA+
77.8%
75.4%
79.4%
82.4%
71.7%
87.6%
Total # of epochs
29,507
29,507
29,507
20,289
17,828
19,236
Total # agreements
25,608
25,583
26,003
17,521
15,345
17,449
86.8
86.7
88.1
86.4
86.1
90.7
Wake
Stage 3/4
Stage 1
PPA+
Stage 2
Stage 3/4
REM
Stage 2
Stage 3/4
REM
% agreement
* : Number of occurrences of specified stage combinations relative to total # of epochs scored.
+
: PPA is percent positive agreement for the specified stage. For example, the PPA associated with Wake is 8183/(8183 + 476 + 461 + 4 + 102)
= 88.7% . That is, both scorers agreed that a particular stage was wake 88.7% of the time, and the other 11.3% of the epochs were scored as wake
by only one or the other of the scorers.
explicitness of definitions used to identify discrete events
or patterns, the experience of the scorer, and the quality of
the underlying signals that contribute to the pattern. A limitation of analysis of PSG data is a paucity of standardized
definitions for sleep staging applicable to both disease and
nondisease states and for hypopnea identification.11 The
reliability of scored PSG data for a multicenter study with
the scope of SHHS can be influenced by the large data-processing needs (as many as 90 studies per week persisting
over 3 years), the need to employ a team of scorers who
may differ in scoring approaches, and the variability of sigSLEEP, Vol. 21, No. 7, 1998
nal quality from studies performed in unattended settings in
diverse environments. For SHHS, we attempted to maximize reliability by rigorous training of scorers, systematic
reporting of signal quality, and attempts to formulate
explicit scoring rules. The findings of this study provide
new data that address the relative reliability of staging, respiratory, and arousal data collected in this setting.
This study demonstrates a high degree of reliability in
the scoring of sleep stage and RDI from unattended, inhome PSG studies performed for the SHHS. The interscorer reliability of scoring of sleep stage (kappas of 0.81754
Reliability of PSG scoring—Whitney et al
0.83) is similar to that reported in the literature from laboratory-based studies.12-14 Differences in scoring between
scorers were approximately the same as differences in scoring when studies were rescored by the same scorer, suggesting that the methods utilized to standardize scoring in
the SHHS achieved as great a concordance among readers
as could be achieved. As has been observed by others,
stage 1 was the sleep stage scored least reliably.
Distinguishing reliably between REM and NREM may be
particularly important in addressing hypotheses regarding
stage-specific respiratory disturbances, and for ascertaining
the effects of REM deprivation on functional outcomes.
We have found that this discrimination can be made
extremely reliably, with kappa for interscorer reliability of
epoch-by-epoch staging rising to 0.87-0.90 when all
NREM sleep stages are combined.
The literature provides little information on the reliability of scoring of arousals. For periodic leg movements
(PLMS), a high reliability (range of r of 0.71-0.99) among
scorers for the number of movements per hour (PLMS
index), the number of episodes, and the mean number of
movements per episode have been demonstrated.15
However, in the latter study, when arousal data was included with the PLMS index, the reliability was lower. New
definitions for EEG arousals, requiring identification of an
“abrupt shift in EEG frequency,” were adopted by the
American Sleep Disorders Association in 1992 without
information on their reliability. These definitions can be
difficult to operationalize in settings with variable underlying EEG frequencies, as in light sleep or in studies with
alpha intrusion. Poor-to-moderate agreement in arousal
identification was recently reported in an interscorer reliability study of 14 accredited European laboratories. Using
the same definitions, they reported an overall kappa of
0.47. Agreement was best for arousals scored during delta
sleep (when the overall background EEG most contrasted
with the faster frequencies of arousals), and for studies
noted a priori to be easily classified.16
Our data also suggest that achievement of high scorer
reliability for arousal identification is more difficult than
achieving high reliability for sleep staging and hypopnea
identification. The modest level of interscorer reliability
(ICC=0.54 ) reflects the difficulty in discerning the occurrence of an “abrupt” change in EEG frequency in a setting
of ongoing changes in background EEG frequency (as
occurs in stages 1, 2, and REM sleep). The experience of
the scorer and the quality of the underlying signals will
influence the ability to identify such changes. The
polysomnograms for the reliability study were chosen after
each scorer had scored at least 80 PSGs (80, 129, and 134
for scorers 912, 914, and 915, respectively). However,
even this requirement may have not been enough; the low
reliability could reflect the relative inexperience of the
SLEEP, Vol. 21, No. 7, 1998
scorer at the time of the study. Scoring data, tracked over
the course of the Sleep Heart Health Study, showed a reduction in differences among scorers in the arousal index (data
not shown), suggesting that arousal reliability may be
impacted positively by ongoing training and experience.
Similar convergence in RDI and sleep stage was not
observed. This suggests that achievement of high scorer
reliability for arousals may require considerably more
experience than for respiratory events and sleep staging.
Additionally, the assessment of reliability using unattended
studies—which may be vulnerable to electrical interference
from background environmental contamination, and to
intermittent loss of EMG signals—will reduce the ability to
accurately identify arousals. In fact, within the sample of
reliability studies, eight studies were noted at the time of
scoring to contain electrical artifact severe enough to interfere with reliable arousal identification. As an indication of
the potential impact of both scorer training and signal quality on reliability, the interscorer ICC for the arousal index
increased to 0.72 (95% CI: 0.44, 0.88) when analyses were
restricted to the two fulltime (most intensively trained)
scorers. The extent to which reliability may be further
improved with additional training, and the extent to which
findings may be generalizable to clinical laboratories where
signals may be less vulnerable to electrical artifact—but
where scorer training and experience may be variable—
require further study.
There is controversy over the optimal way to score
hypopneas.3 One source of disagreement concerns the use
of oxygen desaturation and arousal as corroborative criteria
for identifying a respiratory event. We found that when
reductions in breathing amplitude were used in isolation to
identify respiratory events, there was nonetheless good reliability of the measurement of RDI (ICC=0.74). When evidence of arousal was also required, the reliability was only
slightly improved (ICC=0.77). When an associated oxygen
desaturation was required, however, the reliability of the
RDI was extremely high, with ICC rising from 0.93 when a
minimum 2% desaturation was required, to 0.99 when a
minimum 4% desaturation was required. When respiratory
events were defined as reductions in airflow accompanied
by either oxygen desaturation or arousal, the ICC was
somewhat lower than when desaturation alone was
required. These results are consistent with the findings of
Bliwise et al,17 who found high reliability (r=0.94) among
three scorers for RDI but low reliability of particular types
of events (eg, apnea vs hypopnea; obstructive vs central;
range of r of 0.18-0.69), and Lord et al,18 who also evaluated three scorers and found measures of RDI were reliable
(range of kappa 0.71-0.87).
These findings relate to respiratory events identified at
minimum by a 30% or greater reduction in breathing amplitude (measured by oronasal thermocouples) or thoracoab755
Reliability of PSG scoring—Whitney et al
dominal excursion. It is probable that the reliability of respiratory-event scoring varies according to the amplitude
criteria used to identify breathing reductions. The higher
reliability indices for RDIs defined using events with higher levels of desaturation may relate to the more pronounced
breathing changes that characterize these events. Further
resolution of this issue will require a comparison of reliability according to differences in amplitude criteria for
hypopneas.
In summary, we conclude that the training and quality
assurance methods used in the scoring of SHHS
polysomnograms have produced a high degree of interscorer reliability of sleep staging and RDI. It appears that the
use of rigorous scoring methods can yield highly reliable
measures, even for data collected in an unattended setting,
vulnerable to various sources of artifact. The reliability of
scoring of RDI is considerably improved by the requirement of an associated oxygen desaturation in the identification of respiratory events. It appears to be more difficult
to achieve a high level of reliability for the arousal index
than for the RDI or staging distributions. The former
appears to be particularly sensitive to both scorer experience and signal quality.
Dodge, Gordon A. Ewy, Steven R. Knoper, Linda S.
Snyder; Medlantic Research Institute—Phoenix Strong
Heart: Barbara V. Howard; University of Oklahoma—
Oklahoma Strong Heart: Elisa T. Lee, J. L. Yeh; Missouri
Breaks Research Institute-Dakotas Strong Heart: Thomas
K. Welty. Washington County, MD: The Johns Hopkins
University: F. Javier Nieto, Jonathan M. Samet, Joel G.
Hill, Alan R. Schwartz, Philip L. Smith, Moyses Szklo.
Coordinating Center--Seattle, Wash:
University of
Washington: Patricia W. Wahl, Bonnie K. Lind, Coralyn
W. Whitney, Richard A. Kronmal, Bruce M. Psaty, David S.
Siscovick. Sleep Reading Center—Cleveland, Ohio: Case
Western Reserve University: Susan Redline, Carl E.
Rosenberg, Kingman P. Strohl. NHLBI Project Office-Bethesda, MD: James P. Kiley, Richard R. Fabsitz.
This work was supported by National Heart, Lung, and
Blood Institute cooperative agreements #U01HL53940
(University of Washington), U01HL53941 (Boston
University), U0HL53938 (University of Arizona),
U01HL53916 (University of California, Davis),
U01HL53934 (University of Minnesota), U01HL53931
(New York University), U01HL53937 (Johns Hopkins
University).
ACKNOWLEDGMENTS
REFERENCES
The Sleep Heart Health Study (SHHS) acknowledges
the Atherosclerosis Risk in Communities Study (ARIC),
the Cardiovascular Health Study (CHS), the Framingham
Heart Study (FHS), the Cornell Worksite and Hypertension
Studies, the Strong Heart Study (SHS), the Tucson
Epidemiology Study of Airways Obstructive Diseases
(TES) and the Tucson Health and Environment Study
(H&E) for allowing their cohort members to be part of the
SHHS and for permitting data acquired by them to be used
in the study. Participating Institutions and SHHS
Investigators—Framingham, Mass: Boston University:
George T. O’Connor, Sanford H. Auerbach, Emelia J.
Benjamin, Ralph B. D’Agostino, Rachel J. Givelber,
Daniel J. Gottlieb, Philip A. Wolf; University of
Wisconsin: Terry B. Young. Minneapolis, Minn:
University of Minnesota: Eyal Shahar, Conrad Iber, Mark
W. Mahowald, Paul G. McGovern, Lori L. Vitelli. New
York, NY: New York University: David M. Rapoport,
Joyce A. Walsleben; Cornell University: Thomas G.
Pickering, Gary D. James; State University of New York,
Stonybrook: Joseph E. Schwartz; Columbia University
(Harlem Hospital): Velvie A. Pogue, Charles K. Francis.
Sacramento, CA/Pittsburgh, PA: University of California,
Davis: John A. Robbins, William H. Bonekat; University
of Pittsburgh: Anne B. Newman, Mark Sanders. Tucson,
Ariz/Strong Heart Study: University of Arizona: Stuart F.
Quan, Michael D. Lebowitz, Paul L. Enright, Richard R.
Bootzin, Anthony E. Camilli, Bruce M. Coull, Russell R.
1. Quan SF, Howard BV, Iber C, et al. The Sleep Heart Health Study—
Design, Rationale and Methods. Sleep 1997;20:1077-1085.
2. Rechtschaffen A, Kales A. A manual of standardized terminology,
techniques, and scoring system for sleep stages of human subjects.
Washington, DC, US Government Printing Office, 1968 (NIH
Publication No. 204).
3. The Atlas Task Force. EEG arousals: scoring rules and examples.
Sleep 1992;15:173-84.
4. Sleep Heart Health Study Manual of Operations. SHHS Coordinating
Center, Seattle, WA, 1996.
5. Redline S, Sanders MH, Lind BK, Quan SF, Iber C, Gottlieb D,
Bonekat WH, Rapoport DM, Smith PL, Kiley JP. Methods for obtaining and analyzing unattended polysomnography data for a multicenter
study. Sleep 1998:21;759-757.
6. Sleep Heart Health Study Reading Center Manual of Operations.
SHHS Coordinating Center, Seattle, Wash, 1996.
7. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater
reliability. Psychol Bull 1979;86:420-428.
8. Donner A, Eliasziw M. Sample size requirements for reliability studies. Stat Med 1987;6:441-448.
9. Cohen J. A coefficient of agreement for nominal scales. Educational
and Psychol Meas 1960;20:37-46.
10. Fleiss JL. Statistical Methods for Rates and Proportions. New York:
Wiley and Sons,1981, p. 214.
11. Moser NJ, Phillips BA, Berry DT, Harbison L. What is hypopnea,
anyway? Chest 1994;105:426-8.
12. Schaltenbrand N, Lengelle R, Toussaint M. Sleep stage scoring
using the neural network model: comparison between visual and automatic analysis in normal subjects and patients. Sleep 1996;19(1):26-35
13. Stanus E, Lacroix B, Kerkhofs M, Mendlewicz J. Automated sleep
scoring: a comparative reliability study of two algorithms.
Electroencephalogr Clin Neurophysiol 1987;66(4):448-56.
14. Karacan I, Orr WC, Roth T, et al. Establishment and implementation
of standardized sleep laboratory data collection and scoring procedures.
SLEEP, Vol. 21, No. 7, 1998
756
Reliability of PSG scoring—Whitney et al
Psychophysiol 1978; 15 (2): 173-9.
15. Bliwise DL, Keenan S, Burnburg D. Inter-rater reliability for scoring periodic leg movements in sleep. Sleep 1991;14(3):249-51.
16. Drinan MJ, Murrary A, Griffiths CJ, Gibson GJ. Interobserver variability in recognizing arousal in respiratory sleep disorders. Am J Resp
Crit Care Med (in press).
SLEEP, Vol. 21, No. 7, 1998
17. Bliwise D, Bliwise NG, Kraemer HC, Dement W. Measurement
error in visually scored electrophysiological data: respiration during
sleep. J Neurosci Methods 1984;12:49-56.
18. Lord S, Sawyer B, Pond D, et al. Interrater reliability of computerassisted scoring of breathing during sleep [see comments].
Sleep
12(6):550-8, 1989. Comment in: Sleep 1991;14(1):89-90.
757
Reliability of PSG scoring—Whitney et al