INSTRUMENTATION AND METHODOLOGY Reliability of Scoring Respiratory Disturbance Indices and Sleep Staging Coralyn W. Whitney,1 Daniel J. Gottlieb,2 Susan Redline,3 Robert G. Norman,4 Russell R. Dodge,5 Eyal Shahar,6 Susan Surovec,3 and F. Javier Nieto7 (1) Department of Biostatistics, University of Washington, Seattle, Wash; (2) Department of Medicine, Boston University School of Medicine, Boston, Mass; (3) Department of Pediatrics, Case Western Reserve University, Cleveland, Ohio; (4) Department of Medicine, New York University Medical Center, New York, NY; (5) Respiratory Science Center, University of Arizona, Tucson, Ariz; (6) Division of Epidemiology, University of Minnesota, Minneapolis, Minn; (7) Department of Epidemiology, School of Hygiene and Public Health, Johns Hopkins University, Baltimore, Md Study Objectives: Unattended, home-based polysomnography (PSG) is increasingly used in both research and clinical settings as an alternative to traditional laboratory-based studies, although the reliability of the scoring of these studies has not been described. The purpose of this study is to describe the reliability of the PSG scoring in the Sleep Heart Health Study (SHHS), a multicenter study of the relation between sleep-disordered breathing measured by unattended, in-home PSG using a portable sleep monitor, and cardiovascular outcomes. Design: The reliability of SHHS scorers was evaluated based on 20 randomly selected studies per scorer, assessing both interscorer and intrascorer reliability. Results: Both inter- and intrascorer comparisons on epoch-by-epoch sleep staging showed excellent reliability (kappa statistics >0.80), with stage 1 having the greatest discrepancies in scoring and stage 3/4 being the most reliably discriminated. The arousal index (number of arousals per hour of sleep) was moderately reliable, with an intraclass correlation (ICC) of 0.54. The scorers were highly reliable on various respiratory disturbance indices (RDIs), which incorporate an associated oxygen desaturation in the definition of respiratory events (2% to 5%) with or without the additional use of associated EEG arousal in the definition of respiratory events (ICC>0.90). When RDI was defined without considering oxygen desaturation or arousals to define respiratory events, the RDI was moderately reliable (ICC=0.74). The additional use of associated EEG arousals, but not oxygen desaturation, in defining respiratory events did little to increase the reliability of the RDI measure (ICC=0.77). Conclusions: The SHHS achieved a high degree of intrascorer and interscorer reliability for the scoring of sleep stage and RDI in unattended in-home PSG studies. Key words: Sleep apnea syndrome; polysomnography; scoring; reliability POLYSOMNOGRAPHY (PSG) DATA have been collected for over 30 years to assess sleep continuity andmore recentlyto assess sleep-related respiratory disturbances. Over the last 10 years, interest has grown in the use of PSG data to characterize the level of exposure to sleep disruption and sleep-disordered breathing as potential risk factors for cardiovascular diseases. The Sleep Heart Health Study (SHHS)1 is a multicenter, longitudinal study designed to relate PSG data to cardiovascular outcomes. Such a study is feasible in part due to the recent development of lightweight, portable PSG monitors which have allowed us Accepted for publication August, 1998 Address correspondence and requests for reprints to Coralyn W. Whitney, PhD, Department of Biostatistics, Box 358223, University of Washington, Seattle, WA 98195, e-mail: [email protected] SLEEP, Vol. 21, No. 7, 1998 749 Reliability of PSG scoringWhitney et al years of age or older were invited to participate in the SHHS, excluding only those individuals who were currently treated with continuous positive airway pressure, oral devices, or home oxygen therapy, and subjects who had had a tracheostomy. Those who agreed to participate in SHHS underwent an unattended PSG study at home. Recruitment for the SHHS was initiated in November 1995 and completed in January 1998. to perform 6 440 multichannel sleep recordings in unattended home settings for participants from 10 clinical sites located across the United States. The ability to identify meaningful associations between the PSG measures and cardiovascular outcomes depends on the ability to score relevant PSG parameters reliably. The reliability of measures obtained from unattended, in-home PSG has not been previously described; however, the lack of both a standardized testing environment and a technician monitoring the study and intervening to optimize signal quality suggests that such studies might be more difficult to score reliably than traditional laboratory-based studies. The reliability of scoring of unattended, in-home PSG studies is also of clinical concern, as such studies are being introduced into the practice of sleep medicine. For SHHS, a central Reading Center (Cleveland, Ohio) was established to process and score all PSGs, with the goal of maximizing interscorer and intrascorer reliability. This required the establishment and documentation of clear scoring rules designed to operationalize Rechtschaffen and Kales2 staging and the American Sleep Disorders Association3 arousal criteria. New approaches for hypopnea identification, utilizing software innovations for categorizing apneas and hypopneas according to the degree of desaturation and the presence or absence of arousals, were established. Scorer training and certification procedures were established to minimize differences among scorers. Early in the study, an assessment of the inter- and intrascorer reliability of PSG data was performed to provide a general indication of the reliability of specific PSG indices highlighted to be of importance for SHHS hypotheses. Respiratory disturbance indices (as measures of sleep-disordered breathing), the arousal index, and the epoch-byepoch comparison of sleep staging were assessed. This paper reports the results of this study, highlighting differences in the reliability of different indices, and suggesting areas for further development. Equipment The portable monitor used for the home sleep studies was the Compumedics PS-2 system (Abbotsford, Victoria, Australia). Briefly, this system consists of a 12-channel recording montage, consisting of C3/A1 and C4/A2 EEG, right and left EOG, ECG, chin EMG, oxygen saturation (from finger pulse oximetry), chest and abdominal excursion (inductance plethysmography), airflow (a Pro-Tech oral-nasal thermistor), body position, and ambient light. The data collected during the SHHS PSG is therefore typical of the PSG montage used in most clinical laboratories, with the exception that leg movement data are not collected. Software to score the raw data was developed by Compumedics in collaboration with SHHS investigators (see reference 5 for complete details). Data Collection Data from a PSG study were collected on a PCMCIA card, then downloaded to computers and transferred to magnetic cartridges at the local field centers. The average PSG study consisted of approximately 20 megabytes of raw data. These cartridges were mailed to the central PSG Reading Center (Case Western Reserve University, Cleveland, Ohio). Scorers The SHHS scorers were trained and certified according to rigorous methods.5 Scoring rules were documented in a written Reading Center Manual of Operations,6 which contained examples of typical and atypical events and patterns, and classification schema. Scoring issues, including review of any problematic studies, were reviewed on a weekly basis by the staff, including three scorers, a chief polysomnologist, a study coordinator, a neurologist (diplomate of the American Board of Sleep Medicine), and a pulmonologist. A standardized Scorers Notes Form was used also to identify patterns and scoring issues which were not reported by the software package and could not be identified from the study quality assurance forms (coding signal and overall study qualit)y.4,5 These included notations on the occurrence of patterns of periodic breathing, abnormal eye movements, abnormal wake EEG, alpha METHODS The objectives and design of the SHHS have been described elsewhere.1,4 Briefly, for the purpose of studying the cardiovascular consequences of sleep apnea, the SHHS recruited participants from ongoing community-based cohort studies of cardiovascular disease and hypertension. These parent cohorts included Framingham Heart Study (FHS), Atherosclerosis Risk in Communities Study (ARIC), Cardiovascular Health Study (CHS), New York Hypertension Cohorts, Strong Heart Study (SHS), and Tucsons Epidemiological Study (TES) of Airways Obstructive Disease Cohort, and Health & Environment Cohort (H&E). Participants from the parent cohorts were recruited into SHHS without consideration of the presence or absence of a history of sleep apnea. All participants 40 SLEEP, Vol. 21, No. 7, 1998 750 Reliability of PSG scoringWhitney et al integrated into the scorers normal workloads. Even though such measures were taken to maintain blinding of the reliability studies, equipment improvements during the first 6 months resulted in some software montage displays being slightly different from the majority of studies being scored along with the reliability studies. Thus, although study identification numbers and dates were modified in an attempt to make these studies look like current studies, it is possible that scorers identified some studies as being reliability studies. However, no changes in workflow or scoring behavior were observed. Order-of-assignment tables were constructed by the SHHS Coordinating Center (Seattle, Wash), providing the Reading Center with the order in which the studies were to be assigned to the three scorers. For the interscorer evaluation, all 30 studies were scored once before cycling through the studies for the second and third scorings. The purpose of this was to ensure that (a) no two scorers would be scoring the same study at the same time; and (b) since the scorers did discuss scoring with each other on a day-to-day basis, the studies would be spread out to minimize recall and identification of reliability studies. intrusion, alpha artifact, and periodic large breaths suggesting an arousal disorder. Scorers also coded each study according to perceived problems with sleep staging and arousal detection. Such problems were identified if the scorer felt uncertain about >10% of the event (arousal) determinations or staging decisions because of signal noise (eg, sweat artifact, electrical interference) or biological disturbances (alpha intrusion). Studies were scored manually on a high-resolution monitor using 30-second epochs for staging and arousal detection, and using 5-minute epochs for respiratory data. Sleep data were scored without visualization of respiratory channels, and respiratory data were scored without visualization of EEG and EMG channels. The reliability study occurred during the first year of more than 3 years of PSG scoring. Three of the six scorers employed during the course of the study were on staff at the time of the study. These three scorers were evaluated in this reliability study, and are identified by scorer ID number (912, 914, and 915). Scorer 912 worked halftime and the other two fulltime. Selection of Studies to Assess Reliability Scoring and Definitions The SHHS Coordinating Center and Quality Control Subcommittee designed and coordinated the reliability study, which was implemented at the Reading Center from October, 1996 through January, 1997. Twenty previously scored studies, rated good or better in overall quality,6 were chosen randomly for each of the three scorers, yielding 60 studies to be used in the assessment of reliability. Each study was rescored by the original scorer, and half were rescored by the other two scorers. The previously scored results were used only in the assessment of the intrascorer reliability,* comparing the results from 20 studies scored twice with a median time interval between scorings of 6.5 months. Estimation of interscorer reliability was based on 30 PSG studies scored by all three SHHS scorers. These studies represent the first 6 months of PSG studies performed for SHHS, when approximately 14% of the studies (out of 6440) had been conducted. Selection was made from those studies scored after a scorer had completed a minimum of 80 studies. Sleep stages were scored according to the criteria developed by Rechtshaffen and Kales.2 The predominant sleep stage was scored for each 30-second epoch and coded as wake, stage 1, stage 2, stage 3/4 (delta sleep), or REM. Arousals were identified according to the criteria of the American Sleep Disorders Association,3 modified to accommodate situations in which EMG artifact obscures the EEG signal. The arousal index was calculated as the number of arousals per hour of sleep. Scoring of breathing disturbances was based on identifying changes in breathing amplitude, regardless of associated desaturation or arousal. Potential apneas were identified if airflow was nearly flat (£25% of baseline, where baseline amplitude is identified during the nearest preceding period of regular breathing with stable oxygen saturation) lasting for at least 10 seconds. Potential hypopneas were identified if airflow or thoracoabdominal excursion was approximately £70% of baseline for at least 10 seconds. The software linked each apnea and hypopnea with data from the oximetry channel and EEG channels. Thus, each event was characterized according to its associated desaturation (0 to 5%) and whether an arousal occurred within 3 seconds of the termination of the event. Various measures of respiratory disturbance indices (RDIs), defined as the number of apneas and hypopneas per hour of sleep, were calculated based on varying the requirement for an associated desaturation or arousal. Assignment of Studies To minimize study recognition and the potential for scoring recall bias, scorers were blinded as to whether a study they were scoring was a reliability study (one previously scored and discussed), with all reliability studies *One study was determined to be nonscorable during review because of identification of excessive artifact on the oximetry channel. Another study was not sleep-staged because of excessive EEG artifact. SLEEP, Vol. 21, No. 7, 1998 751 Reliability of PSG scoringWhitney et al Table 1. Interscorer reliabilitydescriptive statistics (n = 30) Scorer ID number Variable 912 914 915 Mean (SD)* Mean (SD) Mean (SD) ICC RDI w/wo desat or arousal 25.93 (13.12) 32.95 (13.57) 27.38 (13.88) 0.74 RDI with 2% desat 19.90 (12.14) 18.44 (10.63) 18.57 (11.12) 0.93 RDI with 3% desat 11.35 (9.32) 9.97 (8.43) 10.80 (8.90) 0.97 RDI with 4% desat 6.08 (7.00) 5.38 (6.39) 5.75 (6.70) 0.99 RDI with 5% desat 3.53 (4.95) 3.16 (4.57) 3.32 (4.70) 0.99 RDI with arousals only 5.56 (4.14) 7.22 (5.44) 6.57 (5.30) 0.77 RDI with 2% desat or arousal 21.24 (12.34) 21.60 (11.13) 20.43 (11.29) 0.91 RDI with 3% desat or arousal 14.11 (9.87) 14.58 (9.16) 14.06 (9.24) 0.95 RDI with 4% desat or arousal 9.75 (7.92) 10.78 (7.88) 10.05 (7.90) 0.94 RDI with 5% desat or arousal 7.69 (6.32) 9.08 (6.76) 8.37 (6.64) 0.90 Arousal index (AI) 13.50 (6.33) 20.63 (11.47) 17.31 (7.68) 0.54 * : SD = Standard deviation; ICC = Intraclass correlation; desat = desaturation. cent positive agreement (PPA)10 within each scorer pair is computed from these data, where PPA is defined as the percent agreement on a particular stage among those epochs in which at least one in the scorer pair indicated that stage (see footnote to Table 4 for example of computation of PPA). Statistical Analysis Interscorer reliability of the respiratory disturbance and arousal indices were estimated using the intraclass correlation coefficient (ICC),7 and the intrascorer reliability by paired t test. Using analysis of variance power curves,8 the number of studies necessary for adequate evaluation of the intraclass correlation for a fixed number of scorers was determined. These curves vary as a function of the number of repeated measurements (number of scorers), the number of studies (n), and values of the hypothesized intraclass correlations. It was determined that a sample size (n) of 30 studies provides at least 80% power at the 5% level of significance to detect the alternative of almost perfect reliability between the scorers. The null hypothesis specifies the ICC is less than 0.80, and the alternative hypothesis specifies the observed ICC is at least 0.90. The assessment of intrascorer reliability, based on a sample size of n=20, was sufficient to detect a difference between scorings (original score minus reliability study rescore) of 0.63 standard deviations, for a=0.05 and 80% power. Scorer agreement on epoch-by-epoch sleep staging was evaluated using the kappa statistic.9 For some kappa analyses, stages 1 through 3/4 were collapsed into a nonREM (NREM) group. For each stage, the number of pairwise agreements and disagreements between scorers are presented (eg, number of occurrences of stage 2 coded as stage 3/4 between two scorers and vice versa). The perSLEEP, Vol. 21, No. 7, 1998 RESULTS The reliability of RDI varied with the definition of RDI (according to use of oxygen saturation and/or arousal as event-defining criteria), but overall was extremely high, up to an intraclass correlation of 0.99 (Table 1). The greatest reliability was found when the most stringent criterion was applied (5% desaturation linked with a breathing amplitude change). When a less stringent, but clinically applied, criterion of a 2% desaturation was used, the intraclass correlation was still excellent at 0.93. Using a definition of RDI which required either a specified level of oxygen desaturation or an arousal resulted in a slightly lower reliability than when oxygen desaturation alone was required (eg, the ICC for RDI with 4% desaturation is 0.99, while the ICC for RDI with 4% desaturation or arousal is 0.94see Table 1). When the definition of RDI required arousal, but did not require oxygen desaturation, the reliability dropped to 0.77. The reliability of the arousal index was modest, with an intraclass correlation of 0.54. For assessment of intrascorer reliability, the paired t 752 Reliability of PSG scoringWhitney et al Table 2.Intrascorer+ reliability Intrascorer Variable Scorer 912 (n=20) Scorer 914 (n=19) Scorer 915 (n=20) Mean (SD) Mean (SD) Mean (SD) RDI w/wo desaturation or arousal 0.51 (6.50) 1.90 (8.11) -3.60 (7.56) RDI with 2% desaturation -1.17 (4.11) -0.17 (3.95) -2.34**(2.74) RDI with 3% desaturation -0.80 (1.89) -0.04 (2.54) -0.85**(1.26) RDI with 4% desaturation -0.44* (0.85) -0.14 (1.49) -0.31* (0.52) RDI with 5% desaturation -0.24* (0.50) 0.06 (0.80) -0.10 (0.34) RDI with arousals only -0.60 (3.48) 2.05 (4.73) -0.89 RDI with 2% desaturation or arousal -0.68 (4.58) 1.33 (4.52) -2.27* (3.70) RDI with 3% desaturation or arousal -0.88 (3.47) 2.02* (3.84) -1.13 (2.67) RDI with 4% desaturation or arousal -0.75 (3.11) 2.30* (3.94) -0.79 (2.74) RDI with 5% desaturation or arousal -0.72 (3.23) 2.24* (4.27) -0.76 (2.90) Arousal index (AI) -3.11* (6.08) 7.32**(10.59) 2.48 (3.14) (8.55) + Mean (SD) of intrascorer differences. Difference equals the original score minus the reliability rescore. * : p < 0.05 ; ** : p < 0.01. tests showed 3 or 4 significant differences, out of the 11, at the 0.05 level for each scorer (Table 2). Scorers 912 and 915 tended to score the RDI measures slightly higher during the reliability study (ie, mean differences are negative), and 914 tended to score slightly lower (mean differences positive). However, even where these statistically significant differences occurred, the magnitude of the differences was small, ranging from 1%-19% of the original mean scores. For the arousal index, there were greater mean differences and variability observed in the original vs rescore comparisons. Scorer 912 scored more arousals in the reliability study, while the other two scorers rescored fewer arousals. These differences were larger and represented 21%-25% of the original mean score, again indicating that arousals are not as easily scored. For epoch-by-epoch staging, the interscorer comparisons yielded kappas in the range of 0.81-0.83 (Table 3). When stage was recoded into three stages (wake, NREM, REM), the kappas increased to the range of 0.87 - 0.90. Intrascorer kappas revealed the same high degree of reliability. The stage-specific results (Table 4) revealed that stage 1 had the lowest percent positive agreement (PPA), ranging from 16.7% for within-scorer evaluation of 914, and up to only 42.2% for scorer 915. Wake had the highest percent positive agreements. The data show that stage 3/4 was very rarely misscored by another scorer or on SLEEP, Vol. 21, No. 7, 1998 Table 3.Kappa statistics evaluating reliability of epoch-by-epoch staging Scorer comparison Number of epochs Interscorer 912 vs 914 912 vs 915 914 vs 915 29,507 ² ² Intrascorer 912 914 915 * + 20, 289 17, 828 19 236 , All stages* Wake vs NREM+ vs REM Kappa Kappa 0.81 0.81 0.83 0.89 0.87 0.90 Kappa Kappa 0.81 0.79 0.87 0.90 0.90 0.92 All stages categories: wake, 1, 2, 3/4, REM. NREM combines stages 1, 2, and 3/4. rescore as wake, stage 1, or REM. However, stages 2 and 3/4 were more often interchanged, as were stage 2 and REM. DISCUSSION Scoring of PSG data largely relies on procedures for pattern identification. Reliability will be influenced by the 753 Reliability of PSG scoringWhitney et al Table 4.Interscorer and intrascorer comparisons of epoch-by-epoch staging Interscorer pair Stage Comparison Intrascorer 912 & 914 912 & 915 914 & 915 912 914 915 Wake 8183* 8361 8347 5116 5513 4632 PPA+ 88.7% 87.7% 90.1% 88.5% 93.3% 89.7% Stage 1 476 481 412 223 178 211 Stage 2 461 551 418 388 207 223 4 10 7 6 3 6 REM 102 136 84 45 8 91 Stage 1 379 333 532 190 178 485 20.4% 19.8% 28.1% 20.1% 16.7% 42.2% 760 691 717 457 599 336 0 1 1 0 0 2 240 174 229 74 114 115 Stage 2 11,208 10,804 11,126 7475 7065 7582 PPA+ 78.5% 77.6% 80.1% 75.6% 76.4% 84.8% Stage 3/4 1241 1084 1028 1141 929 593 REM 605 789 597 432 445 209 Stage 3/4 2487 2690 2453 2155 1153 1824 PPA+ 66.5% 70.9% 70.1% 65.2% 55.3% 75.2% REM 10 7 11 2 0 1 REM 3351 3395 3545 2585 1436 2926 PPA+ 77.8% 75.4% 79.4% 82.4% 71.7% 87.6% Total # of epochs 29,507 29,507 29,507 20,289 17,828 19,236 Total # agreements 25,608 25,583 26,003 17,521 15,345 17,449 86.8 86.7 88.1 86.4 86.1 90.7 Wake Stage 3/4 Stage 1 PPA+ Stage 2 Stage 3/4 REM Stage 2 Stage 3/4 REM % agreement * : Number of occurrences of specified stage combinations relative to total # of epochs scored. + : PPA is percent positive agreement for the specified stage. For example, the PPA associated with Wake is 8183/(8183 + 476 + 461 + 4 + 102) = 88.7% . That is, both scorers agreed that a particular stage was wake 88.7% of the time, and the other 11.3% of the epochs were scored as wake by only one or the other of the scorers. explicitness of definitions used to identify discrete events or patterns, the experience of the scorer, and the quality of the underlying signals that contribute to the pattern. A limitation of analysis of PSG data is a paucity of standardized definitions for sleep staging applicable to both disease and nondisease states and for hypopnea identification.11 The reliability of scored PSG data for a multicenter study with the scope of SHHS can be influenced by the large data-processing needs (as many as 90 studies per week persisting over 3 years), the need to employ a team of scorers who may differ in scoring approaches, and the variability of sigSLEEP, Vol. 21, No. 7, 1998 nal quality from studies performed in unattended settings in diverse environments. For SHHS, we attempted to maximize reliability by rigorous training of scorers, systematic reporting of signal quality, and attempts to formulate explicit scoring rules. The findings of this study provide new data that address the relative reliability of staging, respiratory, and arousal data collected in this setting. This study demonstrates a high degree of reliability in the scoring of sleep stage and RDI from unattended, inhome PSG studies performed for the SHHS. The interscorer reliability of scoring of sleep stage (kappas of 0.81754 Reliability of PSG scoringWhitney et al 0.83) is similar to that reported in the literature from laboratory-based studies.12-14 Differences in scoring between scorers were approximately the same as differences in scoring when studies were rescored by the same scorer, suggesting that the methods utilized to standardize scoring in the SHHS achieved as great a concordance among readers as could be achieved. As has been observed by others, stage 1 was the sleep stage scored least reliably. Distinguishing reliably between REM and NREM may be particularly important in addressing hypotheses regarding stage-specific respiratory disturbances, and for ascertaining the effects of REM deprivation on functional outcomes. We have found that this discrimination can be made extremely reliably, with kappa for interscorer reliability of epoch-by-epoch staging rising to 0.87-0.90 when all NREM sleep stages are combined. The literature provides little information on the reliability of scoring of arousals. For periodic leg movements (PLMS), a high reliability (range of r of 0.71-0.99) among scorers for the number of movements per hour (PLMS index), the number of episodes, and the mean number of movements per episode have been demonstrated.15 However, in the latter study, when arousal data was included with the PLMS index, the reliability was lower. New definitions for EEG arousals, requiring identification of an abrupt shift in EEG frequency, were adopted by the American Sleep Disorders Association in 1992 without information on their reliability. These definitions can be difficult to operationalize in settings with variable underlying EEG frequencies, as in light sleep or in studies with alpha intrusion. Poor-to-moderate agreement in arousal identification was recently reported in an interscorer reliability study of 14 accredited European laboratories. Using the same definitions, they reported an overall kappa of 0.47. Agreement was best for arousals scored during delta sleep (when the overall background EEG most contrasted with the faster frequencies of arousals), and for studies noted a priori to be easily classified.16 Our data also suggest that achievement of high scorer reliability for arousal identification is more difficult than achieving high reliability for sleep staging and hypopnea identification. The modest level of interscorer reliability (ICC=0.54 ) reflects the difficulty in discerning the occurrence of an abrupt change in EEG frequency in a setting of ongoing changes in background EEG frequency (as occurs in stages 1, 2, and REM sleep). The experience of the scorer and the quality of the underlying signals will influence the ability to identify such changes. The polysomnograms for the reliability study were chosen after each scorer had scored at least 80 PSGs (80, 129, and 134 for scorers 912, 914, and 915, respectively). However, even this requirement may have not been enough; the low reliability could reflect the relative inexperience of the SLEEP, Vol. 21, No. 7, 1998 scorer at the time of the study. Scoring data, tracked over the course of the Sleep Heart Health Study, showed a reduction in differences among scorers in the arousal index (data not shown), suggesting that arousal reliability may be impacted positively by ongoing training and experience. Similar convergence in RDI and sleep stage was not observed. This suggests that achievement of high scorer reliability for arousals may require considerably more experience than for respiratory events and sleep staging. Additionally, the assessment of reliability using unattended studieswhich may be vulnerable to electrical interference from background environmental contamination, and to intermittent loss of EMG signalswill reduce the ability to accurately identify arousals. In fact, within the sample of reliability studies, eight studies were noted at the time of scoring to contain electrical artifact severe enough to interfere with reliable arousal identification. As an indication of the potential impact of both scorer training and signal quality on reliability, the interscorer ICC for the arousal index increased to 0.72 (95% CI: 0.44, 0.88) when analyses were restricted to the two fulltime (most intensively trained) scorers. The extent to which reliability may be further improved with additional training, and the extent to which findings may be generalizable to clinical laboratories where signals may be less vulnerable to electrical artifactbut where scorer training and experience may be variable require further study. There is controversy over the optimal way to score hypopneas.3 One source of disagreement concerns the use of oxygen desaturation and arousal as corroborative criteria for identifying a respiratory event. We found that when reductions in breathing amplitude were used in isolation to identify respiratory events, there was nonetheless good reliability of the measurement of RDI (ICC=0.74). When evidence of arousal was also required, the reliability was only slightly improved (ICC=0.77). When an associated oxygen desaturation was required, however, the reliability of the RDI was extremely high, with ICC rising from 0.93 when a minimum 2% desaturation was required, to 0.99 when a minimum 4% desaturation was required. When respiratory events were defined as reductions in airflow accompanied by either oxygen desaturation or arousal, the ICC was somewhat lower than when desaturation alone was required. These results are consistent with the findings of Bliwise et al,17 who found high reliability (r=0.94) among three scorers for RDI but low reliability of particular types of events (eg, apnea vs hypopnea; obstructive vs central; range of r of 0.18-0.69), and Lord et al,18 who also evaluated three scorers and found measures of RDI were reliable (range of kappa 0.71-0.87). These findings relate to respiratory events identified at minimum by a 30% or greater reduction in breathing amplitude (measured by oronasal thermocouples) or thoracoab755 Reliability of PSG scoringWhitney et al dominal excursion. It is probable that the reliability of respiratory-event scoring varies according to the amplitude criteria used to identify breathing reductions. The higher reliability indices for RDIs defined using events with higher levels of desaturation may relate to the more pronounced breathing changes that characterize these events. Further resolution of this issue will require a comparison of reliability according to differences in amplitude criteria for hypopneas. In summary, we conclude that the training and quality assurance methods used in the scoring of SHHS polysomnograms have produced a high degree of interscorer reliability of sleep staging and RDI. It appears that the use of rigorous scoring methods can yield highly reliable measures, even for data collected in an unattended setting, vulnerable to various sources of artifact. The reliability of scoring of RDI is considerably improved by the requirement of an associated oxygen desaturation in the identification of respiratory events. It appears to be more difficult to achieve a high level of reliability for the arousal index than for the RDI or staging distributions. The former appears to be particularly sensitive to both scorer experience and signal quality. Dodge, Gordon A. Ewy, Steven R. Knoper, Linda S. Snyder; Medlantic Research InstitutePhoenix Strong Heart: Barbara V. Howard; University of Oklahoma Oklahoma Strong Heart: Elisa T. Lee, J. L. Yeh; Missouri Breaks Research Institute-Dakotas Strong Heart: Thomas K. Welty. Washington County, MD: The Johns Hopkins University: F. Javier Nieto, Jonathan M. Samet, Joel G. Hill, Alan R. Schwartz, Philip L. Smith, Moyses Szklo. Coordinating Center--Seattle, Wash: University of Washington: Patricia W. Wahl, Bonnie K. Lind, Coralyn W. Whitney, Richard A. Kronmal, Bruce M. Psaty, David S. Siscovick. Sleep Reading CenterCleveland, Ohio: Case Western Reserve University: Susan Redline, Carl E. Rosenberg, Kingman P. Strohl. NHLBI Project Office-Bethesda, MD: James P. Kiley, Richard R. Fabsitz. This work was supported by National Heart, Lung, and Blood Institute cooperative agreements #U01HL53940 (University of Washington), U01HL53941 (Boston University), U0HL53938 (University of Arizona), U01HL53916 (University of California, Davis), U01HL53934 (University of Minnesota), U01HL53931 (New York University), U01HL53937 (Johns Hopkins University). ACKNOWLEDGMENTS REFERENCES The Sleep Heart Health Study (SHHS) acknowledges the Atherosclerosis Risk in Communities Study (ARIC), the Cardiovascular Health Study (CHS), the Framingham Heart Study (FHS), the Cornell Worksite and Hypertension Studies, the Strong Heart Study (SHS), the Tucson Epidemiology Study of Airways Obstructive Diseases (TES) and the Tucson Health and Environment Study (H&E) for allowing their cohort members to be part of the SHHS and for permitting data acquired by them to be used in the study. Participating Institutions and SHHS InvestigatorsFramingham, Mass: Boston University: George T. OConnor, Sanford H. Auerbach, Emelia J. Benjamin, Ralph B. DAgostino, Rachel J. Givelber, Daniel J. Gottlieb, Philip A. Wolf; University of Wisconsin: Terry B. Young. Minneapolis, Minn: University of Minnesota: Eyal Shahar, Conrad Iber, Mark W. Mahowald, Paul G. McGovern, Lori L. Vitelli. New York, NY: New York University: David M. Rapoport, Joyce A. Walsleben; Cornell University: Thomas G. Pickering, Gary D. James; State University of New York, Stonybrook: Joseph E. Schwartz; Columbia University (Harlem Hospital): Velvie A. Pogue, Charles K. Francis. Sacramento, CA/Pittsburgh, PA: University of California, Davis: John A. Robbins, William H. Bonekat; University of Pittsburgh: Anne B. Newman, Mark Sanders. Tucson, Ariz/Strong Heart Study: University of Arizona: Stuart F. Quan, Michael D. Lebowitz, Paul L. Enright, Richard R. Bootzin, Anthony E. Camilli, Bruce M. Coull, Russell R. 1. Quan SF, Howard BV, Iber C, et al. The Sleep Heart Health Study Design, Rationale and Methods. Sleep 1997;20:1077-1085. 2. Rechtschaffen A, Kales A. A manual of standardized terminology, techniques, and scoring system for sleep stages of human subjects. Washington, DC, US Government Printing Office, 1968 (NIH Publication No. 204). 3. The Atlas Task Force. EEG arousals: scoring rules and examples. Sleep 1992;15:173-84. 4. Sleep Heart Health Study Manual of Operations. SHHS Coordinating Center, Seattle, WA, 1996. 5. Redline S, Sanders MH, Lind BK, Quan SF, Iber C, Gottlieb D, Bonekat WH, Rapoport DM, Smith PL, Kiley JP. Methods for obtaining and analyzing unattended polysomnography data for a multicenter study. Sleep 1998:21;759-757. 6. Sleep Heart Health Study Reading Center Manual of Operations. SHHS Coordinating Center, Seattle, Wash, 1996. 7. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979;86:420-428. 8. Donner A, Eliasziw M. Sample size requirements for reliability studies. Stat Med 1987;6:441-448. 9. Cohen J. A coefficient of agreement for nominal scales. Educational and Psychol Meas 1960;20:37-46. 10. Fleiss JL. Statistical Methods for Rates and Proportions. New York: Wiley and Sons,1981, p. 214. 11. Moser NJ, Phillips BA, Berry DT, Harbison L. What is hypopnea, anyway? Chest 1994;105:426-8. 12. Schaltenbrand N, Lengelle R, Toussaint M. Sleep stage scoring using the neural network model: comparison between visual and automatic analysis in normal subjects and patients. Sleep 1996;19(1):26-35 13. Stanus E, Lacroix B, Kerkhofs M, Mendlewicz J. Automated sleep scoring: a comparative reliability study of two algorithms. Electroencephalogr Clin Neurophysiol 1987;66(4):448-56. 14. Karacan I, Orr WC, Roth T, et al. Establishment and implementation of standardized sleep laboratory data collection and scoring procedures. SLEEP, Vol. 21, No. 7, 1998 756 Reliability of PSG scoringWhitney et al Psychophysiol 1978; 15 (2): 173-9. 15. Bliwise DL, Keenan S, Burnburg D. Inter-rater reliability for scoring periodic leg movements in sleep. Sleep 1991;14(3):249-51. 16. Drinan MJ, Murrary A, Griffiths CJ, Gibson GJ. Interobserver variability in recognizing arousal in respiratory sleep disorders. Am J Resp Crit Care Med (in press). SLEEP, Vol. 21, No. 7, 1998 17. Bliwise D, Bliwise NG, Kraemer HC, Dement W. Measurement error in visually scored electrophysiological data: respiration during sleep. J Neurosci Methods 1984;12:49-56. 18. Lord S, Sawyer B, Pond D, et al. Interrater reliability of computerassisted scoring of breathing during sleep [see comments]. Sleep 12(6):550-8, 1989. Comment in: Sleep 1991;14(1):89-90. 757 Reliability of PSG scoringWhitney et al
© Copyright 2026 Paperzz