Sleep, 18(8):659-666 © 1995 American Sleep Disorders Association and Sleep Research Society Methodological and Statistical Problems in Sleep Apnea Research: The Literature on Uvulopalatopharyngoplasty *Kenneth B. Schechtman, t Aaron E. Sher and *Jay F. Piccirillo *Division of Biostatistics, Washington University School of Medicine, St. Louis, Missouri, US.A.; tDivision of Otolaryngology-Head and Neck Surgery and Capital Region Sleep Wake Disorders Center, Albany Medical College, Albany New York, US.A; and 'f.Dept of Otolaryngology-Head and Neck Surgery, Washington University School of Medicine, St. Louis, Missouri, US.A. This manuscript discusses methodological and statistical problems identified in our comprehensive review of the literature on the surgical treatment of sleep apnea. A companion paper based on the same literature review will be published in a forthcoming issue of Sleep. That paper will discuss the efficacy of surgical approaches to the treatment of sleep apnea. Summary: A comprehensive review of the literature on the surgical treatment of sleep apnea found 37 appropriate papers (total n = 992) on uvulopalatopharyngoplasty (UPPP). Methodological and statistical problems in these papers included the following: I) There were no randomized studies and few (n = 4) with control groups. 2) Median sample size was only 21.5; thus statistical power was low and clinically important associations were routinely classified as "not statistically significant". 3) Only one paper presented the confidence bounds that might distinguish between statistical and clinical significance. 4) Because of short follow-up time and infrequent repeat follow-ups, little is known about whether UPPP results deteriorate with time. 5) In at least 15 papers, bias caused by retrospective designs and nonrandom loss to follow-up raised questions about the generalizability of results. 6) Few papers associated polysomnographic data with patient-based quality of life measures. 7) Missing data and missing and inconsistent definitions were common. 8) Baseline measures were often biased because the same assessment was inappropriately but roiltinely used for both screening and baseline. We conclude that because of these and other problems, there is much that is needlessly unknown about UPPP. It is the responsibility of the research and professional communities to define training, editorial and review procedures that will raise the methodological and statistical quality of published research. Key Words: Methodological problems-Control groups-Sample sizeConfidence bounds - Bias - Sleep apnea - U vulopalatopharyngoplasty. There is a broad perception among clinical researchers that there are serious problems with the methodological and statistical components of much of the medical literature. Commonly reported problems include inadequate sample size (1,2), not distinguishing between statistical and clinical significance (3,4), inappropriate statistical methods (5,6), lack of randomization or absent or inappropriate controls (5) and inadequate or biased follow-up (7). Deficiencies of this type can have a variety of negative consequences with Accepted for publication March 1995. Address correspondence and reprint requests to Kenneth B. Schechtman, Ph.D., Washington University School of Medicine, Division of Biostatistics, Box 8067, 660 South Euclid Ave., St. Louis, Missouri 63110, U.S.A. respect to the conclusions of published research. Thus, when sample sizes are too small, a conclusion such as "no difference between groups" can in reality mean that there was not enough statistical power to determine if a difference existed. When control groups are either not present or are inappropriate, it may be impossible to determine the cause of any suggested improvement in patient outcome. When patients are lost to follow-up, the followed subsample may be self-selected and biased so that generalizability is compromised. The purpose of this paper is to discuss methodological and statistical issues such as those noted above as they relate to the literature on the surgical treatment of obstructive sleep apnea (OSA) using uvulopalatopharyngoplasty (UPPP). Conclusions are based on 37 659 K. B. SCHECHTMAN ET AL. 660 UPPP papers discussed in our comprehensive review of the literature on all surgical approaches to the treatment of OSA (8). METHODS As part of a comprehensive review of the literature on the surgical treatment of sleep apnea, a Medline search identified all relevant articles published in English between January 1966 and December 1994. A total of 17 S reports of clinical studies were selected by this process. From this list, we included all clinical studies that reported on at least 10 adult patients undergoing surgery for OSA and that contained data from both pre- and post-surgery polysomnograms (PSGs). If two or more papers contained data on the same patients, the one with the most complete information was retained. Details ofthe selection process that produced 37 UPPP papers with 992 patients who had both pre- and post-surgery PSG data are reported in a companion article to be published in a forthcoming issue of Sleep (8). These 37 papers, which are the "group 1" papers in the companion article, form the basis of the present report. RESULTS Our review of the UPPP literature produced the following methodological concerns: 1. Inadequate sample size and little statistical power The power of a statistical test is the probability that the test will be significant when there really is a difference between groups. The type II error (often termed the beta error) is the false negative rate. It equals 1 minus the power and is the probability of not claiming statistically that there are between group differences when such differences actually exist. As we shall see, inadequate sample size (i.e. low power and high type II error) is a common problem, both in UPPP reports and in the medical literature in general. Among the 37 UPPP papers, not a single one discussed statistical power. Indeed, there was no indication in any paper that the question of how large the sample size needed to be was even considered before the study began. We suspect, therefore, that the sample size in most of these reports was based on how many patients were available, with little or no concern for the number of patients that were actually required to answer the questions of interest. Large differences between groups, particularly when combined with small standard deviations, imply small sample size requirements. Occasionally, as few as S or Sleep, Vol. 18, No.8, 1995 10 patients per group will yield adequate statistical power. Unfortunately, sample size requirements are far greater when between-group differences are small. In this setting, prospective power computations are especially important. They tell the investigator that sample sizes are adequate or, alternatively, that patient availability is too limited for the protocol to be a worthwhile endeavor. Freiman et al. (1) evaluated 71 papers that reported "negative" results and found that inadequate sample size is a common problem. The goal of these authors was to determine how frequently statements such as "There was no significant difference" could be more accurately reported as: "While we got non-significant results, we did not have enough patients to determine whether there was or was not a clinically important difference (i.e., we were wasting our time insofar as the stated endpoint is concerned)". The results reported by Freiman et al. (1) are startling. Sixty-seven of the 71 trials (94.4%) had a sample size that would have missed a 2S% therapeutic improvement at least 10% of the time. That is, the power to detect a 2S% treatment effect was almost always < 90%. In SO of the 71 trials (70.4%), a very large SO% improvement in the response rate would have been called "nonsignificant" at least 10% of the time. Although they do not present the data, it is likely that most of these trials had a < SO% chance of detecting large and clinically important differences. The problems described by Freiman et al. (1) may be more serious in the UPPP literature. Our 37 UPPP papers had from 9 to 90 patients with follow-up data (mean n = 26.8, median = 21.S), numbers that are grossly inadequate for many purposes. For example, UPPP papers that had respiratory distress index (RDI) data showed a combined decrease in RDI of 33 ± 61 % in the oropharyngeal group (i.e. patients whose airway collapse is in the retropalatal segment only) and 6.S ± 47% in the hypopharyngeal group (i.e. patients with retrolingual airway collapse). Assuming these are the true population differences, detecting them statistically would require 93 patients per group to achieve a power of 0.9 and 70 per group for a power of 0.8. With 10 patients per group, typical in the papers we reviewed, the power is only 0.18. With a maximum total of 90 UPPP patients, none of the reviewed papers comes close to having sufficient sample size to provide a reasonable test of the hypothesis that the percent change in RDI following UPPP is equal in oropharyngeal and hypopharyngeal patients. The assertion in anyone of these papers that "no difference was found" has very little meaning. Sample size requirements for statistical tests comparing mean values depend on the ratio defined as the difference between the means in the two groups divided METHODOLOGICAL PROBLEMS IN UPPP RESEARCH by the standard deviation. A simple rule of thumb is that if this ratio is 1 (i.e. the standard deviation equals the difference in means), about 20 subjects per group will yield a statistical test with reasonable power (about 0.8). But if the ratio is 0.5, some 65 patients are required in each group, whereas a ratio of 0.25 means that 250 must be enrolled in each group. The earlier RDI example is typical of a UPPP literature in which standard deviations for many parameters are far greater than mean differences. When this occurs, the ratio defined above is < 1, and sample sizes of 20 or 30 may yield a power of only 0.2 or 0.3. This means that nonsignificant results will be commonly reported when between-group differences are both clinically important and substantial. That is, in most of the papers we reviewed, assertions such as "no significant difference was found" must be interpreted with great caution at best. Summary Sample size requirements and statistical power should be considered before research projects begin. When sample sizes are small, the reader should be skeptical about negative results. Statements that "no difference was found" will often be a polite way of saying "We have no idea whether the two groups are substantially different. We didn't give the question a fair test because of limited sample size". This is the reality in much of the UPPP literature. 2. Confidence bounds: the link between statistical and clinical significance Although all report p-values, only one of the 37 UPPP papers presented confidence bounds. p-values provide information about the statistical significance of associations between variables or differences between groups. But they are uninformative as to the magnitude of potential difference and say nothing about the association between statistical and clinical significance. Because small and clinically irrelevant differences may be statistically significant in large studies, whereas large and clinically important differences may be nonsignificant in small studies, p-values in isolation provide an incomplete message. Confidence bounds yield statements like "I am 95% confident that the magnitude of the between-group difference in the mean of some parameter (or the success rates of some treatment) lies between two numbers, say a and b". Confidence bounds go beyond black and white statements about statistical significance. They quantify how large differences might actually be and evaluate the clinical importance of potential differences. 661 As an example, we define success as a 50% decrease in apnea index (AI) and use one ofthe reviewed UPPP papers (9) to evaluate the association between UPPP success and baseline AI. In this paper (n = 30), the success rate was 46.7% and baseline AI was 42.6 ± 22.0 in successes and 69.6 ± 26.6 in failures. The associated p-value of 0.006 suggests that baseline AI is lower in patients who respond to UPPP. The 95% confidence bounds on the potential difference between the baseline AI of successes and failures range from 8.6 to 45.4 events/hour. Thus, results are significant, but potential differences range widely from 9 to 45. That is, this statistically significant result might reflect anything from a small clinically irrelevant difference to a large and clinically important difference. The sample size is too small to permit greater precision. It is worth noting that 10 of the 37 UPPP papers (total n = 241, success rate = 60.2%) had sufficient information to permit a meta-analysis focused on the association between UPPP success and baseline AI. Using the DerSimonian and Laird approach to metaanalysis (10), we find that pooling these 241 patients results in baseline AI being smaller in the successfully treated patients by a nonsignificant (p = 0.664) 2.3 ± 5.3 events/hour. The 95% confidence bounds range from -8.1 to 12.8. Without the confidence bounds, we can say only that we were unable to detect a statistically significant difference. But the narrow confidence bounds that result from the increased sample size in the meta-analyses permit us to conclude also that ifthere is a difference between groups, it is almost certainly small in magnitude and clinically unimportant. Summary p-values are only part of the story. They tell the reader whether differences exist. But they say nothing about the magnitude of the difference and are uninformative as to the relationship between statistical significance and clinical importance. Confidence bounds can be the missing link; they should be a routine component of many reports of clinical research. None of the UPPP papers we reviewed reported confidence bounds. 3. Uncontrolled studies In their book on clinical trials, the key point about study design made by Friedman et al. (11) is that "Sound scientific clinical investigation almost always demands that a control group be used against which the new intervention can be compared. Randomization is the Sleep. Vol. 18. No.8. 1995 662 K. B. SCHECHTMAN ET AL. preferred way of assigning subjects to control and intervention groups". Although this reflects the broadly held view of the scientific community, our review of the UPPP literature revealed only four reports that used control groups to compare one treatment with another. We found no randomized clinical trials and only one nonrandomized study that compared two versions of UPPP (12). The reasons for control groups and for randomization are clear. Without a control, there is no way to know if therapeutic outcomes reflect the tested treatment or some other factor. The suggested "superiority" ofa given therapy in an uncontrolled study may reflect the night-to-night variability of PSG testing, concurrent treatments and behaviors, more experienced or more skilled surgeons or patient characteristics that happen to favor success in the study sample. For example, in our 37 UPPP papers, success rates among 111 patients with oropharyngeal collapse only was in the 70-80% range as compared to some 20% in 57 patients who also had hypopharyngeal collapse. Clearly, the importance of this single factor in determining success far outweighs the likely impact of any modified approach to UPPP. An assertion that one UPPP procedure is better than another because the results of uncontrolled studies seem better may be largely meaningless because of the enormous impact of the location of collapse. But in a randomized study, between-group balance with respect to this key variable is likely. Moreover, if there is a control, statistical adjustment can often address the between-group imbalances that sometimes occur by chance. Although it is often difficult to conduct surgical trials with properly defined control groups, there are many settings in which randomization is both ethical and practical. For example, modifications of the original UPPP procedure developed by Fujita et al. (13) have been discussed by many authors (12,14-16). Because there is almost no literature that formally compares these procedures, there is little scientific data with which to decide that one approach is best. Thus, randomization is ethical. Because experience indicates that many patients are eager to participate in a scientific investigation when their physician indicates honest uncertainty as to which treatment is better, and when the best possible care is offered as part of a clinical trial, randomization is practical. Finally, because the literature is filled with inferior or useless treatments that became standard (sometimes for decades) because of nonrandomized and uncontrolled clinical studies, it might be argued that randomization is an ethical obligation when therapies have as many associated uncertainties as does UPPP. Among the prominent surgical procedures that are documented as having seen extensive inappropriate use are routine radical masSleep. Vol. 18. No.8. 1995 tectomy in breast cancer patients (17), tonsillectomy in children with chronic sore throat (18), internal mammary artery ligation in angina patients (19) and routine back surgery for patients with disk problems (20). Summary Randomized controlled trials are the gold standard of clinical research. Without a control group, there is no way to know that results reflect the relative benefit of the study treatment. Control groups are almost never used in the UPPP literature. They should be standard when efficacy is being evaluated. 4. Inadequate follow-up Among the 37 UPPP papers, 7 (18.9%) provided no information about the length offollow-up, 25 (67.5%) provided mean follow-up times or follow-up ranges from which means could be accurately derived, and 5 (13.5%) provided minimum follow-up times only. Three of the 25 papers with mean follow-up data contained both a short (at most 6 months) and a longterm (> 12 months) follow-up on each patient (21-23). A fourth paper had a long-term follow-up in progress (24), with the results ofthat follow-up being presented in a paper excluded from our review (25) because it reported on the same patients that were included in the report (24) it was extending. Two other papers had a single follow-up whose mean time was > 1 year, whereas mean follow-up time was >6 and at most 12 months in three papers. Using the shorter time in the papers with both short- and long-term follow-ups, overall mean time was 4.9 months in the 25 papers that provided the necessary data. Average minimum follow-up in the five papers presenting no other followup information was 1.7 months. As important as missing follow-up information are short follow-up times and the lack of repeat follow-up. Without studies that contain both short- and long-term follow-up data, there is no way to know if the benefits ofUPPP are maintained over time. Thus, despite suggestions by some authors that the deterioration ofUPPP results after surgery is uncommon (26), the fact is that we know very little about the long-term impact of UPPP. Two of the four identified reports (21,22) that provide both short- and long-term follow-up data suggest there is deterioration in PSG results a year or two after surgery. A third paper (25) suggests the opposite, while a fourth paper (23) found little improvement at either the early or the late assessment. Given that the limited amount of existing data results from small studies with conflicting results, it is clear that the ques- METHODOLOGICAL PROBLEMS IN UPPP RESEARCH tion of whether UPPP deteriorates with time remains open. Summary Follow-up duration in most UPPP papers is brief, and few papers present both short- and long-term follow-up data. As a result, we do not know if UPPP results deteriorate with time. Research that includes both long-term and repeat follow-up is essential. 5. Results with uncertain generalizability 663 Summary Biased unrepresentative samples are common in the UPPP literature. To avoid this problem, entry and exclusion criteria should be carefully defined and rigorously followed. Prospective designs, with a determined effort to ensure complete follow-up, should be standard. 6. Quality of life Polysomnography results provide the traditional objective measures of OSA severity. However, it is frequently noted that the results of objective tests correlate poorly with symptoms, functional impairments and the emotional consequences of disease (27). This discrepancy between objective and subjective measures has been noted in other disorders (29). Because of the scientific preference for "hard" data over "soft" data, physicians often disregard subjective patientbased reports of disease severity as unreliable and unworthy of scientific examination. Although objective measures of severity are essential, the frequent exclusion of patient-based data has diminished the clinical relevance of studies of prognosis and treatment effectiveness. Disease-specific health status refers to the_ physical problems, functional limitations and emotional consequences of a disease. It can be measured by subjective and objective tests. Although some investigators view quality oflife as a reflection of the patient's perception of and reaction to his or her health status (30), there is no universally accepted definition. Nevertheless, subjective reports of OSA-specific symptom severity have existed for some time (31-33), and a diseasespecific health-related quality-of-life measure was recently developed by Piccirillo et al. (34). Our literature review identified many articles that reported on patient symptoms and patient reports of outcomes. However, no paper utilized validated measures of disease-specific health status in reporting disease severity and patient outcomes. Subjective reports of health status and health-related quality oflife should be standard in clinical research. Such measures would provide a more complete description of OSA while yielding a better understanding of the association between subjective and objective measures. As one author stated, "Measures in addition to PSG, including patient subjective response, would more fully characterize the outcome of revision of the upper airway for sleep apnea"(35). When a clinical report reflects data from patients who may not be typical of some defined larger population, it may be difficult to assess the generalizability of results. This problem arises 1) in prospective studies when outcome data are missing for nonrandom reasons related to the outcomes themselves and 2) in retrospective studies that require follow-up information as an entry criterion so that patients with missing followup are excluded at the outset. These problems are widespread in the UPPP literature. A fundamental goal of prospective clinical research is keeping enrolled subjects on study until follow-up is complete. When follow-up data are missing, an unavoidable danger is that the data are missing for reasons directly related to the probability of a successful outcome. When this happens, patients remaining on study do not reflect the true impact of therapy in the larger population. With comments like "There is some inherent bias in our study in that those patients with a less satisfactory outcome may be more apt to undergo a post-operative evaluation" (27) and "Many patients have reported a complete lack of interest in a postoperative study because they felt so much better after surgery" (28), several authors have acknowledged this potential source of bias. In retrospective database studies that require followup as an entry criterion, bias can result when patients with missing follow-up are excluded at the outset. Among the 37 UPPP papers, 12 were definitely retrospective, 11 were definitely prospective, 5 were probably retrospective, 5 were probably prospective and 4 were indeterminate. Thus, the status of 14 papers (37.8%) as prospective or retrospective was inconclusive and about half of the 37 papers were retrospective. B~cause of missing follow-up information, 15 papers were categorized as "definitely biased", 12 were definitely not biased (insofar as follow-up patterns are concerned), 3 were probably biased, 3 were probably not biased and 4 were indeterminate. In the biased papers, Summary results are based on an unrepresentative sample so that it may be impossible to define a population to which Disease-specific health status and health-related quality-of-life measures describe important aspects of those results can be generalized. Sleep, Vol. 18, No.8, 1995 664 K. B. SCHECHTMAN ET AL. illness and supplement traditional objective measures of disease severity. They have not been used in the UPPP literature. They should be routine. 7. Multiple endpoints Well-designed protocols prospectively define a small number of primary endpoints (ideally one or two). One reason is to help focus the research. But equally important are the statistical uncertainties that are associated with multiple endpoints. When a single statistical test is performed at the 0.05 level of significance, there is a 5% chance of claiming significance when there is no difference between groups. This is the type I error, the probability of wrongly claiming significance. When more than one statistical test is performed, the laws of chance mean that the probably of wrongly claiming significance at least once exceeds 5%. With five independent tests, that probability is about 20%. With 14 tests it exceeds 50%. Translated, when one reads a paper in which large numbers of statistical tests have been performed, it is likely that some significant results have occurred by chance. In the 37 UPPP papers, the mean number of endpoints was 6.8, about the same as have been reported elsewhere (36). Six papers reported 1 or 2 endpoints, 14 had 3-5 endpoints, 9 papers had 6-9 endpoints and 8 papers had ~ 10 endpoints. The maximum number was 28. These numbers indicate that in many of the papers, it is all but certain that some significant relationships reflect the laws of chance and nothing more. And we emphasize, the above tabulation does not include endpoints that were evaluated but not reported or the many other statistical tests that were reported but that did not involve pre- to post-surgery changes in endpoints. Scientists are curious by nature, and it would be absurd to suggest that data be ignored. Instead, we recommend that the researcher: 1) prospectively define one or two endpoints to stand out separately as the only predefined primary endpoints, explicitly categorize other analyses as "secondary" or "exploratory" and state clearly that secondary results must be interpreted cautiously if there are a large number of secondary hypothesis tests and 2) report precise p-values rather than simply "p < 0.05" so that confidence in conclusions can be better assessed. Summary Multiple endpoints reduce confidence in results and are present in most UPPP papers. High-quality research requires that a small number of primary endpoints be prospectively defined. Other analyses should Sleep, Va!. 18, No.8, 1995 be interpreted cautiously because chance significance may be common. 8. Missing data and missing or inconsistent definitions We have already noted that 12 of 37 UPPP papers (32.4%) did not provide mean follow-up data and 14 papers (37.8%) did not provide sufficient information to determine conclusively whether the study was prospective or retrospective. There were seven other papers in which the status could be determined from the context, but there was no clear statement as to whether the research was or was not prospective. Additional examples of inappropriately missing information include 12 papers with missing age on followed patients, 15 papers where the gender distribution among followed patients could not be determined and 22 papers in which the definition of OSA was not presented even though the presence of sleep apnea was a stated entry requirement. Among the papers that did define OSA, there were at least five definitions (AI > 4, AI > 5, AI > 10, at least 30 episodes of apnea during the recording period and RDI > 5). There was even greater variety in the definition of a hypopnea. Moreover, among the 22 papers in which hypopneas were discussed as an outcome measure, only 10 (45.4%) included a definition. Summary Missing information and inconsistent definitions are common in the UPPP literature. This compromises between paper comparisons and is particularly problematic when there are no concurrent controls. Editorial policies should seek to minimize these problems. 9. Biased baseline data Entry into many clinical trials requires that subjects achieve some stated minimum value for a parameter that will serve as an outcome measure or a covariate. In such a setting, standard research practice requires that two separate pretreatment assessments be performed on each subject: a screening assessment that determines eligibility with respect to the parameter and an entirely distinct assessment that determines the baseline value of the parameter. To see why this is necessary, one may consider a study with a minimum acceptable AI of 10. Because of night-to-night variability (37,38), some subjects whose screening AI is > 10 will have true indices of only 8 or 9. For these subjects, the screening AI will be a biased overestimate of their true AI. But there will be no one to balance out this bias in the other direction. Indeed, a subject whose true AI is > 10 but METHODOLOGICAL PROBLEMS IN UPPP RESEARCH whose screening value happens to be 8 or 9 will have been excluded from the trial. The net result is that the bias is unidirectional, and if the screening AI is also used as the baseline AI, baseline values will be biased overestimates of the true apnea index of enrolled subjects. In this setting, post-treatment values will tend to be lower than pretreatment values even when the treatment has no effect. That is, there will be a biased overestimate of treatment effectiveness. The magnitude of the bias noted above is a function of the variability of the parameters of interest. But because there is little precise information about the night-to-night-variability of AI and RDI, we cannot quantify this bias. However, the MUltiple Risk Factor Intervention Trial (MRFIT) (39) demonstrates that the potential bias is large. This primary-prevention cardiovascular trial randomized nearly 13,000 subjects whose high risk status for eligibility was partly determined by blood pressure. According to accepted standards, MRFIT used separate screening and baseline assessments. The result was a mean screening diastolic blood pressure (DBP) of 99 mm Hg as compared to a baseline value of only 91 mm Hg. Had the screening value served also as the baseline value, the result would have been that the effect of the intervention on DBP would have been overestimated by 8 mm Hg, and the true role ofDBP in MRFIT would have been obscured. In closing, we note that all 37 UPPP papers had minimum acceptable values for key outcome parameters. None of them indicated the use of separate screening and baseline assessments. Thus, we believe that published UPPP success rate data may overstate the true impact of surgery. Summary When eligibility requirements include mInImUm values for outcome parameters, biased overestimates of treatment effectiveness will result if the screening value is also used as the baseline value. This bias was present in every one of the UPPP papers. The magnitude of the bias cannot be estimated because there is little information about the night-to-night variability of PSG data. Conclusions and recommendations The UPPP literature contains methodological and statistical problems that include inadequate sample size, the absence of confidence bounds, short and missing follow-up, the infrequent use of controls, results with uncertain generalizability, too many retrospective studies, too many endpoints, little discussion of quality oflife, missing data and inconsistent definitions of key variables. The catalogue of information that is un- 665 known because of these problems or because key questions have not been asked is needlessly large. The list includes information about the degree to which many baseline factors predict UPPP outcome; whether and to what extent UPPP results deteriorate with time; the relative benefits ofUPPP modifications; the impact of concomitant surgeries and comorbidities on UPPP success; the association between health status, quality of life, and PSG outcomes and a variety of questions about the impact ofUPPP on long-term morbidity and mortality. The following steps might yield a substantial improvement in research quality and in the information it provides about the treatment of patients with sleep apnea. These recommendations are based on the results presented herein, on our experience in teaching biostatistics and research design to medical school faculty and fellows and on the suggestions of many authors who have grappled with these issues (40-44). 1) The first step in the research process should be a written protocol that has been prepared in collaboration with a statistician/methodologist. Retrospective database and chart reviews should be a last resort methodology. Studies with concurrent controls, with randomization when possible, with adequate sample size and complete follow-up, with explicit definitions, with a small number of primary endpoints and with both objective and subjective outcome measures should be the goal of protocols that are intended to evaluate a given treatment in a defined set of patients. 2) Journal editors and representatives of professional and scientific societies should meet jointly to establish research standards. These should include editorial expectations regarding research methodology, a minimum set of methodological and statistical information to be included in every published manuscript and a review process that includes methodological and statistical expertise. 3) National meetings of relevant professional and scientific societies should present short courses and symposia on research design, statistical methodology and protocol development. 4) The standards of research protocols sponsored by professional societies should be evaluated and upgraded. 5) Training programs for residents and fellows should routinely include a focus on research design. Because a critical reading of the medicallitenlture requires an appreciation of design issues, such programs should not be restricted to future researchers. REFERENCES 1. Freiman JA, Chalmers TC, Smith H Jr, Kuebler RR. The importance of beta, the type II error and sample size in the design Sleep, Vol. 18, No.8, 1995 666 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. IS. 16. 17. 18. 19. 20. 21. 22. 23. K. B. SCHECHTMAN ET AL. and interpretation of the randomized control trial: survey of 71 "negative" trials. N Engl ] Med 1978;299:690-4. Williams HC, Seed P. Inadequate size of 'negative' clinical trials in dermatology. Br] DermatoI1993;128:317-26. Braitman LE. Confidence intervals assess both clinical significance and statistical significance. Ann Intern Med 1991; 114: 515-7. Simon R. Confidence intervals for reporting results of clinical trials. Ann Intern Med 1986;105:429-35. Glantz SA. Biostatistics: how to detect, correct and prevent errors in the medical literature. Circulation 1980;61: 1-7. Altman DA. Statistics in medical journals. Stat Med 1982;1: 59-71. Liberati A, Himel HN, Chalmers Te. A quality assessment of randomized control trials of primary treatment of breast cancer. ] Clin OncoI1986;4:942-51. Sher AE, Schechtman KB, Piccirillo JF. The efficacy of surgery for obstructive sleep apnea syndrome. A comprehensive review of the literature. Sleep (in press). deBerry-Borowiecki B, Kukwa AA, Blanks RHI. Indications for palatopharyngoplasty. Arch Otolaryngol 1985; III :659-63. DerSimonian R, Laird N. Meta-analysis in clinical trials. Control Clin Trials 1986;7: 177-88. Friedman LM, Furberg CD, DeMets DL. Fundamentals ofclinical trials. Boston: John Wright, PSG Inc, 1982. Kimmelman CP, Levine SB, Shore ET, Millman RP. Uvulopalatopharyngoplasty: a comparison of two techniques. Laryngoscope 1985;95: 1488-90. Fujita S, Conway WA, Zorick F, Roth T. Surgical correction of anatomic abnormalities in obstructive sleep apnea syndrome: uvulopalatopharyngoplasty. Otolaryngol Head Neck Surg 1981; 89:923-34. O'Leary MJ, Millman RP. Technical modifications of uvulopalatopharyngoplasty: the role of the palatopharyngeus. Laryngoscope 1991;101:1332-5. Ryan CF, Dickson RI, Lowe AA, Blokmanis A, Fleetham JA. Upper airway measurements predict response to uvulopalatopharyngoplasty in obstructive sleep apnea. Laryngoscope 1990; 100:248-53. Dickson RI, Blokmanis A. Treatment of obstructive sleep apnea by uvulopalatopharyngoplasty. Laryngoscope 1987;97: 1054-9. Fisher B, Bauer M, Margolese R, et al. Five-year results of a randomized clinical trial comparing total mastectomy and segmental mastectomy with or without radiation in the treatment of breast cancer. N Engl] Med 1985;312:665-73. Bakwin H. Pseudoxia pediatrica. N Engl] Med 1945;24:691-7. Diamond EG, Kittle CF, Crockett JE. Comparison of internal mammary ligation and sham operation for angina pectoris. Am ] CardioI1960;5:483-6. Jensen MC, Brant-Zawadzki MN, Obuchowski N, Modic MT, Malkasian D, Ross JS. Magnetic resonance imaging of the lumbar spine in people without back pain. N Engl ] Med 1994;331: 69-73. Larsson H, Carlsson-Nordlander B, Svanborg E. Long-time follow-up after UPPP for obstructive sleep apnea syndrome: results of sleep apnea recordings and subjective evaluation 6 months and 2 years after surgery. Acta Otolaryngol (Stockh) 1991; 111: 582-90. Launois SH, Feroah TR, Campbell WN, et al. Site of pharyngeal narrowing predicts outcome of surgery for obstructive sleep apnea. Am Rev Respir Dis 1993;147:182-9. Walker EB, Frith RW, Harding DA, Cant BR. Urulopalatopharyngoplasty in severe idiopathic obstructive sleep apnoea syndrome. Thorax 1989;44:205-8. Sleep, Vol. 18, No.8, 1995 24. Fujita S, Conway WA, Zorick FJ, et al. Evaluation of the effectiveness of uvulopalatopharyngoplasty. Laryngoscope 1985;95: 70-4. 25. Conway W, Fujita S, Zorick F, et al. Uvulopalatopharyngoplasty: one year followup. Chest 1985;88:385-7. 26. Philip-Joet F, Rey M, Triglia JM, et al. Uvu1opalatopharyngoplasty in snorers with sleep apneas: predictive value of presurgical polysomnography. Respiration 1991 ;58: 100-5. 27. Davis JA, Fine ED, Maniglia AJ. Uvulopalatopharyngoplasty for obstructive sleep apnea in adults: clinical correlation with polysomnographic results. Ear Nose Throat] 1993;72:63-6. 28. Maisel RH, Antonelli PJ, Iber C, et al. Uvulopalatopharyngoplasty for obstructive sleep apnea: a community's experience. Laryngoscope 1992; 102:604-7. 29. Barry MJ, Fowler FJ. The methodology for evaluating the subjective outcomes of treatment for benign prostatic hyperplasia. Adv UroI1993;6:83-99. 30. Berger M. Quality of life, health status, and clinical research. Med Care 1989;27:S148-56. 31. Hoddes E, Zarcone V, Smythe H, Phillips H, Dement We. Quantification of sleepiness: a new approach. Psychophysiology 1973; 10:431-6. 32. Buysee DJ, Reynolds CF, Monk TH, Berman SR, Kuper DJ. The Pittsburgh sleep quality index: new instrument for psychiatric practice and research. Psychiatry Res 1989;28: 193-213. 33. Johns MW. Reliability and factor analysis of the Epworth sleepiness scale. Sleep 1992; 15:376-81. 34. Piccirillo JF, White DA, Schechtman KB, Edwards DA, Sher AR. Development of the OSA patient oriented severity index. Fourth World Congress on Sleep Apnea, San Francisco, CA October 4, 1994 (abstract). 35. Regestein QR, Ferber R, Johnson TS, Murawski BJ, Strome M. Relief of sleep apnea by revision of the adult upper airway. Arch Otolaryngol Head Neck Surg 1988;114:1109-13. 36. Pocock SJ, Hughes MD, Lee RJ. Statistical problems in the reporting of clinical trials: a survey of three medical journals. N Engl] Med 1987;317:426-32. 37. Wittig RM, Romaker A, Zorick FJ, Roehrs TA, Conway WA, Roth T. Night-to-night consistency of apneas during sleep. Am Rev Respir Dis 1984; 129:244-6. 38. Meyer TJ, Eveloff SE, Lewis RK, Millman RP. One negative polysomnogram does not exclude obstructive sleep apnea. Chest 1993; 103:756-60. 39. Multiple Risk Factor Intervention Trial Research Group. Relationship between baseline risk factors and coronary heart disease and total mortality in the mUltiple risk factor intervention trial. Prev Med 1986; 15:254-73. 40. George SL. Statistics in medical journals: a survey of current policies and proposals for editors. Med Pediatr Onco11985; 13: 109-12. 41. DerSimonian R, Charette LJ, McPeek B, Mosteller F. Reporting on methods in clinical trials. N Engl] Med 1982;306:1332-7. 42. Hokanson JA, Stiernberg CM, McCracken MS, Quinn FB Jr. The reporting of statistical techniques in otolaryngology journals. Arch Otolaryngol Head Neck Surg 1987;113:45-50. 43. Zelen M. Guidelines for publishing papers on cancer clinical trials: responsibilities of editors and authors.] Clin Onco11983; 1:164-9. 44. Colditz GA, Emerson JD. The statistical content of published medical research: some implications for biomedical education. Med Educ 1985; 19:248-55.
© Copyright 2026 Paperzz