Methodological and Statistical Problems in Sleep Apnea Research

Sleep, 18(8):659-666
© 1995 American Sleep Disorders Association and Sleep Research Society
Methodological and Statistical Problems in Sleep
Apnea Research: The Literature on
Uvulopalatopharyngoplasty
*Kenneth B. Schechtman, t Aaron E. Sher and *Jay F. Piccirillo
*Division of Biostatistics, Washington University School of Medicine, St. Louis, Missouri, US.A.;
tDivision of Otolaryngology-Head and Neck Surgery and Capital Region Sleep
Wake Disorders Center, Albany Medical College, Albany New York, US.A; and
'f.Dept of Otolaryngology-Head and Neck Surgery, Washington University School of Medicine,
St. Louis, Missouri, US.A.
This manuscript discusses methodological and statistical problems identified in our comprehensive review of the
literature on the surgical treatment of sleep apnea. A companion paper based on the same literature review will be
published in a forthcoming issue of Sleep. That paper will discuss the efficacy of surgical approaches to the treatment
of sleep apnea.
Summary: A comprehensive review of the literature on the surgical treatment of sleep apnea found 37 appropriate
papers (total n = 992) on uvulopalatopharyngoplasty (UPPP). Methodological and statistical problems in these
papers included the following: I) There were no randomized studies and few (n = 4) with control groups. 2) Median
sample size was only 21.5; thus statistical power was low and clinically important associations were routinely
classified as "not statistically significant". 3) Only one paper presented the confidence bounds that might distinguish
between statistical and clinical significance. 4) Because of short follow-up time and infrequent repeat follow-ups,
little is known about whether UPPP results deteriorate with time. 5) In at least 15 papers, bias caused by retrospective
designs and nonrandom loss to follow-up raised questions about the generalizability of results. 6) Few papers
associated polysomnographic data with patient-based quality of life measures. 7) Missing data and missing and
inconsistent definitions were common. 8) Baseline measures were often biased because the same assessment was
inappropriately but roiltinely used for both screening and baseline. We conclude that because of these and other
problems, there is much that is needlessly unknown about UPPP. It is the responsibility of the research and
professional communities to define training, editorial and review procedures that will raise the methodological and
statistical quality of published research. Key Words: Methodological problems-Control groups-Sample sizeConfidence bounds - Bias - Sleep apnea - U vulopalatopharyngoplasty.
There is a broad perception among clinical researchers that there are serious problems with the methodological and statistical components of much of the
medical literature. Commonly reported problems include inadequate sample size (1,2), not distinguishing
between statistical and clinical significance (3,4), inappropriate statistical methods (5,6), lack of randomization or absent or inappropriate controls (5) and inadequate or biased follow-up (7). Deficiencies of this
type can have a variety of negative consequences with
Accepted for publication March 1995.
Address correspondence and reprint requests to Kenneth B.
Schechtman, Ph.D., Washington University School of Medicine,
Division of Biostatistics, Box 8067, 660 South Euclid Ave., St. Louis,
Missouri 63110, U.S.A.
respect to the conclusions of published research. Thus,
when sample sizes are too small, a conclusion such as
"no difference between groups" can in reality mean
that there was not enough statistical power to determine if a difference existed. When control groups are
either not present or are inappropriate, it may be impossible to determine the cause of any suggested improvement in patient outcome. When patients are lost
to follow-up, the followed subsample may be self-selected and biased so that generalizability is compromised.
The purpose of this paper is to discuss methodological and statistical issues such as those noted above as
they relate to the literature on the surgical treatment
of obstructive sleep apnea (OSA) using uvulopalatopharyngoplasty (UPPP). Conclusions are based on 37
659
K. B. SCHECHTMAN ET AL.
660
UPPP papers discussed in our comprehensive review
of the literature on all surgical approaches to the treatment of OSA (8).
METHODS
As part of a comprehensive review of the literature
on the surgical treatment of sleep apnea, a Medline
search identified all relevant articles published in English between January 1966 and December 1994. A
total of 17 S reports of clinical studies were selected by
this process. From this list, we included all clinical
studies that reported on at least 10 adult patients undergoing surgery for OSA and that contained data from
both pre- and post-surgery polysomnograms (PSGs).
If two or more papers contained data on the same
patients, the one with the most complete information
was retained. Details ofthe selection process that produced 37 UPPP papers with 992 patients who had both
pre- and post-surgery PSG data are reported in a companion article to be published in a forthcoming issue
of Sleep (8). These 37 papers, which are the "group 1"
papers in the companion article, form the basis of the
present report.
RESULTS
Our review of the UPPP literature produced the following methodological concerns:
1. Inadequate sample size and little
statistical power
The power of a statistical test is the probability that
the test will be significant when there really is a difference between groups. The type II error (often termed
the beta error) is the false negative rate. It equals 1
minus the power and is the probability of not claiming
statistically that there are between group differences
when such differences actually exist. As we shall see,
inadequate sample size (i.e. low power and high type
II error) is a common problem, both in UPPP reports
and in the medical literature in general.
Among the 37 UPPP papers, not a single one discussed statistical power. Indeed, there was no indication in any paper that the question of how large the
sample size needed to be was even considered before
the study began. We suspect, therefore, that the sample
size in most of these reports was based on how many
patients were available, with little or no concern for
the number of patients that were actually required to
answer the questions of interest.
Large differences between groups, particularly when
combined with small standard deviations, imply small
sample size requirements. Occasionally, as few as S or
Sleep, Vol. 18, No.8, 1995
10 patients per group will yield adequate statistical
power. Unfortunately, sample size requirements are
far greater when between-group differences are small.
In this setting, prospective power computations are
especially important. They tell the investigator that
sample sizes are adequate or, alternatively, that patient
availability is too limited for the protocol to be a worthwhile endeavor.
Freiman et al. (1) evaluated 71 papers that reported
"negative" results and found that inadequate sample
size is a common problem. The goal of these authors
was to determine how frequently statements such as
"There was no significant difference" could be more
accurately reported as: "While we got non-significant
results, we did not have enough patients to determine
whether there was or was not a clinically important
difference (i.e., we were wasting our time insofar as the
stated endpoint is concerned)". The results reported
by Freiman et al. (1) are startling. Sixty-seven of the
71 trials (94.4%) had a sample size that would have
missed a 2S% therapeutic improvement at least 10%
of the time. That is, the power to detect a 2S% treatment effect was almost always < 90%. In SO of the 71
trials (70.4%), a very large SO% improvement in the
response rate would have been called "nonsignificant"
at least 10% of the time. Although they do not present
the data, it is likely that most of these trials had a
< SO% chance of detecting large and clinically important differences.
The problems described by Freiman et al. (1) may
be more serious in the UPPP literature. Our 37 UPPP
papers had from 9 to 90 patients with follow-up data
(mean n = 26.8, median = 21.S), numbers that are
grossly inadequate for many purposes. For example,
UPPP papers that had respiratory distress index (RDI)
data showed a combined decrease in RDI of 33 ± 61 %
in the oropharyngeal group (i.e. patients whose airway
collapse is in the retropalatal segment only) and 6.S ±
47% in the hypopharyngeal group (i.e. patients with
retrolingual airway collapse). Assuming these are the
true population differences, detecting them statistically
would require 93 patients per group to achieve a power
of 0.9 and 70 per group for a power of 0.8. With 10
patients per group, typical in the papers we reviewed,
the power is only 0.18. With a maximum total of 90
UPPP patients, none of the reviewed papers comes
close to having sufficient sample size to provide a reasonable test of the hypothesis that the percent change
in RDI following UPPP is equal in oropharyngeal and
hypopharyngeal patients. The assertion in anyone of
these papers that "no difference was found" has very
little meaning.
Sample size requirements for statistical tests comparing mean values depend on the ratio defined as the
difference between the means in the two groups divided
METHODOLOGICAL PROBLEMS IN UPPP RESEARCH
by the standard deviation. A simple rule of thumb is
that if this ratio is 1 (i.e. the standard deviation equals
the difference in means), about 20 subjects per group
will yield a statistical test with reasonable power (about
0.8). But if the ratio is 0.5, some 65 patients are required in each group, whereas a ratio of 0.25 means
that 250 must be enrolled in each group. The earlier
RDI example is typical of a UPPP literature in which
standard deviations for many parameters are far greater than mean differences. When this occurs, the ratio
defined above is < 1, and sample sizes of 20 or 30 may
yield a power of only 0.2 or 0.3. This means that nonsignificant results will be commonly reported when
between-group differences are both clinically important and substantial. That is, in most of the papers we
reviewed, assertions such as "no significant difference
was found" must be interpreted with great caution at
best.
Summary
Sample size requirements and statistical power
should be considered before research projects begin.
When sample sizes are small, the reader should be
skeptical about negative results. Statements that "no
difference was found" will often be a polite way of
saying "We have no idea whether the two groups are
substantially different. We didn't give the question a
fair test because of limited sample size". This is the
reality in much of the UPPP literature.
2. Confidence bounds: the link between
statistical and clinical significance
Although all report p-values, only one of the 37 UPPP
papers presented confidence bounds. p-values provide
information about the statistical significance of associations between variables or differences between
groups. But they are uninformative as to the magnitude
of potential difference and say nothing about the association between statistical and clinical significance.
Because small and clinically irrelevant differences may
be statistically significant in large studies, whereas large
and clinically important differences may be nonsignificant in small studies, p-values in isolation provide an
incomplete message.
Confidence bounds yield statements like "I am 95%
confident that the magnitude of the between-group difference in the mean of some parameter (or the success
rates of some treatment) lies between two numbers,
say a and b". Confidence bounds go beyond black and
white statements about statistical significance. They
quantify how large differences might actually be and
evaluate the clinical importance of potential differences.
661
As an example, we define success as a 50% decrease
in apnea index (AI) and use one ofthe reviewed UPPP
papers (9) to evaluate the association between UPPP
success and baseline AI. In this paper (n = 30), the
success rate was 46.7% and baseline AI was 42.6 ±
22.0 in successes and 69.6 ± 26.6 in failures. The
associated p-value of 0.006 suggests that baseline AI
is lower in patients who respond to UPPP. The 95%
confidence bounds on the potential difference between
the baseline AI of successes and failures range from
8.6 to 45.4 events/hour. Thus, results are significant,
but potential differences range widely from 9 to 45.
That is, this statistically significant result might reflect
anything from a small clinically irrelevant difference
to a large and clinically important difference. The sample size is too small to permit greater precision.
It is worth noting that 10 of the 37 UPPP papers
(total n = 241, success rate = 60.2%) had sufficient
information to permit a meta-analysis focused on the
association between UPPP success and baseline AI.
Using the DerSimonian and Laird approach to metaanalysis (10), we find that pooling these 241 patients
results in baseline AI being smaller in the successfully
treated patients by a nonsignificant (p = 0.664) 2.3 ±
5.3 events/hour. The 95% confidence bounds range
from -8.1 to 12.8. Without the confidence bounds,
we can say only that we were unable to detect a statistically significant difference. But the narrow confidence bounds that result from the increased sample
size in the meta-analyses permit us to conclude also
that ifthere is a difference between groups, it is almost
certainly small in magnitude and clinically unimportant.
Summary
p-values are only part of the story. They tell the
reader whether differences exist. But they say nothing
about the magnitude of the difference and are uninformative as to the relationship between statistical significance and clinical importance. Confidence bounds
can be the missing link; they should be a routine component of many reports of clinical research. None of
the UPPP papers we reviewed reported confidence
bounds.
3. Uncontrolled studies
In their book on clinical trials, the key point about
study design made by Friedman et al. (11) is that "Sound
scientific clinical investigation almost always demands
that a control group be used against which the new
intervention can be compared. Randomization is the
Sleep. Vol. 18. No.8. 1995
662
K. B. SCHECHTMAN ET AL.
preferred way of assigning subjects to control and intervention groups". Although this reflects the broadly
held view of the scientific community, our review of
the UPPP literature revealed only four reports that
used control groups to compare one treatment with
another. We found no randomized clinical trials and
only one nonrandomized study that compared two versions of UPPP (12).
The reasons for control groups and for randomization are clear. Without a control, there is no way to
know if therapeutic outcomes reflect the tested treatment or some other factor. The suggested "superiority"
ofa given therapy in an uncontrolled study may reflect
the night-to-night variability of PSG testing, concurrent treatments and behaviors, more experienced or
more skilled surgeons or patient characteristics that
happen to favor success in the study sample. For example, in our 37 UPPP papers, success rates among
111 patients with oropharyngeal collapse only was in
the 70-80% range as compared to some 20% in 57
patients who also had hypopharyngeal collapse. Clearly, the importance of this single factor in determining
success far outweighs the likely impact of any modified
approach to UPPP. An assertion that one UPPP procedure is better than another because the results of
uncontrolled studies seem better may be largely meaningless because of the enormous impact of the location
of collapse. But in a randomized study, between-group
balance with respect to this key variable is likely. Moreover, if there is a control, statistical adjustment can
often address the between-group imbalances that
sometimes occur by chance.
Although it is often difficult to conduct surgical trials
with properly defined control groups, there are many
settings in which randomization is both ethical and
practical. For example, modifications of the original
UPPP procedure developed by Fujita et al. (13) have
been discussed by many authors (12,14-16). Because
there is almost no literature that formally compares
these procedures, there is little scientific data with which
to decide that one approach is best. Thus, randomization is ethical. Because experience indicates that many
patients are eager to participate in a scientific investigation when their physician indicates honest uncertainty as to which treatment is better, and when the
best possible care is offered as part of a clinical trial,
randomization is practical. Finally, because the literature is filled with inferior or useless treatments that
became standard (sometimes for decades) because of
nonrandomized and uncontrolled clinical studies, it
might be argued that randomization is an ethical obligation when therapies have as many associated uncertainties as does UPPP. Among the prominent surgical procedures that are documented as having seen
extensive inappropriate use are routine radical masSleep. Vol. 18. No.8. 1995
tectomy in breast cancer patients (17), tonsillectomy
in children with chronic sore throat (18), internal
mammary artery ligation in angina patients (19) and
routine back surgery for patients with disk problems
(20).
Summary
Randomized controlled trials are the gold standard
of clinical research. Without a control group, there is
no way to know that results reflect the relative benefit
of the study treatment. Control groups are almost never
used in the UPPP literature. They should be standard
when efficacy is being evaluated.
4. Inadequate follow-up
Among the 37 UPPP papers, 7 (18.9%) provided no
information about the length offollow-up, 25 (67.5%)
provided mean follow-up times or follow-up ranges
from which means could be accurately derived, and 5
(13.5%) provided minimum follow-up times only.
Three of the 25 papers with mean follow-up data contained both a short (at most 6 months) and a longterm (> 12 months) follow-up on each patient (21-23).
A fourth paper had a long-term follow-up in progress
(24), with the results ofthat follow-up being presented
in a paper excluded from our review (25) because it
reported on the same patients that were included in
the report (24) it was extending. Two other papers had
a single follow-up whose mean time was > 1 year,
whereas mean follow-up time was >6 and at most 12
months in three papers. Using the shorter time in the
papers with both short- and long-term follow-ups,
overall mean time was 4.9 months in the 25 papers
that provided the necessary data. Average minimum
follow-up in the five papers presenting no other followup information was 1.7 months.
As important as missing follow-up information are
short follow-up times and the lack of repeat follow-up.
Without studies that contain both short- and long-term
follow-up data, there is no way to know if the benefits
ofUPPP are maintained over time. Thus, despite suggestions by some authors that the deterioration ofUPPP
results after surgery is uncommon (26), the fact is that
we know very little about the long-term impact of
UPPP. Two of the four identified reports (21,22) that
provide both short- and long-term follow-up data suggest there is deterioration in PSG results a year or two
after surgery. A third paper (25) suggests the opposite,
while a fourth paper (23) found little improvement at
either the early or the late assessment. Given that the
limited amount of existing data results from small
studies with conflicting results, it is clear that the ques-
METHODOLOGICAL PROBLEMS IN UPPP RESEARCH
tion of whether UPPP deteriorates with time remains
open.
Summary
Follow-up duration in most UPPP papers is brief,
and few papers present both short- and long-term follow-up data. As a result, we do not know if UPPP
results deteriorate with time. Research that includes
both long-term and repeat follow-up is essential.
5. Results with uncertain generalizability
663
Summary
Biased unrepresentative samples are common in the
UPPP literature. To avoid this problem, entry and
exclusion criteria should be carefully defined and rigorously followed. Prospective designs, with a determined effort to ensure complete follow-up, should be
standard.
6. Quality of life
Polysomnography results provide the traditional objective measures of OSA severity. However, it is frequently noted that the results of objective tests correlate poorly with symptoms, functional impairments
and the emotional consequences of disease (27). This
discrepancy between objective and subjective measures has been noted in other disorders (29). Because
of the scientific preference for "hard" data over "soft"
data, physicians often disregard subjective patientbased reports of disease severity as unreliable and unworthy of scientific examination. Although objective
measures of severity are essential, the frequent exclusion of patient-based data has diminished the clinical
relevance of studies of prognosis and treatment effectiveness.
Disease-specific health status refers to the_ physical
problems, functional limitations and emotional consequences of a disease. It can be measured by subjective
and objective tests. Although some investigators view
quality oflife as a reflection of the patient's perception
of and reaction to his or her health status (30), there
is no universally accepted definition. Nevertheless,
subjective reports of OSA-specific symptom severity
have existed for some time (31-33), and a diseasespecific health-related quality-of-life measure was recently developed by Piccirillo et al. (34).
Our literature review identified many articles that
reported on patient symptoms and patient reports of
outcomes. However, no paper utilized validated measures of disease-specific health status in reporting disease severity and patient outcomes. Subjective reports
of health status and health-related quality oflife should
be standard in clinical research. Such measures would
provide a more complete description of OSA while
yielding a better understanding of the association between subjective and objective measures. As one author stated, "Measures in addition to PSG, including
patient subjective response, would more fully characterize the outcome of revision of the upper airway
for sleep apnea"(35).
When a clinical report reflects data from patients
who may not be typical of some defined larger population, it may be difficult to assess the generalizability
of results. This problem arises 1) in prospective studies
when outcome data are missing for nonrandom reasons
related to the outcomes themselves and 2) in retrospective studies that require follow-up information as
an entry criterion so that patients with missing followup are excluded at the outset. These problems are widespread in the UPPP literature.
A fundamental goal of prospective clinical research
is keeping enrolled subjects on study until follow-up
is complete. When follow-up data are missing, an unavoidable danger is that the data are missing for reasons directly related to the probability of a successful
outcome. When this happens, patients remaining on
study do not reflect the true impact of therapy in the
larger population. With comments like "There is some
inherent bias in our study in that those patients with
a less satisfactory outcome may be more apt to undergo
a post-operative evaluation" (27) and "Many patients
have reported a complete lack of interest in a postoperative study because they felt so much better after
surgery" (28), several authors have acknowledged this
potential source of bias.
In retrospective database studies that require followup as an entry criterion, bias can result when patients
with missing follow-up are excluded at the outset.
Among the 37 UPPP papers, 12 were definitely retrospective, 11 were definitely prospective, 5 were probably retrospective, 5 were probably prospective and 4
were indeterminate. Thus, the status of 14 papers
(37.8%) as prospective or retrospective was inconclusive and about half of the 37 papers were retrospective.
B~cause of missing follow-up information, 15 papers
were categorized as "definitely biased", 12 were definitely not biased (insofar as follow-up patterns are concerned), 3 were probably biased, 3 were probably not
biased and 4 were indeterminate. In the biased papers, Summary
results are based on an unrepresentative sample so that
it may be impossible to define a population to which
Disease-specific health status and health-related
quality-of-life measures describe important aspects of
those results can be generalized.
Sleep, Vol. 18, No.8, 1995
664
K. B. SCHECHTMAN ET AL.
illness and supplement traditional objective measures
of disease severity. They have not been used in the
UPPP literature. They should be routine.
7. Multiple endpoints
Well-designed protocols prospectively define a small
number of primary endpoints (ideally one or two). One
reason is to help focus the research. But equally important are the statistical uncertainties that are associated with multiple endpoints. When a single statistical test is performed at the 0.05 level of significance,
there is a 5% chance of claiming significance when there
is no difference between groups. This is the type I error,
the probability of wrongly claiming significance. When
more than one statistical test is performed, the laws of
chance mean that the probably of wrongly claiming
significance at least once exceeds 5%. With five independent tests, that probability is about 20%. With 14
tests it exceeds 50%. Translated, when one reads a
paper in which large numbers of statistical tests have
been performed, it is likely that some significant results
have occurred by chance.
In the 37 UPPP papers, the mean number of endpoints was 6.8, about the same as have been reported
elsewhere (36). Six papers reported 1 or 2 endpoints,
14 had 3-5 endpoints, 9 papers had 6-9 endpoints and
8 papers had ~ 10 endpoints. The maximum number
was 28. These numbers indicate that in many of the
papers, it is all but certain that some significant relationships reflect the laws of chance and nothing more.
And we emphasize, the above tabulation does not include endpoints that were evaluated but not reported
or the many other statistical tests that were reported
but that did not involve pre- to post-surgery changes
in endpoints.
Scientists are curious by nature, and it would be
absurd to suggest that data be ignored. Instead, we
recommend that the researcher: 1) prospectively define
one or two endpoints to stand out separately as the
only predefined primary endpoints, explicitly categorize other analyses as "secondary" or "exploratory"
and state clearly that secondary results must be interpreted cautiously if there are a large number of secondary hypothesis tests and 2) report precise p-values
rather than simply "p < 0.05" so that confidence in
conclusions can be better assessed.
Summary
Multiple endpoints reduce confidence in results and
are present in most UPPP papers. High-quality research requires that a small number of primary endpoints be prospectively defined. Other analyses should
Sleep, Va!. 18, No.8, 1995
be interpreted cautiously because chance significance
may be common.
8. Missing data and missing or inconsistent
definitions
We have already noted that 12 of 37 UPPP papers
(32.4%) did not provide mean follow-up data and 14
papers (37.8%) did not provide sufficient information
to determine conclusively whether the study was prospective or retrospective. There were seven other papers in which the status could be determined from the
context, but there was no clear statement as to whether
the research was or was not prospective. Additional
examples of inappropriately missing information include 12 papers with missing age on followed patients,
15 papers where the gender distribution among followed patients could not be determined and 22 papers
in which the definition of OSA was not presented even
though the presence of sleep apnea was a stated entry
requirement. Among the papers that did define OSA,
there were at least five definitions (AI > 4, AI > 5,
AI > 10, at least 30 episodes of apnea during the
recording period and RDI > 5). There was even greater
variety in the definition of a hypopnea. Moreover,
among the 22 papers in which hypopneas were discussed as an outcome measure, only 10 (45.4%) included a definition.
Summary
Missing information and inconsistent definitions are
common in the UPPP literature. This compromises
between paper comparisons and is particularly problematic when there are no concurrent controls. Editorial policies should seek to minimize these problems.
9. Biased baseline data
Entry into many clinical trials requires that subjects
achieve some stated minimum value for a parameter
that will serve as an outcome measure or a covariate.
In such a setting, standard research practice requires
that two separate pretreatment assessments be performed on each subject: a screening assessment that
determines eligibility with respect to the parameter and
an entirely distinct assessment that determines the
baseline value of the parameter.
To see why this is necessary, one may consider a
study with a minimum acceptable AI of 10. Because
of night-to-night variability (37,38), some subjects
whose screening AI is > 10 will have true indices of
only 8 or 9. For these subjects, the screening AI will
be a biased overestimate of their true AI. But there
will be no one to balance out this bias in the other
direction. Indeed, a subject whose true AI is > 10 but
METHODOLOGICAL PROBLEMS IN UPPP RESEARCH
whose screening value happens to be 8 or 9 will have
been excluded from the trial. The net result is that the
bias is unidirectional, and if the screening AI is also
used as the baseline AI, baseline values will be biased
overestimates of the true apnea index of enrolled subjects. In this setting, post-treatment values will tend to
be lower than pretreatment values even when the treatment has no effect. That is, there will be a biased overestimate of treatment effectiveness.
The magnitude of the bias noted above is a function
of the variability of the parameters of interest. But
because there is little precise information about the
night-to-night-variability of AI and RDI, we cannot
quantify this bias. However, the MUltiple Risk Factor
Intervention Trial (MRFIT) (39) demonstrates that the
potential bias is large. This primary-prevention cardiovascular trial randomized nearly 13,000 subjects
whose high risk status for eligibility was partly determined by blood pressure. According to accepted standards, MRFIT used separate screening and baseline
assessments. The result was a mean screening diastolic
blood pressure (DBP) of 99 mm Hg as compared to a
baseline value of only 91 mm Hg. Had the screening
value served also as the baseline value, the result would
have been that the effect of the intervention on DBP
would have been overestimated by 8 mm Hg, and the
true role ofDBP in MRFIT would have been obscured.
In closing, we note that all 37 UPPP papers had
minimum acceptable values for key outcome parameters. None of them indicated the use of separate
screening and baseline assessments. Thus, we believe
that published UPPP success rate data may overstate
the true impact of surgery.
Summary
When eligibility requirements include mInImUm
values for outcome parameters, biased overestimates
of treatment effectiveness will result if the screening
value is also used as the baseline value. This bias was
present in every one of the UPPP papers. The magnitude of the bias cannot be estimated because there
is little information about the night-to-night variability
of PSG data.
Conclusions and recommendations
The UPPP literature contains methodological and
statistical problems that include inadequate sample size,
the absence of confidence bounds, short and missing
follow-up, the infrequent use of controls, results with
uncertain generalizability, too many retrospective
studies, too many endpoints, little discussion of quality
oflife, missing data and inconsistent definitions of key
variables. The catalogue of information that is un-
665
known because of these problems or because key questions have not been asked is needlessly large. The list
includes information about the degree to which many
baseline factors predict UPPP outcome; whether and
to what extent UPPP results deteriorate with time; the
relative benefits ofUPPP modifications; the impact of
concomitant surgeries and comorbidities on UPPP
success; the association between health status, quality
of life, and PSG outcomes and a variety of questions
about the impact ofUPPP on long-term morbidity and
mortality.
The following steps might yield a substantial improvement in research quality and in the information
it provides about the treatment of patients with sleep
apnea. These recommendations are based on the results presented herein, on our experience in teaching
biostatistics and research design to medical school faculty and fellows and on the suggestions of many authors
who have grappled with these issues (40-44).
1) The first step in the research process should be a
written protocol that has been prepared in collaboration with a statistician/methodologist. Retrospective
database and chart reviews should be a last resort
methodology. Studies with concurrent controls, with
randomization when possible, with adequate sample
size and complete follow-up, with explicit definitions,
with a small number of primary endpoints and with
both objective and subjective outcome measures should
be the goal of protocols that are intended to evaluate
a given treatment in a defined set of patients.
2) Journal editors and representatives of professional and scientific societies should meet jointly to
establish research standards. These should include editorial expectations regarding research methodology, a
minimum set of methodological and statistical information to be included in every published manuscript
and a review process that includes methodological and
statistical expertise.
3) National meetings of relevant professional and
scientific societies should present short courses and
symposia on research design, statistical methodology
and protocol development.
4) The standards of research protocols sponsored
by professional societies should be evaluated and upgraded.
5) Training programs for residents and fellows
should routinely include a focus on research design.
Because a critical reading of the medicallitenlture requires an appreciation of design issues, such programs
should not be restricted to future researchers.
REFERENCES
1. Freiman JA, Chalmers TC, Smith H Jr, Kuebler RR. The importance of beta, the type II error and sample size in the design
Sleep, Vol. 18, No.8, 1995
666
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
IS.
16.
17.
18.
19.
20.
21.
22.
23.
K. B. SCHECHTMAN ET AL.
and interpretation of the randomized control trial: survey of 71
"negative" trials. N Engl ] Med 1978;299:690-4.
Williams HC, Seed P. Inadequate size of 'negative' clinical trials
in dermatology. Br] DermatoI1993;128:317-26.
Braitman LE. Confidence intervals assess both clinical significance and statistical significance. Ann Intern Med 1991; 114:
515-7.
Simon R. Confidence intervals for reporting results of clinical
trials. Ann Intern Med 1986;105:429-35.
Glantz SA. Biostatistics: how to detect, correct and prevent
errors in the medical literature. Circulation 1980;61: 1-7.
Altman DA. Statistics in medical journals. Stat Med 1982;1:
59-71.
Liberati A, Himel HN, Chalmers Te. A quality assessment of
randomized control trials of primary treatment of breast cancer.
] Clin OncoI1986;4:942-51.
Sher AE, Schechtman KB, Piccirillo JF. The efficacy of surgery
for obstructive sleep apnea syndrome. A comprehensive review
of the literature. Sleep (in press).
deBerry-Borowiecki B, Kukwa AA, Blanks RHI. Indications for
palatopharyngoplasty. Arch Otolaryngol 1985; III :659-63.
DerSimonian R, Laird N. Meta-analysis in clinical trials. Control Clin Trials 1986;7: 177-88.
Friedman LM, Furberg CD, DeMets DL. Fundamentals ofclinical trials. Boston: John Wright, PSG Inc, 1982.
Kimmelman CP, Levine SB, Shore ET, Millman RP. Uvulopalatopharyngoplasty: a comparison of two techniques. Laryngoscope 1985;95: 1488-90.
Fujita S, Conway WA, Zorick F, Roth T. Surgical correction of
anatomic abnormalities in obstructive sleep apnea syndrome:
uvulopalatopharyngoplasty. Otolaryngol Head Neck Surg 1981;
89:923-34.
O'Leary MJ, Millman RP. Technical modifications of uvulopalatopharyngoplasty: the role of the palatopharyngeus. Laryngoscope 1991;101:1332-5.
Ryan CF, Dickson RI, Lowe AA, Blokmanis A, Fleetham JA.
Upper airway measurements predict response to uvulopalatopharyngoplasty in obstructive sleep apnea. Laryngoscope 1990;
100:248-53.
Dickson RI, Blokmanis A. Treatment of obstructive sleep apnea
by uvulopalatopharyngoplasty. Laryngoscope 1987;97: 1054-9.
Fisher B, Bauer M, Margolese R, et al. Five-year results of a
randomized clinical trial comparing total mastectomy and segmental mastectomy with or without radiation in the treatment
of breast cancer. N Engl] Med 1985;312:665-73.
Bakwin H. Pseudoxia pediatrica. N Engl] Med 1945;24:691-7.
Diamond EG, Kittle CF, Crockett JE. Comparison of internal
mammary ligation and sham operation for angina pectoris. Am
] CardioI1960;5:483-6.
Jensen MC, Brant-Zawadzki MN, Obuchowski N, Modic MT,
Malkasian D, Ross JS. Magnetic resonance imaging of the lumbar spine in people without back pain. N Engl ] Med 1994;331:
69-73.
Larsson H, Carlsson-Nordlander B, Svanborg E. Long-time follow-up after UPPP for obstructive sleep apnea syndrome: results
of sleep apnea recordings and subjective evaluation 6 months
and 2 years after surgery. Acta Otolaryngol (Stockh) 1991; 111:
582-90.
Launois SH, Feroah TR, Campbell WN, et al. Site of pharyngeal
narrowing predicts outcome of surgery for obstructive sleep apnea. Am Rev Respir Dis 1993;147:182-9.
Walker EB, Frith RW, Harding DA, Cant BR. Urulopalatopharyngoplasty in severe idiopathic obstructive sleep apnoea
syndrome. Thorax 1989;44:205-8.
Sleep, Vol. 18, No.8, 1995
24. Fujita S, Conway WA, Zorick FJ, et al. Evaluation of the effectiveness of uvulopalatopharyngoplasty. Laryngoscope 1985;95:
70-4.
25. Conway W, Fujita S, Zorick F, et al. Uvulopalatopharyngoplasty: one year followup. Chest 1985;88:385-7.
26. Philip-Joet F, Rey M, Triglia JM, et al. Uvu1opalatopharyngoplasty in snorers with sleep apneas: predictive value of presurgical polysomnography. Respiration 1991 ;58: 100-5.
27. Davis JA, Fine ED, Maniglia AJ. Uvulopalatopharyngoplasty
for obstructive sleep apnea in adults: clinical correlation with
polysomnographic results. Ear Nose Throat] 1993;72:63-6.
28. Maisel RH, Antonelli PJ, Iber C, et al. Uvulopalatopharyngoplasty for obstructive sleep apnea: a community's experience.
Laryngoscope 1992; 102:604-7.
29. Barry MJ, Fowler FJ. The methodology for evaluating the subjective outcomes of treatment for benign prostatic hyperplasia.
Adv UroI1993;6:83-99.
30. Berger M. Quality of life, health status, and clinical research.
Med Care 1989;27:S148-56.
31. Hoddes E, Zarcone V, Smythe H, Phillips H, Dement We.
Quantification of sleepiness: a new approach. Psychophysiology
1973; 10:431-6.
32. Buysee DJ, Reynolds CF, Monk TH, Berman SR, Kuper DJ.
The Pittsburgh sleep quality index: new instrument for psychiatric practice and research. Psychiatry Res 1989;28: 193-213.
33. Johns MW. Reliability and factor analysis of the Epworth sleepiness scale. Sleep 1992; 15:376-81.
34. Piccirillo JF, White DA, Schechtman KB, Edwards DA, Sher
AR. Development of the OSA patient oriented severity index.
Fourth World Congress on Sleep Apnea, San Francisco, CA
October 4, 1994 (abstract).
35. Regestein QR, Ferber R, Johnson TS, Murawski BJ, Strome M.
Relief of sleep apnea by revision of the adult upper airway. Arch
Otolaryngol Head Neck Surg 1988;114:1109-13.
36. Pocock SJ, Hughes MD, Lee RJ. Statistical problems in the
reporting of clinical trials: a survey of three medical journals.
N Engl] Med 1987;317:426-32.
37. Wittig RM, Romaker A, Zorick FJ, Roehrs TA, Conway WA,
Roth T. Night-to-night consistency of apneas during sleep. Am
Rev Respir Dis 1984; 129:244-6.
38. Meyer TJ, Eveloff SE, Lewis RK, Millman RP. One negative
polysomnogram does not exclude obstructive sleep apnea. Chest
1993; 103:756-60.
39. Multiple Risk Factor Intervention Trial Research Group. Relationship between baseline risk factors and coronary heart disease and total mortality in the mUltiple risk factor intervention
trial. Prev Med 1986; 15:254-73.
40. George SL. Statistics in medical journals: a survey of current
policies and proposals for editors. Med Pediatr Onco11985; 13:
109-12.
41. DerSimonian R, Charette LJ, McPeek B, Mosteller F. Reporting
on methods in clinical trials. N Engl] Med 1982;306:1332-7.
42. Hokanson JA, Stiernberg CM, McCracken MS, Quinn FB Jr.
The reporting of statistical techniques in otolaryngology journals. Arch Otolaryngol Head Neck Surg 1987;113:45-50.
43. Zelen M. Guidelines for publishing papers on cancer clinical
trials: responsibilities of editors and authors.] Clin Onco11983;
1:164-9.
44. Colditz GA, Emerson JD. The statistical content of published
medical research: some implications for biomedical education.
Med Educ 1985; 19:248-55.