Invited Commentary: Re:`Multiple Comparisons and Related Issues

American Journal of
EPIDEMIOLOGY
Volume 147
Copyright © 1998 by The Johns Hopkins University
Number 9
School of Hygiene and Public Health
May 1, 1998
Sponsored by the Society for Epidemiologic Research
ORIGINAL CONTRIBUTIONS
Invited Commentary: Re: "Multiple Comparisons and Related Issues in the
Interpretation of Epidemiologic Data"
John R. Thompson
In a recent commentary, Savitz and Olshan (1) discussed a number of statistical issues relevant to the
analysis of epidemiologic data. They pointed out how
the use of hypothesis tests can have counterintuitive
consequences. Two examples of this are the use of
adjustments for multiple comparisons, implying that
the interpretation of a test depends on whether or not
other tests are conducted, and the fact that a hypothesis
test is traditionally interpreted differently, depending
on whether or not the investigator thought of the
hypothesis before the data were seen, implying that the
interpretation depends on the source of the hypothesis.
These seemingly illogical consequences led Savitz and
Olshan to suggest that a concern with multiple comparisons is unwarranted and that the state of mind of
the investigator when the hypothesis was put forward
is irrelevant. Indeed, they suggest that a concern with
such issues may lead to the "unjustified dismissal of
meaningful results or exaggerated confidence in weak
results" (1, p. 904).
The fact that traditional statistical methods have
counterintuitive consequences is not in doubt and has
been discussed many times (2-4). What is open to
debate is the way that epidemiologists should react.
There are two reasonable responses. The first notes
that traditional statistics has served medical research
well over the last 50 years and has been an important
force in improving the quality of that research. The
requirement by journals for a statistical analysis has
meant that investigators cannot ignore chance variation, and it has led to improvements in design through
a greater awareness of the importance of sample size.
Given this track record, it might reasonably be argued
that epidemiologists should stick with these tried and
well-accepted methods. However, the corollary of this
is that they must also accept the consequences of
traditional statistics, including adjustment for multiple
comparisons and a concern with the process by which
the hypothesis was generated. A second equally reasonable response is to say that the traditional statistical
approach served us well in the past but is no longer
sufficient to cope with the type of work that epidemiologists do. In particular, it is not well suited to teasing
out causal relations in complex observational studies.
If this view is taken, then a better way of handling
uncertainty in epidemiologic studies is needed. This
will almost certainly lead to some form of likelihood
or Bayesian analysis, in which the counterintuitive
consequences of the traditional methods do not arise.
The one approach that is not tenable is to continue
with traditional statistical methods but to ignore their
logical, if counterintuitive, consequences. This type of
ad hoc modification of traditional statistics has no
theoretical basis and is potentially dangerous.
In coming to a conclusion about the desirability of
adjustments for multiple comparisons and other issues
related to hypothesis testing, one is drawn into a series
of difficult but fundamental questions. It is necessary
to consider, in order,
Received for publication April 1, 1996, and accepted for publication October 1, 1996.
From the Department of Ophthalmology, Robert Kilpatrick Clinical Sciences Building, University of Leicester, Leicester, England.
Reprint requests to Dr. John R. Thompson, Department of Ophthalmology, Robert Kilpatrick Clinical Sciences Building, University
of Leicester, Leicester LE2 7LX, England.
1. Should hypothesis tests be used at all?
2. If tests are to be used, should they be based on p
values?
801
802
Thompson
3. If p values are used, should they be adjusted for
multiple comparisons?
In brief, I will argue that in epidemiology the answers to these questions should be 1) "infrequently";
2) "no"; and 3) if you really insist on p values, "yes."
The arguments raised against multiple comparisons by
Savitz and Olshan (1) are very powerful, but they
highlight issues that would make one want either to
abandon hypothesis testing altogether or at least to
abandon the use of the p value as the basis for testing.
If one gets as far as asking the third question, then it is
only logical to accept the consequences of the previous
steps and to use adjustments for multiple comparisons.
SHOULD HYPOTHESIS TESTS BE USED AT ALL?
A glance through an epidemiology journal will confirm the popularity of hypothesis tests in our field of
application, but closer examination shows that the
tests are not all used for the same reasons. It is convenient to divide the uses into formal applications, in
which the test is used on its own to assess some
hypothesis, and informal applications, in which the
test is interpreted freely by the investigator alongside
other, often unquantified, information. Informal testing may be the basis for an entire analysis or just a
part, as when an exploratory analysis uses p values as
summary statistics. The formal use of tests as a
method of inference, that is, to inform about the parameters of a model, is much less common than their
informal use. Unfortunately, this distinction is rarely
made explicit, and informal users may seek to gain
unjustified credibility for their analysis by association
with the theoretical ideas that underlie formal tests.
When used informally, the logical foundations of p
values are of little interest, since the tests need no
more justification than does a graph of the data; it is
left to the observer to see in them what they can. In
these circumstances, there may be disagreement over
whether a test result is the best way to summarize the
data, but arguments over issues such as adjustment for
multiple comparisons are rather academic. The proof
of worth lies in the tests' ability to combine with other
information and help the investigator toward a better
understanding of the study.
There is now a consensus that confidence intervals
are preferable to statistical tests as a means of summarizing the results of an analysis (5, 6). They give
more information by describing a range of alternatives
rather than concentrating on the value chosen as the
null hypothesis. This argument is strongest when applied to informal analyses because the formal benefits
of confidence intervals are, in part, illusory owing to
their close association with hypothesis tests. If there
are fundamental objections to hypothesis tests, then
the same criticisms are likely to apply to confidence
intervals. In particular, if there is a dispute over
whether tests should be adjusted for multiple comparisons, then the same disagreement will arise over the
need for simultaneous confidence intervals.
Hypothesis tests are sometimes used to aid decision
making despite the fact that, in their standard form, they
do not incorporate information on the consequences of
the decision (4). In theory, it is possible to consider the
costs or utilities associated with each decision and to
modify the threshold of the test accordingly. In practice,
this can be very difficult and time consuming, so the
investigator may prefer to use the result of a hypothesis
test along with other information to arrive at a decision
by an informal, subjective process.
A commonly occurring example of a decision theory problem is whether or not to adjust for a potential
covariate in a model. In theory, the likely bias that
would result from the exclusion of the covariate could
be weighed against the loss in precision and simplicity
that would result from its unnecessary inclusion. To do
this whenever the problem arises would be very time
consuming, and so empirically derived rules are created that work well in most circumstances. Experienced investigators might calculate a p value for the
covariate term in the model but then judge it against a
threshold of 20 percent rather than against the conventional threshold of 5 percent. They will appreciate that
such procedures are only guidelines and that prior
knowledge should always overrule the results of the
test (7). Thus, a decision is arrived at by free interpretation of the hypothesis test along with other information. This process relies heavily on the experience and
common sense of the investigator.
The informal use of hypothesis tests is often helpful,
but it has disadvantages:
1. The analysis relies on the experience of the investigator.
2. The analysis is not necessarily reproducible.
3. In retrospect, it can be hard to see why decisions
were made.
4. It is difficult to detect fraud.
These objections have always held in statistical analysis and make the occasional claims of objectivity
rather hollow.
The remaining use of hypothesis tests is as a formal
mechanism for learning about model parameters.
There are sometimes situations in which investigators
wish to rely on the data alone to judge between rival
theories, although such formal situations are relatively
infrequent.
Am J Epidemiol
Vol. 147, No. 9, 1998
Multiple Comparisons
IF TESTS ARE TO BE USED, SHOULD THEY BE
BASED ON p VALUES?
When hypothesis tests are to be used, a decision
must be made about whether or not to base the test on
a p value. Once again, it helps to distinguish between
the formal and informal applications. It is much harder
to come to any general conclusions if the tests are used
informally, since they then require no logical foundation but rather justify themselves by their usefulness. It
is only in a formal setting that the assumptions and
consequences of the use of p values can be critically
assessed, although if there were a wider appreciation
of the theoretical limitations of p values in formal
settings, it is likely that their informal use would be
greatly diminished.
Frequentist hypothesis tests can be traced back to
two distinct schools of thought (8, 9). One, typified by
the work of Neyman and Pearson (10), advocates an
approach based on rigid thresholds that divide the
sample space into acceptance and rejection regions,
while the second, typified by the work of Fisher (11),
advocates the calculation of/? values, in some sense, to
supply a measure evidence. Current practice usually
combines these two approaches, so that investigators
may take the idea of power, with its rigid thresholds,
from Neyman and Pearson, but then quote exact p
values, as Fisher advocated. This combined approach
makes it more difficult to criticize the logical foundations of what people actually do.
The two frequentist approaches, although different,
share the idea of the observed data as one realization
of many potential sets of values and lead them both to
the same tests based on a model's probability distribution over the entire sample space of possible outcomes. This means that a test depends on all of the
values that might occur and not just on the actual data,
a requirement that breaks the "likelihood principle" (3,
12) and that has been criticized many times for its
ridiculous consequences. In Jeffreys' frequently
quoted words,
What the use of p implies, therefore, is that a hypothesis that may be true may be rejected because it has
not predicted observable results that have not occurred (2, p. 385).
The Neyman-Pearson formulation, with its acceptance
and rejection regions, resembles the solution to a decision theory problem but in its standard form does not
take proper account of costs. On the other hand,
Fisher's approach can be criticized because the p value
is such a poor measure of evidence and because it
ignores the need for an explicit alternative hypothesis;
even the smallest p value will not give evidence
against a hypothesis if the observed data are even less
Am J Epidemiol
Vol. 147, No. 9, 1998
803
likely under every reasonable alternative. Berger and
Sellke (13) have shown that, in realistic circumstances,
thep value can greatly exaggerate the evidence against
the null. In one example, they find that a p value of
0.05 corresponds to posterior belief in the null hypothesis of at least 0.3 over a range of alternatives.
Goodman (14) has demonstrated the weakness of the p
value as a measure of evidence without turning to
Bayesian arguments by looking at the probability that
a further study would replicate a significant finding,
again, p values are seen to exaggerate the strength of
evidence against the null.
No matter which formulation is adopted, traditional
tests are potentially misleading. In practice, the situation is made even worse by the misinterpretation of
tests (14, 15), such as when p values are talked of as
observed error rates or when the conditioning of results on the model selection is ignored.
Hypothesis tests have been used successfully for so
long in part because all schemes of inference lead to
similar results in the most simple situations or when
the data set is very large and in part because experienced investigators have learned not to be too constrained by the size of a p value. For instance, if the
sample size is very large, the p value may be highly
significant, even though the deviation of the population from the null hypothesis is small. Therefore, most
investigators will modify their interpretation of such a
p value accordingly (16). Traditional hypothesis tests
often work only because people have learned to use
them in their own way.
IF p VALUES ARE TO BE USED, SHOULD THEY
BE ADJUSTED FOR MULTIPLE COMPARISONS?
The reasons that the p value survives are its relative
simplicity and familiarity compared with alternative
methods. If, despite their limitations, p values are to be
used, should they be adjusted for multiple comparisons ? The answer will again depend on the context.
Formal inference is the easiest situation to address,
for to have gotten as far as asking about adjusting p
values implies that the decision must have been made
to adopt the frequentist line and to ignore the criticisms of conventional hypothesis tests. In that case,
adjustment is a logical consequence. Within the frequentist formulation (17), there is a basic asymmetry
between the null and the explicit or implicit alternative
hypothesis. The test will only reject the null if the
evidence against it is strong, a constraint that is enforced by insisting on a small probability for the Type
I error (rejecting the hypothesis when it is true), regardless of any resulting loss of power. If a set of tests
is performed, then it is natural to place a familywise
(experimentwise) limit on the Type I error. Ways of
804
Thompson
applying multiple comparisons to specific types of
analysis are discussed at length by Miller (18) and by
Hochberg and Tamhane (19). Westfall and Young (20)
show how resampling can be used to estimate an
adjusted p value in a even wider range of models.
Although methods for adjustment are available for
many situations, the main problem usually lies in
defining the set of tests that are to be treated as a
family. The arbitrariness of this decision is inescapable, given the poor theoretical foundations of traditional tests.
When tests are to be used informally, the situation is
somewhat different. An experienced investigator may
adjust or not as he or she pleases, for even if no
familywise adjustment is made, an informal discounting process will take place. This and the added complexity may partially explain why so many people who
advocate traditional methods do not use familywise
adjustments. Some authors have advocated reporting
large sets of unadjusted/? values with a warning about
the number that might be expected to be significance
by chance (21, 22). This is a classic example of the
informal use of hypothesis tests. The results may be
very useful, but the p values as so calculated have little
or no real meaning.
Unfortunately, the freedom offered by an informal
approach has its own dangers and is open to abuse.
There is a risk that some will be accidentally misled by
unadjusted p values or that others will deliberately use
them to mislead. Given the desire, it is very easy to
produce "significant" results from any set of data (23).
An investigator merely subdivides the subjects by age,
sex, geographic area, etc., and then categorizes the
exposure and uses separate tests to compare different
levels of exposure within each subgroup. With a little
ingenuity, this method is almost sure to find some
significant results. When the place where the differences lie is found, the groups can then be redefined to
preserve the differences but reduce the number of
comparisons, and reasons can usually be invented for
why those are just the effects that would have been
anticipated (24). The investigator who wants to avoid
multiple testing can, of course, perform the search by
tabulation so that only the final test actually needs to
be performed.
Less dramatic, but fundamentally the same, is the
situation in which the exposures or subgroups are
defined in advance. If, for instance, a large number of
aspects of diet are compared in cases and referents, it
would be surprising if no significant results were
found. These arguments call into doubt the worth of a
p value in these circumstances, but if traditional hypothesis tests are to be used formally, then adjustment
will at least preserve their one merit, the protection
against Type I errors. If nonadjustment for multiple
comparisons became acceptable or if investigators
were free to dredge through their data before deciding
what hypotheses to test, then published/? values would
be of even less worth than they are now. It would be
a license to publish coincidences with a pseudoscientific gloss.
AVOIDING THE PROBLEMS
If p values do not measure evidence and tests based
on them lead us to counterintuitive actions, what is the
alternative? The solution is well known but is rarely
put into practice. Measures of evidence based on the
data alone must be comparative, that is, they can only
say how much more one hypothesis is supported than
is another. Under any given model, all of the evidence
in the data is summarized by the likelihood. By comparing the height of the likelihood curve at different
points, one can see how well different hypotheses are
supported by the data. To go further and make statements about a specific hypotheses is possible only if
we are able to specify the probability that we associated with the hypotheses before the data were collected. The information in the likelihood can then be
used to update that prior assessment to a posterior
probability. This is the essence of Bayesian statistics
as described in standard textbooks (25).
Some researchers object to the use of prior probability, in which case they must stop with the comparative information contained in the likelihood. The loglikelihood curve is frequently drawn to illustrate this
approach, but the appeal of the method in simple
examples should not hide the fact that in practice the
likelihood is often multidimensional, and there may be
no obvious way to make statements about one of the
parameters in isolation from the others (26). Without
such a reduction in dimensionality, interpretation is
difficult.
If prior probabilities are accepted, then the problem
of multidimensionality is solved, at least in theory,
because unwanted parameters can be removed by integration. Simulation methods and recent software
make this a practical proposition (27). Within the
Bayesian framework, the investigator quantifies his or
her prior beliefs in the set of specified hypotheses, H,
and the data, X, are used to update those beliefs,
producing a statement of posterior belief in the hypotheses given the data, P(H\X). Unlike the p value
that is derived from P(X\H0), where Ho is the specified
null, posterior belief really is a measure that can be
relied on, and it has the properties that Savitz and
Olshan (1) desire. Indeed, within the Bayesian framework, there is no problem of multiple comparisons
Am J Epidemiol Vol. 147, No. 9, 1998
Multiple Comparisons
because posterior belief in any one hypothesis is the
same, whether or not other hypotheses are considered.
The problem of hypothesis generation after the data
have been seen is not quite so easily dismissed because
in the Bayesian paradigm every theory requires a
statement of prior belief, and although some people
have tried to get around the problem (28), it is not
really possible to make such a statement after the data
have been seen. The best solution is to process the data
sequentially or in batches so that if a previously unsuspected hypothesis is formulated from inspection of
the early data, that experience can be used as the basis
for the quantification of one's prior opinion, and then
the remainder of the data are available to update those
beliefs.
Problems of repeated testing come in many forms.
The term "multiple comparisons" usually refers to the
analysis of the same outcome with several exposures,
such as when a series of treatments are compared. A
similar situation occurs with interim analysis when the
data are analyzed part way through a study, and then,
if it is felt necessary, the study is continued and
reanalyzed later. On the other hand, multiple testing
arises when different outcomes are to be compared
within the same study; this commonly arises when
several different endpoints are measured or when preliminary tests are made to help formulate a model. The
problem of many independent measurements of the
same quantity is perhaps the simplest from a Bayesian
point of view because exchangeability can often be
used to create a relatively simple analysis (28, 29).
When the tests are not exchangeable, the investigators
will need to specify the interdependence between their
prior beliefs, and this is not easy. Although Bayesian
analysis provides an excellent and realistic theoretical
framework for data analysis, it would be wrong to
underestimate its practical problems.
CONCLUSIONS
Savitz and Olshan ridicule adjustment for multiple
comparisons and a concern for the method of generation of the hypothesis, claiming that they are "irrelevant to assessing the validity of the product" (1, p.
904). In one sense, there is little to argue with in this,
but if, as they imply, they use p values and the corresponding confidence intervals to measure that validity,
then it is these measures that will cause a problem. If
p values actually measured evidence, then they would
have all the characteristics that Savitz and Olshan
desire, but, unfortunately, they do not and so their
interpretation requires great care.
The arguments of Savitz and Olshan (1) identify a
genuine problem in epidemiologic analysis. However,
I disagree with their recommendation that it is best not
Am J Epidemiol
Vol. 147, No. 9, 1998
805
to adjust for multiple comparisons or to be concerned
about the method of hypothesis generation. This is
dangerous advice that is open to abuse. Rather, I see
the points they make as highlighting the deficiencies
of p values, and I look forward to a time when epidemiologists adopt sounder methods of data analysis. In
the meantime, those researchers who cannot bring
themselves to abandon traditional statistical methods
should be advised not to attach any specific meaning
to individual unadjusted p values used informally and
to adjust all formal tests, for otherwise those tests lose
their one merit, the protection against Type I errors.
REFERENCES
1. Savitz DA, Olshan AF. Multiple comparisons and related
issues in the interpretation of epidemiologic data. Am J Epidemiol 1995;142:904-8.
2. Jeffreys H. Theory of probability. Oxford, England: Oxford
University Press, 1939.
3. Edwards AWF. Likelihood. Cambridge, England: Cambridge
University Press, 1972.
4. Barnett V. Comparative statistical inference. New York, NY:
John Wiley & Sons, 1973.
5. Rothman K. A show of confidence. N Engl J Med 1978;299:
1362-3.
6. Gardiner MJ, Altaian DG. Confidence intervals rather than p
values. Estimation rather than hypothesis testing. Br Med J
1986;292:746-50.
7. Maldonado G, Greenland S. Simulation study of confounderselection strategies. Am J Epidemiol 1993;138:923-36.
8. Johnstone DJ. Tests of significance in theory and practice.
Statistician 1986;35:491-504.
9. Goodman SN. p values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate.
Am J Epidemiol 1993; 137:485-96.
10. Neyman J, Pearson ES. On the problem of the most efficient
tests of statistical hypotheses. Philos Trans R Soc A 1933;231:
289-337.
11. Fisher RA. Statistical methods and scientific inference. 3rd ed.
New York, NY: Hafner, 1973.
12. Barnard GA. Statistical inference (with discussion). J R Stat
SocB 1949;11:115-49.
13. Berger JO, Sellke T. Testing a point null hypothesis: the
irreconcilability of p values and the evidence (with discussion). J Am Stat Assoc 1987;83:687-97.
14. Goodman SN. A comment on replication, p values and evidence. Stat Med 1992;11:875-9.
15. Goodman SN, Royall R. Evidence and scientific research.
Am J Public Health 1988;78:1568-74.
16. Chatfield C. Model uncertainty, data mining and statistical
inference (with discussion). J R Stat Soc A 1995;158:419-66.
17. Kendall MG, Stuart A. The advanced theory of statistics. Vol.
2, 3rd ed. London, England: Charles Griffin, 1973:190-1.
18. Miller RG. Simultaneous statistical inference. 2nd ed. New
York, NY: Springer-Verlag, 1981.
19. Hochberg Y, Tamhane AC. Multiple comparison procedures.
New York, NY: John Wiley & Sons, 1987.
20. Westfall PH, Young SS. Resampling-based multiple testing.
New York, NY: John Wiley & Sons, 1993.
21. Thomas DC, Siemiatycki J, Dewar R, et al. The problem of
multiple inference in studies designed to generate hypotheses.
Am J Epidemiol 1985;122:1080-95.
22. Walker AM. Reporting the results of epidemiological studies.
Am J Public Health 1986;76:556-8.
23. Mills JL. Data torturing. N Engl J Med 1993;329:1196-9.
806
Thompson
24. Davey-Smith G, Phillips AN, Neaton JD. Smoking as "independent" risk factor for suicide: illustration of an artifact from
observation epidemiology? Lancet 1992;340:709-ll.
25. Bernardo JM, Smith AFM. Bayesian theory. New York, NY:
John Wiley & Sons, 1994.
26. Kalbfleisch JD, Sprott DA. Applications of likelihood methods to models involving large numbers of parameters. J R Stat
Soc 1970:175-94.
27. Spiegelhalter DJ, Thomas A, Best NG, et al. BUGS: Bayesian
inference using Gibbs sampling. Version 0.50. Cambridge,
England: MRC Biostatistics Unit, 1995.
28. Berry DA. Multiple comparisons, multiple tests and data
dredging: a Bayesian perspective. In: Bernardo JM, DeGroot
MH, Lindley DV, et al., eds. Bayesian statistics 3. Oxford,
England: Oxford University Press, 1988:79-94.
29. Greenland S, Robins JM. Empirical-Bayes adjustments for
multiple comparisons are sometimes useful. Epidemiology
1991;2:244-51.
Am J Epidemiol
Vol. 147, No. 9, 1998