American Journal of EPIDEMIOLOGY Volume 147 Copyright © 1998 by The Johns Hopkins University Number 9 School of Hygiene and Public Health May 1, 1998 Sponsored by the Society for Epidemiologic Research ORIGINAL CONTRIBUTIONS Invited Commentary: Re: "Multiple Comparisons and Related Issues in the Interpretation of Epidemiologic Data" John R. Thompson In a recent commentary, Savitz and Olshan (1) discussed a number of statistical issues relevant to the analysis of epidemiologic data. They pointed out how the use of hypothesis tests can have counterintuitive consequences. Two examples of this are the use of adjustments for multiple comparisons, implying that the interpretation of a test depends on whether or not other tests are conducted, and the fact that a hypothesis test is traditionally interpreted differently, depending on whether or not the investigator thought of the hypothesis before the data were seen, implying that the interpretation depends on the source of the hypothesis. These seemingly illogical consequences led Savitz and Olshan to suggest that a concern with multiple comparisons is unwarranted and that the state of mind of the investigator when the hypothesis was put forward is irrelevant. Indeed, they suggest that a concern with such issues may lead to the "unjustified dismissal of meaningful results or exaggerated confidence in weak results" (1, p. 904). The fact that traditional statistical methods have counterintuitive consequences is not in doubt and has been discussed many times (2-4). What is open to debate is the way that epidemiologists should react. There are two reasonable responses. The first notes that traditional statistics has served medical research well over the last 50 years and has been an important force in improving the quality of that research. The requirement by journals for a statistical analysis has meant that investigators cannot ignore chance variation, and it has led to improvements in design through a greater awareness of the importance of sample size. Given this track record, it might reasonably be argued that epidemiologists should stick with these tried and well-accepted methods. However, the corollary of this is that they must also accept the consequences of traditional statistics, including adjustment for multiple comparisons and a concern with the process by which the hypothesis was generated. A second equally reasonable response is to say that the traditional statistical approach served us well in the past but is no longer sufficient to cope with the type of work that epidemiologists do. In particular, it is not well suited to teasing out causal relations in complex observational studies. If this view is taken, then a better way of handling uncertainty in epidemiologic studies is needed. This will almost certainly lead to some form of likelihood or Bayesian analysis, in which the counterintuitive consequences of the traditional methods do not arise. The one approach that is not tenable is to continue with traditional statistical methods but to ignore their logical, if counterintuitive, consequences. This type of ad hoc modification of traditional statistics has no theoretical basis and is potentially dangerous. In coming to a conclusion about the desirability of adjustments for multiple comparisons and other issues related to hypothesis testing, one is drawn into a series of difficult but fundamental questions. It is necessary to consider, in order, Received for publication April 1, 1996, and accepted for publication October 1, 1996. From the Department of Ophthalmology, Robert Kilpatrick Clinical Sciences Building, University of Leicester, Leicester, England. Reprint requests to Dr. John R. Thompson, Department of Ophthalmology, Robert Kilpatrick Clinical Sciences Building, University of Leicester, Leicester LE2 7LX, England. 1. Should hypothesis tests be used at all? 2. If tests are to be used, should they be based on p values? 801 802 Thompson 3. If p values are used, should they be adjusted for multiple comparisons? In brief, I will argue that in epidemiology the answers to these questions should be 1) "infrequently"; 2) "no"; and 3) if you really insist on p values, "yes." The arguments raised against multiple comparisons by Savitz and Olshan (1) are very powerful, but they highlight issues that would make one want either to abandon hypothesis testing altogether or at least to abandon the use of the p value as the basis for testing. If one gets as far as asking the third question, then it is only logical to accept the consequences of the previous steps and to use adjustments for multiple comparisons. SHOULD HYPOTHESIS TESTS BE USED AT ALL? A glance through an epidemiology journal will confirm the popularity of hypothesis tests in our field of application, but closer examination shows that the tests are not all used for the same reasons. It is convenient to divide the uses into formal applications, in which the test is used on its own to assess some hypothesis, and informal applications, in which the test is interpreted freely by the investigator alongside other, often unquantified, information. Informal testing may be the basis for an entire analysis or just a part, as when an exploratory analysis uses p values as summary statistics. The formal use of tests as a method of inference, that is, to inform about the parameters of a model, is much less common than their informal use. Unfortunately, this distinction is rarely made explicit, and informal users may seek to gain unjustified credibility for their analysis by association with the theoretical ideas that underlie formal tests. When used informally, the logical foundations of p values are of little interest, since the tests need no more justification than does a graph of the data; it is left to the observer to see in them what they can. In these circumstances, there may be disagreement over whether a test result is the best way to summarize the data, but arguments over issues such as adjustment for multiple comparisons are rather academic. The proof of worth lies in the tests' ability to combine with other information and help the investigator toward a better understanding of the study. There is now a consensus that confidence intervals are preferable to statistical tests as a means of summarizing the results of an analysis (5, 6). They give more information by describing a range of alternatives rather than concentrating on the value chosen as the null hypothesis. This argument is strongest when applied to informal analyses because the formal benefits of confidence intervals are, in part, illusory owing to their close association with hypothesis tests. If there are fundamental objections to hypothesis tests, then the same criticisms are likely to apply to confidence intervals. In particular, if there is a dispute over whether tests should be adjusted for multiple comparisons, then the same disagreement will arise over the need for simultaneous confidence intervals. Hypothesis tests are sometimes used to aid decision making despite the fact that, in their standard form, they do not incorporate information on the consequences of the decision (4). In theory, it is possible to consider the costs or utilities associated with each decision and to modify the threshold of the test accordingly. In practice, this can be very difficult and time consuming, so the investigator may prefer to use the result of a hypothesis test along with other information to arrive at a decision by an informal, subjective process. A commonly occurring example of a decision theory problem is whether or not to adjust for a potential covariate in a model. In theory, the likely bias that would result from the exclusion of the covariate could be weighed against the loss in precision and simplicity that would result from its unnecessary inclusion. To do this whenever the problem arises would be very time consuming, and so empirically derived rules are created that work well in most circumstances. Experienced investigators might calculate a p value for the covariate term in the model but then judge it against a threshold of 20 percent rather than against the conventional threshold of 5 percent. They will appreciate that such procedures are only guidelines and that prior knowledge should always overrule the results of the test (7). Thus, a decision is arrived at by free interpretation of the hypothesis test along with other information. This process relies heavily on the experience and common sense of the investigator. The informal use of hypothesis tests is often helpful, but it has disadvantages: 1. The analysis relies on the experience of the investigator. 2. The analysis is not necessarily reproducible. 3. In retrospect, it can be hard to see why decisions were made. 4. It is difficult to detect fraud. These objections have always held in statistical analysis and make the occasional claims of objectivity rather hollow. The remaining use of hypothesis tests is as a formal mechanism for learning about model parameters. There are sometimes situations in which investigators wish to rely on the data alone to judge between rival theories, although such formal situations are relatively infrequent. Am J Epidemiol Vol. 147, No. 9, 1998 Multiple Comparisons IF TESTS ARE TO BE USED, SHOULD THEY BE BASED ON p VALUES? When hypothesis tests are to be used, a decision must be made about whether or not to base the test on a p value. Once again, it helps to distinguish between the formal and informal applications. It is much harder to come to any general conclusions if the tests are used informally, since they then require no logical foundation but rather justify themselves by their usefulness. It is only in a formal setting that the assumptions and consequences of the use of p values can be critically assessed, although if there were a wider appreciation of the theoretical limitations of p values in formal settings, it is likely that their informal use would be greatly diminished. Frequentist hypothesis tests can be traced back to two distinct schools of thought (8, 9). One, typified by the work of Neyman and Pearson (10), advocates an approach based on rigid thresholds that divide the sample space into acceptance and rejection regions, while the second, typified by the work of Fisher (11), advocates the calculation of/? values, in some sense, to supply a measure evidence. Current practice usually combines these two approaches, so that investigators may take the idea of power, with its rigid thresholds, from Neyman and Pearson, but then quote exact p values, as Fisher advocated. This combined approach makes it more difficult to criticize the logical foundations of what people actually do. The two frequentist approaches, although different, share the idea of the observed data as one realization of many potential sets of values and lead them both to the same tests based on a model's probability distribution over the entire sample space of possible outcomes. This means that a test depends on all of the values that might occur and not just on the actual data, a requirement that breaks the "likelihood principle" (3, 12) and that has been criticized many times for its ridiculous consequences. In Jeffreys' frequently quoted words, What the use of p implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred (2, p. 385). The Neyman-Pearson formulation, with its acceptance and rejection regions, resembles the solution to a decision theory problem but in its standard form does not take proper account of costs. On the other hand, Fisher's approach can be criticized because the p value is such a poor measure of evidence and because it ignores the need for an explicit alternative hypothesis; even the smallest p value will not give evidence against a hypothesis if the observed data are even less Am J Epidemiol Vol. 147, No. 9, 1998 803 likely under every reasonable alternative. Berger and Sellke (13) have shown that, in realistic circumstances, thep value can greatly exaggerate the evidence against the null. In one example, they find that a p value of 0.05 corresponds to posterior belief in the null hypothesis of at least 0.3 over a range of alternatives. Goodman (14) has demonstrated the weakness of the p value as a measure of evidence without turning to Bayesian arguments by looking at the probability that a further study would replicate a significant finding, again, p values are seen to exaggerate the strength of evidence against the null. No matter which formulation is adopted, traditional tests are potentially misleading. In practice, the situation is made even worse by the misinterpretation of tests (14, 15), such as when p values are talked of as observed error rates or when the conditioning of results on the model selection is ignored. Hypothesis tests have been used successfully for so long in part because all schemes of inference lead to similar results in the most simple situations or when the data set is very large and in part because experienced investigators have learned not to be too constrained by the size of a p value. For instance, if the sample size is very large, the p value may be highly significant, even though the deviation of the population from the null hypothesis is small. Therefore, most investigators will modify their interpretation of such a p value accordingly (16). Traditional hypothesis tests often work only because people have learned to use them in their own way. IF p VALUES ARE TO BE USED, SHOULD THEY BE ADJUSTED FOR MULTIPLE COMPARISONS? The reasons that the p value survives are its relative simplicity and familiarity compared with alternative methods. If, despite their limitations, p values are to be used, should they be adjusted for multiple comparisons ? The answer will again depend on the context. Formal inference is the easiest situation to address, for to have gotten as far as asking about adjusting p values implies that the decision must have been made to adopt the frequentist line and to ignore the criticisms of conventional hypothesis tests. In that case, adjustment is a logical consequence. Within the frequentist formulation (17), there is a basic asymmetry between the null and the explicit or implicit alternative hypothesis. The test will only reject the null if the evidence against it is strong, a constraint that is enforced by insisting on a small probability for the Type I error (rejecting the hypothesis when it is true), regardless of any resulting loss of power. If a set of tests is performed, then it is natural to place a familywise (experimentwise) limit on the Type I error. Ways of 804 Thompson applying multiple comparisons to specific types of analysis are discussed at length by Miller (18) and by Hochberg and Tamhane (19). Westfall and Young (20) show how resampling can be used to estimate an adjusted p value in a even wider range of models. Although methods for adjustment are available for many situations, the main problem usually lies in defining the set of tests that are to be treated as a family. The arbitrariness of this decision is inescapable, given the poor theoretical foundations of traditional tests. When tests are to be used informally, the situation is somewhat different. An experienced investigator may adjust or not as he or she pleases, for even if no familywise adjustment is made, an informal discounting process will take place. This and the added complexity may partially explain why so many people who advocate traditional methods do not use familywise adjustments. Some authors have advocated reporting large sets of unadjusted/? values with a warning about the number that might be expected to be significance by chance (21, 22). This is a classic example of the informal use of hypothesis tests. The results may be very useful, but the p values as so calculated have little or no real meaning. Unfortunately, the freedom offered by an informal approach has its own dangers and is open to abuse. There is a risk that some will be accidentally misled by unadjusted p values or that others will deliberately use them to mislead. Given the desire, it is very easy to produce "significant" results from any set of data (23). An investigator merely subdivides the subjects by age, sex, geographic area, etc., and then categorizes the exposure and uses separate tests to compare different levels of exposure within each subgroup. With a little ingenuity, this method is almost sure to find some significant results. When the place where the differences lie is found, the groups can then be redefined to preserve the differences but reduce the number of comparisons, and reasons can usually be invented for why those are just the effects that would have been anticipated (24). The investigator who wants to avoid multiple testing can, of course, perform the search by tabulation so that only the final test actually needs to be performed. Less dramatic, but fundamentally the same, is the situation in which the exposures or subgroups are defined in advance. If, for instance, a large number of aspects of diet are compared in cases and referents, it would be surprising if no significant results were found. These arguments call into doubt the worth of a p value in these circumstances, but if traditional hypothesis tests are to be used formally, then adjustment will at least preserve their one merit, the protection against Type I errors. If nonadjustment for multiple comparisons became acceptable or if investigators were free to dredge through their data before deciding what hypotheses to test, then published/? values would be of even less worth than they are now. It would be a license to publish coincidences with a pseudoscientific gloss. AVOIDING THE PROBLEMS If p values do not measure evidence and tests based on them lead us to counterintuitive actions, what is the alternative? The solution is well known but is rarely put into practice. Measures of evidence based on the data alone must be comparative, that is, they can only say how much more one hypothesis is supported than is another. Under any given model, all of the evidence in the data is summarized by the likelihood. By comparing the height of the likelihood curve at different points, one can see how well different hypotheses are supported by the data. To go further and make statements about a specific hypotheses is possible only if we are able to specify the probability that we associated with the hypotheses before the data were collected. The information in the likelihood can then be used to update that prior assessment to a posterior probability. This is the essence of Bayesian statistics as described in standard textbooks (25). Some researchers object to the use of prior probability, in which case they must stop with the comparative information contained in the likelihood. The loglikelihood curve is frequently drawn to illustrate this approach, but the appeal of the method in simple examples should not hide the fact that in practice the likelihood is often multidimensional, and there may be no obvious way to make statements about one of the parameters in isolation from the others (26). Without such a reduction in dimensionality, interpretation is difficult. If prior probabilities are accepted, then the problem of multidimensionality is solved, at least in theory, because unwanted parameters can be removed by integration. Simulation methods and recent software make this a practical proposition (27). Within the Bayesian framework, the investigator quantifies his or her prior beliefs in the set of specified hypotheses, H, and the data, X, are used to update those beliefs, producing a statement of posterior belief in the hypotheses given the data, P(H\X). Unlike the p value that is derived from P(X\H0), where Ho is the specified null, posterior belief really is a measure that can be relied on, and it has the properties that Savitz and Olshan (1) desire. Indeed, within the Bayesian framework, there is no problem of multiple comparisons Am J Epidemiol Vol. 147, No. 9, 1998 Multiple Comparisons because posterior belief in any one hypothesis is the same, whether or not other hypotheses are considered. The problem of hypothesis generation after the data have been seen is not quite so easily dismissed because in the Bayesian paradigm every theory requires a statement of prior belief, and although some people have tried to get around the problem (28), it is not really possible to make such a statement after the data have been seen. The best solution is to process the data sequentially or in batches so that if a previously unsuspected hypothesis is formulated from inspection of the early data, that experience can be used as the basis for the quantification of one's prior opinion, and then the remainder of the data are available to update those beliefs. Problems of repeated testing come in many forms. The term "multiple comparisons" usually refers to the analysis of the same outcome with several exposures, such as when a series of treatments are compared. A similar situation occurs with interim analysis when the data are analyzed part way through a study, and then, if it is felt necessary, the study is continued and reanalyzed later. On the other hand, multiple testing arises when different outcomes are to be compared within the same study; this commonly arises when several different endpoints are measured or when preliminary tests are made to help formulate a model. The problem of many independent measurements of the same quantity is perhaps the simplest from a Bayesian point of view because exchangeability can often be used to create a relatively simple analysis (28, 29). When the tests are not exchangeable, the investigators will need to specify the interdependence between their prior beliefs, and this is not easy. Although Bayesian analysis provides an excellent and realistic theoretical framework for data analysis, it would be wrong to underestimate its practical problems. CONCLUSIONS Savitz and Olshan ridicule adjustment for multiple comparisons and a concern for the method of generation of the hypothesis, claiming that they are "irrelevant to assessing the validity of the product" (1, p. 904). In one sense, there is little to argue with in this, but if, as they imply, they use p values and the corresponding confidence intervals to measure that validity, then it is these measures that will cause a problem. If p values actually measured evidence, then they would have all the characteristics that Savitz and Olshan desire, but, unfortunately, they do not and so their interpretation requires great care. The arguments of Savitz and Olshan (1) identify a genuine problem in epidemiologic analysis. However, I disagree with their recommendation that it is best not Am J Epidemiol Vol. 147, No. 9, 1998 805 to adjust for multiple comparisons or to be concerned about the method of hypothesis generation. This is dangerous advice that is open to abuse. Rather, I see the points they make as highlighting the deficiencies of p values, and I look forward to a time when epidemiologists adopt sounder methods of data analysis. In the meantime, those researchers who cannot bring themselves to abandon traditional statistical methods should be advised not to attach any specific meaning to individual unadjusted p values used informally and to adjust all formal tests, for otherwise those tests lose their one merit, the protection against Type I errors. REFERENCES 1. Savitz DA, Olshan AF. Multiple comparisons and related issues in the interpretation of epidemiologic data. Am J Epidemiol 1995;142:904-8. 2. Jeffreys H. Theory of probability. Oxford, England: Oxford University Press, 1939. 3. Edwards AWF. Likelihood. Cambridge, England: Cambridge University Press, 1972. 4. Barnett V. Comparative statistical inference. New York, NY: John Wiley & Sons, 1973. 5. Rothman K. A show of confidence. N Engl J Med 1978;299: 1362-3. 6. Gardiner MJ, Altaian DG. Confidence intervals rather than p values. Estimation rather than hypothesis testing. Br Med J 1986;292:746-50. 7. Maldonado G, Greenland S. Simulation study of confounderselection strategies. Am J Epidemiol 1993;138:923-36. 8. Johnstone DJ. Tests of significance in theory and practice. Statistician 1986;35:491-504. 9. Goodman SN. p values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. Am J Epidemiol 1993; 137:485-96. 10. Neyman J, Pearson ES. On the problem of the most efficient tests of statistical hypotheses. Philos Trans R Soc A 1933;231: 289-337. 11. Fisher RA. Statistical methods and scientific inference. 3rd ed. New York, NY: Hafner, 1973. 12. Barnard GA. Statistical inference (with discussion). J R Stat SocB 1949;11:115-49. 13. Berger JO, Sellke T. Testing a point null hypothesis: the irreconcilability of p values and the evidence (with discussion). J Am Stat Assoc 1987;83:687-97. 14. Goodman SN. A comment on replication, p values and evidence. Stat Med 1992;11:875-9. 15. Goodman SN, Royall R. Evidence and scientific research. Am J Public Health 1988;78:1568-74. 16. Chatfield C. Model uncertainty, data mining and statistical inference (with discussion). J R Stat Soc A 1995;158:419-66. 17. Kendall MG, Stuart A. The advanced theory of statistics. Vol. 2, 3rd ed. London, England: Charles Griffin, 1973:190-1. 18. Miller RG. Simultaneous statistical inference. 2nd ed. New York, NY: Springer-Verlag, 1981. 19. Hochberg Y, Tamhane AC. Multiple comparison procedures. New York, NY: John Wiley & Sons, 1987. 20. Westfall PH, Young SS. Resampling-based multiple testing. New York, NY: John Wiley & Sons, 1993. 21. Thomas DC, Siemiatycki J, Dewar R, et al. The problem of multiple inference in studies designed to generate hypotheses. Am J Epidemiol 1985;122:1080-95. 22. Walker AM. Reporting the results of epidemiological studies. Am J Public Health 1986;76:556-8. 23. Mills JL. Data torturing. N Engl J Med 1993;329:1196-9. 806 Thompson 24. Davey-Smith G, Phillips AN, Neaton JD. Smoking as "independent" risk factor for suicide: illustration of an artifact from observation epidemiology? Lancet 1992;340:709-ll. 25. Bernardo JM, Smith AFM. Bayesian theory. New York, NY: John Wiley & Sons, 1994. 26. Kalbfleisch JD, Sprott DA. Applications of likelihood methods to models involving large numbers of parameters. J R Stat Soc 1970:175-94. 27. Spiegelhalter DJ, Thomas A, Best NG, et al. BUGS: Bayesian inference using Gibbs sampling. Version 0.50. Cambridge, England: MRC Biostatistics Unit, 1995. 28. Berry DA. Multiple comparisons, multiple tests and data dredging: a Bayesian perspective. In: Bernardo JM, DeGroot MH, Lindley DV, et al., eds. Bayesian statistics 3. Oxford, England: Oxford University Press, 1988:79-94. 29. Greenland S, Robins JM. Empirical-Bayes adjustments for multiple comparisons are sometimes useful. Epidemiology 1991;2:244-51. Am J Epidemiol Vol. 147, No. 9, 1998
© Copyright 2026 Paperzz