Achieving statistical significance with covariates Draft Gabriel Lenz Alexander Sahn April 07, 2017 Abstract An important and understudied area of hidden researcher discretion is the use of covariates. Researchers choose which covariates to include in statistical models and these choices affect the size and statistical significance of estimates reported in studies. How often does the statistical significance of published findings depend on these discretionary choices? The main hurdle to studying this problem is that researchers never know the true model and can always make a case that their choices are most plausible, closest to the true data generating process, or most likely to rule out alternative explanations. We attempt to surmount this hurdle through a meta-analysis of articles published in the American Journal of Political Science (AJPS). In almost 40% of observational studies, we find that researchers achieve conventional levels of statistical significance through covariate adjustments. Although that discretion may be justified, researchers almost never disclose or justify it. 1 Introduction When researchers conduct statistical analyses, they have a great deal of discretion about how to analyze their data. They can decide which covariates to include, how to code their variables, what statistical model to estimate, how to calculate the standard errors, etc. Absent pre-specification, they can observe how each of those decisions changes the estimated effect of their key variable and then decide which one to publish. By trying many different specifications, researchers may happen upon a false positive—that is, find a statistically and substantively significant effect when none exists. Through this process, researchers may intentionally “p-hack” (Simmons, Nelson and Simonsohn, 2011) their way to a statistically significant finding or they may do so unintentionally. Of course, reviewers and editors may limit this discretion, but likely do not eliminate it. Although researchers have known about this problem for decades, e.g., Leamer (1983), the extent of the problem remains unclear. In this paper, we focus on the discretion researchers exert on one modeling decision in particular: the choice of covariates. Although the crisis in reproducibility has attracted considerable attention (Miguel et al., 2014), researchers have largely focused on aspects other than covariate choice— in part because covariate choice is such a hard problem.1 In most studies, investigators have numerous covariates they can include in (or exclude from) their statistical models. The choice of which covariates to include is rarely discussed or scrutinized, but may have large effects on the key variable on which the author’s claim rests. We assess the extent to which researchers could be (intentionally or unintentionally) using covariates to present false positives as real effects. Including covariates can have many effects on statistical estimates. As students learn in introductory statistics, covariates can deflate (decrease in absolute value) estimates when they correlate, with the same sign, with both the dependent variable and the key independent variable. They can inflate (increase estimates in absolute value) when they correlate with different signs with these variables. If researchers knew the true function generating their data, they could correctly specify their statistical model, exerting no discretion in their choice of covariates. In practice, however, they do not know the true model. To quote a well-known remark attributed to George Box, “All models are wrong, but some are useful,” and another from Theil (1971), “Models are to be used, but not to be believed.” Since researchers almost never know the true data generating process, they have to choose which covariates to include. They can therefore potentially choose to include covariates that tend to inflate their key coefficient estimates and exclude those that deflate their estimates. The main hurdle to studying this problem is that researchers can always make a case that their model is most plausible, closest to the true data generating process, most likely to rule out alternative explanations, etc. We attempt to surmount this hurdle through a meta-analysis. Although we cannot compare a researcher’s multivariate estimate to the “true” estimate, we can compare it to an estimate without covariates. The estimate without covariates has an important and often overlooked advantage: it cannot be “hacked,” at least not with covariates. This limited discretion reduces the number of “forking paths” (Loken and Gelman, 2017). We therefore examine how researchers’ multivariate estimates diverge from their minimal specification—where researchers have no discretion over covariates—to their full specification—where researchers may have a great deal of discretion. In particular, we examine how often studies depend on potentially discretionary covariates to achieve statistical significance, and how often they report that to readers. Of course, 1 There are important exceptions (Leamer, 1985; Athey and Imbens, 2015; Simonsohn, Simmons and Nelson, 2015). 2 researchers covariate choices may be well justified—we are reluctant to second-guess them. We thus focus on prevalence and disclosure. Our analysis is possible because the American Journal of Political Science (AJPS), a leading political science journal, began enforcing the posting of replication data and code. In practice, then Editor Rick Wilson would not proceed with the copyediting of accepted papers until the authors posted their data to Dataverse. We therefore analyze AJPS volumes 56-59 (2012-2015). We find considerable evidence of undisclosed p-value deflation. In almost 40% of cases, achieving conventional levels of statistical significance depends on covariates. Although that discretion may be justified, researchers almost never disclose or justify it. Data According to our pre-analysis plan,2 we selected articles from AJPS 2012-2015 that focused on one key causal claim and had a statistical model with at least three covariates. Because of a paucity of experimental studies that met our criteria, we subsequently expanded to only a single covariate for experimental studies. We excluded a handful of articles that met these criteria but reported a null result. 18 experimental and 46 observational studies met these criteria, and our analysis focuses on these 64. Some articles reported multiple estimates of the key finding (e.g., multiple cases or multiple measures). To keep the analysis manageable, we reanalyze only one estimate. To select this estimate, we use the following rules: We always use estimates highlighted as most important in the text. If no estimate is highlighted, we then use estimates based on summary measures of the key variables. If neither of these exist, we use the model with the smallest absolute value coefficient on the key variable. In figure 1, we successfully replicate the main finding in each of these papers. Given the difficulties researchers have faced with replication (Dewald, Thursby and Anderson, 1986; King, 1995), this result is reassuring—AJPS’s replication policy works. Undisclosed P-Values Deflation in Observational Studies We start by analyzing the 46 observational studies that met our criteria. We compare the key estimate with covariates and without covariates. In some cases, estimation requires covariates to be sensible, and we include covariates in these cases. For example, if the key test is an interaction, we include the main effects. We also include indicators for experimental conditions and include fixed effects when the analysis depends on it (e.g., within-subject experiments). We therefore describe our estimate without potentially discretionary covariates as the “minimal specification” and contrast it with the “full specification.” Figure 2 presents evidence of undisclosed p-value deflation. For the observational studies, it shows the p-value in the minimal specification (triangle) and in the full specification (circle). Of the 2 https://osf.io/p6y2g/ We report all analyses specified in the pre-analysis plan in an appendix available at this link. In the process of collecting data, we departed from our pre-analysis plan in several ways and explain those in the revised plan. None of the departures change the findings. 3 4 2 log Replicated Estimate 0 −2 −4 −4 −2 0 2 log Published Estimate Figure 1: Replication of Key Estimates in Full Specification Models 4 4 46 observational studies, 35 (76%) failed to show a minimal specification (left panel of figure 2). We count studies as showing a minimal specification if they presented one in a statistical model, scatterplot, cross tab, or difference in means. Among the 35 that lack a minimal specification, a surprisingly large number have minimal estimates that fall below the conventional level of statistical significance (0.05): 17 or 50%.3 In these 17 cases, researchers used covariates to decrease their p-values, sometimes substantially. As figure 2 shows, eight of the studies lowered their p-values considerably, from more than 0.75, to less than or about 0.05. Another seven or so lowered their p-values by a smaller amount, from about 0.20 to less than or about 0.05. Finally, three lower their p-values a trivial amount to reach 0.05. Of the 46 observational studies, 11 studies did show a minimal specification. These exhibit a very different pattern of changes in their p-values from the minimal specification to the full specification (right panel of figure 2). In fact, only one study has substantial p-value deflation. To varying degrees then, researchers are deflating their p-values with covariates and not revealing this deflation to readers. They are doing so to some degree in almost 40% of observational articles in AJPS during this period, and doing so to a moderate to large degree in over 30% of articles. Of course, these researchers may be very justified, a point we discuss in the next section. Arguing the merits of each case is difficult and our instinct is to defer to the authors. Our main point is that researchers are decreasing their p-values from minimal specifications—where covariate discretion is absent—to full specifications—which are rife with forking paths and open to p-hacking. Furthermore, researchers are not disclosing the deflation or offering justifications (though in some cases we can find a strong implied one, as we discuss below). Are the Decreased P-Values Justified? Although we are tempted to focus exclusively on the lack of disclosure and justification, we can’t resist some investigation into whether these p-value decreases are reasonable. They have two main justifications. First, covariates may soak up noise in dependent variables, decreasing the standard errors of the key estimate. Second, covariates may increase the key estimate’s absolute value—they may do so for several reasons but especially when covariates capture a suppressor variable. Soaking up noise is (mostly) uncontroversial. In contrast, increasing the absolute value of the estimate is potentially controversial, though justifiable in some cases. It is potentially controversial because, since no one knows the true model, researchers never know whether they are over inflating an estimate. More generally, adding covariates to a model does not necessarily reduce bias until researchers have specified the true data generating process, which in most studies they can never knowingly do (Clarke, 2005). How much of the p-values decreases results from soaking up noise versus increasing the absolute value of the coefficients? Figure 3 shows that increasing the key coefficients drives much of the pvalue decreases, though both contribute. The light gray arrows show the total decrease in p-values (same as figure 2) and the black arrows show decreases from changing only the key coefficient estimate. More precisely, the black arrows show the change in the p-value from the minimal specification to the full specification if only the coefficient changes (not the standard error). Only in one case does the decrease result mostly from decreasing the standard error. 3 Of the 35 studies that did not report a minimal specification, three had a minimal specification with a different sign than the full specification (one of which was highly significant). We code these to a p-value of 1. 5 Minimal Specification Not Shown Shown 1.00 0.75 Full Specification p−value deflation p value Minimal Specification 0.50 0.25 0.00 Figure 2: Undisclosed P-Value Deflation (Observational Studies) Several dark arrows point upwards in figure 3. They do so because the key coefficient estimate decreases in absolute value, making the result less statistically significant. Given that ruling out alternative explanations is, for many, the main point of covariates, we should arguably see more upward arrows. We carefully read through these articles and attempted to assess which ones were justifiable cases of key-estimate inflation (in absolute value), a difficult judgment call. Some instances of inflation seem well justified. One excellent article that exemplifies reasonable justification is Davenport (2015), which studies the effect of casualties and low draft numbers on parents’ turnout during the Vietnam War. She finds that parents whose children are at high risk of being drafted (low lottery numbers) and who live in towns that have casualties, are more likely to turn out. In this context, one can tell a simple story about suppressor variables: poor regions of the country have lower turnout and disproportionately contribute soldiers to combat roles in Vietnam, so are hit disproportionately by casualties. Socio-economic status may therefore suppress Davenport’s key interaction. Indeed, in reanalyzing Davenport’s data, we find that prior turnout (which likely captures individual socio-economic status) and town measures of socio-economic status double the size of her key interaction, lowering its p-value from 0.11 in the minimal specification to 0.01 in the full specification. For several other articles, we can tell a reasonable story about covariates inflating the key estimate (lowering the p-value), though a less compelling one than in the Davenport case. For many articles, however, there is no obvious story—though the articles’ authors or subfield experts may be able to generate one. Unusual covariate choices, such as the inclusion of posttreatment variables, helped 6 researchers achieve statistical significance in more than a few of these articles. Minimal Specification Not Shown Shown 1.00 0.75 p−value change from coefficient change p−value total p−value change from minimal to full specification 0.50 0.25 0.00 Figure 3: Components of P-Value Deflation Experimental Studies In observational studies, covariates can frequently change the key variable’s estimate substantially. They can do so because they can correlate with the key variable and dependent variable. In experimental studies, by contrast, covariates rarely change the key variable’s estimate much because researchers randomly assign the key variable. Experiments may therefore be less vulnerable to intentional or unintentional p-hacking with covariates, especially as their sample size increases. In figure 4, we examine p-value deflation in the 18 experimental articles that met our criteria. In contrast with the observational results, we find little sign of p-value deflation. Only two of 18 studies show substantial deflation and only one went undisclosed. Experimental approaches have many advantages. One advantage that has been rarely discussed is reduced discretion in covariate choice. Conclusion We find evidence of undisclosed p-value deflation. In almost 40% of the 2012-2015 AJPS articles we analyzed, researchers achieved conventional levels of statistical significance using covariates. 7 Minimal Specification Shown 0.8 p value 0.6 0.4 0.2 0.0 Full Specification Minimal Specification Figure 4: Undisclosed P-Value Deflation (Experimental Studies) 8 Not Shown Much of the deflation comes from inflating the key effect estimates, less from reducing uncertainty. Although that discretion may be justified, researchers almost never disclose or justify it. The reproducibility crisis has highlighted the value of discretion-resistant statistics. Since they lack potentially discretionary covariates, minimal specifications have such a quality. They are important because they are transparent. To help readers evaluate the level of discretion, researchers should disclose the minimal specification. If it departs noticeably from other specifications, they need to explain why. References Athey, Susan and Guido Imbens. 2015. “A Measure of Robustness to Misspecification.” The American Economic Review 105(5):476–480. Clarke, Kevin A. 2005. “The Phantom Menace: Omitted Variable Bias in Econometric Research.” Conflict Management and Peace Science 22(4):341–352. Davenport, Tiffany C. 2015. “Policy-Induced Risk and Responsive Participation: The Effect of a Son’s Conscription Risk on the Voting Behavior of His Parents.” American Journal of Political Science 59(1):225–241. Dewald, William G., Jerry G. Thursby and Richard G. Anderson. 1986. “Replication in Empirical Economics: The Journal of Money, Credit and Banking Project.” The American Economic Review 76(4):587–603. King, Gary. 1995. “Replication, Replication.” PS: Political Science and Politics 28(3):444–452. Leamer, Edward E. 1983. “Let’s Take the Con Out of Econometrics.” The American Economic Review 73:31–43. Leamer, Edward E. 1985. “Sensitivity Analyses Would Help.” The American Economic Review 75:308–313. Loken, Eric and Andrew Gelman. 2017. “Measurement Error and the Replication Crisis.” Science 355(6325):584–585. Miguel, E., C. Camerer, K. Casey, J. Cohen, K. M. Esterling, A. Gerber, R. Glennerster, D. P. Green, M. Humphreys, G. Imbens, D. Laitin, T. Madon, L. Nelson, B. A. Nosek, M. Petersen, R. Sedlmayr, J. P. Simmons, U. Simonsohn and M. Van der Laan. 2014. “Promoting Transparency in Social Science Research.” Science 343(6166):30–31. Simmons, Joseph P., Leif D. Nelson and Uri Simonsohn. 2011. “False-Positive Psychology.” Psychological Science 22:1359–1366. Simonsohn, Uri, Joseph P. Simmons and Leif D. Nelson. 2015. Specification Curve: Descriptive and Inferential Statistics on All Reasonable Specifications. SSRN Scholarly Paper ID 2694998 Social Science Research Network Rochester, NY: . Theil, Henri. 1971. Principles of Econometrics. New York, NY: Wiley. 9
© Copyright 2026 Paperzz