Achieving statistical significance with covariates

Achieving statistical significance with covariates
Draft
Gabriel Lenz
Alexander Sahn
April 07, 2017
Abstract
An important and understudied area of hidden researcher discretion is the use of covariates.
Researchers choose which covariates to include in statistical models and these choices affect the
size and statistical significance of estimates reported in studies. How often does the statistical
significance of published findings depend on these discretionary choices? The main hurdle
to studying this problem is that researchers never know the true model and can always make
a case that their choices are most plausible, closest to the true data generating process, or
most likely to rule out alternative explanations. We attempt to surmount this hurdle through
a meta-analysis of articles published in the American Journal of Political Science (AJPS). In
almost 40% of observational studies, we find that researchers achieve conventional levels of
statistical significance through covariate adjustments. Although that discretion may be justified,
researchers almost never disclose or justify it.
1
Introduction
When researchers conduct statistical analyses, they have a great deal of discretion about how to
analyze their data. They can decide which covariates to include, how to code their variables, what
statistical model to estimate, how to calculate the standard errors, etc. Absent pre-specification,
they can observe how each of those decisions changes the estimated effect of their key variable
and then decide which one to publish. By trying many different specifications, researchers may
happen upon a false positive—that is, find a statistically and substantively significant effect when
none exists. Through this process, researchers may intentionally “p-hack” (Simmons, Nelson and
Simonsohn, 2011) their way to a statistically significant finding or they may do so unintentionally.
Of course, reviewers and editors may limit this discretion, but likely do not eliminate it. Although
researchers have known about this problem for decades, e.g., Leamer (1983), the extent of the
problem remains unclear.
In this paper, we focus on the discretion researchers exert on one modeling decision in particular:
the choice of covariates. Although the crisis in reproducibility has attracted considerable attention
(Miguel et al., 2014), researchers have largely focused on aspects other than covariate choice—
in part because covariate choice is such a hard problem.1 In most studies, investigators have
numerous covariates they can include in (or exclude from) their statistical models. The choice of
which covariates to include is rarely discussed or scrutinized, but may have large effects on the
key variable on which the author’s claim rests. We assess the extent to which researchers could be
(intentionally or unintentionally) using covariates to present false positives as real effects.
Including covariates can have many effects on statistical estimates. As students learn in introductory
statistics, covariates can deflate (decrease in absolute value) estimates when they correlate, with the
same sign, with both the dependent variable and the key independent variable. They can inflate
(increase estimates in absolute value) when they correlate with different signs with these variables.
If researchers knew the true function generating their data, they could correctly specify their
statistical model, exerting no discretion in their choice of covariates. In practice, however, they do
not know the true model. To quote a well-known remark attributed to George Box, “All models
are wrong, but some are useful,” and another from Theil (1971), “Models are to be used, but not to
be believed.” Since researchers almost never know the true data generating process, they have to
choose which covariates to include. They can therefore potentially choose to include covariates
that tend to inflate their key coefficient estimates and exclude those that deflate their estimates.
The main hurdle to studying this problem is that researchers can always make a case that their model
is most plausible, closest to the true data generating process, most likely to rule out alternative
explanations, etc. We attempt to surmount this hurdle through a meta-analysis.
Although we cannot compare a researcher’s multivariate estimate to the “true” estimate, we can
compare it to an estimate without covariates. The estimate without covariates has an important and
often overlooked advantage: it cannot be “hacked,” at least not with covariates. This limited discretion reduces the number of “forking paths” (Loken and Gelman, 2017). We therefore examine how
researchers’ multivariate estimates diverge from their minimal specification—where researchers
have no discretion over covariates—to their full specification—where researchers may have a great
deal of discretion. In particular, we examine how often studies depend on potentially discretionary
covariates to achieve statistical significance, and how often they report that to readers. Of course,
1 There
are important exceptions (Leamer, 1985; Athey and Imbens, 2015; Simonsohn, Simmons and Nelson, 2015).
2
researchers covariate choices may be well justified—we are reluctant to second-guess them. We
thus focus on prevalence and disclosure.
Our analysis is possible because the American Journal of Political Science (AJPS), a leading political
science journal, began enforcing the posting of replication data and code. In practice, then Editor
Rick Wilson would not proceed with the copyediting of accepted papers until the authors posted
their data to Dataverse. We therefore analyze AJPS volumes 56-59 (2012-2015).
We find considerable evidence of undisclosed p-value deflation. In almost 40% of cases, achieving
conventional levels of statistical significance depends on covariates. Although that discretion may
be justified, researchers almost never disclose or justify it.
Data
According to our pre-analysis plan,2 we selected articles from AJPS 2012-2015 that focused on one
key causal claim and had a statistical model with at least three covariates. Because of a paucity of
experimental studies that met our criteria, we subsequently expanded to only a single covariate for
experimental studies. We excluded a handful of articles that met these criteria but reported a null
result. 18 experimental and 46 observational studies met these criteria, and our analysis focuses on
these 64.
Some articles reported multiple estimates of the key finding (e.g., multiple cases or multiple
measures). To keep the analysis manageable, we reanalyze only one estimate. To select this
estimate, we use the following rules: We always use estimates highlighted as most important in the
text. If no estimate is highlighted, we then use estimates based on summary measures of the key
variables. If neither of these exist, we use the model with the smallest absolute value coefficient on
the key variable.
In figure 1, we successfully replicate the main finding in each of these papers. Given the difficulties
researchers have faced with replication (Dewald, Thursby and Anderson, 1986; King, 1995), this
result is reassuring—AJPS’s replication policy works.
Undisclosed P-Values Deflation in Observational Studies
We start by analyzing the 46 observational studies that met our criteria. We compare the key
estimate with covariates and without covariates. In some cases, estimation requires covariates to be
sensible, and we include covariates in these cases. For example, if the key test is an interaction, we
include the main effects. We also include indicators for experimental conditions and include fixed
effects when the analysis depends on it (e.g., within-subject experiments). We therefore describe our
estimate without potentially discretionary covariates as the “minimal specification” and contrast it
with the “full specification.”
Figure 2 presents evidence of undisclosed p-value deflation. For the observational studies, it shows
the p-value in the minimal specification (triangle) and in the full specification (circle). Of the
2 https://osf.io/p6y2g/
We report all analyses specified in the pre-analysis plan in an appendix available at this link.
In the process of collecting data, we departed from our pre-analysis plan in several ways and explain those in the revised
plan. None of the departures change the findings.
3
4
2
log Replicated Estimate
0
−2
−4
−4
−2
0
2
log Published Estimate
Figure 1: Replication of Key Estimates in Full Specification Models
4
4
46 observational studies, 35 (76%) failed to show a minimal specification (left panel of figure 2).
We count studies as showing a minimal specification if they presented one in a statistical model,
scatterplot, cross tab, or difference in means. Among the 35 that lack a minimal specification,
a surprisingly large number have minimal estimates that fall below the conventional level of
statistical significance (0.05): 17 or 50%.3 In these 17 cases, researchers used covariates to decrease
their p-values, sometimes substantially. As figure 2 shows, eight of the studies lowered their
p-values considerably, from more than 0.75, to less than or about 0.05. Another seven or so lowered
their p-values by a smaller amount, from about 0.20 to less than or about 0.05. Finally, three lower
their p-values a trivial amount to reach 0.05.
Of the 46 observational studies, 11 studies did show a minimal specification. These exhibit a very
different pattern of changes in their p-values from the minimal specification to the full specification
(right panel of figure 2). In fact, only one study has substantial p-value deflation.
To varying degrees then, researchers are deflating their p-values with covariates and not revealing
this deflation to readers. They are doing so to some degree in almost 40% of observational articles
in AJPS during this period, and doing so to a moderate to large degree in over 30% of articles. Of
course, these researchers may be very justified, a point we discuss in the next section. Arguing
the merits of each case is difficult and our instinct is to defer to the authors. Our main point
is that researchers are decreasing their p-values from minimal specifications—where covariate
discretion is absent—to full specifications—which are rife with forking paths and open to p-hacking.
Furthermore, researchers are not disclosing the deflation or offering justifications (though in some
cases we can find a strong implied one, as we discuss below).
Are the Decreased P-Values Justified?
Although we are tempted to focus exclusively on the lack of disclosure and justification, we can’t
resist some investigation into whether these p-value decreases are reasonable. They have two main
justifications. First, covariates may soak up noise in dependent variables, decreasing the standard
errors of the key estimate. Second, covariates may increase the key estimate’s absolute value—they
may do so for several reasons but especially when covariates capture a suppressor variable.
Soaking up noise is (mostly) uncontroversial. In contrast, increasing the absolute value of the
estimate is potentially controversial, though justifiable in some cases. It is potentially controversial
because, since no one knows the true model, researchers never know whether they are over inflating
an estimate. More generally, adding covariates to a model does not necessarily reduce bias until
researchers have specified the true data generating process, which in most studies they can never
knowingly do (Clarke, 2005).
How much of the p-values decreases results from soaking up noise versus increasing the absolute
value of the coefficients? Figure 3 shows that increasing the key coefficients drives much of the pvalue decreases, though both contribute. The light gray arrows show the total decrease in p-values
(same as figure 2) and the black arrows show decreases from changing only the key coefficient
estimate. More precisely, the black arrows show the change in the p-value from the minimal
specification to the full specification if only the coefficient changes (not the standard error). Only in
one case does the decrease result mostly from decreasing the standard error.
3 Of
the 35 studies that did not report a minimal specification, three had a minimal specification with a different sign
than the full specification (one of which was highly significant). We code these to a p-value of 1.
5
Minimal Specification Not Shown
Shown
1.00
0.75
Full Specification
p−value deflation
p value
Minimal Specification
0.50
0.25
0.00
Figure 2: Undisclosed P-Value Deflation (Observational Studies)
Several dark arrows point upwards in figure 3. They do so because the key coefficient estimate
decreases in absolute value, making the result less statistically significant. Given that ruling out
alternative explanations is, for many, the main point of covariates, we should arguably see more
upward arrows.
We carefully read through these articles and attempted to assess which ones were justifiable cases
of key-estimate inflation (in absolute value), a difficult judgment call.
Some instances of inflation seem well justified. One excellent article that exemplifies reasonable
justification is Davenport (2015), which studies the effect of casualties and low draft numbers
on parents’ turnout during the Vietnam War. She finds that parents whose children are at high
risk of being drafted (low lottery numbers) and who live in towns that have casualties, are more
likely to turn out. In this context, one can tell a simple story about suppressor variables: poor
regions of the country have lower turnout and disproportionately contribute soldiers to combat
roles in Vietnam, so are hit disproportionately by casualties. Socio-economic status may therefore
suppress Davenport’s key interaction. Indeed, in reanalyzing Davenport’s data, we find that
prior turnout (which likely captures individual socio-economic status) and town measures of
socio-economic status double the size of her key interaction, lowering its p-value from 0.11 in the
minimal specification to 0.01 in the full specification.
For several other articles, we can tell a reasonable story about covariates inflating the key estimate
(lowering the p-value), though a less compelling one than in the Davenport case. For many articles,
however, there is no obvious story—though the articles’ authors or subfield experts may be able to
generate one. Unusual covariate choices, such as the inclusion of posttreatment variables, helped
6
researchers achieve statistical significance in more than a few of these articles.
Minimal Specification Not Shown
Shown
1.00
0.75
p−value change from coefficient change
p−value
total p−value change from minimal to full specification
0.50
0.25
0.00
Figure 3: Components of P-Value Deflation
Experimental Studies
In observational studies, covariates can frequently change the key variable’s estimate substantially.
They can do so because they can correlate with the key variable and dependent variable. In
experimental studies, by contrast, covariates rarely change the key variable’s estimate much because
researchers randomly assign the key variable. Experiments may therefore be less vulnerable to
intentional or unintentional p-hacking with covariates, especially as their sample size increases.
In figure 4, we examine p-value deflation in the 18 experimental articles that met our criteria. In
contrast with the observational results, we find little sign of p-value deflation. Only two of 18
studies show substantial deflation and only one went undisclosed.
Experimental approaches have many advantages. One advantage that has been rarely discussed is
reduced discretion in covariate choice.
Conclusion
We find evidence of undisclosed p-value deflation. In almost 40% of the 2012-2015 AJPS articles
we analyzed, researchers achieved conventional levels of statistical significance using covariates.
7
Minimal Specification Shown
0.8
p value
0.6
0.4
0.2
0.0
Full Specification
Minimal Specification
Figure 4: Undisclosed P-Value Deflation (Experimental Studies)
8
Not Shown
Much of the deflation comes from inflating the key effect estimates, less from reducing uncertainty.
Although that discretion may be justified, researchers almost never disclose or justify it.
The reproducibility crisis has highlighted the value of discretion-resistant statistics. Since they lack
potentially discretionary covariates, minimal specifications have such a quality. They are important
because they are transparent.
To help readers evaluate the level of discretion, researchers should disclose the minimal specification.
If it departs noticeably from other specifications, they need to explain why.
References
Athey, Susan and Guido Imbens. 2015. “A Measure of Robustness to Misspecification.” The American
Economic Review 105(5):476–480.
Clarke, Kevin A. 2005. “The Phantom Menace: Omitted Variable Bias in Econometric Research.”
Conflict Management and Peace Science 22(4):341–352.
Davenport, Tiffany C. 2015. “Policy-Induced Risk and Responsive Participation: The Effect of
a Son’s Conscription Risk on the Voting Behavior of His Parents.” American Journal of Political
Science 59(1):225–241.
Dewald, William G., Jerry G. Thursby and Richard G. Anderson. 1986. “Replication in Empirical
Economics: The Journal of Money, Credit and Banking Project.” The American Economic Review
76(4):587–603.
King, Gary. 1995. “Replication, Replication.” PS: Political Science and Politics 28(3):444–452.
Leamer, Edward E. 1983. “Let’s Take the Con Out of Econometrics.” The American Economic Review
73:31–43.
Leamer, Edward E. 1985. “Sensitivity Analyses Would Help.” The American Economic Review
75:308–313.
Loken, Eric and Andrew Gelman. 2017. “Measurement Error and the Replication Crisis.” Science
355(6325):584–585.
Miguel, E., C. Camerer, K. Casey, J. Cohen, K. M. Esterling, A. Gerber, R. Glennerster, D. P. Green,
M. Humphreys, G. Imbens, D. Laitin, T. Madon, L. Nelson, B. A. Nosek, M. Petersen, R. Sedlmayr,
J. P. Simmons, U. Simonsohn and M. Van der Laan. 2014. “Promoting Transparency in Social
Science Research.” Science 343(6166):30–31.
Simmons, Joseph P., Leif D. Nelson and Uri Simonsohn. 2011. “False-Positive Psychology.” Psychological Science 22:1359–1366.
Simonsohn, Uri, Joseph P. Simmons and Leif D. Nelson. 2015. Specification Curve: Descriptive and
Inferential Statistics on All Reasonable Specifications. SSRN Scholarly Paper ID 2694998 Social
Science Research Network Rochester, NY: .
Theil, Henri. 1971. Principles of Econometrics. New York, NY: Wiley.
9