Hierarchical Models for Calibrating Selection Bias

Using Hierarchical Models to Calibrate Selection
Bias
Douglas Rivers
Stanford University and YouGov
February 26, 2016
Margins of error
We agree that margin of sampling error in surveys has an
accepted meaning and that this measure is not
appropriate for non-probability samples. . . . We believe
that users of non-probability samples should be
encouraged to report measures of the precision of their
estimates, but suggest that to avoid confusion, the set of
terms be distinct from those currently used in probability
sample surveys.
AAPOR Report on Non-probability Sampling (2013)
Is inference possible with unknown selection probabilities?
I
It better be, since we certainly don’t know what the selection
probabilities are for most public opinion polls and market
research surveys.
With single digit response rates, actual sample inclusion
probabilities differ by two orders of magnitude from the initial
unit selection probabilities.
I
The usual approach is to assume ignorable selection
(conditional independence of selection and survey variables
given a set of covariates).
Such inferences are made conditional upon the selection
model, which is unlikely to hold exactly. Shouldn’t this be
reflected somehow in the margin of error?
I
Empirically, calculated standard errors in pre-election polls
substantially underestimate the RMSE.
Gelman, Goel and Rothschild (2016) find that the actual
RMSE was understated between 25% and 50% (depending
upon type of election) in 4,221 polls.
Three questions
A 100(1 − α)% confidence interval for a descriptive population
parameter θ̂0 is usually computed using
θ̂ ± z1−α/2 s.e.(θ̂)
where θ̂ is a sample mean or proportion, possibly weighted.
1. Does θ̂ have a normal sampling distribution?
2. Can we estimate s.e.(θ̂) without knowing the selection
probabilities?
3. Is the sampling distribution of θ̂ centered on the population
parameter θ0 ?
If the answer to all three questions is “yes,” then the confidence
interval will have the stated level of coverage.
1. Is θ̂ normally distributed?
Suppose {yi }N
i=1 is a bounded sequence of real numbers and
{Di }N
is
a
sequence
of independent Bernoulli random variables
i=1
with E(Di ) = πi . Let
n=
N
X
Di
i=1
∗
θN
=
N
1 X
Di y i
θ̂N =
nN
i=1
PN
i=1 πi yi
N π̄
2
ωN
=
1
N π̄
N
X
N
1 X
π̄N =
πi
N
i=1
∗ 2
πi (1 − πi )(yi − θN
)
i=1
If
(i) limN→∞ π̄N = π̄ where 0 < π̄ < 1
∗ = θ∗
(ii) limN→∞ θN
2 = ω 2 where 0 < ω 2 < ∞
(iii) limN→∞ ωN
then
√
L
∗
n(θ̂N − θN
) −→ N(0, ω 2 )
2. Can we estimate s.e.(θ̂)?
Under the same assumptions as the preceding result,
"
1X
c θ̂) =
s.e.(
(yi − θ̂)2
n
#1/2
i∈s
is a conservative estimator of s.e.(θ̂) with asymptotic bias O(π̄N ).
This also works with weighting, except that yi is replaced
everywhere by wi yi (where wi is the weight) and
"P
c θ̂) =
s.e.(
2
2
i∈s wi (yi − θ̂)
nw̄ 2
#1/2
Independence of the draws is enough. You don’t need to know the
selection probabilities.
3. Is the distribution of θ̂ centered on θ0 ?
Unfortunately, no. The sampling distribution of θ̂ is approximately
2 a
∗ ωN
θ̂ ∼ N θN ,
n
so the confidence interval
c θ̂)
θ̂ ± z1−α/2 s.e.(
is shifted to the right by the quantity
. ∗
Bias(θ̂) = θN
− θ0
∗.
The margin of error has approximately correct coverage for θN
The interval is still useful for quantifying sampling error (how
much variation could be expected from selecting another sample
using the same process), but actual coverage for the population
parameter is overstated (sometimes by a lot).
Post-stratification to correct for selection bias
Bias can be eliminated if we can identify a set of covariates that
make selection conditionally independent of the survey variables.
The conditional independence (ignorability) assumption is more
plausible if the number of covariates is large.
However, post-stratification involves a bias-variance tradeoff.
Post-stratifying on a large number of variables is a form of
over-fitting which, while it may reduce bias, can increase the mean
square error by inflating the variance.
MSE(θ̂) = Bias2 (θ̂) + V(θ̂)
Is the problem bias or variance?
Data
Seven opt-in internet surveys, a probability internet panel study,
and an RDD phone survey fielded almost identical questionnaires
in 2004-05, including eight items also included in the 2004
American Community Survey (ACS).
Primary demographics. Gender, age, race, education.
Secondary demographics. Marital status, home ownership,
number of bedrooms, number of vehicles.
Six of the opt-in surveys used online panels, while one
(SPSS-AOL) used a “river sample.” All of the opt-in samples used
some form of quota sampling on gender, sometimes on age and/or
region, and only one on race. The probability internet panel (KN)
uses purposive sampling for within-panel selection, while it appears
that the phone survey may have used gender quotas.
Only one of the opt-in survey vendors (Harris Interactive) provided
post-stratification weights.
Parallel estimates with different post-stratification schemes
Phone (SRBI)
Probability Web (KN)
8
6
●
●
●
●
1. Harris
8
8
6
6
●
●
●
●
●
●
●
●
−2
●
●
●
●
●
●
●
●
−4
●
●
−6
−8
2
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−2
1
2
3
4
2
0
−4
−6
−6
5
1
2. Luth
2
3
4
5
●
●
●
●
●
●
●
●
●
●
●
●
●
−4
●
●
●
2
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−2
−4
0
−6
−8
4
5
0
Number of raking variables
1
5. Survey Direct
3
4
5
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
1
−2
●
●
●
−4
−6
−8
3
●
●
4
5
0
−8
2
●
●
●
5
7. GoZing
2
−6
●
●
10
Error (Percent)
0
Error (Percent)
●
5
●
4
●
4
Number of raking variables
6
2
●
●
●
●
6. SPSS/AOL
4
●
●
●
0
8
6
−4
●
●
Number of raking variables
8
−2
2
3
●
●
−8
3
2
−4
−6
2
●
●
●
2
−8
1
1
−2
−6
0
●
●
●
4
●
●
Error (Percent)
−2
●
4
●
●
0
Error (Percent)
●
●
●
●
4. SSI
6
●
●
●
●
3. Greenfield
8
●
●
Number of raking variables
6
●
●
0
8
●
●
●
Number of raking variables
6
2
●
●
−8
0
8
4
●
●
−2
−4
Number of raking variables
Error (Percent)
●
−8
0
Error (Percent)
4
●
Error (Percent)
0
4
●
●
2
Error (Percent)
Error (Percent)
4
●
●
●
●
●
●
●
●
●
●
●
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
2
3
●
●
●
●
●
●
−5
−10
●
●
−15
●
●
●
●
●
●
●
●
●
0
1
−20
●
0
1
2
3
4
Number of raking variables
5
0
1
2
3
4
Number of raking variables
5
4
Number of raking variables
5
95% confidence intervals for estimates
Phone (SRBI)
Probability Web (KN)
8
6
1. Harris
8
8
6
6
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−2
●
●
−4
●
Error (Percent)
2
●
4
●
●
●
●
●
●
●●
0
●
●
●
●
●
●
●
●
●
●
−2
1
2
3
4
●
0
1
2
3
4
2. Luth
0
6
●
●
●
●
●
−2
●
●
●
●
●
●
●
●
−4
●
●
●
2
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
−2
−4
0
−8
5
0
1
Number of raking variables
5. Survey Direct
3
4
●
●
5
0
1
2
Error (Percent)
●
0
●
●
●
●
●
●
●
●
●
−4
●
●●
●
●
●
●
●
●
●
●
●
●
●
−2
●
●
●
−4
●
●
●
●
●
●
●
●
4
5
−8
0
●
●
●
●
●
●
●
●
●
●
●
●
−5
−10
●
●
●
●
●
●
●
●
−15
●
●
●
●
−8
3
5
0
−6
●
7. GoZing
2
●
●
●
−6
●
●
●
10
4
●
●
●
Number of raking variables
6
2
5
●
●
●
●
6. SPSS/AOL
4
−2
2
8
6
●
●
●
Number of raking variables
8
4
●
●
−6
4
● ●
●
●
−8
3
● ●
●
−4
−6
2
3
2
−8
1
2
−2
−6
0
●
4
●
●
●
●
0
Error (Percent)
●
●
Error (Percent)
●
●
●
●
●
●
●
●
4
●
●●
●
●
4. SSI
8
●
1
3. Greenfield
6
0
●
●●
Number of raking variables
8
●
●
●
Number of raking variables
6
●
●
●● ●
5
8
2
●
●
−8
5
Number of raking variables
4
●●
●●
0
−6
−8
0
2
−2
−4
−6
−8
Error (Percent)
●
●
−4
●
−6
Error (Percent)
●
●
●
Error (Percent)
Error (Percent)
0
4
●
●
●
2
Error (Percent)
4
●
●●
●
●
●
●
●
−20
●
0
1
2
3
4
Number of raking variables
5
0
1
2
3
4
Number of raking variables
5
0
1
2
3
4
Number of raking variables
5
Variance inflation caused by post-stratification
250
Effects of Weighting on Standard Errors
●
150
●
100
●
50
●
●
●
●
●
●
Gender
Gender + Age
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Gender + Age
+ Race
Gender + Age
+ Race + Educ
Gender + Age+ Race
+ Educ + Region
●
0
S.E. inflation (percent)
200
●
Weighting variables
●
●
Testing for nonignorable selection bias
Problem: It is difficult to distinguish between sampling variability
and selection bias in the full sample.
Idea: Define cells by post-stratifying into 96 cells based on primary
demographics (2 gender × 4 age × 3 race × 4 education factors)
and compute the error in each cell for the four secondary
demographics and compare to the expected sampling error if there
is no selection bias.
The ACS has a high response rate and large sample (nearly one
million persons) provides reasonably accurate estimates for the
population proportions in each cell.
Notation
x = covariates with finite support X
y = survey variable (assumed to be dichotomous)
p0 (x) = population distribution of x
q0 (x) = P{y = 1|x} = population conditional distribution of y
πy (x) = average sample selection probability for (x, y ) units
π(x) = q0 (x)π1 (x) + [1 − q0 (x)]π0 (x) = selection probability in x
p̂(x) = sample proportion in cell x
X
p ∗ (x) = E[p̂(x)] = p0 (x)π(x)/
p0 (x 0 )π(x 0 )
x 0 ∈X
q̂(x) = sample proportion y = 1 in cell x
q ∗ (x) = E[q̂(x)] = q1 (x)π1 (x)/[q0 (x)π0 (x) + q1 (x)π1 (x)]
Sources of error
Estimation error has three components:
X
θ̂ − θ0 =
[p̂(x) − p ∗ (x)]q̂(x) + p ∗ (x)[q̂(x) − q ∗ (x)]
sampling
x∈X
+
X
[p ∗ (x) − p0 (x)]q ∗ (x)
post-stratification
p0 (x)[q ∗ (x) − q0 (x)]
selection bias
x∈X
+
X
x∈X
Post-stratification error can be eliminated by weighting the
observations in cell x by the ratio p0 (x)/p̂(x).
We wish to test whether the last component (selection bias) is
present: q ∗ (x) = q0 (x) for x ∈ X .
A chi-squared test for selection bias
Let nx denote the sample count in cell x. Conditional upon
{nx : x ∈ X }, the statistic
X q̂(x) − q0 (x) 2
X =
s.e.(q̂(x)
2
nx >0
has expected value J (where J is the number of cells for which
nx > 0) if there is no selection bias.
When there is selection bias, then X 2 /J will tend to be large. A
rough test of the hypothesis compares X 2 to a chi-square
distribution with J degrees of freedom.
Results
Phone (SRBI)
Probability Web (KN)
●
10
●
●
●
●
●
●
●
●
●
●
●
●
0
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
●
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
●
Number of raking variables
3. Greenfield
●
15
10
●
●
●
●
●
●
●
●
●
●
●
●
0
1
●
●
●
●
●
●
2
●
●
●
●
3
●
●
●
●
●
●
●
4
5
15
●
●
10
●
5
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
●
●
Number of raking variables
6. SPSS/AOL
●
10
●
●
●
●
●
●
●
●
●
●
0
1
2
●
●
5
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Chi−squared / degrees of freedom
25
15
●
Number of raking variables
5
●
●
0
1
20
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
2
3
4
5
●
●
●
●
●
●
●
10
●
●
●
●
●
●
●
●
●
0
1
●
●
3
●
●
●
●
●
●
●
●
●
●
●
●
●
0
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
●
4
5
20
15
4
Number of raking variables
5
●
●
10
5
0
2
●
5
7. GoZing
●
●
●
●
25
●
●
●
●
10
Number of raking variables
●
●
15
15
●
0
4
●
●
●
●
●
20
●
3
●
●
●
●
●
●
5
5
4. SSI
5. Survey Direct
20
10
25
Number of raking variables
25
15
Number of raking variables
20
0
20
0
25
Chi−squared / degrees of freedom
Chi−squared / degrees of freedom
●
2. Luth
0
Chi−squared / degrees of freedom
5
Number of raking variables
20
0
10
●
●
25
5
15
Chi−squared / degrees of freedom
5
●
20
Chi−squared / degrees of freedom
●
15
25
Chi−squared / degrees of freedom
20
1. Harris
25
Chi−squared / degrees of freedom
Chi−squared / degrees of freedom
25
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
●
Number of raking variables
Estimating the magnitude of selection bias
Although the amount of selection bias is not particularly large, we
can still reject the hypothesis of no selection bias. We model the
bias as a random variable with a multilevel variance structure.
We have S + 1 surveys, denoted by s = 0, 1, . . . , S (s = 0 for ACS).
Sampling model:
First level:
Second level:
yi,s |xi ∼ Bernoulli(qs∗ (x))
logit(qs∗ (x)) ∼ Normal(µ0 (x) + ηs , τx2 + σs2 )
µ0 (x) ∼ Normal(ZxT α, ω 2 )
ZxT α = αgender + αage + αrace + αeducation
We use diffuse normal priors for the means and half-Cauchy priors
(with scale 5) for the variances.
We impose the restriction that η0 (the bias in ACS) is zero.
Multilevel model estimates
Estimated Bias
4
3
2
●
1
●
●
Bias (Percent)
●
●
●
0
●
●
●
●
●
−1
●
●
●
●
●
●
●
●
●
●
●
●
−2
●
●
−3
●
●
●
●
●
●
●
−4
●
●
●
●
−5
−6
Phone (SRBI)
Probability Web (KN)
1. Harris
2. Luth
3. Greenfield
4. SSI
5. Survey Direct
6. SPSS/AOL
7. GoZing
Discussion
I
Model estimated separately for each survey variable.
Alternatively, estimate the model for a set of variables and use
the predictive distribution for another variable or survey to
estimate how much the MSE exceeds the variance of the
estimate.
I
Current data limited by items correlated with income and
family size, which should be used as covariates.
I
Little evidence of substantial selection bias—for most of the
surveys, selection bias adds 1–2% to the standard error.
I
Easy improvements are possible in panel surveys by selecting a
balanced sample, reducing variability. The good performance
of the probability panel is due primarily to balanced
within-panel selection, not probabilistic recruitment.