14.1. Measurement error

Ch. 14 Missing Data and Other Opportunities
A Solomon Kurz
4/8/2017
14.1. Measurement error
First, let's grab our data.
library(rethinking)
data(WaffleDivorce)
d <- WaffleDivorce
rm(WaffleDivorce)
Before we plot, let's make a custom theme: theme_black:
theme_black <- theme(
# Specify axis options
axis.line = element_line(size = 1/4, color = "white"),
axis.text = element_text(color = "white", lineheight = 0.9),
axis.ticks = element_line(color = "white", size = 0.2),
axis.title = element_text(color = "white"),
# Specify legend options
legend.background = element_rect(color = NA, fill = "black"),
legend.key = element_rect(color = "white", fill = "black"),
legend.text = element_text(color = "white"),
legend.title = element_text(face = "bold", hjust = 0, color = "white"),
# Specify panel options
panel.background = element_rect(fill = "black", color = NA),
panel.border = element_rect(fill = NA, color = NA),
panel.grid = element_blank(),
# Specify facetting options
strip.background = element_rect(fill = "grey30", color = "grey10"),
strip.text = element_text(color = "white",angle = -90),
# Specify plot options
plot.background = element_rect(color = "black", fill = "black"),
plot.title = element_text(color = "white"),
plot.subtitle = element_text(color = "white")
)
Now, let's make use of our custom theme and reproduce/reimagine Figure 14.1.a.
ggplot(data = d, aes(x = MedianAgeMarriage, y = Divorce)) +
geom_point(color = "white", alpha = .5, size = 2, shape = 16) +
geom_segment(aes(xend = MedianAgeMarriage,
y = Divorce - Divorce.SE,
yend = Divorce + Divorce.SE),
color = "white", size = 1/4) +
labs(x = "Median age marriage" , y = "Divorce rate") +
theme_black
Figure 14.1.b.
ggplot(data = d, aes(x = log(Population), y = Divorce)) +
geom_point(color = "white", alpha = .5, size = 2, shape = 16) +
geom_segment(aes(xend = log(Population),
y = Divorce - Divorce.SE,
yend = Divorce + Divorce.SE),
color = "white", size = 1/4) +
labs(x = "log population" , y = "Divorce rate") +
theme_black
14.1.1. Error on the outcome.
Before we fit our models, let's switch packages.
detach(package:rethinking)
library(brms)
Now we're ready to fit our model. In brms, you specify error on the criterion variable
following the form response | se(se_response, sigma = TRUE) . In this form, se stands
for standard error, the loose frequentist analogue to the Bayesian posterior SD. Unless
you're fitting a meta-analysis on summary information, make sure to specify sigma = TRUE.
Without that you'll have no estimate for 𝜎!
dlist <- list(
div_obs = d$Divorce,
div_sd = d$Divorce.SE,
R
= d$Marriage,
A
= d$MedianAgeMarriage)
b14.1 <- brm(data = dlist, family = gaussian,
div_obs | se(div_sd, sigma = TRUE) ~ 1 + R + A,
prior = c(set_prior("normal(0, 50)", class = "Intercept"),
set_prior("normal(0, 10)", class = "b"),
set_prior("cauchy(0, 2.5)", class = "sigma")),
chains = 2, iter = 5000, warmup = 1000, cores = 2,
control = list(adapt_delta = 0.95))
print(b14.1)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Family:
Formula:
Data:
Samples:
gaussian (identity)
div_obs | se(div_sd, sigma = TRUE) ~ 1 + R + A
dlist (Number of observations: 50)
2 chains, each with iter = 5000; warmup = 1000; thin = 1;
total post-warmup samples = 8000
WAIC: Not computed
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept
35.72
8.17
19.99
52.27
3876
1
R
0.00
0.09
-0.18
0.18
3906
1
A
-1.00
0.26
-1.52
-0.50
4135
1
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma
1.08
0.19
0.73
1.5
5575
1
Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
is a crude measure of effective sample size, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
To my knowledge, you cannot specify start values as liberally in brms as you can in
rethinking.
Also, I have increased the SD in the intercept prior. The normal(0, 10) prior McElreath
used was quite informative and led to discrepancies between the rethinking and brms
results. You'll also note that with McElreath's strong prior, the estimate he reported in the
text for the intercept was 21.30, just on the outer edge of what such a prior would predict.
With the softer constraints on the intercept, the slopes in our brms model differ from those
in the text.
print(b14.1)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Family:
Formula:
Data:
Samples:
gaussian (identity)
div_obs | se(div_sd, sigma = TRUE) ~ 1 + R + A
dlist (Number of observations: 50)
2 chains, each with iter = 5000; warmup = 1000; thin = 1;
total post-warmup samples = 8000
WAIC: Not computed
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept
35.72
8.17
19.99
52.27
3876
1
R
0.00
0.09
-0.18
0.18
3906
1
A
-1.00
0.26
-1.52
-0.50
4135
1
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma
1.08
0.19
0.73
1.5
5575
1
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Even so, I assert it's still worthwhile to take measurement error seriously.
Figure 14.2.a.
dfittedData <- data.frame(fitted(b14.1),
Divorce.SE = d$Divorce.SE,
Divorce = d$Divorce,
A = d$MedianAgeMarriage)
ggplot(data = dfittedData, aes(x = Divorce.SE,
y = Estimate - Divorce)) +
geom_hline(yintercept = 0, linetype = 2, color = "grey50") +
geom_point(color = "white", alpha = .5, size = 2, shape = 16) +
theme_black
Before we make Figure 14.2.b., we need to fit a model that ignores measurement error.
b14.1b <- brm(data = dlist, family = gaussian,
div_obs ~ 1 + R + A,
prior = c(set_prior("normal(0, 50)", class = "Intercept"),
set_prior("normal(0, 10)", class = "b"),
set_prior("cauchy(0, 2.5)", class = "sigma")),
chains = 2, iter = 5000, warmup = 1000, cores = 2,
control = list(adapt_delta = 0.95))
print(b14.1b)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Family:
Formula:
Data:
Samples:
gaussian (identity)
div_obs ~ 1 + R + A
dlist (Number of observations: 50)
2 chains, each with iter = 5000; warmup = 1000; thin = 1;
total post-warmup samples = 8000
WAIC: Not computed
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept
36.96
8.01
21.00
52.67
4005
1
R
-0.06
0.08
-0.22
0.11
4105
1
A
-1.00
0.26
-1.50
-0.49
4258
1
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma
1.52
0.16
1.24
1.88
6126
1
Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
is a crude measure of effective sample size, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Figure 14.2.b.
nd <- data.frame(R = rep(mean(d$Marriage), times = 30),
A = seq(from = 23, to = 30.2, length.out = 30),
div_sd = rep(mean(d$Divorce.SE), times = 30))
dfittedLine <- data.frame(fitted(b14.1, newdata = nd),
nd)
dfittedLineb <- data.frame(fitted(b14.1b, newdata = nd),
nd)
ggplot(data = dfittedData, aes(x = A,
y = Estimate)) +
geom_ribbon(data = dfittedLineb,
aes(ymin = X2.5.ile, ymax = X97.5.ile),
fill = "grey50", alpha = 1/3) +
geom_line(data = dfittedLineb,
color = "grey50", linetype = 2) +
geom_ribbon(data = dfittedLine,
aes(ymin = X2.5.ile, ymax = X97.5.ile),
fill = "royalblue", alpha = 1/3) +
geom_line(data = dfittedLine,
color = "royalblue") +
geom_point(color = "white", alpha = .5, size = 2, shape = 16) +
geom_segment(aes(xend = A,
y = Estimate - Est.Error,
yend = Estimate + Est.Error),
color = "white", size = 1/4) +
labs(x = "Median age marriage" , y = "Divorce rate (posterior)") +
coord_cartesian(xlim = c(23.3, 29.5), ylim = c(4, 14)) +
theme_black
You might note two things. First, a consequence of the different slopes for our brms model
resulted in a difference in the ways in which our model-implied estimates fit along the
slope for median marriage age. Second, the differences in slopes resulted in a much smaller
difference between the measurement error model and the model from the model that
ignored measurement error.
14.1.2. Error on both outcome and predictor.
In brms, you specify error on predictors with an me statement in the form of me(predictor,
sd_predictor).
dlist <- list(
div_obs = d$Divorce,
div_sd = d$Divorce.SE,
mar_obs = d$Marriage,
mar_sd = d$Marriage.SE,
A
= d$MedianAgeMarriage)
b14.2 <- brm(data = dlist, family = gaussian,
div_obs | se(div_sd, sigma = TRUE) ~ 1 + me(mar_obs, mar_sd) + A
,
prior = c(set_prior("normal(0, 50)", class = "Intercept"),
set_prior("normal(0, 10)", class = "b"),
set_prior("cauchy(0, 2.5)", class = "sigma")),
save_mevars = TRUE,
chains = 3, iter = 5000, warmup = 1000, cores = 3,
control = list(adapt_delta = 0.95)) # 8.02222 seconds
print(b14.2)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Family:
Formula:
Data:
Samples:
gaussian (identity)
div_obs | se(div_sd, sigma = TRUE) ~ 1 + me(mar_obs, mar_sd) + A
dlist (Number of observations: 50)
3 chains, each with iter = 5000; warmup = 1000; thin = 1;
total post-warmup samples = 12000
WAIC: Not computed
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept
35.80
8.25
19.59
52.18
5476
1
A
-1.01
0.26
-1.52
-0.49
6188
1
memar_obsmar_sd
0.00
0.09
-0.18
0.18
4749
1
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma
1.07
0.2
0.72
1.49
12000
1
Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
is a crude measure of effective sample size, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Note that you'll need to specify save_mevars = TRUE in order to save the posterior samples
of error-adjusted variables obtained by using the me argument.
Here is the code for Figure 14.3.a.
dfittedData2 <- data.frame(fitted(b14.2),
Divorce.SE = d$Divorce.SE,
Divorce = d$Divorce,
Marriage = d$Marriage)
ggplot(data = dfittedData2, aes(x = Divorce.SE,
y = Estimate - Divorce)) +
geom_hline(yintercept = 0, linetype = 2, color = "grey50") +
geom_point(color = "white", alpha = .5, size = 2, shape = 16) +
theme_black
In order to get the posterior samples for error-adjusted Marriage rate, we'll need to use
posterior_samples.
post14.2 <- posterior_samples(b14.2)
Examine the object with head or str. You'll notice 50 Xme_memar_obsmar_sd[i] vectors,
each corresponding to one of the 50 states. You can get the mean for these using apply and
then put those results in the dfittedData2 data frame for plotting.
dfittedData2$MarriageFitted <- apply(post14.2[, 5:54], 2, mean)
With that information, we can now make our version of Figure 14.4.b.
ggplot(data = dfittedData2, aes(x = MarriageFitted,
y = Estimate)) +
geom_segment(aes(xend = Marriage, yend = Divorce),
color = "white", size = 1/4) +
geom_point(color = "white", size = 2, shape = 20, alpha = .5) +
geom_point(aes(x = Marriage, y = Divorce), color = "royalblue", size = 2, s
hape = 16) +
labs(x = "Marriage rate (posterior)" , y = "Divorce rate (posterior)") +
coord_cartesian(xlim = c(14, 30.5), ylim = c(4, 14)) +
theme_black
A result of our less-informative prior on the intercept seems to have been less shrinkage
for marriage rate than McElreath reported in the text.
14.2. Missing data
brms does not support Bayesian imputation the way rethinking does at this time. But it is
on Bürkner's list of things to do. Keep checking brms updates.
Anyway, remove your objects.
rm(d, theme_black, b14.1, dfittedData, b14.1b, nd, dfittedLine, dfittedLineb,
dlist, b14.2, dfittedData2, post14.2)
Note. The analyses in this document were done with:
•
•
•
•
•
•
R 3.3.2
RStudio 1.0.136
rmarkdown 1.3
rethinking 1.59
brms 1.5.1.9000
ggplot2 2.2.1
References
McElreath, R. (2016). Statistical rethinking: A Bayesian course with examples in R and Stan.
Chapman & Hall/CRC Press.