Posterior-predictive checking with brms predict:
Working with simulation iterations
A Solomon Kurz
3/15/2017
Starting out with the data and the model
Let's simulate our data. These data and the model to follow are based heavily on
McElreath's Statistical Rethinking text, chapter 4.
set.seed(393700787) # Setting our seed to make the results reproducable.
d <- data.frame(ID = 1:100,
male = rep(0:1, times = 50),
lbs = rnorm(100, mean = 99.19, sd = 14.23))
d$inches <- rnorm(100, mean = 48.31 + .11*d$lbs + 2.56*d$male, sd = 1.68)
d$sex <- ifelse(d$male == 0, "female", "male") # This will come in handy late
r
So we've got a data frame with height and weight values for 100 individuals evenly split by
male/female. The goal of our initial model is to predict height with weight. Here's it is in
brms.
library(brms)
b1 <- brm(data = d, family = "gaussian",
inches ~ 1 + lbs,
prior = c(set_prior("normal(50, 10)", class = "Intercept"),
set_prior("normal(0, 5)", class = "b"),
set_prior("cauchy(0, 1)", class = "sigma")),
chains = 4, iter = 2000, warmup = 500, cores = 4)
plot(b1)
print(b1)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Family:
Formula:
Data:
Samples:
gaussian (identity)
inches ~ 1 + lbs
d (Number of observations: 100)
4 chains, each with iter = 2000; warmup = 500; thin = 1;
total post-warmup samples = 6000
WAIC: Not computed
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept
49.98
1.56
46.94
53.05
6000
1
lbs
0.11
0.02
0.08
0.14
6000
1
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma
2.32
0.16
2.02
2.68
6000
1
Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
is a crude measure of effective sample size, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
The chains look good and the model estimates look sensible.
Model-implied predictions with pred
With the pred function you can get summary statistics for model-implied criterion values.
By default, pred returns criterion values based on the values of the predictor(s) in the data
frame. Because our data frame, d, had 100 cases, it has 100 predictor values for pred to use.
By default, pred estimates n synthetic sets of criterion values where n = the total number of
iterations minus those used in warmup. We fit our model with 4 chains, each of which had
1500 iterations following 500 warmup iterations. Therefore, we have 6000 (i.e., 1500*4 =
6000) total sets of model-implied criterion values. When you specify summary = F in pred,
you get the results of all those simulations. For convenience, we'll go ahead and put it in a
data frame and index each row as an iteration.
predb1 <- data.frame(Iter = 1:6000,
predict(b1, summary = F))
str(predb1)
Each row corresponds to the simulated data based on one of the HMC iterations, which
we've identified with the first vector, Iter. The subsequent 100 vectors are the individual
cases. You can get basic summary information using the apply function.
With your giant data frame of simulated values, you could also make a more compact
summary data frame that corresponds nicely with predict without the summary = F
argument.
dfsummary <- data.frame(mean
sd =
ll =
ul =
= apply(predb1[
apply(predb1[ ,
apply(predb1[ ,
apply(predb1[ ,
, 2:101], 2, mean),
2:101], 2, sd),
2:101], 2, quantile, .025),
2:101], 2, quantile, .975))
dfpred1withoutsummaryF <- data.frame(predict(b1))
head(dfsummary)
##
##
##
##
##
##
##
X1
X2
X3
X4
X5
X6
mean
59.39038
59.44288
61.56093
60.11286
62.27618
61.28375
sd
2.328336
2.320962
2.357847
2.335895
2.362836
2.338140
ll
54.83234
54.94083
56.91136
55.55721
57.54220
56.64375
ul
64.10416
64.17259
66.13335
64.76826
66.86371
65.91954
head(dfpred1withoutsummaryF)
##
##
##
##
##
##
##
1
2
3
4
5
6
Estimate Est.Error X2.5.ile X97.5.ile
59.41117 2.350736 54.80449 64.09160
59.46774 2.317751 54.90343 63.95555
61.58379 2.355710 57.04271 66.25112
60.09462 2.324144 55.67653 64.68663
62.24553 2.368566 57.67773 66.86274
61.26609 2.332270 56.62512 66.00718
See?
But reducing these data into simple summaries seems like a waste of such a rich data
frame. We can do better.
To make this data set easier to work with in ggplot2, we might employ the reshape2
package to convert it to the long format. You can learn more about going from the wide to
the long format with reshape2 here and here.
library(reshape2)
predb1 <- data.frame(melt(data = predb1, id.vars = "Iter"))
head(predb1)
##
##
##
##
##
##
##
1
2
3
4
5
6
Iter variable
value
1
X1 62.16708
2
X1 62.26839
3
X1 57.03063
4
X1 59.21549
5
X1 64.68024
6
X1 57.05494
Our reformatted data frame is already ready to go. But a little extra data processing might
make it easier to work with.
# We might reformat varaible, our ID vector, to match that from the original
data set
predb1$variable <- as.integer(gsub("X", "", predb1$variable))
# We might rename the value vector to something more descriptive
names(predb1)[3] <- 'Estimate'
# Finally, it can be handy to merge the synthetic data with the original valu
es
predb1 <- merge(x = d, y = predb1, by.x = "ID", by.y = "variable")
head(predb1)
##
##
##
##
##
##
##
1
2
3
4
5
6
ID male
lbs
inches
sex Iter Estimate
1
0 88.47016 55.91295 female
1 62.16708
1
0 88.47016 55.91295 female
2 62.26839
1
0 88.47016 55.91295 female
3 57.03063
1
0 88.47016 55.91295 female
4 59.21549
1
0 88.47016 55.91295 female
5 64.68024
1
0 88.47016 55.91295 female
6 57.05494
Plotting
To get overall picture, you can compare the original distribution of the criterion to modelimplied random draws.
Let i be the number of random draws you'd like to plot.
i <- 1:50
ggplot(data = d, aes(x = inches)) +
geom_density(color = "purple", size = 2) +
geom_density(data = predb1[predb1$Iter == i, ],
aes(x = Estimate, group = Iter), size = .1) +
scale_y_continuous(NULL, breaks = NULL) +
theme(panel.grid = element_blank())
And you can facet the plot by a grouping variable, like sex.
ggplot(data = d, aes(x = inches)) +
geom_density(aes(color = sex), size = 2) +
scale_color_manual(values = c("red3", "blue3")) +
geom_density(data = predb1[predb1$Iter == i, ],
aes(x = Estimate, group = Iter), size = .1) +
scale_y_continuous(NULL, breaks = NULL) +
theme(panel.grid = element_blank(),
legend.position = "none") +
facet_wrap(~sex)
But the beauty of these data are that you can get a more refined sense of how the model fits
at the single-case level. Let's start with box plots.
Let i be an index for the subset of cases you'd like to examine.
i <- 1:40
ggplot(data = predb1[predb1$ID == i, ], aes(x = factor(ID), y = Estimate)) +
geom_boxplot(outlier.size = NA) +
geom_point(aes(y = inches), size = 1, color = "purple") +
theme(panel.grid = element_blank())
Violin plots might give us a better feel for the shapes of the distributions.
ggplot(data = predb1[predb1$ID == i, ], aes(x = factor(ID), y = Estimate)) +
geom_violin(fill = "black", size = 0) +
geom_point(aes(y = inches), size = 1, color = "purple") +
theme(panel.grid = element_blank())
You can reorder these or the box plots by the medians of the simulated values.
ggplot(data = predb1[predb1$ID == i, ],
aes(x = reorder(factor(ID), Estimate, FUN = median), y = Estimate)) +
geom_violin(fill = "black", size = 0) +
geom_point(aes(y = inches), size = 1, color = "purple") +
theme(panel.grid = element_blank())
Or you might reorder them by the original criterion value, inches.
ggplot(data = predb1[predb1$ID == i, ],
aes(x = reorder(factor(ID), inches), y = Estimate)) +
geom_violin(fill = "black", size = 0) +
geom_point(aes(y = inches), size = 1, color = "purple") +
theme(panel.grid = element_blank())
And you can facet these by sex.
ggplot(data = predb1[predb1$ID == i, ],
aes(x = reorder(factor(ID), inches), y = Estimate)) +
geom_violin(fill = "black", size = 0) +
geom_point(aes(y = inches, color = sex), size = 1) +
scale_color_manual(values = c("red3", "blue3")) +
theme(panel.grid = element_blank(),
legend.position = "none") +
facet_wrap(~sex)
If you wanted a more granular analysis with, say, histograms:
i <- 1:9
ggplot(data = predb1[predb1$ID == i, ], aes(x = Estimate)) +
geom_histogram(binwidth = 1) +
geom_vline(aes(xintercept = inches, color = sex)) +
scale_color_manual(values = c("red3", "blue3")) +
facet_wrap(~ID) +
theme(panel.grid = element_blank(),
legend.position = "none")
Heck, you could even do something like case-level trace plots for the synthetic draws.
ggplot(data = predb1[predb1$ID == i, ], aes(x = Iter, y = Estimate)) +
geom_line(size = .1) +
geom_hline(aes(yintercept = inches, color = sex)) +
scale_color_manual(values = c("red3", "blue3")) +
facet_wrap(~ID) +
theme(panel.grid = element_blank(),
legend.position = "none")
And again, perhaps we might reorder them by the original values.
i <- 1:25 # increasing the number of cases, just for kicks
ggplot(data = predb1[predb1$ID == i, ], aes(x = Iter, y = Estimate)) +
geom_line(size = .1) +
geom_hline(aes(yintercept = inches, color = sex)) +
scale_color_manual(values = c("red3", "blue3")) +
facet_wrap(~reorder(ID, inches)) +
theme(panel.grid = element_blank(),
legend.position = "none")
Back to the drawing board
The common theme across all the plots we faceted by sex is that model b1 systematically
overpredicts height for females and underpredicts height for males. We can formally assess
for that with a multivariable model.
b2 <- brm(data = d, family = "gaussian",
inches ~ 1 + lbs + male,
prior = c(set_prior("normal(50, 10)", class = "Intercept"),
set_prior("normal(0, 5)", class = "b"),
set_prior("cauchy(0, 1)", class = "sigma")),
chains = 4, iter = 2000, warmup = 500, cores = 4)
LOO(b1, b2)
##
LOOIC
SE
## b1
455.42 13.82
## b2
398.13 14.05
## b1 - b2 57.29 10.97
Based on the LOO, we have good reason to suspect that our multivariable model will make
better out-of-sample predictions that the simple univariate model, b1. We can follow
similar steps like before to see how well this model does for within-sample predictions, too.
Data processing.
predb2 <- data.frame(Iter = 1:6000,
predict(b2, summary = F))
predb2 <- data.frame(melt(data = predb2, id.vars = "Iter"))
# Cleaning up the ID variable
predb2$variable <- as.integer(gsub("X", "", predb2$variable))
# Renaming a vector
names(predb2)[3] <- 'Estimate'
# Merging the synthetic data with the original values
predb2 <- merge(d, predb2, by.x = "ID", by.y = "variable")
Plotting our new model-implied synthetic data
Starting off with a group-level plot. Once again, i indexes which simulations we want to
compare with the original data.
i <- 1:50
ggplot(data = d, aes(x = inches)) +
geom_density(aes(color = sex), size = 2) +
scale_color_manual(values = c("red3", "blue3")) +
geom_density(data = predb2[predb2$Iter == i, ],
aes(x = Estimate, group = Iter), size = .1) +
scale_y_continuous(NULL, breaks = NULL) +
theme(panel.grid = element_blank(),
legend.position = "none") +
facet_wrap(~sex)
These simulations are much better at matching up with the original data values by sex.
Now let's refocus our analysis to the single-case level.
i <- 1:25
ggplot(data = predb2[predb2$ID == i, ], aes(x = Iter, y = Estimate)) +
geom_line(size = .1) +
geom_hline(aes(yintercept = inches, color = sex)) +
scale_color_manual(values = c("red3", "blue3")) +
facet_wrap(~reorder(ID, inches)) +
theme(panel.grid = element_blank(),
legend.position = "none")
Still not perfect, but better. We’d probably improve our estimates at the single-case level
with time series data and a multilevel model.
Remove your objects.
rm(d, b1, predb1, dfsummary, dfpred1withoutsummaryF, i, b2, predb2)
Note. The analyses in this document were done with:
•
•
•
•
•
•
R 3.3.2
RStudio 1.0.136
rmarkdown 1.3
brms 1.5.1.9000
rstan 2.14.1
ggplot2 2.2.1
References
McElreath, R. (2016). Statistical rethinking: A Bayesian course with examples in R and Stan.
Chapman & Hall/CRC Press.
© Copyright 2026 Paperzz