Web Appendix to:
Metrics for covariate balance in cohort studies for causal effects
July 31, 2013
Jessica M Franklin, Jeremy A Rassen, Diana Ackermann, Dorothee B Bartels, and Sebastian
Schneeweiss
Web Appendix 1: Code for Balance Metrics
In this section, we provide R code for calculating each of the ten balance metrics
implemented in the simulation study. In addition, we provide a function for calculating the
estimated bias based on the observed covariate distributions, as in the simulations. In
general, the dat argument in each function is the name of the dataset where data are
stored. It is assumed that the exposure variable is in this dataset and is named X. The cov
argument in the functions that calculate imbalance one covariate at a time indicates on
which covariate the balance should be calculated. It can be specified either by the
appropriate column number or by the variable name in the dat dataset. Other function
arguments are explained below.
1. Absolute difference
abd <- function(dat, cov) {
cov <- dat[,cov]
abs(mean(cov[dat$X==1]) - mean(cov[dat$X==0]))
}
2. Standardized difference
std <- function(dat, cov) {
cov <- dat[,cov]
s <- sqrt((var(cov[dat$X==1]) + var(cov[dat$X==0]))/2)
abs(mean(cov[dat$X==1]) - mean(cov[dat$X==0]))/s
}
3. Overlapping coefficient
ovl <- function(dat, cov) {
cov <- dat[,cov]
if(length(unique(cov)) <= 10) {
pt <- apply(prop.table(table(cov, dat$X), 2), 1, min)
return(1-sum(pt)) # reversed to measure imbalance
}
mn <- min(cov)*1.25
mx <- max(cov)*1.25
f1 <- approxfun(density(cov[dat$X==1], from=mn, to=mx, bw="nrd"))
f0 <- approxfun(density(cov[dat$X==0], from=mn, to=mx, bw="nrd"))
fn <- function(x) pmin(f1(x), f0(x))
s <- try(integrate(fn, lower = mn, upper = mx,
subdivisions = 500)$value)
ifelse(inherits(s, "try-error"), NA, 1-s) #Reverse: measure imbalance
}
1
4. K-S distance
ksd <- function(dat, cov) {
cov <- dat[,cov]
F1 <- ecdf(cov[dat$X==1])
F0 <- ecdf(cov[dat$X==0])
max(abs(F1(cov) - F0(cov)))
}
5. Lévy distance
ld <- function(dat, cov) {
cov <- dat[,cov]
F1 <- ecdf(cov[dat$X==1])
F0 <- ecdf(cov[dat$X==0])
e <- max(abs(F1(cov) - F0(cov)))
if(length(unique(cov)) <= 10) return(e)
x <- seq(min(cov), max(cov), length.out=1000)
check <- all(F0(x-e) - e <= F1(x) & F1(x) <= F0(x+e) + e)
while(check) {
e <- e-.01
check <- all(F0(x-e) - e <= F1(x) & F1(x) <= F0(x+e) + e)
}
e
}
6. Mahalanobis balance
### covs should be a reduced datset that contains only those covariates
# that will be used for calculating Mahalanobis balance, for example,
# covs=dat[,1:6]
### trt should be the exposure variable, for example, trt=dat$X
mhb <- function(covs, trt) {
S <- cov(covs)
Sinv <- solve(S)
x1 <- colMeans(covs[trt==1,])
x0 <- colMeans(covs[trt==0,])
sum((t(x1-x0) %*% Sinv) * (x1-x0))
}
7. L1 measure
### X is the exposure
### covs is a reduced dataset that contains only those covariates that # will
be used for calculating the L1 measure (see cem documentation)
library(cem)
L1.meas(X, covs)$L1
8. L1 median
### In the simulations, we calculated medSv in each unmatched dataset,
# and then used that stratification in all matched samples based on
# that dataset to calculate that L1 median.
### In the code below, X1 is the exposure in a matched sample and covs1
# is the covariate matrix in the matched sample.
library(cem)
medSv <- L1.profile(X, covs, plot = FALSE, M = 101)
L1.meas(X1, covs1, breaks=medSv$medianCP)$L1
9. C-statistic
2
### dat should contain a variable X which defines the exposure and
# variables PS1 and PS2 that define the two estimated propensity scores
### ps = 1 or 2 identifies which PS should be used for calcualting the
# C-statistic
library(ROCR)
c.stat <- function(dat, ps = NULL) {
if(ps == 1) prd <- prediction(dat$PS1, dat$X)
if(ps == 2) prd <- prediction(dat$PS2, dat$X)
unlist(performance(prd, "auc")@y.values)-0.5 #standardized to min @0
}
10. General weighted difference
### dat is the dataset as in metrics 1-5
### nc is the number of covariates to be used for calculation. It is
# assumed that the covariates will be in positions 1βnc in the dat
# dataframe
gwd <- function(dat, nc) {
C <- dat[,1:nc]
for(i in 1:nc) {
for(j in i:nc) C <- cbind(C, dat[,i]*dat[,j])
}
m1 <- colMeans(C[dat$X==1,])
m0 <- colMeans(C[dat$X==0,])
s <- apply(C, 2, sf, dat = dat)
b <- c(rep(1, nc), rep(.5, nc*(nc+1)/2))
mean(b*abs(m1-m0)/s)
}
Web Appendix 2: Full Simulation Specifications
In addition to the binary event simulation described in the text, we repeated the
simulations using a continuous outcome and a Poisson event count. All simulation
specifications were identical across the differing outcome types except for the outcomegenerating model.
Specifically, when ππ was generated as a continuous variable, we used the linear
model:
ππ = π½π + π·πΏπ + π½π ππ
ππ ~π(ππ , π 2 )
where all π½ parameters now represent the linear effect of treatment or confounders and π 2
was chosen in each scenario so that the variance explained in the outcome model was 25%
of the total variability (π 2 = 3ππ2 ).
When ππ was generated as a Poisson event count, we used the log-linear model:
log{πΈ(ππ )} = π½π + π·πΏπ + π½π ππ ,
where the π½ parameters now represent log-rate ratio effects of treatment or confounders.
In both the linear and Poisson models, the same simulation scenarios were run, using the
values presented in Table 1 of the text.
3
Web Appendix 3: Full Simulation Results
In this section, we provide figures that show example datasets from each of the five
simulation scenarios. In each of Web Figures 1β5, we have plotted the PS distribution in
exposed and unexposed simulated patients before and after matching on PS1 (left) or PS2
(right). Unmatched data is in the top panel and lower panels are matched with successively
decreasing calipers. The average number of treated patients in each sample across
simulated datasets is shown in the upper left corner. In Web Figures 6β10, we have
plotted the mean and 95% quantile bars for bias (x-axis) and each covariate imbalance
metric (y-axis) in the unmatched data (the right most point in each plot) and in each
matched sample (moving to the left as matching was performed with increasingly tight
calipers), matched on PS1 (left) and on PS2 (right). The linear correlation (Ο) and variation
explained (R2) for each metric are in the lower right corners. The association measures for
estimated bias are at the top of each panel.
4
Web Figure 1: Base case (Scenario 1) results, binary outcome (Figure 2 from the text).
5
Web Figure 2: Nonlinear outcome (Scenario 2) results, binary outcome.
6
Web Figure 3: Nonlinear outcome and exposure (Scenario 3) results, binary outcome.
7
Web Figure 4: Redundant covariates (Scenario 4) results, binary outcome.
8
Web Figure 5: Instrumental variables (Scenario 5) results, binary outcome.
9
Web Figure 6: Low exposure prevalence (Scenario 6) results, binary outcome.
10
Web Figure 7: Small study size (Scenario 7) results, binary outcome.
11
Web Figure 8: Base case (Scenario 1) results, Poisson outcome.
12
Web Figure 9: Nonlinear outcome (Scenario 2) results, Poisson outcome.
13
Web Figure 10: Nonlinear outcome and exposure (Scenario 3) results, Poisson outcome.
14
Web Figure 11: Redundant covariates (Scenario 4) results, Poisson outcome.
15
Web Figure 12: Instrumental variables (Scenario 5) results, Poisson outcome.
16
Web Figure 13: Low exposure prevalence (Scenario 6) results, Poisson outcome.
17
Web Figure 14: Small study size (Scenario 7) results, Poisson outcome.
18
Web Figure 15: Base case (Scenario 1) results, continuous outcome.
19
Web Figure 16: Nonlinear outcome (Scenario 2) results, continuous outcome.
20
Web Figure 17: Nonlinear outcome and exposure (Scenario 3) results, continuous outcome.
21
Web Figure 18: Redundant covariates (Scenario 4) results, continuous outcome.
22
Web Figure 19: Instrumental variables (Scenario 5) results, continuous outcome.
23
Web Figure 20: Low exposure prevalence (Scenario 6) results, continuous outcome.
24
Web Figure 21: Small study size (Scenario 7) results, continuous outcome.
25
© Copyright 2026 Paperzz