Examining Associations Between the Built Environment

eAppendix for:
Distributed Lag Models
Examining Associations Between the Built Environment and Health
Jonggyu Baek, Brisa N. Sánchez, Veronica J. Berrocal, and Emma V. Sanchez-Vaznaugh
SIMULATION
We performed a small scale simulation study to improve our understanding of estimation and
inference of associations of interest in DLMs. In particular, we wanted to assess how estimates
of the associations of interest are affected by the degree of spatial correlation in the built
environment, and the shape of association between measured environment factors and an
outcome across distance. Further, we compared results obtained from DLMs to those obtained
from traditional approaches based on linear models whose goal is to estimate the average
association between features of the built environment and an outcome up to a-priori specified
distances.
For our simulations, we used as spatial domain the square (0, 500) × (0, 500). In the
square, we simulated food store locations (e.g., features of the built environment) by simulating
realizations from an inhomogeneous Poisson point process. The intensity of the inhomogeneous
Poisson process was taken to be a realization of a log Gaussian process with mean ๐œ‡๐‘ฅ , marginal
variance ๐œŽ๐‘ฅ2 , and exponential correlation function. In other words, the correlation between the log
intensity at two points on the 500 × 500 grid is given by ๐‘’๐‘ฅ๐‘(โˆ’๐‘‘/๐œ™), where ๐‘‘ is the distance
between two points and ๐œ™ is the decay parameter, i.e., the rate at which the correlation decays.
We considered three scenarios for the spatial variability of the intensity function: 1) the
marginal variance of the intensity function ๐œŽ๐‘ฅ2 is set equal to 0; this implies that the intensity is
constant over space and store locations are realizations of a homogeneous Poisson point process
with intensity equal to log(๐œ‡๐‘ฅ ); 2) ๐œŽ๐‘ฅ2 = 1 and ๐œ™ = 5โ„3; this corresponds to an intensity with a
spatial correlation that is equal to 0.05 when the distance between two points is equal to 5 units,
resulting in sampled food stores that display a small amount of clustering; and 3) ๐œŽ๐‘ฅ2 = 1 and
๐œ™ = 20โ„3; this corresponds to an intensity function with a correlation that decays to 0.05 at a
distance of 20 units, resulting in sampled food stores that display a large amount of clustering. In
each case, the mean of the log Gaussian process used to simulate the intensity of the
inhomogeneous Poisson process was taken to be equal to 0.15 (see Figure 2 in the manuscript).
For each of three built environment settings, we simulated one realization of the built
environment; however, given a realization of the built environment, we simulated 1000 datasets
with different locations for the health outcomes (e.g., the schools in our motivating application)
and different outcome values (e.g., childrenโ€™s BMIz at the various schools).
To simulate schoolโ€™s locations within the (0, 500) × (0, 500) region, we proceeded as
follows: we sampled ๐‘› โˆˆ {1,000, 6,000} schoolsโ€™ (๐‘ฅ๐‘– , ๐‘ฆ๐‘– ) coordinates from a Uniform(0, 500)
distribution, for ๐‘– = 1, . . , ๐‘›. Finally, after counting the number of locations in the built
environment around each outcome location, we obtained, for each location ๐‘–, ๐‘‹๐‘– (๐‘Ÿ๐‘™โˆ’1 ; ๐‘Ÿ๐‘™ ). We
used these coordinates to generate values of outcome ๐‘Œ๐‘– by sampling from the model: ๐‘Œ๐‘– =
โˆ‘๐ฟ๐‘™=1 ๐›ฝ(๐‘Ÿ๐‘™ )๐‘‹๐‘– (๐‘Ÿ๐‘™โˆ’1 ; ๐‘Ÿ๐‘™ ) + ๐œ–๐‘– , where ๐‘Ÿ0 = 0, ๐‘Ÿ๐ฟ = 10, ๐ฟ = 100 and ๐œ–๐‘– ~ ๐‘(0, ๐œ 2 ). We used two
function shapes for ๐›ฝ(๐‘Ÿ): 1) a step function given by ๐›ฝ(๐‘Ÿ) = 0.1 if ๐‘Ÿ โ‰ค 5 and 0 otherwise, which
results in the true data generating model ๐‘Œ๐‘– = 0.1๐‘‹๐‘– (0; 5) + ๐œ–๐‘– (Figure 3A in the manuscript),
and 2) a smooth function ๐›ฝ(๐‘Ÿ) = 0.1๐‘“๐‘ (๐‘Ÿ)/๐‘“๐‘ (0), where ๐‘“๐‘ (๐‘Ÿ) is a normal density function with
mean 0 and standard deviation 5โ„3 (Figure 3B in the manuscript). Note that that in the
traditional models used to study the effect of the built environment on health, the tacit
assumption is that the effect of the environment on health can be described by a step function of
distance; in other words, the association ๐›ฝ(๐‘Ÿ๐‘™โˆ’1 ; ๐‘Ÿ๐‘™ ) is deemed constant up to specified distance
๐‘Ÿ๐‘˜ but is zero beyond ๐‘Ÿ๐‘˜ . A step function ๐›ฝ(๐‘Ÿ) is likely unrealistic since it is hard to believe that
the association abruptly vanishes beyond distance 5, yet this assumption is frequently (implicitly)
made in practice. In contrast, the second function used for ๐›ฝ(๐‘Ÿ) implies that the association
decays smoothly with distance and is near zero by distance 5. We chose the variance ๐œ 2 of the
error term so that the model ๐‘… 2 was equal to either 0.2, 0.5 or 0.8 for the three different built
environment schemes. In our motivating example the number of available schools is near 6,000,
and the model ๐‘… 2 was near 0.2 when the DLM was fitted without adjustment of confounders.
In fitting DLMs, we chose 100 lags, ๐ฟ = 100, with ๐‘Ÿ๐ฟ = 10. We fitted the model within a
Bayesian framework and specified the following prior distributions: ๐›ฝ0 โˆ 1, ๐œถ โˆ ๐Ÿ,
๐’ƒ1 ~ ๐‘(0, ๐œŽ๐‘2 ๐‘ฐ๐ฟโˆ’2 ), ๐œŽ๐‘2 ~ ๐ผ๐บ(0.1, 1 × 10โˆ’6 ), and ๐œ 2 ~ ๐ผ๐บ(0.1, 1 × 10โˆ’6 ). Details on posterior
inference and the MCMC algorithm are provided in the next section. For comparison, we also
fitted the traditional linear model, ๐‘Œ๐‘– = ๐›ฝ0 + ๐›ฝ1 ๐‘‹๐‘– (0; ๐‘Ÿ๐‘˜ ) + ๐œ–๐‘– which assumes a constant effect of
the built environment up to a distance ๐‘Ÿ๐‘˜ . We used ๐‘Ÿ๐‘˜ = 2.5, 5, and 7.5, respectively, and
compared the estimate of ๐›ฝ1 with the estimate of ๐›ฝฬ… (0; ๐‘Ÿ๐‘˜ ) obtained from the DLM for these three
distances.
To examine how well DLMs capture true buffer effects at given distance lags, bias,
variance, mean squared error (MSE), and coverage rate were calculated at each ๐‘Ÿ๐‘™ , ๐‘™ = 1, 2, โ€ฆ , ๐ฟ,
using the formulas:
ฬ‚
๐ต๐‘–๐‘Ž๐‘ (๐‘Ÿ๐‘™ ) = โˆ‘1000
๐‘–=1 (๐›ฝ๐‘– (๐‘Ÿ๐‘™โˆ’1 ; ๐‘Ÿ๐‘™ ) โˆ’ ๐›ฝ(๐‘Ÿ๐‘™โˆ’1 ; ๐‘Ÿ๐‘™ ))โ„1000,
ฬ‚ ฬ‚
๐‘‰๐‘Ž๐‘Ÿ(๐‘Ÿ๐‘™ ) = โˆ‘1000
๐‘–=1 ๐‘‰๐‘Ž๐‘Ÿ (๐›ฝ๐‘– (๐‘Ÿ๐‘™โˆ’1 ; ๐‘Ÿ๐‘™ ))โ„1000,
2
ฬ‚
๐‘€๐‘†๐ธ(๐‘Ÿ๐‘™ ) = โˆ‘1000
๐‘–=1 (๐›ฝ๐‘– (๐‘Ÿ๐‘™โˆ’1 ; ๐‘Ÿ๐‘™ ) โˆ’ ๐›ฝ(๐‘Ÿ๐‘™โˆ’1 ; ๐‘Ÿ๐‘™ )) โ„1000,
ฬ‚
ฬ‚
๐ถ๐‘œ๐‘ฃ๐‘’๐‘Ÿ๐‘Ž๐‘”๐‘’(๐‘Ÿ๐‘™ ) = โˆ‘1000
๐‘–=1 ๐ผ (๐›ฝ๐‘–,2.5% (๐‘Ÿ๐‘™ ) โ‰ค ๐›ฝ(๐‘Ÿ๐‘™ ) โ‰ค ๐›ฝ๐‘–,97.5% (๐‘Ÿ๐‘™ )).
To summarize their overall performance and compare DLMs with classical regression models,
๐‘Ÿ
we calculated the integrated MSE, ๐ผ๐‘€๐‘†๐ธ = ๐ฟ๐ฟ โˆ‘๐ฟ๐‘™=1 ๐‘€๐‘†๐ธ(๐‘Ÿ๐‘™ ), for both models. In evaluating
IMSE for the classical regression models, ๐›ฝฬ‚ (๐‘Ÿ๐‘™โˆ’1 ; ๐‘Ÿ๐‘™ ) was set equal to ๐›ฝฬ‚1 for ๐‘Ÿ๐‘™ โ‰ค ๐‘Ÿ๐‘˜ and zero
otherwise.
When the true DL coefficient function ๐›ฝ(๐‘Ÿ) is a step function, bias occurs around
distance lags where the step happens (eFigure 2A and 2B). Since the fitted DLM assumes that
the buffer effect is a continuous function of distance, bias at those lags is expected, and that
results in low coverage rates as well. When ๐›ฝ(๐‘Ÿ) varies continuously in r (eFigure 2C and 2D),
much less bias is present, and the bias primarily occurs at the smallest lags because the estimated
buffer effects are smoother than the true ๐›ฝ(๐‘Ÿ). Some degree of over-smoothing is expected to
occur when using random effect variances (vs GCV) to compute smoothing parameters.1 Also, at
the first few lags, there is relatively smaller amount of information since many DL covariates
๐‘‹๐‘– (๐‘Ÿ๐‘™โˆ’1 ; ๐‘Ÿ๐‘™ ) in the first few lags have many zero values. Hence bias at smallest lags is expected.
Additionally, when the degree of clustering in the built environment becomes large, the range of
lags at which bias occurs becomes wider and coverage rates tend to be smaller.
For both functions ๐›ฝ(๐‘Ÿ), variance of the estimated buffer effects is larger at the first few
distance lags due to less information in DL covariates as previously explained. Note also that the
variance of the estimated coefficients at both end points (๐‘Ÿ๐‘™ = 0.1 and 10) tends to be larger than
for other values of ๐‘Ÿ๐‘™ because at the end points the coefficients are constrained only in one
direction. The estimated buffer effects are more variable when the spatial dependence in the
intensity function controlling the spatial distribution of the built environment features decays at a
slower rate. This can be anticipated because the amount of independent contributions of built
environment covariates ๐‘‹๐‘– (๐‘Ÿ๐‘™โˆ’1 ; ๐‘Ÿ๐‘™ ) is decreased (compare panels eFigure 2A vs 2B, and 2C vs
2D). The MSE is primarily dominated by bias since the variance is fairly constant across a range
of distances, except at the endpoints, as mentioned above.
The comparison of estimated average association up to distance ๐‘Ÿ๐‘˜ , with ๐‘Ÿ๐‘˜ = 2.5, 5, and
7.5, obtained from DLMs and traditional linear models is reported in Table 1 of the manuscript.
The true average association up to distance ๐‘Ÿ๐‘˜ , ๐›ฝฬ… (0; ๐‘Ÿ๐‘˜ ), is calculated using
2 )โ„
โˆ‘๐‘˜๐‘™=1 ๐›ฝ(๐‘Ÿ๐‘™โˆ’1 ; ๐‘Ÿ๐‘™ )๐œ‹(๐‘Ÿ๐‘™2 โˆ’ ๐‘Ÿ๐‘™โˆ’1
๐œ‹๐‘Ÿ๐‘˜2 . When locations of food stores are generated from a
homogeneous Poisson point process, the estimated associations from the traditional linear
models are very close to the true values and their coverage rates are close to 95% (i.e., valid
inference) for both functions used for ๐›ฝ(๐‘Ÿ). However, if there is clustering of locations in the
built environment, the estimated associations from the traditional models are positively biased
(away from the null) giving invalid inference unless the model is correctly specified (i.e., when
๐›ฝ(๐‘Ÿ) is the step function with ๐‘Ÿ๐‘˜ = 5). In particular, when ๐‘Ÿ๐‘˜ = 2.5, a huge amount of bias
occurs in the traditional models due to failure in adjusting the effects at longer lags. When the
buffer size selected was greater than the true buffer size in traditional models (e.g., ๐‘Ÿ๐‘˜ = 7.5), the
amount of bias in estimates was smaller; however standard errors of the estimated coefficients
were grossly underestimated yielding invalid inference (e.g., very low coverage). Note that when
negative and positive bias is cancelled up to specified distances in the fitted DLMs, bias in
estimating the average buffer effect is close to zero (eFigure 2). In general, compared to the
traditional regression models, estimated average buffer effects obtained using DLMs generally
performed better having much less bias and better coverage rates except when the fitted
traditional models coincide with the true data generating models.
Since both the traditional regression models and the DLMs have some degree of bias, we
summarize their relative performance in terms of integrated mean squared error (IMSE) up to
distance ๐‘Ÿ๐ฟ = 10 (eTable 1).1 When the true form of the ๐›ฝ(๐‘Ÿ) function is the step function, the
IMSE was minimum for the traditional regression models using the a-priori distance lag ๐‘Ÿ๐‘˜ = 5,
which is not surprising since the estimated model is the data generating model. However, when
๐›ฝ(๐‘Ÿ) decays with distance ๐‘Ÿ, the DLMs consistently yield the smallest IMSE.
To conserve space and avoid redundancy, here we only reported results for the simulation
setting with ๐‘› = 6000 and ๐‘… 2 = 0.2, since this scenario corresponds to the data in our
motivating example. For the smaller sample sizes, bias and coverage rates of the DLM estimates
deteriorate, and the strong confounding bias in the traditional regression models persists. The
bias in the DLM is largely attenuated when the model ๐‘… 2 increases, but this does not happen for
the traditional regression models.
To further examine assumptions used by the fitted DLMs we conducted additional
simulations: 1) we specified different numbers of lags, i.e., ๐ฟ = 25, 50, 200, to define ringshaped areas that differ from the ones (๐ฟ = 100) used in the data generating model; and 2) we
assumed different maximum distance ๐‘Ÿ๐ฟ = 3, 20. As expected, using a smaller numbers of lags
in DLMs (๐ฟ = 25), resulted in smoother estimated DL coefficients because the DL coefficients
are estimated in wider ring shaped area and thus become coarser. A larger number of lags (๐ฟ =
200) yielded similar results as ๐ฟ = 100. When the maximum distance was misspecified and
๐‘Ÿ๐ฟ = 3, we observed bias in the DL coefficients when there is clustering of locations in the built
environment. However, the amount of bias in estimates of the average buffer effect at ๐‘Ÿ๐‘˜ = 2.5
was less than that from traditional regression models. Results were consistent to those with ๐‘Ÿ๐ฟ =
10 when the maximum lag distance used to fit the DLMs was equal to 20.
ESIMATION
We constrain the coefficients ๐›ฝ(๐‘Ÿ๐‘™โˆ’1 ; ๐‘Ÿ๐‘™ ) to vary as a smooth function of distance ๐‘Ÿ๐‘™ , ๐‘™ =
1, 2, โ€ฆ , ๐ฟ, by using splines.2,3 This ensures that coefficients corresponding to adjacent areas are
similar, as we would not typically expect associations to change abruptly across distance. It also
alleviates possible numerical problems that may arise when many locations have zero food stores
between two given radii ๐‘Ÿ๐‘™โˆ’1 and ๐‘Ÿ๐‘™ . In particular, we model the association coefficients
๐›ฝ(๐‘Ÿ๐‘™โˆ’1 ; ๐‘Ÿ๐‘™ ) using a radial basis function; that is
๐›ฝ(๐‘Ÿ๐‘™โˆ’1 ; ๐‘Ÿ๐‘™ ) = ๐›ผ0 + ๐›ผ1 ๐‘Ÿ๐‘™ + โˆ‘๐ฟ๐‘˜=1 ๐›ผฬƒ๐‘˜ |๐‘Ÿ๐‘™ โˆ’ ๐‘Ÿ๐‘˜ |3,
(1a)
1 ๐‘Ÿ1
ฬƒ , where ๐‘ช0 = [ โ‹ฎ โ‹ฎ ], ๐‘ช1 =
In a matrix form, (1a) can be written as ๐œท = ๐‘ช0 ๐œถ + ๐‘ช1 ๐œถ
1 ๐‘Ÿ๐ฟ
๐›ผ
0
[|๐‘Ÿ๐‘™ โˆ’ ๐‘Ÿ๐‘˜ |3 ]1โ‰ค๐‘™,๐‘˜โ‰ค๐ฟ , ๐œถ = [๐›ผ ], and ๐œถ
ฬƒ = (๐›ผฬƒ1 , โ€ฆ , ๐›ผฬƒ๐ฟ )๐‘‡ . The coefficients ๐œถ
ฬƒ are penalized so the
1
squared second derivative of the estimated DL coefficient function is penalized. The objective is
ฬƒ ) โˆฅ2 subject to the
to minimize โˆฅ ๐’€ โˆ’ ๐Ÿ๐‘› ๐›ฝ0 โˆ’ ๐‘ฟ๐œท โˆฅ2 = โˆฅ ๐’€ โˆ’ ๐Ÿ๐‘› ๐›ฝ0 โˆ’ ๐‘ฟ(๐‘ช0 ๐œถ + ๐‘ช1 ๐œถ
ฬƒ ๐‘‡ ๐‘ช1 ๐œถ
ฬƒ โ‰ค ๐‘๐‘œ๐‘›๐‘ ๐‘ก, and ๐‘ช๐‘‡0 ๐œถ
ฬƒ = 0. The latter constraint implies that there are really ๐ฟ
constrains ๐œถ
ฬƒ rather than ๐ฟ + 2 implied from the columns of ๐‘ช0 and ๐‘ช1 .1,4 As is well
free parameters ๐œถ and ๐œถ
ฬƒ = ๐‘ด1 ๐’‚1 ,
known, the optimization problem can be re-written as a mixed model by redefining ๐œถ
where ๐‘ด1 is an ๐ฟ × (๐ฟ โˆ’ 2) orthogonal matrix to ๐‘ช0 , where ๐‘ด1 can be determined using the ๐‘„๐‘…
decomposition [๐‘ช0 ๐‘ช1 ] = ๐‘ธ๐‘ ๐‘น๐‘ and setting ๐‘ด1 as the 3rd to last columns of ๐‘ธ๐‘ .4 Further,
1/2
1/2
1/2
finding ๐‘ด2 that satisfies ๐‘ด2 = ๐‘ด2 ๐‘ด2 = ๐‘ด1๐‘‡ ๐‘ช1 ๐‘ด1 , and defining ๐’ƒ1 through the
โˆ’1/2
โˆ’1โˆ•2
transformation ๐’‚1 to ๐‘ด2 ๐’ƒ1, and re-structuring the data ๐‘ฟโˆ— = ๐‘ฟ๐‘ช0 and ๐’โˆ— = ๐‘ฟ๐‘ช1 ๐‘ด1 ๐‘ด2 , the
mixed model becomes ๐’€ = ๐›ฝ0 ๐Ÿ๐‘› + ๐‘ฟโˆ— ๐œถ + ๐’โˆ— ๐’ƒ1 + ๐, where ๐ ~ ๐‘ต๐‘› (๐ŸŽ, ๐œ 2 ๐‘ฐ) and
๐’ƒ1 ~ ๐‘ต๐ฟโˆ’2 (๐ŸŽ, ๐œŽ๐‘2 ๐‘ฐ). The smoothing parameter is ๐œ† = ๐œ 2 /๐œŽ๐‘2 .
The mixed model can be fitted with packaged software for mixed models in the
frequentist framework. Once we have the estimates from the fitted regression, the estimates of
๐œถ
๐œถ
the DL coefficients can be obtained as ๐œท = ๐›€ [๐’ƒ ] and ๐ถ๐‘œ๐‘ฃ(๐œท) = ๐›€๐ถ๐‘œ๐‘ฃ ([๐’ƒ ]) ๐›€๐‘‡ where ๐›€ =
1
1
[๐‘ช0 ๐‘ช1 ๐‘ด1 ๐‘ด2โˆ’1/2 ].
Alternatively, the model can be estimated in the Bayesian framework. With prior
distributions of ๐›ฝ0 โˆ 1, ๐œถ โˆ ๐Ÿ, ๐’ƒ1 ~ ๐‘(0, ๐œŽ๐‘2 ๐‘ฐ๐ฟโˆ’2 ) ๐œŽ๐‘2 ~ ๐ผ๐บ(๐‘Ž๐œŽ , ๐‘๐œŽ ), ๐œ 2 ~ ๐ผ๐บ(๐‘Ž๐œ , ๐‘๐œ ), the full
conditionals are all available in closed forms. Let ๐‘ซโˆ— = [๐Ÿ๐‘› ๐‘ฟโˆ— ๐’โˆ— ] =
โ„2
[๐Ÿ๐‘› ๐‘ฟ๐‘ช0 ๐‘ฟ๐‘ช1 ๐‘ด1 ๐‘ดโˆ’1
], then the full conditional for ๐›ฝ0 , ๐›‚, ๐’ƒ1 is ๐‘(๐›ฝ0 , ๐›‚, ๐’ƒ1 | โ‹…) = ๐‘(๐, ๐šบ),
2
โˆ’1
where ๐šบ = (๐‘ซโˆ— ๐‘‡ ๐‘ซโˆ— โ„๐œ 2 + ๐œŽ๐‘โˆ’2 ๐‘ฎ) , ๐‘ฎ = ๐‘‘๐‘–๐‘Ž๐‘”{๐ŸŽ3 , ๐Ÿ๐ฟโˆ’2 } and ๐ = ๐šบ๐‘ซโˆ— ๐‘‡ ๐’€โ„๐œ 2 . The full
conditional distribution for ๐œŽ๐‘2 is ๐‘(๐œŽ๐‘2 | โ‹…) = ๐ผ๐บ(๐‘Ž๐œŽ + (๐ฟ โˆ’ 2)โ„2 , ๐‘๐œŽ + ๐’ƒ1 ๐‘‡ ๐‘ฎ๐’ƒ1 โ„2), while, the
full conditional distribution of ๐œ 2 is ๐‘(๐œ 2 | โ‹…) = ๐ผ๐บ(๐‘Ž๐œ + ๐‘›โ„2 , ๐‘๐œ + (๐’“๐‘‡ ๐’“)โ„2), where ๐’“ = ๐’€ โˆ’
๐‘ซโˆ— (๐›ฝ0 , ๐›‚, ๐’ƒ1 )๐‘‡ . Inference for DL coefficients ๐œท is obtained by transforming posterior samples of
๐œถ
๐›‚, ๐’ƒ1 by ๐›€ [๐’ƒ ] with ๐›€ as described above. Inference for average lag effects, ๐›ฝฬ… (0; ๐‘Ÿ๐‘˜ ), can be
1
easily determined from posterior samples.
References
1.
Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression. Cambridge University
Press; 2003.
2.
Hastie TJ, Tibshirani RJ. Generalized Additive Models. CRC Press; 1990.
3.
Zanobetti A, Wand MP, Schwartz J, Ryan LM. Generalized additive distributed lag
modelsโ€ฏ: quantifying. Biostatistics. 2000;1(3):279-292.
4.
Green PJ, Silverman BW. Nonparametric Regression and Generalized Linear Models: A
Roughness Penalty Approac. CRC Press; 1993.
5.
Simon Wood. Generalized Additive Models: An Introduction with R. CRC Press; 2006.
eFigure 1. The estimated DL coefficients up to 7 miles from schools in the student characteristics
adjusted DLM. Food environment associations are only for (A) boys or (B) girls, and (C) the
difference of association by sex; associations are only for (D) non-Hispanic Whites or (E)
Hispanics, and (F) the difference of association by race/ethnicity; associations are only for (G)
5th grade children or (H) 7th grade children, and (I) the difference of association by grade.
eFigure 2. Bias, variance, MSE, and coverage rate at each ๐‘Ÿ๐‘™ , ๐‘™ = 1, 2, โ€ฆ , 100 for the cases when
๐›ฝ(๐‘Ÿ) is: (A) a step function under the built environment without clustering. (B) the step function
under the built environment with a large amount of clustering. (C) ๐›ฝ(๐‘Ÿ) is the normal pdf under
the built environment without clustering. (D) ๐›ฝ(๐‘Ÿ) is the normal pdf under the built environment
with a large amount of clustering. Reported results are from a simulation case with n = 6,000
and ๐‘… 2 = 0.2.
eTable 1. Integrated MSE from fitted traditional linear models with distance lag ๐‘Ÿ๐‘˜ = 2.5, 5, and
7.5 and from fitted DLMs with a maximum distance ๐‘Ÿ๐ฟ = 10. Reported results are from a
simulation case with ๐‘› = 6,000 and ๐‘… 2 = 0.2.
๐›ฝ(๐‘Ÿ)
Step
Curve
Spatial range
in the built
environment
Independence
5
20
Traditional
linear
model
(๐‘Ÿ๐‘˜ =2.5)
Traditional
linear
model
(๐‘Ÿ๐‘˜ =5)
Traditional
linear
model
(๐‘Ÿ๐‘˜ =7.5)
DLM
IMSE*
2.512
7.500
12.883
IMSE*
0.004
0.004
0.003
IMSE*
2.055
1.861
1.945
IMSE*
0.125
0.269
0.360
0.790
0.743
0.767
1.113
1.060
1.083
0.006
0.010
0.013
Independence
0.205
5
0.169
20
0.179
* IMSE is multiplied by 100.
eTable 2. Simulation findings regarding the use of DIC for model selection. For each of 1000 datasets simulated for each scenario, we calculated DIC
for the pre-specified distances in Table 1 of the manuscript and the DLM. Because selection of traditional vs DL model may depend on the a-priori
specified distance, as a more comprehensive way to select the best traditional models, we also computed DIC for L traditional models using buffer
sizes ๐‘Ÿ๐‘˜ โˆˆ {๐‘Ÿ1 , โ€ฆ , ๐‘Ÿ๐ฟ }. The distance ๐‘Ÿ(min DIC) that gave the traditional model with minimum DIC was selected as the โ€œbestโ€ buffer size. The bias in
coefficients for the pre-specified buffer sizes is given in Table 1. Percent bias in the coefficient from the โ€œbestโ€ traditional model,
๐œƒฬ‚1,๐‘Ÿ๐‘š๐‘–๐‘›(๐ท๐ผ๐ถ) compared to ๐›ฝฬ… (0, ๐‘Ÿ(min DIC) ), was calculated, as well as percent bias in the cumulative effect up to ๐‘Ÿ(min DIC) computed from the DLM. The
minimum DIC value among traditional models was compared with the DIC from the DLM. DIC selected the model that generated the data in almost
all cases, except when the curve (Figure 3B in the manuscript) was used to generate data and there was lower power (e.g., n=1000). However, in
these cases, the estimates from even best traditional model (even when its DIC is lower compared to the DLM) remain more biased. While DIC may
select models that fit better, it may not select models that give unbiased effect estimates.
True
ฮฒ(r)
Step
Spatial
range in the
built
environment
Mean DIC from
N
DLM
traditional model with buffer size:
2.5 /
5/
7.5 /r(min DIC)
P(DLM is
selected)
via DIC
Mean of Mean %Bias in
r(min DIC)
Mean %Bias in
๐œƒฬ‚1,๐‘Ÿ๐‘š๐‘–๐‘›(๐ท๐ผ๐ถ)
chosen via
๐›ฝฬ‚ฬ… (0, ๐‘Ÿ(min DIC) )
from
DIC in
traditional
from DLM
traditional
model
model
5.0
5.4
9.4
5.0
2.3
3.8
0
1000
6000
2410
14391
2526 / 2394 / 2491 / 2393
15155 / 14353 / 14939 / 14353
0.01
0.00
5
1000
6000
4340
25988
4400 / 4333 / 4368 / 4332
26378 / 25973 / 26181 / 25972
0.03
0.01
5.0
5.0
7.4
2.3
12.5
7.6
20
1000
6000
4694
28146
4744 / 4689 / 4710 / 4687
28467 / 28133 / 28256 / 28132
0.03
0.01
5.0
5.0
8.1
2.3
14.0
8.7
0
1000
6000
186
1082
213 /
1276 /
277 / 314 /
1651 / 1877 /
204
1256
0.96
1.00
2.6
2.6
7.6
3.2
6.5
2.6
5
1000
6000
1643
9810
1652 / 1680 / 1721 /
9886 / 10054 / 10302 /
1643
9861
0.50
1.00
2.9
2.9
24.2
20.1
7.2
3.1
20
1000
6000
1793
10739
1802 / 1819 / 1845 / 1793
10813 / 10913 / 11071 / 10785
0.44
1.00
3.0
2.9
26.1
22.1
9.0
3.4
Curve
eTable 3. Results from simulation study that examines whether accounting for spatial patterning in the outcome resolves issues of bias in the
traditional model. Results are shown for simulation settings where true ๐›ฝ(๐‘Ÿ) is the curve shown in Figure 3A of the manuscript, ๐‘› โˆˆ
{1000, 6000}, and small or large spatial correlation in the built environment. Results from a traditional linear model including spatial
smoothing effects (model fitted is ๐‘Œ๐‘– = ๐›ฝ0 + ๐›ฝ1 ๐‘‹๐‘– (0; ๐‘Ÿ๐‘˜ ) + ๐‘ (๐‘ฅ๐‘–, ๐‘ฆ๐‘– ) + ๐œ–๐‘– , where ๐‘ (๐‘ฅ๐‘– , ๐‘ฆ๐‘– ) is a bivariate smoothing term of longitude (x) and
latitude (y) constructed using tensor product basis functions5) are compared to the traditional model without spatial terms. Summaries are
derived from 1000 simulated datasets. Accounting for the spatial patterning in the outcome does not resolve the bias in the estimated built
environment association. Est. beta is the mean of estimates in 1000 datasets. Coverage rate is the percent of 95% confidence intervals
including the true association. SD(beta) is the standard deviation of 1000 estimates. Mean(SE) is the mean of 1000 standard error estimates.
Spatial
range in the
built
environment
๐‘Ÿ๐‘˜ = 2.5
๐‘Ÿ๐‘˜ = 5
๐‘Ÿ๐‘˜ = 7.5
Coverage
rate
-
SD*
(beta)
Mean*
(SE)
Est.
beta
0.021
Coverage
rate
-
SD*
(beta)
Mean*
(SE)
Est.
beta
0.010
Coverage
rate
-
SD*
(beta)
Mean*
(SE)
True value
Est.
beta
0.058
5
5
20
20
0.074
0.074
0.076
0.076
0.154
0.000
0.252
0.000
5.241
2.130
6.901
2.717
5.231
2.120
6.471
2.632
0.024
0.024
0.022
0.022
0.723
0.106
0.878
0.658
1.842
0.802
2.159
0.870
1.799
0.732
1.999
0.813
0.011
0.011
0.010
0.010
0.510
0.004
0.813
0.361
1.097
0.480
1.085
0.446
1.007
0.410
1.000
0.407
1000
5
0.074
0.152 5.234
6000
5
0.074
0.000 2.125
TLM
1000
20
0.076
0.216 6.782
6000
20
0.076
0.000 2.643
* SD(beta) and Mean(SE) are multiplied by 1000 for readability.
5.218
2.116
6.376
2.594
0.024
0.024
0.022
0.022
0.727
0.104
0.888
0.662
1.839
0.798
2.093
0.839
1.792
0.729
1.957
0.796
0.011
0.011
0.010
0.010
0.499
0.006
0.833
0.390
1.090
0.476
1.043
0.426
1.002
0.408
0.974
0.396
Model
TLM with
spatial
smoothing terms
N
1000
6000
1000
6000