Appendix S1 We used the Spatio-Temporal Exploratory Model

Appendix S1 We used the Spatio-Temporal Exploratory Model (STEM) to estimate species’
distributions because of its ability to adapt to non-stationary in predictor-response relationships
modeled from large sets of irregularly distributed observational data (Fink et al., 2010, 2014).
STEM is an ensemble of local regression models generated by repeatedly partitioning the study
extent into grids of spatiotemporal blocks, called stixels, and then fitting independent regression
models, called base models, within all stixels. Together, the base models form an ensemble of
local occurrence estimates and local land cover association estimates uniformly distributed
across the study extent.
In the methods sections of the paper, we describe how we used the ensemble to estimate
1) fine-scale occurrence and 2) regional-scale trajectories of several land cover association
statistics. In this appendix, we discuss the how we specified the stixel size, how the partitions
were created and randomized, and the base models we used for this analysis.
S1.1 Specifying Stixel Size
Specifying the size of the stixels that define the local neighborhoods is an important part
of STEM. Because of the non-uniform distribution of eBird observations in space and through
time, stixel size controls a bias-variance tradeoff (Fink et al. 2010, 2014). The larger the stixel,
the larger the number of observations in the stixel used to train the base model (reducing the
variance associated with sample size). The smaller the stixel, the weaker the assumption of
spatial-temporal stationarity (reducing the bias associated with less flexible models).
Since we wanted to capture seasonal variation in distributions, we wanted the time
dimension to be short enough to adapt to seasonal changes. From past experience we have found
that a grid with a 40-day window can adapt to a wide variety of complex avian migration
patterns across a diverse set of terrestrial species using eBird data (NABCI 2011, 2013). The
spatial dimensions were selected to be the smallest size possible (lowest bias) with the goal of
capturing non-stationary patterns, but large enough to meet the minimum sample size
requirements (variance control) for the BRTs to be fit throughout the study area.
The number of stixels supporting a STEM estimate at a given location and time is called
the ensemble support. Given the 40-day window, we estimated the smallest latitude-longitude
stixel dimensions necessary to achieve at least 50% support, throughout the study area based on
the eBird training data locations during the month with the smallest number of observations. To
do this we generated a random sample of 10 uniformly distributed partitions at this time and
recorded ensemble support across a fine grid of locations within the study area. To be included in
the ensemble support, each stixel was required to meet the base model minimum sample size
requirement (See S1.3 Boosted Regression Tree Base Models).
For this analysis we used a regular spatiotemporal grid with dimensions 10-degrees
longitude by 7-degrees latitude by 40-days and a minimum base-model sample size of 30.
Figure 1 shows a partition with 7 degrees latitude by 10 degrees longitude stixels (Fig. 1) and the
associated map of ensemble support (Fig. 1). The support map shows most of the continental US
is covered by at least 9 out of 10 base-models possible. The combined effects of regionally
sparse data and boundary effects are evident in Montana and North Dakota where ensemble
support drops to 6 out of 10 partitions.
Figure 1 Ensemble support diagnostics. The left panel shows a typical stixel partition of 7
degrees latitude by 10 degrees longitude. The right panel shows the associated map of ensemble
support. The support map shows most of the continental US is covered by at least 9 out of 10
base-models possible. The combined effects of regionally sparse eBird data and boundary effects
are evident in Montana and North Dakota where ensemble support drops to 6 out of 10
partitions.
S1.2 Partitions: Creating and Randomizing
Each individual partition divides the study extent into a regular grid of spatiotemporal
neighborhoods (stixels) and regression base models are fit independently to the data in each
stixel. The STEM ensemble is created from a sample of partitions to generate a uniformly
distributed ensemble of model estimates across the study extent while facilitating bootstrap
estimation. First, to capture sampling variation we generated 50 subsamples, each consisting of
70% of the data. Second, we generated four randomly located partitions for each subsample so
that “edge” effects associated with individual partitions could be averaged out. STEM uses
bootstrap smoothing, also known as bagging, to combine estimates across the ensemble while
controlling inter-model variability (Efron, 2014). Each bootstrap smooth was based on a random
subset of 100 partitions. We generated one hundred bootstrap replicates in this way, each
equivalent to a grid-based block subsample (Lahiri & Zhu, 2006).
S1.3 Boosted Regression Tree Base Models
Within each stixel species’ occurrence was assumed to be stationary and we fit Boosted
Regression Trees (BRTs) with a stixel minimum sample size requirement of 30 observations.
BRTs are a flexible, highly automated nonparametric regression technique that can accommodate
a wide-range of potential covariates with non-linear effects and interactions (Hastie et al., 2009) .
BRTs have been found to perform well for species distribution modeling (Elith et al., 2008).
BRT occurrence models were fit by using presence or absence of species on a checklist as the
binomial response variable. Effort and time covariates were included to account for variation in
detectability and availability for detection of birds.
We selected BRT parameters to facilitate a bagging model strategy when combining
information across base models. The strategy is to aim for base models with low bias, erring on
the side of overfitting rather than underfitting, and then rely on the variance reducing properties
of combining base model estimates across the base-models to control overfitting. All BRTs were
run with an interaction depth of three, the shrinkage parameter equal to 0.01, the bag fraction
equal to 0.80 and 500 trees in the ensemble. The interaction depth was set to 3 to insure that we
would capture 2-way interactions (which we have good reason to expect a priori: e.g. interactions
of elevation and land cover) but also give the base models extra flexibility to produce low-bias
estimates. Each BRT ensemble included 500 trees. We selected the other parameters because
they tended to produce overfit base models when tested over different regions, season, and
species. The trees were fit in R (R Core Team, 2015) with the gbm package (Ridgeway, 2015).
References
Allouche, O., Tsoar, A. & Kadmon, R. (2006) Assessing the accuracy of species distribution
models: prevalence, kappa and the true skill statistic (TSS). Journal of Applied Ecology, 43,
1223-1232.Elith, J., J. R. Leathwick, and T. Hastie. 2008. A working guide to boosted regression
trees. Journal of Animal Ecology 77:802–813.
Efron, B. (2014) Estimation and accuracy after model selection. Journal of the American
Statistical Association, 109, 991-1007.
Fink, D., Damoulas, T., Bruns, N. E., La Sorte, F. A., Hochachka, W. M., Gomes, C. P., and
Kelling, S. (2014) Crowdsourcing meets ecology: hemispherewide spatiotemporal species
distribution models. AI magazine 35:19–30.
Fink, D., Hochachka, W. M., Zuckerberg, B., Winkler, D. W., Shaby, B., Munson, M. A.,
Hooker, G. J., Riedewald, M., Sheldon, D., and Kelling, S. (2010) Spatiotemporal Exploratory
models for Large-scale Survey Data. Ecological Applications. Ecological Applications 20:2131–
2147.
Hastie, T., Tibshirani, R., and Friedman, J. (2009) The elements of statistical learning: data
mining, inference, and prediction. Second edition. Springer-Verlag, New York, USA.
Lahiri, S. N., & Zhu, J. (2006) Resampling methods for spatial regression models under a class
of stochastic designs. The Annals of Statistics, 34(4), 1774-1813.
NABCI (2011) The State of the Birds 2011 Report on Public Lands and Waters. U.S.
Department of Interior, Washington, DC.
NABCI (2013) The State of the Birds 2013 Report on Private Lands and Waters. U.S.
Department of Interior, Washington, DC.
R Core Team (2015) R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
Ridgeway, G. (2015) Generalized Boosted Regression Models. R package version 2.1.1.
http://CRAN.R-project.org/package=gbm.
Wood, S.N. (2006) Generalized additive models : an introduction with R. Chapman &
Hall/CRC, Boca Raton, FL.