guevara_etal_soc_mx_us_6_10_2016

Soil organic carbon across Mexico
and the conterminous United States
Guevara M.,Vargas R., et al. (co-authorship likely in
alphabetical order),
There is a need of an interoperable effort to
efficiently characterize soil functionality across
Mexico and United States continuum
• To facilitate capacities building (technical and institutional) and avoid
interoperability barriers of carbon cycle science and reporting
purposes (i.e. Vargas et al., 2016) .
– To better represent variability across geopolitical borders,
– To periodically provide estimates of soil organic carbon and its
associated uncertainty based on the best information available,
– To better inform policy relevant decisions about carbon cycle
science across both countries,
Vargas et al., 2016 Enhancing interoperability to facilitate implementation of
REDD+: case study of Mexico, Carbon management. In press.
Introduction
• Why soil organic carbon mapping across Mexico and United States?
– Reporting purposes for climate change adaptation guidelines,
– There are just a few digital soil organic carbon mapping efforts
for the specific case of Mexico and United states continuum, and
high uncertainty in available information
• Why 30cm depth?
– A harmonized database representing soil organic carbon at this
depth is the first result of institutional (USA and Mexican) data
sharing and curation
– It is also a priority standard soil depth interval (0-30cm) required
by the IPCC
Why building analytical and institutional
capacities for soil carbon mapping?
To highlight the benefits of digital soil mapping into the federal
agenda and facilitate carbon cycle related
policy relevant research
• We work with principles:
– Reproducible research,
– data sharing,
– transparent methods and
– (ideally) open source platforms
Image from: A Safe Operating Space for Humanity. Rockström et al. 2009. Nature
461:24
Knowledge gaps:
Spatial
variability
Spatial and
temporal detail
Sources or
sinks?
Reducing model
uncertainty
Scharlemann, J.P.W., Hiederer, R., Kapos, V. (2009). Global map of terrestrial
soil organic carbon stocks. UNEP-WCMC & EU-JRC, Cambridge, UK.
Global efforts: Polygon based approaches
Hiederer R, Kochy M (2011) Global Soil Organic Carbon Estimates and the Harmonized
World Soil Database. EUR 25225 EN. Publications Office of the EU, Luxembourg.
Global efforts: ISRIC SoilGrids1km ~30%
explained variance
Unexpected
pattern
across
Yucatan
Peninsula
permille
SoilGrids1km — Global Soil Information Based on Automated Mapping, Hengl
et al., 2014 http://dx.doi.org/10.1371/journal.pone.0105992
National efforts: Cloud based soil mapping
~30% of explained variance
Using Google's cloud-based platform for digital soil mapping, Padarian et al.
2015 doi:10.1016/j.cageo.2015.06.023
National efforts: Linear geo-statistics
Organic carbon
Interpolation of Mexican soil properties at a scale of 1:1,000,000 Cruz-Cardenas et
al., 2014. 213, 29–35.
This is a collective effort
• To build technical and institutional capacities as well as key alliances
for digital soil mapping across Mexico and United States
• Data -> Information -> Knowledge -> Wisdom?
– Spatial detail and depth relationships
– Precision and accuracy
– Spatial and temporal modeling
Scientific questions of this study?
• How much soil organic carbon in is stored in the first 30 centimeters
of soil across Mexico and United States, and what is the associated
uncertainty?
• How much of the spatial variability of soil organic carbon can we
explain across Mexico and United States?
Objectives
• To generate an interpretable and predictable model to describe
organic carbon variability
• To quantify the soil organic carbon stock stored in the first 30 cm
depth of the soils across Mexico and United States
– To design a spatial soil organic carbon inference system
• Parametric, machine learning and Bayesian statistics
• Quantify uncertainty estimates
Methods
• Conceptual model
• The data set (point data and SCORPAN factors)
• Statistical design
– Interpretability (i.e. linear terms)
– Prediction capacity (i.e. non parametric modeling)
– Mexico and United States (5 km and 250 m)
– Downscaling model <100 m (State of Delaware in US
and La Encrucijada in Mexico)
• Computational resources available at the University of Delaware
thanks to the Soil Plant Atmosphere Continuum working group and
the UD IT team.
A Pedometric mapping approach
Courtesy of Tomislav Hengl, ISRIC Summer school 2012, Wageningen NL
The conceptual model
Available data
Available soil covariates
Static 2D
(Baseline estimates)
Grunwald et al., 2011 SSSAJ
SOC 30cm depth = f (SCORPAN) + e
n=38218
gr cm-2
Available data representative for the first 30cm depth
SOC 30cm depth = f (SCORPAN) + e
Digital terrain analysis
Remote sensing
Climate surfaces
Soil organic carbon prediction factors, up to 250m pixel size Hengl., et al. in preparation
SOC 30cm depth = f (SCORPAN) + e
•
•
•
•
Linear models (i.e. lm)
Machine learning (i.e. random forest, kknn)
Bayesian methods (i.e. Hamiltonian Monte Carlo simulations)
Bootstrapping methods (i.e. Independent uncertainty estimates)
There is no best method (Wolpert s no free lunch statistical theorem). We assume that
different prediction algorithms will capture different portions of soil organic carbon
variability
SOC 30cm depth = f (SCORPAN) + e
Example of residual mapping of a multivariate linear model predicting SOC to new data,
See Hengl et al., A generic framework for spatial prediction of soil variables
based on regression-kriging, Geoderma 120 1-2, doi:10.1016/j.geoderma.2003.08.018
Results
•
•
•
•
•
•
Descriptive statistics and spatial autocorrelation
Linear models (variable selection)
Machine learning models (kknn and random forest)
Bayesian statistics
Predictive maps (5 km – 250 m)
Soil organic carbon stocks (and its uncertainties) across land uses
of Mexico and United States.
Histogram of available data
The range of values of available data shows a mayor density between 0 and 1.
Both data sets (Mexican and US) show a similar distribution and spatial
autocorrelation structure
Linear model
coefficients
R2 0.31
We use step wise linear models to analyze variable importance
The problem of multicollinearity,
variance inflation factor (VIF) plot
Statistical redundancy among predictors will affect both, model interpretability and model
predictability. More predictors often higher R2 but not necessarily a better model…
SOC decreases as topographic wetness index
(twi) and temperature (xTemp) increases,
SOC increases with vegetation (xEVI)
R2 0.28
After multiple (hundreds) of runs and intuitively variable drop offs we found an
interpretable (linear) model with only 3 predictors, and without sacrificing too much
prediction capacity (we cross validate [10 fold] to avoid overfitting in performance
evaluation, R2 0.28)
Linear models (median of 500 realizations)
R2 0.28
(0.281, 0.283)
Each realization was generated with an independent random combination of available
data for training and test (70 – 30 %)
Linear models (SD of 500 realizations)
~1hrs
We use the standard deviation (SD) of all model realizations as a measure of
uncertainty, we also benchmark to compare time performance
Bayesian multivariate model R2 0.28 (0.27,
0.30)
Median
SD
~30 hrs
Bayesian analysis aim to describe the moments of a given distribution of data and
simulate them (in a probabilistic basis) across the predictors domain
Kernel weighted nearest neighbors (500
realizations) R2 0.39 (0.37,0.41)
Median
SD
~3hrs
Kknn is a pattern recognition technique which is very fast and generate reasonable
results, it uses a kernel function to convert distance in weights an average in regression
Random forests (500 realizations) R2 0.49
(0.47, 0.51)
Median
SD
~11hrs
Random forests is non parametric ensemble of multiple decision trees generated by
the means of ‘bagging’
Regression kriging of random forest
ISRIC - Global Soil Information Facilities
GSIF better capture
the tail highest values
(>1 gr cm -2) observations
Our best method was validated using 10-fold cross validation in a geostatistical
framework thanks to ISRIC facilities for automated mapping https://cran.rproject.org/web/packages/GSIF/index.html
ISRIC-GSIF provides a flexible platform for
transparent digital soil mapping capacity
building (top-bottom and bottom-up)
Median of lm, Bayes lm, kknn, and rf, (2000
realizations)
Median of lm, Bayes lm, kknn, and rfGSIF,
(2000 realizations)
SD of all models (2000 realizations)
The median of all models balances the
uncertainty and prediction capacity to new data
This is a multiscale inference system (Dover
DE, USA, SOC 10m pixel size)
Pending miscellaneous and work in
progress:
• Updating stocks per country and land use (using the North American
Land Cover and Classification System)
• Manuscript writing (finishing first draft, committed to August 2016)
Ideas, comments, interests are very welcome?