Higgs.etal.JSM.2006 - Colorado State University

Bayesian modeling for ordinal
substrate size using EPA
stream data
A spatial model for ordered
categorical data
Megan Dailey Higgs
Jennifer Hoeting
Brian Bledsoe*
Department of Statistics, Colorado State University
*Department of Civil Engineering, Colorado State University
Substrate size in streams
►
►
►
Influences in-stream physical habitat
Often indicative of stream health
EPA collected data at 485 sites in Washington and Oregon
between 1994 and 2004
Data Collection Protocol
► At
a site:
 11 transects x 5 points along each transect
 Choose particle under the sharp end of a stick
 Visually estimate and classify size
Creating the response
► For
a site:
 Transform the original size classes to
log10(Geometric Mean) for all sample points
 Find the median for the site
► Geometric
mean
The response
► Yi
= median[log10(geometric mean)] for site i
► Transformation
provides a more symmetric,
continuous-like variable
 Typically modeled as a continuous variable
 Predictive models have performed poorly
► Response
is an ordered categorical variable
 12 categories (6 with very few observations)
Ordered categorical data
► Yi
is a categorical response variable with K
ordered values: {1,…,K}
► Modeling
objectives:
 Explain the variation in the ordered response
from covariate(s)
 Incorporate the spatial dependence
 Estimate, predict, and create maps of Pr(Yi ≤ k)
and Pr(Yi = k)
Formulating the spatial model
Non-spatial model
for ordered
categorical data
Albert & Chib
(1993, 1997)
+
Spatial model for
binary and count
data
=
Spatial model for
ordered categorical
data
• Diggle, Tawn, & Moyeed
(1998)
• Gelfand & Ravishanker
(1998)
•Generalized geostatistical models with a latent
Gaussian process
•Metropolis Hastings within Gibbs sampling
approach
Latent variable formulation
► Define
latent variable, Zi, such that Zi = Xi’β + εi
 εi ~ N(0,1) for the probit model
 εi ~ Standard Logistic for logit model
► Define
the categorical response, Yi = {1,…,K},
using Zi and ordered cut-points, θ = (θ1 , … ,θK-1),
where 0 = θ1 < θ2 < … < θK-1 < θK = ∞
Yi = 1
Yi = k
Yi = K
if
Zi < θ1
if θk-1 ≤ Zi < θk
if
Zi ≥ θK-1
Latent variable formulation
►
Thus,
Pr(Yi ≤ k | θ, β) = Pr(Zi < θk)
Pr(Yi = k | θ, β) = Pr(θk-1 ≤ Zi < θk)
 If Z ~ N(Xi’β, 1), then
Pr(Yi ≤ k | θ, β) = Φ(θk – Xi’β)
Pr(Yi = k | θ, β) = Φ(θk – Xi’β) - Φ(θk-1 – Xi’β)
where Φ is the N(0,1) cdf
Spatial cumulative model
► Zi =
Xi’β + Wi + εi
 where
is the latent variable
εi ~ N(0,1)
W ~ N(0, s2H(d))
(H(d))ij = r(si-sj;d)
Zi | β, Wi ~ N(Xi’β + Wi , 1)
► Pr(Yi
≤ k | β, θ, Wi) = Pr(Zi < θk)
= Φ(θk – Xi’β - Wi)
Where θ = (θ1 , … ,θK) is a vector of cut-points
such that 0 = θ1 < θ2 < … < θK-1 < θK = ∞
Fitting the spatial model
►
The likelihood
►
Estimating b = (b0, b1),
►
Transform θ to a real-valued, unrestricted cut-points:
a = (a2 , ... , aK-1)
►
MCMC sampling
g = (s2, d) , θ = (θ2, … ,θK-1)
where
a2 = log(θ2)
ak = log(θk – θk-1)
 Metropolis-Hastings within Gibbs sampling
 Prior:
► b – flat and conjugate Normal
► s2 and d – Independent uniform priors
► a - multivariate normal
Simulated data
► Simulated
(n = 82)
data at a subset of the original locations
 Cluster infill around the 82 sites (n=120)
 Spatial process:
►W
is a stationary Gaussian process with E[W(s)]=0
and Cov[W(si),W(sj)] = s2r(si-sj;d)
►Exponential correlation function: r(d) = exp(-dd)
 Covariate:
►Distance
weighted stream power
Preliminary Results
► Posterior
quantities
 Based on 1000 iterations (burn-in = 1000)
Posterior mean of the spatial process
Posterior SD of the spatial process
Posterior mean and SD for
Pr(Yi = 2)
Posterior mean and SD for
Pr(Yi = 5)
Posterior mean and SD for
Pr(Yi ≤ 5)
Future Work
►
Convergence and mixing for the spatial model
►
Models and methods for large data sets
 Spectral parameterization of the spatial process
► Wikle
(2002), Paciorek & Ryan (2005), Royle & Wikle (2005)
 Importance sampling
► Gelfand
& Ravishanker (1998), Gelfand, Ravishanker, & Ecker (2000)
 Sub-sampling
►
Investigate different spatial correlation functions and
distance metrics
 Traditional
 Stream based
►
Model selection for the spatial model
Funding and Affiliations
CR-829095
FUNDING/DISCLAIMER
The work reported here was developed under the STAR Research
Assistance Agreement CR-829095 awarded by the U.S.
Environmental Protection Agency (EPA) to Colorado State
University. This presentation has not been formally reviewed by
EPA. The views expressed here are solely those of the authors
and STARMAP, the Program they represent. EPA does not endorse
any products or commercial services mentioned in this
presentation.
Megan’s research is also partially supported by the PRIMES
National Science Foundation Grant DGE-0221595003.
Thank you
Subset of data
(nsmall = 82)
Sample path plot - Example
Surface for estimating g=(s2,d)
Sample path plot – Avoiding plateau