click here

High Resolution Bayesian Space-Time
Modelling for Ozone Concentration
Levels
Sujit Sahu
Southampton Statistical Sciences Research Institute,
University of Southampton
North Carolina State University, Raleigh:
September 2011
Collaborators: Shuvo Bakar, Alan Gelfand and Stan Yip.
High Resolution Bayesian Space-Time
Modelling for Ozone Concentration
Levels
Sujit Sahu
Southampton Statistical Sciences Research Institute,
University of Southampton
North Carolina State University, Raleigh:
September 2011
Collaborators: Shuvo Bakar, Alan Gelfand and Stan Yip.
Outline:
1
2
3
4
5
Ozone and Ozone standards.
Models for evaluating the primary standard.
Models for evaluating the secondary standard.
Software for implementing the models.
Discussion
Will be used as a reminder slide...
Sujit Sahu
2
Ozone pollution
Ozone high up is good, ozone down below is bad.
Ground level ozone: bad health effects: respiratory,
lung function, coughing, throat irritation, congestion,
bronchitis, emphysema, asthma.
Meteorology conditions affect ozone productionsunlight, high temperature, wind direction and wind
speed and possibly others.
So, the high ozone season is primarily from May to
September.
Ozone is a secondary pollutant.
Sujit Sahu
3
Ozone production
Sunlight + VOC + NOx = Ozone.
VOC’s (Volatile Organic Compounds) - organic gases
but really “chemicals that participate in the formation
of ozone.”
Sujit Sahu
4
Two standards for ozone levels
Standards set by the US Environmental Protection
Agency (USEPA).
Primary standard
The true value of the 3-year rolling average of the
annual 4th highest daily 8-hour maximum should be
less than 85 parts per billion (ppb). (Current value is
75 after revision in 2007).
Protects human health.
Secondary standard
The true value of an annual index called W126
should be less than 21 parts per million (ppm).
W126 measures total high exposure in a year.
Protects human welfare like vegetation, crops etc.
Sujit Sahu
5
The primary ozone standard
Definitions
1
2
Basic readings = Ozone level (in ppb) at each hour.
Calculate averages of 8 successive hourly levels at
each hour,
e.g., the 8-hour average at 5PM is the average of the
readings from 1PM-4PM, 5PM and 6PM-8PM,
i.e. take the 4 hours before the hour, that hour and 3
hours after.
3
4
5
Daily 8-hour maximum average = Maximum of the 24
8-hour averages in a day.
Calculate the annual 4th highest daily 8-hour max.
Average three (most recent) years’ annual 4th
highest daily 8-hour maximum. Final Statistic!
Sujit Sahu
6
The secondary ozone standard
Definitions
1
For each hour’s O3 level (in ppm) transform:
O3
.
1 + 4403 × e−126×O3
2
3
4
5
Calculate the daily index by summing the above
(rounded to three decimal places) for the 12 daylight
hours from 8AM to 7PM.
Sum the daily indices in a month to obtain the
monthly index.
Calculate three months running total of the monthly
indices.
Obtain the maximum of the three months running
totals in the ozone season; W126. Final Statistic!
Sujit Sahu
7
Why the logistic transformation?
The transformation effectively sets all values less
than 0.05ppm (or 50ppb) to zero.
Values above 0.10ppm (or 100ppb) are unchanged.
0.20
W126
0.15
0.10
0.05
0.00
0.00
0.05
0.10
0.15
0.20
O3 in ppm
Will work with the daily index:
Z (s, t) =
7PM
X
k =8AM
Qk (s, t)
,
1 + 4403 × exp(−126 × Qk (s, t))
Qk (s, t) is the k th hourly concentration on day t.
Sujit Sahu
8
High spatial resolution
*
*
*
*
**
*
*
*
* **
*
*
*
*
*
* *
*
* **
* *
*
* *
* **
* *
* * ******
** ** **
*
* **
*
* * **
** * * * **
*
****** * * **** *
* * **
*
*
* *
*
*
** *** *
*
* *
**** * * ** *
*
**
** *
* *
*
**
*
*
*
*
******* ** * **
*
*
*
*
**
*
*
* ** * ** **
*
*
***
*
*** * ** *
*
*
* * * * * * * * *** **** *
*
*** * * * ***** *
*
*
*
*
* *
*
*
*
*
*
** * *
*
******** *
**
* *
*
* **** * **** * ** *
*
* ** * * * *
* **
*
* * *
* * ** *** * ** **** **** *
******* *
* *****
*
*
**
* ** * *
*
*
* ** * * * * *
*
*
* *
*
*
*
** **** *** * * *
*
*****
*
* ** *
*
*
* **** ** **** **
*
*
*
* *
*
*
**
* *
* *
*
* ** *
*
** * * * *
*
*
*
*
*
*
*
* * ** * * * ** * ***** *
* **
* ** * *
*
*
* *R
* *****
* * *
* **
**
**
*
*** *
* ** * **
*
*
*
* * * * * ** * *
*
* **
*
* *
*
*
*
*
* *
*
*
*
*
* * * * ***
* *
*
*
**** * *
* *** * * ** *
**
**
* **
**
**
*
** *
*
* *
*
*
*
*
*
*
*
***
*
**
* ** *
* *
***
***
* ***
** *
*
*
*
*
*
* **
** * ** *
***
*
**
*
* *
**
**
**
*
*
*
* **
**
*
*
*
*
**
**
*
*
*
*
Both the standards are site specific.
However, typically ozone levels are monitored only at
a handful of sites!
Need a spatial model for interpolation.
Does the location where I live comply or not!
Sujit Sahu
9
High temporal resolution
Both the standards are calculated at annual levels!
But it is advantageous to model data on a higher
temporal resolution. More about this later.
The 8-hour averages encourage normality.
Can avoid extreme value theory!
Will model daily data for both the standards!
Need to use time series analysis methods!
The primary challenge here is in modelling and
analysis of large space-time data sets using
hierarchical Bayesian models.
The space-time modelling work that we do has not
been attempted previously.
Sujit Sahu
10
Objectives in modelling
Compliance, Trend and Pattern
Find the probability that any given site is
non-compliant with respect to the standards.
Is there an overall trend? Increasing? Decreasing?
Meteorological variables affect ozone levels. How to
assess trend after adjusting for them?
Explore spatial pattern in ozone levels: contrast rural
and urban areas.
In addition, there is the 8-hr average forecasting
problem.
Sujit Sahu
11
The CMAQ model
CMAQ = Community Multi-scale Air Quality Model.
A computer simulation model which produces
“averaged” output on 36km, 12km (used here), and
now 4km grid cells in some cities.
Uses variables such as power station emission
volumes, meteorological data, land-use, etc. with
atmospheric science (appropriate differential
equations) to predict pollution levels. Not driven by
monitoring station data.
Outputs are biased but no missing data.
Monitoring data provide more accurate
measurements, but are missing for some sites.
Sujit Sahu
12
Digression to Forecasting
http://airnow.gov provides forecasts for the:
next day’s 8-hour maximum.
8-hour average map at the current hour.
Forecasted 11 8−h averages in 694 sites during July 8−14, 2010.
25
+
+
+
+
a
+
*
+
CMAQ
Regression
Bayes
+
20
+
+
+
+
15
RMSE
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
a
a
+
+
+
+
+
+
+
a
a
a
a
10
a
5
6
+
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
8
*
*
*
*
*
*
*
*
*
*
10
*
*
*
*
*
*
a
a
a
a
a
*
*
12
a
a
a
a
+
+
+
+
+
+
+
+
+
a
a
a
a
a
a
a
a
a
*
*
*
*
+
+
+
+
a
a
a
*
+
+
+
a
a
a
a
+
+
+
a
a
a
a
*
*
*
*
*
*
*
*
*
*
*
*
14
a
a
a
a
*
*
*
*
*
*
*
16
Hour of Day (EDT)
Used data from the last 7−days upto 3−hours before the forecasted hour
RMSEs of real time forecasts for 8-hr ave’s.
Sujit Sahu
13
Outline:
1
2
3
4
5
Ozone and Ozone standards.
Models for evaluating the primary standard.
Models for evaluating the secondary standard.
Software for implementing the models.
Discussion
Sujit Sahu
14
Data for the primary standard
Work with daily data in the high ozone season May to
September in each year (153 days) for 10 years
97-06.
No need to model data for the remaining months
since we only need the annual 4th highest maximum.
This gives rise to ‘gaps’ in time series, between
successive years!
Covariates
Have: (1) temperature, (2) windspeed and (3) relative
humidity data from many meteorological stations.
We spatially interpolate those for the monitoring
stations and prediction locations.
Sujit Sahu
15
Preliminaries in modelling
Apply transformation: Try square root or log.
Observed data = Zl (s, t), s = (long, lat)′ , at n sites.
l = 1, . . . , r denotes year, r = 10.
t = 1, . . . , T denotes day, T = 153.
Data are observed at n = 700 sites s1 , . . . , sn .
Covariates: xl (s, t), (temperature, wind speed and
relative humidity).
Need to model approximately 1 million
(700 × 153 × 10) observations.
Sujit Sahu
16
Hierarchical models (Sahu et al., 2007)
Measurement error model:
Zl (s, t) = Ol (s, t) + ǫl (s, t), ǫl (s, t) ∼ N(0, σǫ2 ).
Ol (s, t) = true value, underlying space-time process.
ǫl (s, t) are independent.
σǫ2 is called the ‘nugget’ effect.
Model for true ozone
Ol (s, t) = ρOl (s, t − 1) + xl (s, t)′ β + ηl (s, t)
ρOl (s, t − 1): auto-regressive.
xl (s, t)′ β: adjustment for increment in local
meteorology.
ηl (s, t): space-time intercept, independent in time.
Assume η lt ∼ N(0, Ση ), Ση = ση2 e−φη D .
Sujit Sahu
17
Hierarchical auto-regressive models
Advantages
These validate well, Cameletti et al. (2011).
They also fit better and have superior theoretical
properties, Sahu and Bakar (2011).
Dis-advantages
However, it is computationally prohibitive to
implement the AR models for large data sets.
Known as the big-N problem, Banerjee et al. (2004).
But Banerjee et al. (2008) gave a solution based on a
Gaussian Predictive Process (GPP) approximation.
Sujit Sahu
18
Models using GPP approximations
Ideally, in addition to the nugget effect, would like to
fit:
Ol (si , t) = xl (si , t)′β + ηl (si , t).
The problem is that then we will have nT spatially
correlated ηl (si , t), the same number as data.
GPP approximations, reduce this number by
considering a smaller number, m << n, of knot
locations, denoted by s∗1 , . . . , s∗m .
Consider a spatial Gaussian process
η̃ ∗lt = (η̃l (s∗1 , t), . . . , η̃l (s∗m , t)) at the knots.
Based on the above process, obtain a kriged value
η̃l (si , t) at each of the observation sites.
Now fit the model:
Ol (si , t) = xl (si , t)′β + η̃l (si , t).
Sujit Sahu
19
Details of the GPP approximations
Let C be the n × m covariance matrix with the ijth
element Cov(η̃l (si , t), η̃(s∗j , t)) = C(si , s∗j ), for
i = 1, . . . , n, j = 1, . . . , m.
Let Σ∗ be the m × m covariance matrix of the spatial
process at the knots, η̃ ∗lt = (η̃l (s∗1 , t), . . . , η̃l (s∗m , t)).
Then the kriged η̃ lt = (η̃l (s1 , t), . . . , η̃l (sn , t)) is given
by:
η̃ lt = C(Σ∗ )−1 η̃ ∗lt .
Thus η̃ lt provides a GPP approximation for the full
rank process η lt .
We shall assume an auto-regressive model for η̃ ∗lt .
Sujit Sahu
20
Dynamic η̃ ∗lt .
Auto-regressive model
η̃ ∗lt = ρη̃ ∗lt−1 + ω lt
ρ is the auto-regressive parameter.
ω lt ∼ N(0, Σ∗ ) independently in time l and t.
Comments
The spatial knot locations do not change in time.
Hence, this can handle cross sectional data where
different locations are sampled at each time point.
It is possible to consider knots in time as well.
Knots do not need to be regularly spaced.
A sensitivity study may be conducted for the knot
selections.
Sujit Sahu
21
Setting up the GPP approximations
Fitting sites
Validation sites
Knot locations
Fitting sites, validation sites and knot locations in the
eastern US.
Sujit Sahu
22
Predictions
At a new site s′ and time t, the model is:
Zl (s′ , t) ∼ N Ol (s′ , t), σǫ2 .
Then
Ol (s′ , t) = xl (s′ , t)′ β + η̃l (s′ , t).
Then
η̃l (s′ , t) = c′ (Σ∗ )−1 η̃ ∗lt
where the jth element of c′ = C(s′ , s∗j ).
Bayesian predictive distributions:
Z
π(zpred |zobs ) = π(zpred |par) π(par|zobs )dpar.
Sujit Sahu
23
Summaries: primary standard
1
2
The annual 4th highest daily maximum fl (s) =
transformed 4th highest value of
{Ol (s, 1), . . . , Ol (s, T )}.
The 3-year rolling averages
gl (s) =
3
fl−2 (s) + fl−1 (s) + fl (s)
.
3
The meteorology-adjusted summaries
T
1X
hl (s) =
transform {Ol (s, t) − x′l (s, t)β} .
T t=1
The unadjusted levels
T
1X
transform(Ol (s, t)).
ul (s) =
Sujit Sahu
T
4
24
Inference using MCMC
All these summaries and trend values are simulated
at each MCMC iteration for a large number of
prediction sites.
Using the MCMC iterates we can make inference on
each of these summaries.
For example, to see if a site complies to the standards
or not, we can calculate the probability of compliance.
We can also see if a change (reduction) is statistically
significant or not.
Sujit Sahu
25
150
100
50
0
Annual 4th highest maximum ozone
200
Annual 4th highest maximums
1997
1999
2001
2003
2005
Years
Figure: Plot of the annual 4th highest maximum daily maximum
8-hr average ozone levels.
Sujit Sahu
26
120
100
80
60
Three year rolling average
140
3-year rolling annual 4th highest maximums
1999
2000
2001
2002
2003
2004
2005
2006
Years
Figure: Plot of the three year rolling average of the annual 4th
highest maximum daily maximum 8-hr average ozone levels.
Sujit Sahu
27
Validation mean square errors for daily
200
VMSE
150
100
50
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
Year
Sujit Sahu
28
Predicted versus the observed 4th highest
110
Annual 4th highest maximum (1997−2006)
GG
G
G
G G
G G
G G
G G
G
G GG
GG
G G GG
G G
GG
G
G
G
G
GG G G
G
G
G
G
GG GGG G
G G
G
G
G GG
G
G
G
G
G
G
G
G
G
G G
GG
G
G
GGG
G
G
G
G
G
G
G
G
G
G
G G GG
GGG G
GG
G
G
G
G
GG
GG
G G
G GG
G
G
GG G GG G
G G
G G GG G
G
G
G
G
G
G GG G
G
GG
G G
G
G
GG
G
G
GG G
G
G
GG G G GG
GG
G G
G
G
G
G
G
G
G
G
GG G
GG G
G G
G G
G
G
GG G
G G
G
G
GGG G
G
G
G
G GG G
G
GG G
GGG GGG
G
GG
G
G G
G
GGG
G
GG
G G G GG
GG G
G G G G
G
GG
G GG
G
GG
GGG GG
GG
G GGG
G G
G
G
GG
G
G
G
G
G
G
GG
GG
GG
GG
G
G
G
G GG
G
G
GGGG G
G
G
G
G
G
G
G
G
G
GG
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G GG
G GG
GGG
GG
G
G
G
G
G
GG
G
GG
G
GG G
G
G
GG
G
G
G
G
G
G
GG
GG
G
GG
G
G GG
GG
G G
GG
GG
G
G
GG
GG G G G
G G
G
GGGG G
G
GGG
G
G
G
G
GG
GGG
G
GG G
G
G
GGG
GG
G
G
G
G
GG
G GG GGG
GG
G
G
G G
GG G
GG
G GG
G GG
G
G
G
G
G
G
G
G
GG
G
G
GG
GG G
G
GG
G G
GG G
G GG
GG G
G
GG
GG
G
G
G
G
G
G
G
G
G
G
G
GG G G
G G
G
GGG
GG G
G
GG
G G
G
G
GG
GG
G G
G
GG
G
GG
GG
G
G
G
GG
G
GGG G G
GG G
G
G
G G
G G GGG G GG
G
G
GG
GG G
G
G G
GG
GG G
GG
GG
GG
G
G
G
G
G
G G GG G
G
G
G G G GG
G G
GG
GG G
G
G
GG
GG
GGG
G
G
G
G
G
G
G
G
G
G G
GG
G
G
GG
G
G
G
G G
G GG
G G
G
G
G
G
G G
G
Predicted
70
80
90
100
G
G
G
G
60
G
G
50
G
50
60
70
80
90
100
110
Observed
Sujit Sahu
29
Model based map of the 4th highest max
Year 1997
63
68.4
64.5
72.5
69 73.1 82
79.6
82.2
66.9
63.1
82.5
63.7
61.5
73
68.5 76.1
90.1 99
62
95
90
63.8
87.2
79
79.2
100.2
105.5
86
77.8
91.9
100
73.6
90.4
76.9 95.9
68.6
82.5
97.5
74.2
95.9
76.4
80.9
74.6
95.4
84.4
85
80
87 87.2
89.9
74.4
78
88.4
81.9
75
69.8
68.9
78.6
81.9
70
63.2
Sujit Sahu
30
Model based map of the 4th highest max
Year 2006
63
60
65.2
72.6
90
66.8 66.1 80.6
59.6
75
71.6
72.5
63.9
78.4
69.8
83.8
86.8
67.2
82
87.9
73.9
64.4 77.6
71.1
75.8
79.6
91.4
65
83.485.4
85
80
63.1
75.4
78.8
79.2
82.9
67.9
89.8
78.5 67.6
88
75
76.2
85.5
96.1 83.1
74.1
63.5
77.2
70
73
72.2
77.9
76.8
65
70.6
Sujit Sahu
31
Model based map of the 3-year average
3−Year average for 2001
62.4
64.4
75.9
71.2
100
75.5 82.1 90
84
69.7
74.1
84.4
92.5
73.1
75.3
59.3
89.9
74.1
68.8 81.2
79.3
93.494.3
82.2
92.6
84
87.5
90.5
86.5
83.8
80.1 75.4
95
90
78.4
84
93.4
96.8
97.5
88.2
78.8 91.8
74.8
89.4
87.3
77.7
80.8
76.2
95.3 87.6
95.6
78
80.6
85
80
79.4
77.7 87.6
83.3
75
73.7
76.8
80
84.2
70
65.8
Sujit Sahu
32
Model based map of the 3-year average
3−Year average for 2002
61.1
65.3
69.3
70.6
71.8 81.5 86.3
87.3
66.8
77.2
84
97.5
70.9
79.8
59.7
94.5
75.5
69.1 82.9
94
98.4
90
93.693.2
78.8
82.3
94.7
77.4
84.8
81.6
89.3
74.9 93.3
77
89.5
91
80.7
80.5
100
78.6
92.5
86
92
91.6
80
81.8
81.2
76.8 77.1
93 87.9
74.8
92.4
76.1
74.4 83.2
79.8
71.5
70
74.8
78
79.2
59.2
Sujit Sahu
33
Model based map of the 3-year average
3−Year average for 2003
60.5
64.9
100
70.2
72.2
72.1 83
89
91.8
64.1
76.5
88.9
69.3
83
56.2
77.2
77.1
89.6
80.8
79.7
71.9
65 77.2
100
100.2
81.6
96.2
96.8
88.9
90
97
88.8 90
92.5
73.6
78.8
92.3
72.5
79.4
80.7
88.3
88.7
79.5
88.5
80
77.6
79.3
71.8 71.8
88.3 82.3
71.5
86.8
72.6
73.1 75.2
74.1
68.5
70
72.3
76.1
77.7
57.6
Sujit Sahu
34
Model based map of the 3-year average
3−Year average for 2004
59.9
62.5
65.4
68.6
70.5 77.3 83.5
62.4
74.2
89.7
83.5
67.4
79
56.5
95.1
78.7
90
74.9
88.6
93.4
84.5
73.5 92.6
72.2
85.9
88.5
79.1
76
91.7
86 84.6
80
68.3
75.7
64.3 75.4
71.3
74.8
82.4
75.9
87.8
83
73.8
75.4
68.2 67.6
88
82
86.7
71.5
70
73
74
77.6
73.8
69.1
71.2
75.7
75.5
60
50.6
Sujit Sahu
35
Model based map of the 3-year average
3−Year average for 2005
60.4
58.9
66.5
70.4
90
70.5 75.9 83.7
84.5
61.5
74.9
79.8
68.5
75.8
60.8
86.7
91.5
68.5
86.8
85
91.9
82.2
70.3 88.2
70.3
81.3
81.7
77.4
70.1
79.480.2
80
64
75.6
76.7
71.2
76.7
69.9
86.6
75
72.4
71.6
69.1 61.6
85.2 78.5
69.3
82.9
70
71.7
73.6 79.3
72.3
68.5
65
70.5
78.6
77.4
53.8
Sujit Sahu
60
36
Model based map of the 3-year average
3−Year average for 2006
61.6
59.5
65.4
70.9
68.1 72.2 80.8
60.3
75.5
76.5
74.9
67.1
76.2
67
83.7
87.3
84.3
85
92.8
76.7
66.6 80.4
70.2
78.7
78.2
79.5
66.3
90
68.5
80.680.7
80
64.2
73
79.2
75
78.5
68.9
88.3
75
73.8
73.5 62.8
89.3 81.5
72.5
67.3
81.8
70
83.8
77
70.3
71.8
79.8
76.9
65
60.4
Sujit Sahu
37
Probability of non-compliance of the primary
ozone standard in 2006
Probability that 3−Year average > 85ppb for 2006
61.6
59.5
68.1 72.2 80.8
76.5
60.3
75.5
74.9
67.1
76.2
67
68.5
0.8
83.7
87.3
66.3
84.3
92.8
76.7
66.6 80.4
70.2
78.7
78.2
79.5
1.0
65.4
70.9
80.680.7
0.6
64.2
73
79.2
75
78.5
68.9
88.3
73.8
72.5
73.5 62.8
83.8
0.4
77
89.3 81.5
67.3
81.8
0.2
70.3
71.8
Sujit Sahu
79.8
76.9
0.0
38
Outline:
1
2
Ozone and Ozone standards.
Models for evaluating the primary standard.
Models for evaluating the secondary standard.
3
4
5
Software for implementing the models.
Discussion
Sujit Sahu
39
Available data for the secondary standard
Work with daily indices for the 214 days between
April to October in 2006.
October is included because it is the last month
which may contain the high values for August.
It is prohibitive to model hourly data from about 600
sites in the eastern US (about 1.5 million).
Covariates
Have CMAQ output that matches the data time
scales but on 12 km square grid.
We can get the meteorological data as well.
However, met variables fail to remain significant after
including CMAQ output!
Sujit Sahu
40
15000
10000
Frequency
5000
0
0
Frequency
10000 20000 30000 40000 50000 60000
Data scales
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
0.0
0.2
0.4
0.6
0.8
1.0
(b)
10
20
Frequency
6000
4000
0
0
2000
Frequency
30
8000
40
10000
(a)
−12
−10
−8
−6
(c)
−4
−2
0
5
10
15
20
25
30
(d)
(a) Raw daily data (b) Square-root daily data
(c) Log daily data (d) Annual raw data
Sujit Sahu
41
What if we just krige the annual indices?
Advantages: Simple spatial only model. Policy
makers do it all the time. Hard to beat using MSE.
Disadvantages: Cannot use learning from temporally
correlated data. Makes an observation missing when
only a few days’ data are missing.
Sujit Sahu
c
c CMAQ
* Kriging
40
c
c
c
c
c
c
30
Prediction
Results from kriging with
CMAQ as a covariate.
Tail areas are always
under-predicted.
May never estimate
anything more than the
standard value 21.
CMAQ has large bias.
c c
c
c
c
c c
c
c
c
c
c
20
c
c
c
c
c c
10
c
*
**
c
* *
c
* cc
c c
c
c
*
*
c
c
c * * *
*
*
c
* ******c * * c
c
*
*c
5
10
cc
c c
c
*c
*
c
c
c
*
c
* c
** * *
c
*cc ***c* c *** c *
**c * c*c * *c ***
*
* **
*
**c
*c *
* *
*
*c
*
*
15
c
c
c
c
20
* *
*
*
25
42
A global GLM
Assume Z (s, t) follows a GLM with mean µ(s, t) :
log µ(s, t) = a0 + a1 log x(s, t), t = 1, . . . , T .
x(s, t) is CMAQ output. Should really be x(A, t).
Gamma likelihood with shape ν and rate µ(s, t)/ν.
Can adopt the Gaussian model instead.
Obviously, a global model won’t work!
Need a site specific model.
Sujit Sahu
43
Site-wise GLM
Assume Z (s, t) follows a GLM with mean µ(s, t) :
log µ(s, t) = a0 (s) + a1 (s) log x(s, t)
Fit the model at each site si and obtain the estimates
of a0 (si ) and a1 (si ).
Do not require the standard errors.
Will treat a0 (s1 ), . . . , a0 (sn ) and a1 (s1 ), . . . , a1 (sn ) as
realisations (observations).
Need bivariate spatial model for the a0 (s) and a1 (s).
Sujit Sahu
44
Bivariate spatial models for slope and
intercept
Define the 2 by 2 covariance matrix at any site s
Bij = Cov(ai (s), aj (s)), i, j = 0, 1.
Now assume that
Cov(a(s), a(s′ )) = ρ(s, s′ ) B
where ρ(s, s′ ) is a valid covariance function.
The covariance matrix of the 2n × 1
a = (a(s1 ), . . . , a(sn )) is given by
Σa = R ⊗ B
where (R)ij = ρ(si , sj ) and ⊗ denotes the Kronecker
product.
Specify a Wishart prior distribution for B −1 .
Assume constant mean surfaces for a0 (s) and a1 (s).
Sujit Sahu
45
Model fitting and prediction algorithm...
1
2
3
Fit the site-wise GLM’s, and obtain estimates of
a0 (si ) and a1 (si ) for all i.
Model the pair of observations a0 (si ) and a1 (si ) for all
i using Bayesian bivariate spatial model.
Prediction at a new location s′ using MCMC. At the
jth MCMC iteration:
(j)
(j)
(i) Generate a0 (s′ ) and a1 (s′ ) for each s′ .
(ii) Invert the site-wise model to obtain:
(j)
(j)
µ(j) (s′ , t) = exp{a0 (s′ ) + a1 (s′ ) log x(s′ , t)}.
4
Finally, calculate the annual W126 index value by
aggregating µ(j) (s′ , t), t = 1, . . . , T .
Sujit Sahu
46
Summaries: secondary standard
1
2
3
4
Recall that we model Z (s, t), the daily index.
Calculate
Pthe monthly indices:
Mj (s) = t∈ month j Z (s, t), j = Apr, . . . , Oct.
Calculate
month running total:
Pthree
j
M̄j (s) = k =j−2 Mk (s), j = Jun, . . . , Oct.
Calculate the maximum value:
M̂(s) = maxj=Jun:Oct M̄j (s).
Sujit Sahu
47
Discussion of the method
Need bivariate spatial model for the univariate
problem!
Temporal variation enters only through the
downscaled CMAQ, x(s, t), as well.
It is a popular technique, currently being adopted by
many authors.
We are interested in out of sample predictions,
justification comes from superiority in validations.
Sujit Sahu
48
Alternative modelling strategies
Need the predictive process approximations for all
modelling strategies because of the big-N problem.
1
Can adopt the space-time GPP approximations for
daily indices.
Has been developed for the primary standard.
If this works, it will be a unified method for evaluating
both the standards
2
Kriging the annual indices. Results have been shown
already.
Has advantages and dis-advantages.
Sujit Sahu
49
Validation of the W126 standard
c
c
*
e
p
CMAQ
Krige
GAM
GPP
40
c
e
c
Prediction
e
c
c
c
p
c e
c
p
*
p
p
e
*
10
c
c
c
e
*p
e
e
*
*
c
*
e
*
p
c
e
5
c
c
p
*
c p
ep
c
c
c
e
c
pc p
e* c
*
p p p
c
p
e
p pc *
*
*
p
*e
pe e
c
e
p
e
*
*
c
c
p
**e
e
*e e
* e
*
*
*
e
e
p
Sujit Sahu
c
c
p p
10
p
e
p
p
*c
c
*
p
e
*
e
e
*
c
c
c
c
p
e
c
c
c
c
c
c
c
c
p
p
c c *
c
p pp p
p
*
c
ec * c
cc
e
e
e
e
p e
e e
p
c
c
pe
e
p
e * e p
p
e
p
*
p
*p
p
pc p p e
p*p
e
p *
*
c
p
** c * ee
c
p
*c e
p
*
e
e
*e *p c
e *e
*
c
*
**
*
c
*
*c
e
*
*
pe
* c p * pc* e
e
*
pp
e
*
p
*
e
**
p
c
e *
e
*
* ee
e
e
*
e
c
p
p
c
20
c
c
30
e
p
p
pc
e
* *
p
e
*
e
*
*
p
15
Observation
20
25
50
Model predicted map of the W126 standard
in 2006
5
3
45
8
6
5
25
3
5
6
13
9
11
6
7
20
15
7
40
13
6
11
24
23
27
24
12
21
15
16
21
35
22
11
13
13
24
25
29 24
19
19
16
24
24
18
13
8
10
13
7
12
30
14
19
18
12
5
11
6
Sujit Sahu
−95
−90
−85
−80
−75
−70
51
Uncertainty map of the predicted W126
standard in 2006
*
45
**
*
*
*
*
*
*
*
*
*
*
*
* **
40
*
*
*
*
*
35
*
* ** *
*
***
*
*
**
*
*
*
*
*
*
*
*
*
30
***
*
* **
*
* **
*
*** **
*
*
*
**
*
*
****
*
*
**
*
* ****
**
*
*
*
*
* *
*
*
*
* * **
**
****
*
**
*
*
**
*
*
*
*
*
*
*
*
*
*
**
** * *** *
* * ** * **
* **
*
*
*
*
*
*
* *
*** ** ** * * *
* ** * * *
*
**
* ****** *
**
*
*
*
*
** * * ****** *
*
* *** * * * * * **
*
*** ** *
** ** ** ** * *
*
*
*
*
**** *
*
* *
* *
*
*
** **** * * *
** *
*
* *
*
*
*
*
* ** **
*
*
*
* *
* *
*
* ** *
*
**
*
*
*
** * *
* * ***
** * *
**
* ** *
* * *
*
***** *
*
*
*
*
*
*
* *
** * * *
*
*
*
*
*
* ** * *
* *
*
* *
*
*
*
*
*
*
*
*
*
*
** * ******
*
***
**
**
*
**
*
*
*
*
*
**
* *
**
*
*
* ** *
40
*
*
*
*
*
30
20
*
*
** *
*
* *
* **
* ** * *
***
*
*
*
* *
*
*
**
*
10
*
Sujit Sahu
−95
−90
−85
−80
−75
−70
52
Probability of non-compliance of the W126
standard in 2006
5
3
45
8
0.7
3
6
5
0.6
5
6
13
9
11
6
15
7
7
0.5
40
13
6
11
24
23
27
24
12
21
0.4
16
21
35
22
11
24
D
0.3
25
C
29 24
19
19
16
24
24
13
13
18
13
8
13
0.2
7
12
30
14
19
18
12
0.1
11
0.0
6
Sujit Sahu
−95
−90
−85
−80
−75
−70
53
Outline:
1
2
3
Ozone and Ozone standards.
Models for evaluating the primary standard.
Models for evaluating the secondary standard.
Software for implementing the models.
4
5
Discussion
Sujit Sahu
54
Available R packages
There is no Bayesian R package for analysing
space-time data.
Packages such as geoR; geoRglm; spBayes can
analyse spatial and multivariate spatial data.
The package spBayes is useful for modelling
multivariate spatial data and can fit some
spatio-temporal regression models.
However, spBayes cannot implement well-known
time series models such as the auto-regressive
models.
And it is not easy to work with rich temporal datasets.
Sujit Sahu
55
spTimer: Spatio-Temporal Bayesian
Modelling using R
Developed by Shuvo Bakar: [email protected]
The code for spTimer is developed using the C
language that is hidden from the R user.
At the moment spTimer can fit, predict and forecast
point referenced space-time data using three
hierarchical Bayesian modelling methods:
Regression models with Gaussian processes (GP).
Hierarchical Auto-regressive (AR) models.
Models based on a GPP approximation.
Computations are performed using MCMC and users
are given options to select many parameters, on the
fly transformations, output aggregation etc. etc.
Sujit Sahu
56
Main functions in spTimer
spT.Gibbs is the main routine to fit the models.
spT.prediction is for performing predictions
based on the results obtained from spT.Gibbs.
spT.forecast is to obtain forecasts for future time
points.
spT.priors can be used to specify the prior
distributions and hyper-parameter values.
spT.initials provides the initial values.
spT.decay provides the sampling type for spatial
decay parameter.
Default choices are provided for the priors and initial
values!
Sujit Sahu
57
The template for spT.Gibbs
Input for spT.Gibbs
formula, coords, priors, initials.
model: GP, AR, GPP.
scale.transform: NONE, SQRT, LOG.
distance.method: geodetic, euclidean.
cov.fnc: exponential, gaussian,
spherical, matern.
spatial.decay: fixed, discrete or random-walk.
Output from spT.Gibbs
Posterion samples
Predictive Model Choice criterion (PMCC)
and other quantities like computation time.
Sujit Sahu
58
The prediction and forecasting functions
Input for spT.prediction
pred.data is for the data frame for prediction sites.
pred.coords is the prediction coordinates.
posteriors output from the spT.Gibbs.
Summary: True or False
Input for spT.forecast
K is for the k-step ahead forecast.
forecast.data is for the forecast covariate data
frame.
forecast.coords is the forecast coordinates.
posteriors output from the spT.Gibbs.
Summary: True or False.
Sujit Sahu
59
Illustrating spTimer...
Example
Ozone monitoring sites
Figure:
29 Ozone monitoring sites in New York.
We have daily data for two months, July and August, 2006
collected at 29 ozone monitoring sites.
Will use data from 25 sites to model.
Data from the remaining 5 will be used for validation.
Sujit Sahu
60
Illustrating spTimer
GP Models:
> priors<-spT.priors(model="GPP", var.prior=Gam(2,1),
beta.prior=Nor(0,10ˆ4))
> initials<-spT.initials(model="GPP", sig2eps=0.01,
sig2eta=0.5, beta=NULL, phi=0.001)
> spatial.decay<-spT.decay(type="MH", tuning=0.05)
> post.gpp<-spT.Gibbs(formula = o8hrmax ˜ cMAXTMP+WDSP+RH, data,
model="GP", time.data, coords, priors, initials, tuning,
its, burnin, distance.method="geodetic:km", cov.fnc =
"exponential",scale.transform="SQRT",spatial.decay="MH")
Sujit Sahu
61
Illustration ...
GP Model Summary:
>
#
#
#
MCMC.stat(post.gpp,burnin=1000)
Model: GPP
Number of Iterations: 5000
Number of Burn-in:
1000
Mean Median
SD Low2.5p Up97.5p
(Intercept) 4.3037 4.2542 0.7773 2.8986 5.9601
cMAXTMP
0.0938 0.0953 0.0220 0.0482 0.1334
WDSP
0.1188 0.1189 0.0229 0.0736 0.1634
RH
-0.1084 -0.1063 0.0716 -0.2544 0.0276
rho
0.1563 0.1563 0.0440 0.0711 0.2432
sig2eps
0.2068 0.2063 0.0111 0.1863 0.2303
sig2eta
0.6389 0.6312 0.0753 0.5074 0.8098
phi
0.0037 0.0036 0.0008 0.0023 0.0055
-> MCMC.plot(post.gpp,burnin=1000,package="coda")
Computation time for spT.Gibbs: 1 minute 35 seconds (Very old DELL laptop
upgraded to 2GB RAM running the Ubuntu Linux OS).
Sujit Sahu
62
Illustration ...
GP Model Predictions and Forecast:
> pred.coords<-as.matrix(val.site[,2:3])
> pred.gpp<-spT.prediction(burnin=1000, pred.data=val.dat,
pred.coords=pred.coords, posteriors=post.gpp,
tol.dist=2, Summary=TRUE)
> names(pred.gpp)
> spT.validation(val.dat$o8hrmax,c(pred.gpp$Mean))
> spT.pCOVER(val.dat$o8hrmax,c(pred.gpp$Up),c(pred.gpp$Low))
-> fore.gpp<-spT.forecast(burnin=1000, K=1, forecast.data,
forecast.coords, posteriors=post.gpp, Summary=TRUE)
> names(fore.gpp)
Sujit Sahu
63
Illustration ...
spTimer
5,000
GP < 1 min.
Computation time:
AR < 1 min.
GP 6.87
RMSE:
AR 6.80
Model Selection
GP 1001.83 (PMCC)
criteria:
AR 874.14 (PMCC)
Iterations
spBayes
5,000
> 15 hours
–
7.33
–
–
–
RMSE = Root Mean Squared Error
PMCC = Predictive Model Choice Criteria
Sujit Sahu
64
Outline:
1
2
3
4
Ozone and Ozone standards.
Models for evaluating the primary standard.
Models for evaluating the secondary standard.
Software for implementing the models.
5
Sujit Sahu
Discussion.
65
Discussion
Constructed high resolution spatio-temporal model
for both the ozone standards.
These are very challenging computational problems.
Methods/results are useful for many purposes, e.g.
evaluating long-term emission reduction policies.
measuring exposure.
Not discussed at all: Fusion with computer model
output, Forecasting.
The R-package spTimer is still under construction.
Can be obtained by request.
Sujit Sahu
66