ThesisChristenNick_Draft_20160603

Statistically Evaluating High-Resolution Precipitation Forecasts
in the San Francisco Bay Area
A THESIS
SUBMITTED TO THE DEPARTMENT OF EARTH & CLIMATE SCIENCE
OF SAN FRANCISCO STATE UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF BACHELOR OF SCIENCE
BY
Nicholas Christen
May 2015
Thesis Advisor:
Dr. David Dempsey
Committee Members:
Dr. Alexander Stine
Dr. John Monteverdi
Abstract
We statistically evaluate high-resolution WRF-ARW model precipitation forecasts within
three nested domains across California and the San Francisco Bay Area. Our nested WRF
domains have model grid resolutions of 10 km, 3.3 km, and 1.1 km, and are referred to in this
project as “CenCal,” “BayArea,” and “SFMarin,” respectively. [Too detailed?] This project
addresses topics including whether model grid resolution has a significant effect on forecast
accuracy, whether 24-hour precipitation forecasts are more accurate closer to or farther from the
model initialization time, and if there are spatial patterns in forecast error. Our statistical
calculations are made from the distance-weighted mean value of 24-hour forecast precipitation
accumulations of the nine model grid points nearest each of the set of core observation stations
that lie within the inner-most WRF subdomain. [Too detailed?] From these forecast-observation
matched pairs, we calculate various measures of forecast error for each station, as well as for the
aggregate of matched pairs across the three WRF subdomains over the course of the 2015-2016
rainfall season (early October – April). Forecasts are classified based on model grid resolution as
SFMarin, BayArea_slim, and CenCal_slim. One-tailed t-tests at the 95% confidence level show
that differences in Mean Absolute Error (MAE) scores between SFMarin and CenCal_slim are
significant, indicating that lower-resolution forecasts are more accurate. These counterintuitive
results may be attributed to the fact that convective scale precipitation is modeled explicitly in
the 1.1-km domain, while a parameterization scheme is used for both the 3.3 and 10-km
domains. Mean Error (ME) scores show a consistent negative forecast bias. Results also show
that forecast accuracy decreases in the second 24-hour period of forecasts relative to the first.
Analogous tests, based on bootstrapping confidence intervals, show that grid resolution has no
affect on the model’s ability to correctly forecast the occurrence of precipitation events and the
occurrence of nonevents. However, a notable feature is that the model systematically does better
forecasting nonevents. Finally, we observe that spatial pattern of MAE and ME scales with that
of observed precipitation.
I. Introduction and Background
The Advanced Research Version of the Weather Research and Forecasting (WRF-ARW)
model, developed by the National Center for Atmospheric Research, can make weather forecasts
on user-specified bounded regions in space (domains) with high spatial and temporal resolution
(Skamarock and et al., 2008). To make a forecast, the model solves a set of mathematic
equations that describe how the state of the atmosphere, represented on a three-dimensional grid
of points, changes over time at a series of discrete times.
Since mid-August 2014, the WRF-ARW model has been run locally to produce 48-hour
forecasts every six hours (i.e., four runs per day, initialized respectively at 00Z, 06Z, 12Z, and
18Z). To calculate a forecast, the model requires two kinds of information: (1) the state of the
atmosphere at the grid of points at the starting time (i.e., initial conditions); and (2) the state of
the atmosphere at grid points on the boundary of the forecast domain at all subsequent forecast
times (i.e., boundary conditions). To provide the necessary initial and boundary conditions to the
WRF-ARW model, we use analyses and forecasts from a lower-resolution weather forecast
model, the 40km-resolution North American Mesoscale (NAM) model, which is run by the
National Center of Environment Prediction four times per day (NCEP, 2014) (Figure 1). A
description of the WRF model is provided in the following section of this thesis.
The overarching motivation for this research is to understand (a) controls on mesoscale
weather patterns in the San Francisco Bay Area, and (b) the ability of a high-resolution model
such as the WRF-ARW to simulate and forecast these patterns. Particularly of interest is the
climatological rainfall maximum on the leeward side of Mt. Tamalpais in Marin County
(Appendix I) (Figure 2). In this study, we focus on precipitation patterns in the Bay Area, and
evaluate WRF-ARW precipitation forecasts statistically during the fall and winter of 2015-16.
Section II of this thesis describes the configuration of the WRF-ARW model forecasts
that we use for this study. Section III then poses specific questions about model precipitation
forecast performance in the Bay Area. Section IV describes our choice of statistics, how we
calculate them, and how we use them to test statistical hypotheses. Section V presents a
summary and discussion of our results.
II. The WRF Model
The WRF model is configured locally to run in three nested domains, each with specific
grid resolutions, time steps, and model output frequencies (Figure 3). The largest, “parent”
domain (referred to as “CenCal”) has the lowest resolution (10 km) and covers most of
California. A smaller, higher resolution domain (3.3 km, referred to as “BayArea”) is nested
within the parent domain and covers the San Francisco Bay area and surroundings. An even
smaller, yet higher resolution domain (1.1 km, referred to as “SFMarin”) is nested within the San
Francisco Bay Area domain, covering central San Francisco Bay Area – the northern San
Francisco peninsula, Marin County, and some parts of East Bay.
To initialize WRF-ARW forecasts, we feed the NAM initialization to WRF
Preprocessing System (WPS) software (Skamarock et al. 2008, Chapters 5 & 6), which
interpolates the NAM initialization onto the WRF-ARW’s higher-resolution grids. However, the
resulting fields don’t have any more detail (smaller scale features resolvable by WRF’s higherresolution grids) than the lower-resolution NAM initialization. To start WRF forecasts with more
detail, we implement a dynamic start, in which we initialize WRF with a NAM initialization six
hours earlier than the intended start time, and then integrate for six hours with boundary
conditions interpolated in time between initializations from two successive NAM forecasts runs,
which are available to us six hours apart. From the intended start time, output from the latest
NAM forecast run provides boundary conditions for the WRF forecast at intervals of six hours
out to 48 hours. (See Appendix II.)
III. Research Questions
We address four research questions about WRF-ARW precipitation forecast accuracy:
1) Does the pattern of forecast precipitation resemble the observed spatially variable pattern
in the Bay Area, and how does forecast accuracy vary spatially?
2) Does forecast accuracy depend on model grid resolution? Plots of the 1.1 km and 10 km
resolution forecasts, respectively, for the same rainfall event are shown in Figure 4. The
greater level of detail in the higher-resolution model suggests that its forecasts might be
more accurate, so we will test the statistical null hypothesis that model resolution does
not matter.
3) Does forecast accuracy depend on how long before a 24-hour verification period the
model is initialized? For example, we expect forecasts initialized at the start of a
verification period to be more accurate than those initialized 24 hours earlier. We
therefore test the statistical null hypothesis that the two forecasts do not differ.
4) Does forecast accuracy depend on the time of day of model initialization? We have no
reason to expect any difference, but we will test it anyway.
IV. Statistical Evaluation Methods
A. Measures of Model Forecast Accuracy
1. Continuous Statistics
We calculate Mean Absolute Error (MAE) and Mean Error (ME) (forecast bias). MAE is
the mean magnitude of the difference between forecast and observed precipitation at a set of
weather stations:
MAE º
1 n
å Fi - Oi
n i=1
where Fi is a forecast at the same location and time as an observation, Oi, and n is the number of
these forecast-observation pairs (“matched pairs”).
We define observed and forecast precipitation as 24-hour accumulations starting at 00Z, 06Z,
12Z, or 18Z, the times corresponding to each of the four operational model initialization times.
ME tells us the sign of the forecast error, and is equal to the average forecast value minus
the average observed value. Thus it is useful for identifying negative or positive forecast bias.
ME º
1 n
å( Fi - Oi ) = F - O
n i=1
We choose to base continuous statistics strictly on days when there is precipitation forecast or
observed, or both. Periods when no precipitation is observed and none is forecast (a correct
forecast) represent the majority of days and therefore tend to dominate the statistics, so we
exclude them to focus on the accuracy of precipitation forecast amounts.
2. Categorical Statistics
We calculate categorical statistics to evaluate the model’s ability to forecast whether or
not precipitation events will occur, rather than its ability to forecast precipitation amounts.
Categorical statistics are based on counts of the number of correctly forecast precipitation events
and nonevents, as well as incorrect forecasts of each (four categories total). Figure 5 shows a
generalized contingency table that displays these counts and marginal and total sums of the
counts. The individual counts and marginal sums are used to calculate fractional measures of
forecast accuracy. Among the many possible categorical statistics are the probability of
forecasting (“detecting”) precipitation events (PODY = a/(a+c), where “Y” means “yes”) and
the probability of detecting precipitation non-events correctly (PODN = d/(b+d), where “N”
means “no” or “none”). In the interest of time, we chose to calculate and test the significance of
these two statistics and not others.
We use the Developmental Testbed Center (DTC)’s Meteorological Evaluation Tools
(MET) software to calculate both continuous and categorical statistics (“MET Version 4.1 Users
Guide”).
B. Observations
To evaluate the model forecast accuracy, we need to compare model forecasts to
observations. My research uses surface observations from weather stations in the Meteorological
Assimilation Data Ingest System (MADIS) UrbaNet and Mesonet4 databases (MADIS, 2014)
(step O1 in Figure 6). We receive and store observations in NetCDF format every hour
automatically via the internet, using Unidata’s Local Data Manager (LDM) software. Incoming
data comprise surface observations from over 4,000 North American stations maintained by
local, state, and federal agencies and private firms (NOAA, 2002). These observations come
from roughly 1,000 stations in the CenCal domain, 200 in the BayArea domain, and 50 in the
SFMarin domain.
We construct 24-hour precipitation accumulations from precipitation accumulations
reported about every six minutes at each weather station. This calculation takes into account two
different accumulated precipitation reporting conventions. Most stations (80 to 90%) report
running 24-hour precipitation totals, while the rest report precipitation accumulation since local
midnight.
We perform some quality control (Appendix III) in attempt to obtain a more reliable
subset of observations for our analyses. The locations of surface weather stations remaining in
our database after quality control are shown in Figure 7.
C. WRF Forecasts and Matched Forecast-Observation Pairs
From each 48-hour precipitation forecast initiated at 00Z, 06Z, 12Z, and 18Z, two 24hour precipitation accumulations are calculated at each grid point in each of the three nested
WRF model domains, beginning 6 and 30 hours after the model dynamic start, respectively. We
refer to the forecast accumulation from 6 to 30 hours after the dynamic start as a 0-24 hr forecast,
and the forecast accumulation from 30 to 54 hours as a 24-48 hr forecast. The calculations take
into account the fact that the model forecasts accumulations since the start of the model run.
These forecast 24-hour accumulations are then interpolated to the locations of
observation stations, using a distance-weighted mean of the values from the nine model grid
points nearest each station (Cressman, 1959) (Figure 8).
Our dataset of matched forecast-observation pairs spans most of the period from early
October 2015 through April 2016, which roughly corresponds to the 2015-16 rainfall season in
the Bay Area.
D. Matched Pair Aggregation, Test Statistics, and Hypothesis Testing
To address each of our research questions (1) – (4) (Section III), we select a different
subset (aggregation) of matched pairs from the full dataset, as described below.
(1) We focus our questions about spatial patterns of forecasts and forecast accuracy in the
Bay Area on the total precipitation for the 2015-16 rainfall season. To this end, we aggregate
matched pairs for a subset of stations in the SFMarin domain at times from October 2015 to
April 2016 at which: (a) at least some precipitation was reported in the SFMarin domain, and (b)
observations were available from all stations in the subset. There were at least 105 rainfall events
(24-hour periods with rainfall observed somewhere in the SFMarin domain) and (after quality
control) 42 stations reporting precipitation sometime during the October-April period. We want
to include as many 24-hour periods with rainfall at as many stations as possible to capture the
pattern of seasonal total rainfall. However, not all stations reported during all precipitation
events, and there were some precipitation events in which relatively few stations reported. As a
compromise to try to optimize both spatial and temporal coverage of the pattern of seasonal
rainfall, we calculate partial-seasonal precipitation totals at 28 stations, all of which reported
during 67 (of the 105+) rainfall events.
Our analysis to address questions (1) starts with a qualitative comparison of plots of
observed and forecast partial-seasonal rainfall and inspection of plots of MAE and ME for each
station in the SFMarin domain (see Section V, “Results and Discussion”).
(2) To test whether or not grid resolution affects forecast accuracy, we aggregate matched
pairs according to the following criteria:
a) Matched pairs come from forecasts on the SFMarin, BayArea, or CenCal domain.
b) Matched pairs are for stations located within the SFMarin domain. We refer to the
subsets of all matched pairs from the larger BayArea and CenCal domains that lie
within the smaller SFMarin domain as BayArea_slim and CenCal_slim,
respectively.
c) Matched pairs are from the period from October 2015 to April 2016.
d) For continuous but not categorical error statistics, matched pairs include a
forecast, observation, or both of at least 0.254 mm (0.01 in., a trace) of
precipitation.
e) Matched pairs come from forecasts initialized at the same time of day (00Z, 06Z,
12Z, or 18Z). We aggregate these separately because 24-hour accumulations
calculated from the WRF runs initialized at these times overlap, and are therefore
not fully independent.
f) The forecast in each matched pair is for the same 24-hour period of the 48-hour
forecast (either 0-24 hours or 24-48 hours after the 6-hour dynamic start).
Based on these criteria, we create 24 aggregations corresponding to combinations of
forecasts from domains with three resolutions, at four initialization times, for two 24-hour
forecast periods in a 48-hour forecast (3 × 4 × 2 = 24). The number of matched pairs (n) in these
aggregations ranges from 2,440 to 2,864 for continuous statistics, and from 6,586 to 7,225 for
categorical statistics (which, unlike continuous statistics, includes matched pairs of forecasts and
observations both with zero rainfall).
For each aggregation, we compute the Mean Absolute Error (MAE) and Mean Error
(ME) (Section III.A.1) and the Probability of Detecting “Yes” (PODY) and “No” (PODN)
(Section III.A.2). For each initialization time and 24-hour forecast period, we then pose the
statistical null hypothesis that the MAEs for each possible pair of domain resolutions are the
same. We do the same for MEs, PODYs, and PODNs.
To conduct each hypothesis test about the MAE’s, we calculate a t-statistic for each
possible pair of the SFMarin, BayArea_slim, and CenCal_slim aggregations:
tn-1 =
MAEdom1 - MAEdom2
2
2
s dom1
s dom2
ndom1
+
ndom2
where the subscripts “dom1” and “dom2” refer to the members of each aggregation pair; ndom1
and ndom2 are the number of matched pairs in each aggregation pair; n is the minimum of ndom1
and ndom2; and s dom1 and s dom2 are the standard deviations of the absolute errors in each
aggregation pair. We calculate an analogous t-statistic to test the null hypothesis about
equivalence of MEs from different domains.
We begin with two-tailed t-tests with a confidence level of 0.05 to determine whether
MAEs or MEs from each domain pair are significantly different. If we reject the null hypothesis,
we then perform a one-tailed t-test to test the hypothesis taking into account the apparent sign of
the difference.
To conduct hypothesis tests about the equivalence PODYs or PODNs for each possible
domain pair, we take advantage of 95% bootstrapped confidence intervals calculated by the MET
software for PODY and PODN. If PODY for one domain in a pair falls outside of the 95%
confidence interval for PODY for the other domain, we reject the null hypothesis, and similarly
for PODN.
In addition, we test the null hypothesis that the ME is zero (that is, that the forecast bias is
not statistically significant) for each domain separately). The t-statistic for testing this hypothesis
is:
tndom-1 =
MEdom
2
s dom
ndom
(3) To test for differences in forecast accuracy between forecasts initialized closer to vs.
farther away from 24-hour verification periods, we aggregate matched pairs using the same
criteria listed for research question (2) above, with two exceptions. First, we consider forecasts
from only the SFMarin domain, and second, we consider only 24-hour accumulation periods for
which a 0-24 hour forecast and a 24-48 hour forecast from the day before are both available (see
Figure 9).
Based on these criteria, we create 8 aggregations corresponding to combinations of
forecasts at four initialization times and two 24-hour forecast periods in a 48-hour forecast (4 × 2
= 8). The number of matched pairs (n) in these aggregations ranges from 2,004 to 2,316 for
continuous statistics.
For each aggregation, we compute the MAE. For each initialization time, we then pose
the statistical null hypothesis that the MAE for forecasts initialized at the start of a 24-hour
verification period are the same as the MAE for forecasts initialized 24 hours earlier.
To conduct each hypothesis test about the MAE’s, we calculate a t-statistic for the pair of
forecast start times:
tn-1 =
MAE24-48hr - MAE0-24hr
2
s 24-48hr
n
+
2
s 0-24hr
n
where n is the total number of matched pairs in each aggregation, and the subscripts “0-24hr”
and “24-48hr” refer to the forecasts starting at the beginning of a 24-hour verification period (6
hours after the dynamic start) and forecasts for the same verification period initialized 24 hours
earlier, respectively.
We test the hypotheses about the equivalence of MAEs in the same ways as for research
question (2) above.
(4) To test whether or not forecast accuracy depends on the time of day when the model
is initialized, we aggregate matched pairs using the same criteria listed for research question (2)
above, with the exception that we consider only 0-24hr forecasts.
Based on these criteria, we create 12 aggregations corresponding to combinations of
forecasts from domains with three resolutions and at four initialization times (3 × 4 = 12). The
number of matched pairs (n) in these aggregations ranges from 2,441 to 2,772 for continuous
statistics and from 6,917 to 7,225 for categorical statistics.
For each aggregation, we compute the MAE, PODY, and PODN. For each of the six
possible pairs of initialization times, we then pose the statistical null hypothesis that the MAEs,
PODYs, and PODNs for each pair are the same.
To conduct each hypothesis test about the MAE’s, we calculate a t-statistic for each of
the six possible pairs of four initialization times:
tn-1 =
MAEinit1 - MAEinit 2
2
2
s init1
s init
2
ninit1
+
ninit 2
where the subscripts “init1” and “init2” refer to two different model initialization times.
We test the hypotheses about the equivalence of PODYs and PODNs in the same ways as
for research question (2) above.
An additional question of interest is to test whether PODY and PODN are significantly
different from one another. We test this based on 95% bootstrapping confidence intervals.
Results and Discussion
(1) Our first set of results provides information about spatial patterns of forecast
accuracy. At the expense of excluding some precipitation from the seasonal totals (as outlined in
Section IV.B) above, we believe we are able to preserve the spatial pattern of seasonal rainfall
and forecast errors. The spatial pattern of October 2015 - April 2016 precipitation observations
suggests that this year was representative of many others in that we see rainfall maximum on the
leeward side of Mt. Tamalpais in Marin County (Figure 10). Stations E7094 and RSSC report
partial seasonal totals of 528 mm and 611 mm, respectively. These totals are contrasted with
totals ranging in the 300-mm range at stations to the north (MSSMC), and totals elsewhere in the
northern and eastern San Francisco Bay Area widely in the 200 and 300-mm range. These
stations are all within a geographically small area relative to the spatial coverage of mid-latitude
cyclones that deliver most of the annual precipitation to this part of the world, on average. Since
this entire area is mostly affected by all of the same synoptic scale weather patterns, regional
differences in precipitation totals among stations are attributed in large part to the area’s
topographical features.
We plot a map view of the MAE values at each station (Figure 10), and visual inspection
suggests that MAE roughly scales with total observed precipitation and that forecast bias is
negative for most stations. Table 1 shows that for all forecast initialization times and domains,
these negative biases are significantly different from zero. Figure 11 plots MAE and ME vs.
event observed precipitation for all stations and 24-hour periods except those in which no rain
was observed or forecast. Two main features of these data are again (a) negative forecast biases,
and (b) that the magnitude of forecast error is somewhat correlated to observed event size.
However, we note that the variances are clearly not normally distributed, and the trend is not
tested for significance. In Figure 12, forecast precipitation is plotted against observed
accumulations for the same periods as in the previous figure. This plot shows a linear regression
with a slope that is visually quite different from 1, providing further evidence for persistent
under-forecasting of SFMarin forecasts. By normalizing the data from the spatial plots shown in
Figure 10, we obtain a much tighter relationship between event size and forecast error at
stations. We normalize these data, calculating an average event precipitation value for each
station, which we then plot against MAE. An R2 value of 0.83 shows strong correlation, and the
slope of the regression curve is 0.6 (Figure 13). The slope of this regression line has not been
tested for significance, and should be tested in future work. However, the fact that the size of
error appears to correlate with the size of precipitation event throughout the domain, and that
stations in the lee of Mt. Tamalpais fit this general trend, may be a first line of evidence that the
model is not missing the local rainfall maximum feature. These are encouraging preliminary
results, and they call for future work on the subject.
Our results for research question (2) show that the 10-kilometer resolution CenCal_slim
precipitation forecasts consistently appear to have the lowest MAE scores, followed by the 3.3kilometer BayArea_slim forecasts and 1.1-kilometer SFMarin forecasts (Table 2). MAE and ME
scores for the aggregate of all WRF domain forecasts at all model initialization times are
presented in Figure 14. Two-tailed t-tests show that for all SFMarin - CenCal_slim comparisons,
these differences are significant. For domain resolution pairs with significantly different MAE
scores, we then calculate one-tailed t-statistics and determine that MAE scores for CenCal_slim
forecasts are in fact significantly lower than for BayArea_slim and SFMarin (Table 3).
Categorical statistics provide an alternative perspective on forecast accuracy. By nature,
these are a less demanding test of statistical forecast accuracy, in that they do not focus on
amounts observed and forecasts as the continuous statistics do. PODY and PODN show similar
results for research question (2), but to a lesser degree (i.e. the differences for the domain pairs
are not all statistically significant) (Table 4). Taking MAE as our primary measure of forecast
accuracy, we proceed to reject our null hypothesis and determine that model grid resolution does
indeed affect precipitation forecast accuracy. Although it may be counterintuitive, SFMarin
forecasts are less accurate than the lower-resolution forecasts.
The apparent outperformance of the higher-resolution forecasts by the lower-resolution
forecasts might be attributed to the physics of the WRF model. For the 1.1-km domain,
convective precipitation is represented explicitly, while a parameterization scheme is used to
represent convective precipitation in the 3.3-km and 10-km domains. It is entirely possible that
the scheme is better suited for grid resolutions of at least 10 km or coarser, and therefore does
not do as well with higher spatial resolutions. Furthermore, it may be that 1.1 km is still too
coarse for representing individual convective rainclouds. In fact, we don’t commonly see entire
1.1-km wide air parcels being lofted high into the atmosphere. Seemingly poorer forecast skill of
the high-resolution domain relative to other domains may also be a function of the location of its
observation stations relative to the subdomain boundaries. Even after quality control to remove
stations that lie within 2.7 kilometers of the SFMarin boundary, it may be that errors propagating
from the boundary may affect more stations in this domain due to its small size. In the larger
domains, the stations used for the statistical evaluation are of course nowhere near the
boundaries and are therefore likely minimally affected by the propagating errors that arise when
the model calculates domain boundary conditions.
(3) MAE scores for the three WRF grid resolutions also reveal that forecasts for a given
24-hour observation period verify better for runs initialized immediately before that period (0-24
hr forecasts) than for those initialized 24 hours prior to its beginning (24-48hr forecasts). Twotailed t-tests for the MAE between 0-24 hr and 24-48 hr SFMarin forecasts show that the
differences are generally statistically significant, in favor of lower MAE scores for 0-24 hr
forecasts (Table 5). The exception to this is for 06Z WRF runs, which show no significant
difference between 0-24 hr and 24-48 hr forecasts. Possible explanations for this have not been
explored in this study, and should be considered for future work. The other three operational
WRF runs, however, agree on 0-24 hr forecasts being more accurate.
(4) Results show that MAE scores for the domain forecasts at different initialization
times are generally not significantly different (Table 6). While this is true for many initialization
time pairs, some cases allow us to reject this null hypothesis. Most notably, the comparison
between 06Z and 18Z forecasts at all three model grid resolutions yield significant differences in
MAE, at least for the aggregate of all of the 0-24 hr forecasts. Also, we note that six of eight
cases in which we reject the null hypothesis are for comparisons involving 06Z runs. Future
work should explore why WRF precipitation forecasts initialized at 18Z are consistently more
accurate (as measured by MAE) than 06Z forecasts, and why 0-24 hr 06Z forecasts exhibit signs
of being less accurate than forecasts for the other three initialization times.
We also note additional features of the model performance that are not particularly
related to the questions of forecast accuracy between different model grid resolutions and periods
within 48-hour forecasts. For example, the model has a higher likelihood of correctly forecasting
nonevents than it does precipitation events. This is shown in the PODY and PODN statistics in
Table 7. The differences are significant for all grid resolutions and in both forecast periods. We
also detect forecast biases that are negative and significantly different from zero at the 95%
confidence level, suggesting that the model consistently under-forecasts precipitation events.
Statistical evaluation is one of many methods of determining forecast accuracy and
precision. Due to the nature of precipitation – namely its infrequent occurrence and high
variability on small spatial and temporal scales – it is notoriously a difficult quantity to forecast.
By nature, continuous statistics are an especially demanding test for forecast accuracy, as they
consider the amount of precipitation observed vs. the amount forecast. We believe that
categorical statistics are a less demanding, but still useful test. Although statistical verification
can often yield poor results, it can show how the model (1) captures spatial patterns of
precipitation, (2) performs at different grid resolution scales, (3) performs given various lengths
of time away from forecast initialization, and (4) performs given the initialization time of day.
Even if model verification is statistically poor most for many individual cases, we can asses
whether the apparently poor skill is due to the model shifting the rainfall some small distance
from where it was observed, slightly missing the timing of rainfall events, or even correctly
forecasting rain but failing to reflect the observed totals.
V. Conclusions
In this study, we examine the skill of WRF-ARW model precipitation forecasts in three
nested domains using continuous and categorical statistical tests. We conclude that lowerresolution CenCal_slim forecasts are more accurate than SFMarin forecsasts, based on MAE
scores. The apparently better performance of the low-resolution domain relative to the domains
with finer grid resolution may be attributed to the fact that that the 3.3 and 10 km domains have
different physical parameters for convective-scale precipitation than does the 1.1 km domain.
MAE scores also tell us that 48-hour precipitation forecasts for model runs initialized 24 hours
back in time from a 24-hour observation period are generally less accurate than those for runs
initialized immediately before the period. Other notable features include the model’s higher
likelihood of correctly forecasting nonevents than rainfall events, and the fact that the model
forecast bias is consistently negative. We also see that MAE scales with observed precipitation,
and that MAE scores at stations in the lee of Mt. Tamalpais fit this overall trend, though further
work is necessary to test whether the trend is significant. We take this as preliminary evidence
that model does not miss the climatological local rainfall maximum pattern. Although we see
that higher spatial resolution does not improve WRF-ARW precipitation forecast accuracy as
measured by MAE, it is encouraging that we do not see extremely high MAE scores in the lee of
Mt. Tamalpais. With these results, it could be reasonable to use the WRF model at SFMarin grid
resolution to evaluate the mesoscale processes associated with precipitation around Mt.
Tamalpais.
Appendix I. The Kentfield Rainfall Maximum
Long-term monthly average precipitation values for Bay Area communities show that
Kentfield receives notably more precipitation than surrounding locations (WRCC, 2016). This
differential precipitation is caused by topography, since the same synoptic-scale weather patterns
affect the entire region, and because rainfall in this part of the world is dominated by wintertime
frontal precipitation rather than summertime convective-scale. The highest-resolution WRFARW domain used in our research, referred to as “SFMarin,” is of sufficiently small spatial
resolution to address the mesoscale processes involved with Mt. Tamalpais. We test the accuracy
of forecasts made with the SFMarin grid relative to coarser-resolution “SFBayArea” and
“CenCal” domains. If the model can reproduce this spatial pattern over the course of a rainfall
season, then we have confidence that we can look at specific model output fields to address
physical hypotheses such as the mechanisms responsible for the Kentfield Rainfall Maximum.
Two existing hypothesis are that (a) the mountain is narrow relative to the advective scale of
orographically induced precipitation, and (b) clouds and precipitation are formed in place due to
low-level convergence on the leeward side of the mountain. This phenomenon motivates us to
evaluate the WRF model statistically.
Appendix II. WRF Model Configuration and Post-Processing
WRF-ARW model produces outputs on terrain-following coordinate surfaces and in
Network Common Data Format (NetCDF). We post-process the WRF-ARW model outputs with
Unified Post Processor (UPP) software developed by National Centers for Environmental
Prediction (NCEP). UPP interpolates the WRF-ARW output fields vertically onto constant
pressure surfaces and reformats the data files to NCEP’s Gridded Binary1 (GRIB1) format,
which our statistical analysis software, Model Evaluation Tools (MET) will read. With each
operational model run, we create 48-hour precipitation forecasts
During the 6-hour dynamic start period, topographic forcing, land-sea contrasts and
internal dynamics produces detail resolvable by the high-resolution WRF grids. This is missing
in a NAM initialization.
Appendix III. Quality Control of Precipitation Observation Reports
We eliminate stations that report physically impossible or highly implausible
precipitation amounts. For example, we exclude stations reporting excessive amounts in a matter
of minutes or hours, or high totals when there is no evidence of rain based on GOES-West 1-km
satellite archives and totals at nearby stations. We also discovered negative precipitation values
in the statistics files that arose from certain UrbaNet stations zeroing out not at midnight, but
instead a few minutes before. Therefore, our algorithm, which subtracts the hourly precipitation
report closest to the end of the 24-hour period from the report closest to the beginning, yielded
negative 24-hour precipitation accumulations at those stations. We recalculated all 24-hour
accumulations back to October 2015 to fix this error.
Another form of quality control is to remove stations that lie within 2.5 grid points (2.7
kilometers) from the boundary of the SFMarin subdomain. This is necessary because the model
forecast values are zero at all of the outer-most grid points, which influence the stations close
enough to incorporate those values into their distance weighted means.
Selected References
"Animation of Archived GOES-West Visible Satellite Images." Animation of Archived GOESWest Visible Satellite Images. http://squall.sfsu.edu/scripts/gwvis_big_archloop.html.
Cressman, George P. "An Operational Objective Analysis System." Mon. Wea. Rev. Monthly
Weather Review 87, no. 10 (1959): 367-74. http://docs.lib.noaa.gov/rescue/mwr/087/mwr-08710-0367.pdf.
"Model Evaluation Tools Version 4.1 (METv4.1) Users Guide." DTC. May 2013.
http://www.dtcenter.org/.
Skamarock et al. 2008 "A Description of the Advanced Research WRF Version 3." Chapters 5 &
6. University Corporation for Atmospheric Research. June 2008.
http://www2.mmm.ucar.edu/wrf/users/docs/arw_v3.pdf.
Western Regional Climate Center, 2016. “US COOP Station Map” US COOP Station Map.
http://www.wrcc.dri.edu/coopmap/
Wilks, Daniel S. Statistical Methods in the Atmospheric Sciences: An Introduction. San Diego:
Academic Press, 1995.
Figures and Tables
Figure 1. Map view of the NAM 40-km domain used for WRF-ARW initialization and boundary
conditions. Locations of model grid points plotted as crosses.
12
30-yr Average (1981-2010)
Monthly Rainfall (in.)
10
8
6
4
2
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Kentfield
San Rafael
Muir Woods
Hamilton AFB
Downtown SF
Richmond
Figure 2. Western Regional Climate Center long-term average monthly precipitation values at
COOP stations across the northern and central San Francisco Bay Area. Period of record: 30
years (1981 – 2010) for all stations except Hamilton AFB, which has data for 1934 – 1971)
Figure 3. Top: Map view of the nested WRF-ARW domains in which we operationally run the
model to produce 48-hour precipitation forecasts each six hours.
Bottom: Table showing the model output frequencies, spatial grid resolutions, and model times
steps of the three WRF-ARW domains
Figure 4. Map view of WRF 48-hour accumulated precipitation (in.) (color-shaded contours) for
a rainfall event in December 2014 in the SFMarin domain, resolved at 10 km (top image) and 1.1
km (bottom).
Figure 5. Contingency table providing the logic for the programs that calculate categorical
statistics based on WRF precipitation forecasts. For n total events (or matched pairs), categories
are broken down into (a) number of events in which the model forecast rain and rain was
observed, (b) number of events in which the model forecast rain but none was observed, (c)
number of events in which the model did not forecast rain, but rain was observed, and (d) the
number of events in which the model did not forecast rain and no rain was observed. Subtotals
a+b, c+d, a+c, and b+d represent the number of total forecasts for rain, forecasts for no rain,
observations of rain, and observations of no rain, respectively.
Figure 6. Schematic diagram illustrating the logical steps for statistical evaluation of WRFARW forecasts
Figure 7. Map view of the SFMarin WRF subdomain showing the locations of model grid points
(black crosses), and the surface weather stations (blue stars) remaining after quality control
Figure 8. Schematic illustration of the grid points involved in the distance-weighted mean
interpolation of WRF-ARW precipitation values for SFMarin, BayArea_slim, and CenCal_slim
forecasts. Shown are the hypothetical locations of SFMarin, BayArea, and CenCal grid points
(respectively as magenta, red, and blue crosses), as well as a surface weather station (blue star).
The boundary of the SFMarin domain is marked by the solid black line.
Figure 9. Periods corresponding to two example operational WRF forecasts, with the 24-hour
evaluation period established for comparing accuracy of the first 24 hours of the 00Z forecast on
Oct 25th with the second 24 hours of the 00Z Oct 24th forecast. This method is applied for all
operational runs in our data set in which some precipitation was observed or forecast.
Figure 10. Map view plots of total forecast and observed preciopitation (mm), and forecast
errors (mm) in the SFMarin domain over the course of the 2015-16 rainfall season. Values are
plotted over WRF model SFMarin terrain height (color-filled contours) with lines of equal
elevation (solid black)
Top Left: Partial-seasonal MAE (mm) for the aggregate of 24-hour periods in common among all
stations reporting precipitation. Values are color coded and scaled in size according to their
magnitude.
Top Right: Partial-seasonal total observed precipitaion (mm)
Bottom Left: Partial-seasonal ME (mm) for the aggregate of 24-hour periods in common among
all stations reporting precipitation. Numbers are color coded according to their value.
Bottom Right: Partial-seasonal total forecast precipitation (mm)
Mean Absolute Error
Mean Error (Bias) (mm)
Observed Precipitation (mm)
Figure 11. Top: SFMarin forecast MAE (mm) plotted against observed precipitation for all
stations and 24-hour periods remaining after quality control of precipitation observations. Also
shown are the one-to-one lines (dashed) and the linear regression fit. Bottom: SFMarin forecast
ME (mm) plotted against observed precipitation for all stations and 24-hour periods remaining
after quality control of precipitation observations.
Figure 12. SFMarin forecast precipitation (mm) plotted against observed precipitation for all
stations and 24-hour periods remaining after quality control of precipitation observations.
Figure 13. SFMarin MAE (mm) plotted against averaged observed precipitation at all stations
within the SFMarin subdomain remaining after quality control.
Figure 14. (Top) MAE and (bottom) ME values (mm) for the aggregate of all WRF domain 024hr forecasts (SFMarin, BayArea_slim, and CenCal_slim) at all model initialization times (00Z,
06Z, 12Z, and 18Z)
Table 1:
WRF Run (first 24 hrs)
Grid Forecast
ME (mm)
00Z
SFMarin
BayArea_slim
CenCal_slim
SFMarin
BayArea_slim
CenCal_slim
SFMarin
BayArea_slim
CenCal_slim
SFMarin
BayArea_slim
CenCal_slim
-0.925
-0.895
-1.016
-1.258
-1.272
-1.342
-1.593
-1.553
-1.634
-1.102
-1.074
-1.108
06Z
12Z
18Z
t
t-crit
-5.543
-5.541
-6.865
-6.930
-7.098
-8.121
-9.571
-9.471
-10.238
-6.457
-6.326
-6.976
0.960
0.960
0.960
0.960
0.960
0.960
0.960
0.960
0.960
0.960
0.960
0.960
Reject
Hnull
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Table 2: MAE and ME data for the first 24-hour periods of WRF forecasts at all grid resolutions
WRF Run
00Z
06Z
12Z
18Z
Grid Forecast MAE (mm) ME (mm)
SFMarin
5.181
-0.925
BayArea_slim
5.045
-0.895
CenCal_slim
4.752
-1.016
SFMarin
5.382
-1.258
BayArea_slim
5.311
-1.272
CenCal_slim
4.921
-1.342
SFMarin
5.073
-1.593
BayArea_slim
5.021
-1.553
CenCal_slim
4.905
-1.634
SFMarin
4.916
-1.102
BayArea_slim
4.908
-1.074
CenCal_slim
4.665
-1.108
Table 3: T-statistics and tests for addressing whether MAE and ME scores for domain pairs are
not significantly different. Shown are ME and MAE domain pairs for all WRF initialization
times for 0-24hr forecasts
Column1
00Z, MAE
SFMarin-BayArea
SFMarin-CenCal
BayArea-CenCal
00Z, ME
SFMarin-BayArea
SFMarin-CenCal
BayArea-CenCal
06Z, MAE
SFMarin-BayArea
SFMarin-CenCal
BayArea-CenCal
06Z, ME
SFMarin-BayArea
SFMarin-CenCal
BayArea-CenCal
12Z, MAE
SFMarin-BayArea
SFMarin-CenCal
BayArea-CenCal
12Z, ME
SFMarin-BayArea
SFMarin-CenCal
BayArea-CenCal
18Z, MAE
SFMarin-BayArea
SFMarin-CenCal
BayArea-CenCal
18Z, ME
SFMarin-BayArea
SFMarin-CenCal
BayArea-CenCal
t
t-crit
Reject
Hnull
0.585
1.923
1.337
0.960 No
0.480 Yes
0.480 Yes
-0.131
0.407
0.554
0.960 No
0.960 No
0.960 No
0.278
1.878
1.600
0.960 No
0.480 Yes
0.480 Yes
0.055
0.341
0.286
0.960 No
0.960 No
0.960 No
0.225
0.729
0.505
0.960 No
0.480 Yes
0.480 Yes
-0.172
0.179
0.356
0.960 No
0.960 No
0.960 No
0.033
1.077
1.046
0.960 No
0.480 Yes
0.480 Yes
-0.117
0.025
0.147
0.960 No
0.960 No
0.960 No
Table 4: t-statistics and test results for addressing whether significant differences in MAE exist
between 0-24hr and 24-48hr forecasts.
t
00Z
06Z
12Z
18Z
6.826
-0.539
5.201
2.991
t-crit (2-tailed)
0.960
0.960
0.960
0.960
Reject
t-crit (1-tailed)
Hnull?
N/A Yes
N/A No
0.480 Yes
0.480 Yes
Table 5: t-tests results for addressing whether PODY and PODN scores for domain forecast
pairs are not significantly different
00Z, PODY
SFMarin-BayArea
SFMarin-CenCal
BayArea-CenCal
006Z, PODY
SFMarin-BayArea
SFMarin-CenCal
BayArea-CenCal
Reject
Hnull
Yes
Yes
Yes
Reject
Hnull
Yes
Yes
Yes
00Z, PODN
SFMarin-BayArea
SFMarin-CenCal
BayArea-CenCal
06Z, PODN
SFMarin-BayArea
SFMarin-CenCal
BayArea-CenCal
Reject
Hnull?
No
Yes
Yes
Reject
Hnull?
No
No
Yes
Table 6: t-statistics and test results for addressing whether MAE scores are significantly
different among aggregated forecasts pairs of initialization times.
Forecast Init Pairs
00Z-06Z
00Z-12Z
00Z-18Z
06Z-12Z
06Z-18Z
12Z-18Z
Grid Forecast
SFMarin
BayArea_slim
CenCal_slim
SFMarin
BayArea_slim
CenCal_slim
SFMarin
BayArea_slim
CenCal_slim
SFMarin
BayArea_slim
CenCal_slim
SFMarin
BayArea_slim
CenCal_slim
SFMarin
BayArea_slim
CenCal_slim
t
-0.815
-1.103
-0.762
0.458
0.104
-0.703
1.110
0.585
0.401
1.255
1.194
0.070
1.870
1.633
1.117
0.659
0.479
1.066
t-crit
0.960
0.960
0.960
0.480
0.480
0.960
0.960
0.960
0.480
0.960
0.960
0.480
0.960
0.960
0.960
0.960
0.480
0.960
Reject Hnull?
No
Yes
No
No
No
No
Yes
No
No
Yes
Yes
No
Yes
Yes
Yes
No
No
Yes
Table 7: t-test results for addressing whether PODN and PODY are not significantly different
from each other, based on upper and lower 95% bootsratpping confidence levels. Tests are
performed for multiple forecast runs (model initialization times), both forecast periods, and for
forecasts made with all three WRF model grid resolutions (SFMarin, CenCal_slim, and
BayArea_slim).
SFMarin PODN-PODY
00Z
06Z
BayArea PODN-PODY
00Z
06Z
CenCal PODN-PODY
00Z
06Z
Reject Hnull?
Yes
Yes
Reject Hnull?
Yes
Yes
Reject Hnull?
Yes
Yes