Statistically Evaluating High-Resolution Precipitation Forecasts in the San Francisco Bay Area A THESIS SUBMITTED TO THE DEPARTMENT OF EARTH & CLIMATE SCIENCE OF SAN FRANCISCO STATE UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF BACHELOR OF SCIENCE BY Nicholas Christen May 2015 Thesis Advisor: Dr. David Dempsey Committee Members: Dr. Alexander Stine Dr. John Monteverdi Abstract We statistically evaluate high-resolution WRF-ARW model precipitation forecasts within three nested domains across California and the San Francisco Bay Area. Our nested WRF domains have model grid resolutions of 10 km, 3.3 km, and 1.1 km, and are referred to in this project as “CenCal,” “BayArea,” and “SFMarin,” respectively. [Too detailed?] This project addresses topics including whether model grid resolution has a significant effect on forecast accuracy, whether 24-hour precipitation forecasts are more accurate closer to or farther from the model initialization time, and if there are spatial patterns in forecast error. Our statistical calculations are made from the distance-weighted mean value of 24-hour forecast precipitation accumulations of the nine model grid points nearest each of the set of core observation stations that lie within the inner-most WRF subdomain. [Too detailed?] From these forecast-observation matched pairs, we calculate various measures of forecast error for each station, as well as for the aggregate of matched pairs across the three WRF subdomains over the course of the 2015-2016 rainfall season (early October – April). Forecasts are classified based on model grid resolution as SFMarin, BayArea_slim, and CenCal_slim. One-tailed t-tests at the 95% confidence level show that differences in Mean Absolute Error (MAE) scores between SFMarin and CenCal_slim are significant, indicating that lower-resolution forecasts are more accurate. These counterintuitive results may be attributed to the fact that convective scale precipitation is modeled explicitly in the 1.1-km domain, while a parameterization scheme is used for both the 3.3 and 10-km domains. Mean Error (ME) scores show a consistent negative forecast bias. Results also show that forecast accuracy decreases in the second 24-hour period of forecasts relative to the first. Analogous tests, based on bootstrapping confidence intervals, show that grid resolution has no affect on the model’s ability to correctly forecast the occurrence of precipitation events and the occurrence of nonevents. However, a notable feature is that the model systematically does better forecasting nonevents. Finally, we observe that spatial pattern of MAE and ME scales with that of observed precipitation. I. Introduction and Background The Advanced Research Version of the Weather Research and Forecasting (WRF-ARW) model, developed by the National Center for Atmospheric Research, can make weather forecasts on user-specified bounded regions in space (domains) with high spatial and temporal resolution (Skamarock and et al., 2008). To make a forecast, the model solves a set of mathematic equations that describe how the state of the atmosphere, represented on a three-dimensional grid of points, changes over time at a series of discrete times. Since mid-August 2014, the WRF-ARW model has been run locally to produce 48-hour forecasts every six hours (i.e., four runs per day, initialized respectively at 00Z, 06Z, 12Z, and 18Z). To calculate a forecast, the model requires two kinds of information: (1) the state of the atmosphere at the grid of points at the starting time (i.e., initial conditions); and (2) the state of the atmosphere at grid points on the boundary of the forecast domain at all subsequent forecast times (i.e., boundary conditions). To provide the necessary initial and boundary conditions to the WRF-ARW model, we use analyses and forecasts from a lower-resolution weather forecast model, the 40km-resolution North American Mesoscale (NAM) model, which is run by the National Center of Environment Prediction four times per day (NCEP, 2014) (Figure 1). A description of the WRF model is provided in the following section of this thesis. The overarching motivation for this research is to understand (a) controls on mesoscale weather patterns in the San Francisco Bay Area, and (b) the ability of a high-resolution model such as the WRF-ARW to simulate and forecast these patterns. Particularly of interest is the climatological rainfall maximum on the leeward side of Mt. Tamalpais in Marin County (Appendix I) (Figure 2). In this study, we focus on precipitation patterns in the Bay Area, and evaluate WRF-ARW precipitation forecasts statistically during the fall and winter of 2015-16. Section II of this thesis describes the configuration of the WRF-ARW model forecasts that we use for this study. Section III then poses specific questions about model precipitation forecast performance in the Bay Area. Section IV describes our choice of statistics, how we calculate them, and how we use them to test statistical hypotheses. Section V presents a summary and discussion of our results. II. The WRF Model The WRF model is configured locally to run in three nested domains, each with specific grid resolutions, time steps, and model output frequencies (Figure 3). The largest, “parent” domain (referred to as “CenCal”) has the lowest resolution (10 km) and covers most of California. A smaller, higher resolution domain (3.3 km, referred to as “BayArea”) is nested within the parent domain and covers the San Francisco Bay area and surroundings. An even smaller, yet higher resolution domain (1.1 km, referred to as “SFMarin”) is nested within the San Francisco Bay Area domain, covering central San Francisco Bay Area – the northern San Francisco peninsula, Marin County, and some parts of East Bay. To initialize WRF-ARW forecasts, we feed the NAM initialization to WRF Preprocessing System (WPS) software (Skamarock et al. 2008, Chapters 5 & 6), which interpolates the NAM initialization onto the WRF-ARW’s higher-resolution grids. However, the resulting fields don’t have any more detail (smaller scale features resolvable by WRF’s higherresolution grids) than the lower-resolution NAM initialization. To start WRF forecasts with more detail, we implement a dynamic start, in which we initialize WRF with a NAM initialization six hours earlier than the intended start time, and then integrate for six hours with boundary conditions interpolated in time between initializations from two successive NAM forecasts runs, which are available to us six hours apart. From the intended start time, output from the latest NAM forecast run provides boundary conditions for the WRF forecast at intervals of six hours out to 48 hours. (See Appendix II.) III. Research Questions We address four research questions about WRF-ARW precipitation forecast accuracy: 1) Does the pattern of forecast precipitation resemble the observed spatially variable pattern in the Bay Area, and how does forecast accuracy vary spatially? 2) Does forecast accuracy depend on model grid resolution? Plots of the 1.1 km and 10 km resolution forecasts, respectively, for the same rainfall event are shown in Figure 4. The greater level of detail in the higher-resolution model suggests that its forecasts might be more accurate, so we will test the statistical null hypothesis that model resolution does not matter. 3) Does forecast accuracy depend on how long before a 24-hour verification period the model is initialized? For example, we expect forecasts initialized at the start of a verification period to be more accurate than those initialized 24 hours earlier. We therefore test the statistical null hypothesis that the two forecasts do not differ. 4) Does forecast accuracy depend on the time of day of model initialization? We have no reason to expect any difference, but we will test it anyway. IV. Statistical Evaluation Methods A. Measures of Model Forecast Accuracy 1. Continuous Statistics We calculate Mean Absolute Error (MAE) and Mean Error (ME) (forecast bias). MAE is the mean magnitude of the difference between forecast and observed precipitation at a set of weather stations: MAE º 1 n å Fi - Oi n i=1 where Fi is a forecast at the same location and time as an observation, Oi, and n is the number of these forecast-observation pairs (“matched pairs”). We define observed and forecast precipitation as 24-hour accumulations starting at 00Z, 06Z, 12Z, or 18Z, the times corresponding to each of the four operational model initialization times. ME tells us the sign of the forecast error, and is equal to the average forecast value minus the average observed value. Thus it is useful for identifying negative or positive forecast bias. ME º 1 n å( Fi - Oi ) = F - O n i=1 We choose to base continuous statistics strictly on days when there is precipitation forecast or observed, or both. Periods when no precipitation is observed and none is forecast (a correct forecast) represent the majority of days and therefore tend to dominate the statistics, so we exclude them to focus on the accuracy of precipitation forecast amounts. 2. Categorical Statistics We calculate categorical statistics to evaluate the model’s ability to forecast whether or not precipitation events will occur, rather than its ability to forecast precipitation amounts. Categorical statistics are based on counts of the number of correctly forecast precipitation events and nonevents, as well as incorrect forecasts of each (four categories total). Figure 5 shows a generalized contingency table that displays these counts and marginal and total sums of the counts. The individual counts and marginal sums are used to calculate fractional measures of forecast accuracy. Among the many possible categorical statistics are the probability of forecasting (“detecting”) precipitation events (PODY = a/(a+c), where “Y” means “yes”) and the probability of detecting precipitation non-events correctly (PODN = d/(b+d), where “N” means “no” or “none”). In the interest of time, we chose to calculate and test the significance of these two statistics and not others. We use the Developmental Testbed Center (DTC)’s Meteorological Evaluation Tools (MET) software to calculate both continuous and categorical statistics (“MET Version 4.1 Users Guide”). B. Observations To evaluate the model forecast accuracy, we need to compare model forecasts to observations. My research uses surface observations from weather stations in the Meteorological Assimilation Data Ingest System (MADIS) UrbaNet and Mesonet4 databases (MADIS, 2014) (step O1 in Figure 6). We receive and store observations in NetCDF format every hour automatically via the internet, using Unidata’s Local Data Manager (LDM) software. Incoming data comprise surface observations from over 4,000 North American stations maintained by local, state, and federal agencies and private firms (NOAA, 2002). These observations come from roughly 1,000 stations in the CenCal domain, 200 in the BayArea domain, and 50 in the SFMarin domain. We construct 24-hour precipitation accumulations from precipitation accumulations reported about every six minutes at each weather station. This calculation takes into account two different accumulated precipitation reporting conventions. Most stations (80 to 90%) report running 24-hour precipitation totals, while the rest report precipitation accumulation since local midnight. We perform some quality control (Appendix III) in attempt to obtain a more reliable subset of observations for our analyses. The locations of surface weather stations remaining in our database after quality control are shown in Figure 7. C. WRF Forecasts and Matched Forecast-Observation Pairs From each 48-hour precipitation forecast initiated at 00Z, 06Z, 12Z, and 18Z, two 24hour precipitation accumulations are calculated at each grid point in each of the three nested WRF model domains, beginning 6 and 30 hours after the model dynamic start, respectively. We refer to the forecast accumulation from 6 to 30 hours after the dynamic start as a 0-24 hr forecast, and the forecast accumulation from 30 to 54 hours as a 24-48 hr forecast. The calculations take into account the fact that the model forecasts accumulations since the start of the model run. These forecast 24-hour accumulations are then interpolated to the locations of observation stations, using a distance-weighted mean of the values from the nine model grid points nearest each station (Cressman, 1959) (Figure 8). Our dataset of matched forecast-observation pairs spans most of the period from early October 2015 through April 2016, which roughly corresponds to the 2015-16 rainfall season in the Bay Area. D. Matched Pair Aggregation, Test Statistics, and Hypothesis Testing To address each of our research questions (1) – (4) (Section III), we select a different subset (aggregation) of matched pairs from the full dataset, as described below. (1) We focus our questions about spatial patterns of forecasts and forecast accuracy in the Bay Area on the total precipitation for the 2015-16 rainfall season. To this end, we aggregate matched pairs for a subset of stations in the SFMarin domain at times from October 2015 to April 2016 at which: (a) at least some precipitation was reported in the SFMarin domain, and (b) observations were available from all stations in the subset. There were at least 105 rainfall events (24-hour periods with rainfall observed somewhere in the SFMarin domain) and (after quality control) 42 stations reporting precipitation sometime during the October-April period. We want to include as many 24-hour periods with rainfall at as many stations as possible to capture the pattern of seasonal total rainfall. However, not all stations reported during all precipitation events, and there were some precipitation events in which relatively few stations reported. As a compromise to try to optimize both spatial and temporal coverage of the pattern of seasonal rainfall, we calculate partial-seasonal precipitation totals at 28 stations, all of which reported during 67 (of the 105+) rainfall events. Our analysis to address questions (1) starts with a qualitative comparison of plots of observed and forecast partial-seasonal rainfall and inspection of plots of MAE and ME for each station in the SFMarin domain (see Section V, “Results and Discussion”). (2) To test whether or not grid resolution affects forecast accuracy, we aggregate matched pairs according to the following criteria: a) Matched pairs come from forecasts on the SFMarin, BayArea, or CenCal domain. b) Matched pairs are for stations located within the SFMarin domain. We refer to the subsets of all matched pairs from the larger BayArea and CenCal domains that lie within the smaller SFMarin domain as BayArea_slim and CenCal_slim, respectively. c) Matched pairs are from the period from October 2015 to April 2016. d) For continuous but not categorical error statistics, matched pairs include a forecast, observation, or both of at least 0.254 mm (0.01 in., a trace) of precipitation. e) Matched pairs come from forecasts initialized at the same time of day (00Z, 06Z, 12Z, or 18Z). We aggregate these separately because 24-hour accumulations calculated from the WRF runs initialized at these times overlap, and are therefore not fully independent. f) The forecast in each matched pair is for the same 24-hour period of the 48-hour forecast (either 0-24 hours or 24-48 hours after the 6-hour dynamic start). Based on these criteria, we create 24 aggregations corresponding to combinations of forecasts from domains with three resolutions, at four initialization times, for two 24-hour forecast periods in a 48-hour forecast (3 × 4 × 2 = 24). The number of matched pairs (n) in these aggregations ranges from 2,440 to 2,864 for continuous statistics, and from 6,586 to 7,225 for categorical statistics (which, unlike continuous statistics, includes matched pairs of forecasts and observations both with zero rainfall). For each aggregation, we compute the Mean Absolute Error (MAE) and Mean Error (ME) (Section III.A.1) and the Probability of Detecting “Yes” (PODY) and “No” (PODN) (Section III.A.2). For each initialization time and 24-hour forecast period, we then pose the statistical null hypothesis that the MAEs for each possible pair of domain resolutions are the same. We do the same for MEs, PODYs, and PODNs. To conduct each hypothesis test about the MAE’s, we calculate a t-statistic for each possible pair of the SFMarin, BayArea_slim, and CenCal_slim aggregations: tn-1 = MAEdom1 - MAEdom2 2 2 s dom1 s dom2 ndom1 + ndom2 where the subscripts “dom1” and “dom2” refer to the members of each aggregation pair; ndom1 and ndom2 are the number of matched pairs in each aggregation pair; n is the minimum of ndom1 and ndom2; and s dom1 and s dom2 are the standard deviations of the absolute errors in each aggregation pair. We calculate an analogous t-statistic to test the null hypothesis about equivalence of MEs from different domains. We begin with two-tailed t-tests with a confidence level of 0.05 to determine whether MAEs or MEs from each domain pair are significantly different. If we reject the null hypothesis, we then perform a one-tailed t-test to test the hypothesis taking into account the apparent sign of the difference. To conduct hypothesis tests about the equivalence PODYs or PODNs for each possible domain pair, we take advantage of 95% bootstrapped confidence intervals calculated by the MET software for PODY and PODN. If PODY for one domain in a pair falls outside of the 95% confidence interval for PODY for the other domain, we reject the null hypothesis, and similarly for PODN. In addition, we test the null hypothesis that the ME is zero (that is, that the forecast bias is not statistically significant) for each domain separately). The t-statistic for testing this hypothesis is: tndom-1 = MEdom 2 s dom ndom (3) To test for differences in forecast accuracy between forecasts initialized closer to vs. farther away from 24-hour verification periods, we aggregate matched pairs using the same criteria listed for research question (2) above, with two exceptions. First, we consider forecasts from only the SFMarin domain, and second, we consider only 24-hour accumulation periods for which a 0-24 hour forecast and a 24-48 hour forecast from the day before are both available (see Figure 9). Based on these criteria, we create 8 aggregations corresponding to combinations of forecasts at four initialization times and two 24-hour forecast periods in a 48-hour forecast (4 × 2 = 8). The number of matched pairs (n) in these aggregations ranges from 2,004 to 2,316 for continuous statistics. For each aggregation, we compute the MAE. For each initialization time, we then pose the statistical null hypothesis that the MAE for forecasts initialized at the start of a 24-hour verification period are the same as the MAE for forecasts initialized 24 hours earlier. To conduct each hypothesis test about the MAE’s, we calculate a t-statistic for the pair of forecast start times: tn-1 = MAE24-48hr - MAE0-24hr 2 s 24-48hr n + 2 s 0-24hr n where n is the total number of matched pairs in each aggregation, and the subscripts “0-24hr” and “24-48hr” refer to the forecasts starting at the beginning of a 24-hour verification period (6 hours after the dynamic start) and forecasts for the same verification period initialized 24 hours earlier, respectively. We test the hypotheses about the equivalence of MAEs in the same ways as for research question (2) above. (4) To test whether or not forecast accuracy depends on the time of day when the model is initialized, we aggregate matched pairs using the same criteria listed for research question (2) above, with the exception that we consider only 0-24hr forecasts. Based on these criteria, we create 12 aggregations corresponding to combinations of forecasts from domains with three resolutions and at four initialization times (3 × 4 = 12). The number of matched pairs (n) in these aggregations ranges from 2,441 to 2,772 for continuous statistics and from 6,917 to 7,225 for categorical statistics. For each aggregation, we compute the MAE, PODY, and PODN. For each of the six possible pairs of initialization times, we then pose the statistical null hypothesis that the MAEs, PODYs, and PODNs for each pair are the same. To conduct each hypothesis test about the MAE’s, we calculate a t-statistic for each of the six possible pairs of four initialization times: tn-1 = MAEinit1 - MAEinit 2 2 2 s init1 s init 2 ninit1 + ninit 2 where the subscripts “init1” and “init2” refer to two different model initialization times. We test the hypotheses about the equivalence of PODYs and PODNs in the same ways as for research question (2) above. An additional question of interest is to test whether PODY and PODN are significantly different from one another. We test this based on 95% bootstrapping confidence intervals. Results and Discussion (1) Our first set of results provides information about spatial patterns of forecast accuracy. At the expense of excluding some precipitation from the seasonal totals (as outlined in Section IV.B) above, we believe we are able to preserve the spatial pattern of seasonal rainfall and forecast errors. The spatial pattern of October 2015 - April 2016 precipitation observations suggests that this year was representative of many others in that we see rainfall maximum on the leeward side of Mt. Tamalpais in Marin County (Figure 10). Stations E7094 and RSSC report partial seasonal totals of 528 mm and 611 mm, respectively. These totals are contrasted with totals ranging in the 300-mm range at stations to the north (MSSMC), and totals elsewhere in the northern and eastern San Francisco Bay Area widely in the 200 and 300-mm range. These stations are all within a geographically small area relative to the spatial coverage of mid-latitude cyclones that deliver most of the annual precipitation to this part of the world, on average. Since this entire area is mostly affected by all of the same synoptic scale weather patterns, regional differences in precipitation totals among stations are attributed in large part to the area’s topographical features. We plot a map view of the MAE values at each station (Figure 10), and visual inspection suggests that MAE roughly scales with total observed precipitation and that forecast bias is negative for most stations. Table 1 shows that for all forecast initialization times and domains, these negative biases are significantly different from zero. Figure 11 plots MAE and ME vs. event observed precipitation for all stations and 24-hour periods except those in which no rain was observed or forecast. Two main features of these data are again (a) negative forecast biases, and (b) that the magnitude of forecast error is somewhat correlated to observed event size. However, we note that the variances are clearly not normally distributed, and the trend is not tested for significance. In Figure 12, forecast precipitation is plotted against observed accumulations for the same periods as in the previous figure. This plot shows a linear regression with a slope that is visually quite different from 1, providing further evidence for persistent under-forecasting of SFMarin forecasts. By normalizing the data from the spatial plots shown in Figure 10, we obtain a much tighter relationship between event size and forecast error at stations. We normalize these data, calculating an average event precipitation value for each station, which we then plot against MAE. An R2 value of 0.83 shows strong correlation, and the slope of the regression curve is 0.6 (Figure 13). The slope of this regression line has not been tested for significance, and should be tested in future work. However, the fact that the size of error appears to correlate with the size of precipitation event throughout the domain, and that stations in the lee of Mt. Tamalpais fit this general trend, may be a first line of evidence that the model is not missing the local rainfall maximum feature. These are encouraging preliminary results, and they call for future work on the subject. Our results for research question (2) show that the 10-kilometer resolution CenCal_slim precipitation forecasts consistently appear to have the lowest MAE scores, followed by the 3.3kilometer BayArea_slim forecasts and 1.1-kilometer SFMarin forecasts (Table 2). MAE and ME scores for the aggregate of all WRF domain forecasts at all model initialization times are presented in Figure 14. Two-tailed t-tests show that for all SFMarin - CenCal_slim comparisons, these differences are significant. For domain resolution pairs with significantly different MAE scores, we then calculate one-tailed t-statistics and determine that MAE scores for CenCal_slim forecasts are in fact significantly lower than for BayArea_slim and SFMarin (Table 3). Categorical statistics provide an alternative perspective on forecast accuracy. By nature, these are a less demanding test of statistical forecast accuracy, in that they do not focus on amounts observed and forecasts as the continuous statistics do. PODY and PODN show similar results for research question (2), but to a lesser degree (i.e. the differences for the domain pairs are not all statistically significant) (Table 4). Taking MAE as our primary measure of forecast accuracy, we proceed to reject our null hypothesis and determine that model grid resolution does indeed affect precipitation forecast accuracy. Although it may be counterintuitive, SFMarin forecasts are less accurate than the lower-resolution forecasts. The apparent outperformance of the higher-resolution forecasts by the lower-resolution forecasts might be attributed to the physics of the WRF model. For the 1.1-km domain, convective precipitation is represented explicitly, while a parameterization scheme is used to represent convective precipitation in the 3.3-km and 10-km domains. It is entirely possible that the scheme is better suited for grid resolutions of at least 10 km or coarser, and therefore does not do as well with higher spatial resolutions. Furthermore, it may be that 1.1 km is still too coarse for representing individual convective rainclouds. In fact, we don’t commonly see entire 1.1-km wide air parcels being lofted high into the atmosphere. Seemingly poorer forecast skill of the high-resolution domain relative to other domains may also be a function of the location of its observation stations relative to the subdomain boundaries. Even after quality control to remove stations that lie within 2.7 kilometers of the SFMarin boundary, it may be that errors propagating from the boundary may affect more stations in this domain due to its small size. In the larger domains, the stations used for the statistical evaluation are of course nowhere near the boundaries and are therefore likely minimally affected by the propagating errors that arise when the model calculates domain boundary conditions. (3) MAE scores for the three WRF grid resolutions also reveal that forecasts for a given 24-hour observation period verify better for runs initialized immediately before that period (0-24 hr forecasts) than for those initialized 24 hours prior to its beginning (24-48hr forecasts). Twotailed t-tests for the MAE between 0-24 hr and 24-48 hr SFMarin forecasts show that the differences are generally statistically significant, in favor of lower MAE scores for 0-24 hr forecasts (Table 5). The exception to this is for 06Z WRF runs, which show no significant difference between 0-24 hr and 24-48 hr forecasts. Possible explanations for this have not been explored in this study, and should be considered for future work. The other three operational WRF runs, however, agree on 0-24 hr forecasts being more accurate. (4) Results show that MAE scores for the domain forecasts at different initialization times are generally not significantly different (Table 6). While this is true for many initialization time pairs, some cases allow us to reject this null hypothesis. Most notably, the comparison between 06Z and 18Z forecasts at all three model grid resolutions yield significant differences in MAE, at least for the aggregate of all of the 0-24 hr forecasts. Also, we note that six of eight cases in which we reject the null hypothesis are for comparisons involving 06Z runs. Future work should explore why WRF precipitation forecasts initialized at 18Z are consistently more accurate (as measured by MAE) than 06Z forecasts, and why 0-24 hr 06Z forecasts exhibit signs of being less accurate than forecasts for the other three initialization times. We also note additional features of the model performance that are not particularly related to the questions of forecast accuracy between different model grid resolutions and periods within 48-hour forecasts. For example, the model has a higher likelihood of correctly forecasting nonevents than it does precipitation events. This is shown in the PODY and PODN statistics in Table 7. The differences are significant for all grid resolutions and in both forecast periods. We also detect forecast biases that are negative and significantly different from zero at the 95% confidence level, suggesting that the model consistently under-forecasts precipitation events. Statistical evaluation is one of many methods of determining forecast accuracy and precision. Due to the nature of precipitation – namely its infrequent occurrence and high variability on small spatial and temporal scales – it is notoriously a difficult quantity to forecast. By nature, continuous statistics are an especially demanding test for forecast accuracy, as they consider the amount of precipitation observed vs. the amount forecast. We believe that categorical statistics are a less demanding, but still useful test. Although statistical verification can often yield poor results, it can show how the model (1) captures spatial patterns of precipitation, (2) performs at different grid resolution scales, (3) performs given various lengths of time away from forecast initialization, and (4) performs given the initialization time of day. Even if model verification is statistically poor most for many individual cases, we can asses whether the apparently poor skill is due to the model shifting the rainfall some small distance from where it was observed, slightly missing the timing of rainfall events, or even correctly forecasting rain but failing to reflect the observed totals. V. Conclusions In this study, we examine the skill of WRF-ARW model precipitation forecasts in three nested domains using continuous and categorical statistical tests. We conclude that lowerresolution CenCal_slim forecasts are more accurate than SFMarin forecsasts, based on MAE scores. The apparently better performance of the low-resolution domain relative to the domains with finer grid resolution may be attributed to the fact that that the 3.3 and 10 km domains have different physical parameters for convective-scale precipitation than does the 1.1 km domain. MAE scores also tell us that 48-hour precipitation forecasts for model runs initialized 24 hours back in time from a 24-hour observation period are generally less accurate than those for runs initialized immediately before the period. Other notable features include the model’s higher likelihood of correctly forecasting nonevents than rainfall events, and the fact that the model forecast bias is consistently negative. We also see that MAE scales with observed precipitation, and that MAE scores at stations in the lee of Mt. Tamalpais fit this overall trend, though further work is necessary to test whether the trend is significant. We take this as preliminary evidence that model does not miss the climatological local rainfall maximum pattern. Although we see that higher spatial resolution does not improve WRF-ARW precipitation forecast accuracy as measured by MAE, it is encouraging that we do not see extremely high MAE scores in the lee of Mt. Tamalpais. With these results, it could be reasonable to use the WRF model at SFMarin grid resolution to evaluate the mesoscale processes associated with precipitation around Mt. Tamalpais. Appendix I. The Kentfield Rainfall Maximum Long-term monthly average precipitation values for Bay Area communities show that Kentfield receives notably more precipitation than surrounding locations (WRCC, 2016). This differential precipitation is caused by topography, since the same synoptic-scale weather patterns affect the entire region, and because rainfall in this part of the world is dominated by wintertime frontal precipitation rather than summertime convective-scale. The highest-resolution WRFARW domain used in our research, referred to as “SFMarin,” is of sufficiently small spatial resolution to address the mesoscale processes involved with Mt. Tamalpais. We test the accuracy of forecasts made with the SFMarin grid relative to coarser-resolution “SFBayArea” and “CenCal” domains. If the model can reproduce this spatial pattern over the course of a rainfall season, then we have confidence that we can look at specific model output fields to address physical hypotheses such as the mechanisms responsible for the Kentfield Rainfall Maximum. Two existing hypothesis are that (a) the mountain is narrow relative to the advective scale of orographically induced precipitation, and (b) clouds and precipitation are formed in place due to low-level convergence on the leeward side of the mountain. This phenomenon motivates us to evaluate the WRF model statistically. Appendix II. WRF Model Configuration and Post-Processing WRF-ARW model produces outputs on terrain-following coordinate surfaces and in Network Common Data Format (NetCDF). We post-process the WRF-ARW model outputs with Unified Post Processor (UPP) software developed by National Centers for Environmental Prediction (NCEP). UPP interpolates the WRF-ARW output fields vertically onto constant pressure surfaces and reformats the data files to NCEP’s Gridded Binary1 (GRIB1) format, which our statistical analysis software, Model Evaluation Tools (MET) will read. With each operational model run, we create 48-hour precipitation forecasts During the 6-hour dynamic start period, topographic forcing, land-sea contrasts and internal dynamics produces detail resolvable by the high-resolution WRF grids. This is missing in a NAM initialization. Appendix III. Quality Control of Precipitation Observation Reports We eliminate stations that report physically impossible or highly implausible precipitation amounts. For example, we exclude stations reporting excessive amounts in a matter of minutes or hours, or high totals when there is no evidence of rain based on GOES-West 1-km satellite archives and totals at nearby stations. We also discovered negative precipitation values in the statistics files that arose from certain UrbaNet stations zeroing out not at midnight, but instead a few minutes before. Therefore, our algorithm, which subtracts the hourly precipitation report closest to the end of the 24-hour period from the report closest to the beginning, yielded negative 24-hour precipitation accumulations at those stations. We recalculated all 24-hour accumulations back to October 2015 to fix this error. Another form of quality control is to remove stations that lie within 2.5 grid points (2.7 kilometers) from the boundary of the SFMarin subdomain. This is necessary because the model forecast values are zero at all of the outer-most grid points, which influence the stations close enough to incorporate those values into their distance weighted means. Selected References "Animation of Archived GOES-West Visible Satellite Images." Animation of Archived GOESWest Visible Satellite Images. http://squall.sfsu.edu/scripts/gwvis_big_archloop.html. Cressman, George P. "An Operational Objective Analysis System." Mon. Wea. Rev. Monthly Weather Review 87, no. 10 (1959): 367-74. http://docs.lib.noaa.gov/rescue/mwr/087/mwr-08710-0367.pdf. "Model Evaluation Tools Version 4.1 (METv4.1) Users Guide." DTC. May 2013. http://www.dtcenter.org/. Skamarock et al. 2008 "A Description of the Advanced Research WRF Version 3." Chapters 5 & 6. University Corporation for Atmospheric Research. June 2008. http://www2.mmm.ucar.edu/wrf/users/docs/arw_v3.pdf. Western Regional Climate Center, 2016. “US COOP Station Map” US COOP Station Map. http://www.wrcc.dri.edu/coopmap/ Wilks, Daniel S. Statistical Methods in the Atmospheric Sciences: An Introduction. San Diego: Academic Press, 1995. Figures and Tables Figure 1. Map view of the NAM 40-km domain used for WRF-ARW initialization and boundary conditions. Locations of model grid points plotted as crosses. 12 30-yr Average (1981-2010) Monthly Rainfall (in.) 10 8 6 4 2 0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Kentfield San Rafael Muir Woods Hamilton AFB Downtown SF Richmond Figure 2. Western Regional Climate Center long-term average monthly precipitation values at COOP stations across the northern and central San Francisco Bay Area. Period of record: 30 years (1981 – 2010) for all stations except Hamilton AFB, which has data for 1934 – 1971) Figure 3. Top: Map view of the nested WRF-ARW domains in which we operationally run the model to produce 48-hour precipitation forecasts each six hours. Bottom: Table showing the model output frequencies, spatial grid resolutions, and model times steps of the three WRF-ARW domains Figure 4. Map view of WRF 48-hour accumulated precipitation (in.) (color-shaded contours) for a rainfall event in December 2014 in the SFMarin domain, resolved at 10 km (top image) and 1.1 km (bottom). Figure 5. Contingency table providing the logic for the programs that calculate categorical statistics based on WRF precipitation forecasts. For n total events (or matched pairs), categories are broken down into (a) number of events in which the model forecast rain and rain was observed, (b) number of events in which the model forecast rain but none was observed, (c) number of events in which the model did not forecast rain, but rain was observed, and (d) the number of events in which the model did not forecast rain and no rain was observed. Subtotals a+b, c+d, a+c, and b+d represent the number of total forecasts for rain, forecasts for no rain, observations of rain, and observations of no rain, respectively. Figure 6. Schematic diagram illustrating the logical steps for statistical evaluation of WRFARW forecasts Figure 7. Map view of the SFMarin WRF subdomain showing the locations of model grid points (black crosses), and the surface weather stations (blue stars) remaining after quality control Figure 8. Schematic illustration of the grid points involved in the distance-weighted mean interpolation of WRF-ARW precipitation values for SFMarin, BayArea_slim, and CenCal_slim forecasts. Shown are the hypothetical locations of SFMarin, BayArea, and CenCal grid points (respectively as magenta, red, and blue crosses), as well as a surface weather station (blue star). The boundary of the SFMarin domain is marked by the solid black line. Figure 9. Periods corresponding to two example operational WRF forecasts, with the 24-hour evaluation period established for comparing accuracy of the first 24 hours of the 00Z forecast on Oct 25th with the second 24 hours of the 00Z Oct 24th forecast. This method is applied for all operational runs in our data set in which some precipitation was observed or forecast. Figure 10. Map view plots of total forecast and observed preciopitation (mm), and forecast errors (mm) in the SFMarin domain over the course of the 2015-16 rainfall season. Values are plotted over WRF model SFMarin terrain height (color-filled contours) with lines of equal elevation (solid black) Top Left: Partial-seasonal MAE (mm) for the aggregate of 24-hour periods in common among all stations reporting precipitation. Values are color coded and scaled in size according to their magnitude. Top Right: Partial-seasonal total observed precipitaion (mm) Bottom Left: Partial-seasonal ME (mm) for the aggregate of 24-hour periods in common among all stations reporting precipitation. Numbers are color coded according to their value. Bottom Right: Partial-seasonal total forecast precipitation (mm) Mean Absolute Error Mean Error (Bias) (mm) Observed Precipitation (mm) Figure 11. Top: SFMarin forecast MAE (mm) plotted against observed precipitation for all stations and 24-hour periods remaining after quality control of precipitation observations. Also shown are the one-to-one lines (dashed) and the linear regression fit. Bottom: SFMarin forecast ME (mm) plotted against observed precipitation for all stations and 24-hour periods remaining after quality control of precipitation observations. Figure 12. SFMarin forecast precipitation (mm) plotted against observed precipitation for all stations and 24-hour periods remaining after quality control of precipitation observations. Figure 13. SFMarin MAE (mm) plotted against averaged observed precipitation at all stations within the SFMarin subdomain remaining after quality control. Figure 14. (Top) MAE and (bottom) ME values (mm) for the aggregate of all WRF domain 024hr forecasts (SFMarin, BayArea_slim, and CenCal_slim) at all model initialization times (00Z, 06Z, 12Z, and 18Z) Table 1: WRF Run (first 24 hrs) Grid Forecast ME (mm) 00Z SFMarin BayArea_slim CenCal_slim SFMarin BayArea_slim CenCal_slim SFMarin BayArea_slim CenCal_slim SFMarin BayArea_slim CenCal_slim -0.925 -0.895 -1.016 -1.258 -1.272 -1.342 -1.593 -1.553 -1.634 -1.102 -1.074 -1.108 06Z 12Z 18Z t t-crit -5.543 -5.541 -6.865 -6.930 -7.098 -8.121 -9.571 -9.471 -10.238 -6.457 -6.326 -6.976 0.960 0.960 0.960 0.960 0.960 0.960 0.960 0.960 0.960 0.960 0.960 0.960 Reject Hnull Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Table 2: MAE and ME data for the first 24-hour periods of WRF forecasts at all grid resolutions WRF Run 00Z 06Z 12Z 18Z Grid Forecast MAE (mm) ME (mm) SFMarin 5.181 -0.925 BayArea_slim 5.045 -0.895 CenCal_slim 4.752 -1.016 SFMarin 5.382 -1.258 BayArea_slim 5.311 -1.272 CenCal_slim 4.921 -1.342 SFMarin 5.073 -1.593 BayArea_slim 5.021 -1.553 CenCal_slim 4.905 -1.634 SFMarin 4.916 -1.102 BayArea_slim 4.908 -1.074 CenCal_slim 4.665 -1.108 Table 3: T-statistics and tests for addressing whether MAE and ME scores for domain pairs are not significantly different. Shown are ME and MAE domain pairs for all WRF initialization times for 0-24hr forecasts Column1 00Z, MAE SFMarin-BayArea SFMarin-CenCal BayArea-CenCal 00Z, ME SFMarin-BayArea SFMarin-CenCal BayArea-CenCal 06Z, MAE SFMarin-BayArea SFMarin-CenCal BayArea-CenCal 06Z, ME SFMarin-BayArea SFMarin-CenCal BayArea-CenCal 12Z, MAE SFMarin-BayArea SFMarin-CenCal BayArea-CenCal 12Z, ME SFMarin-BayArea SFMarin-CenCal BayArea-CenCal 18Z, MAE SFMarin-BayArea SFMarin-CenCal BayArea-CenCal 18Z, ME SFMarin-BayArea SFMarin-CenCal BayArea-CenCal t t-crit Reject Hnull 0.585 1.923 1.337 0.960 No 0.480 Yes 0.480 Yes -0.131 0.407 0.554 0.960 No 0.960 No 0.960 No 0.278 1.878 1.600 0.960 No 0.480 Yes 0.480 Yes 0.055 0.341 0.286 0.960 No 0.960 No 0.960 No 0.225 0.729 0.505 0.960 No 0.480 Yes 0.480 Yes -0.172 0.179 0.356 0.960 No 0.960 No 0.960 No 0.033 1.077 1.046 0.960 No 0.480 Yes 0.480 Yes -0.117 0.025 0.147 0.960 No 0.960 No 0.960 No Table 4: t-statistics and test results for addressing whether significant differences in MAE exist between 0-24hr and 24-48hr forecasts. t 00Z 06Z 12Z 18Z 6.826 -0.539 5.201 2.991 t-crit (2-tailed) 0.960 0.960 0.960 0.960 Reject t-crit (1-tailed) Hnull? N/A Yes N/A No 0.480 Yes 0.480 Yes Table 5: t-tests results for addressing whether PODY and PODN scores for domain forecast pairs are not significantly different 00Z, PODY SFMarin-BayArea SFMarin-CenCal BayArea-CenCal 006Z, PODY SFMarin-BayArea SFMarin-CenCal BayArea-CenCal Reject Hnull Yes Yes Yes Reject Hnull Yes Yes Yes 00Z, PODN SFMarin-BayArea SFMarin-CenCal BayArea-CenCal 06Z, PODN SFMarin-BayArea SFMarin-CenCal BayArea-CenCal Reject Hnull? No Yes Yes Reject Hnull? No No Yes Table 6: t-statistics and test results for addressing whether MAE scores are significantly different among aggregated forecasts pairs of initialization times. Forecast Init Pairs 00Z-06Z 00Z-12Z 00Z-18Z 06Z-12Z 06Z-18Z 12Z-18Z Grid Forecast SFMarin BayArea_slim CenCal_slim SFMarin BayArea_slim CenCal_slim SFMarin BayArea_slim CenCal_slim SFMarin BayArea_slim CenCal_slim SFMarin BayArea_slim CenCal_slim SFMarin BayArea_slim CenCal_slim t -0.815 -1.103 -0.762 0.458 0.104 -0.703 1.110 0.585 0.401 1.255 1.194 0.070 1.870 1.633 1.117 0.659 0.479 1.066 t-crit 0.960 0.960 0.960 0.480 0.480 0.960 0.960 0.960 0.480 0.960 0.960 0.480 0.960 0.960 0.960 0.960 0.480 0.960 Reject Hnull? No Yes No No No No Yes No No Yes Yes No Yes Yes Yes No No Yes Table 7: t-test results for addressing whether PODN and PODY are not significantly different from each other, based on upper and lower 95% bootsratpping confidence levels. Tests are performed for multiple forecast runs (model initialization times), both forecast periods, and for forecasts made with all three WRF model grid resolutions (SFMarin, CenCal_slim, and BayArea_slim). SFMarin PODN-PODY 00Z 06Z BayArea PODN-PODY 00Z 06Z CenCal PODN-PODY 00Z 06Z Reject Hnull? Yes Yes Reject Hnull? Yes Yes Reject Hnull? Yes Yes
© Copyright 2026 Paperzz