An Assessment of Seasonal Water Supply Outlooks in the Colorado R. Basin Jean C. Morrill1, Holly C. Hartmann1 and Roger C. Bales2,a 1 Department of Hydrology and Water Resources, University of Arizona, Tucson, AZ, USA 2 a School of Engineering, University of California, Merced, CA, USA Corresponding author: 7/13/2017 Roger C. Bales University of California, Merced P.O. Box 2039 Merced, CA 95344 209-724-4348 (o) 209-228-4047 (fax) [email protected] 1 Abstract A variety of forecast skill measures of interest to the water resources applications community and other stakeholders were used to assess the strengths and weaknesses of seasonal water supply outlooks (WSO’s) at 54 sites in the Colorado R. basin, and provide a baseline against which alternative and experimental forecast methods can be compared. These included traditional scalar measures, categorical measures, probabilistic measures and distribution-oriented measures. Despite the shortcomings of the WSO’s they are an improvement over climatology at most sites over the period of record. The majority of forecast points have very conservative predications of seasonal flow, with below-average flows often over predicted and above-average flows under predicted. Late-season forecasts at most locations are generally better than those issued in January. There is a low false alarm rate for both low and high flows at most sites, however, low and high flows are not forecast nearly as often as they are observed. Moderate flows have a very high probability of detection, but are forecast more often than they occur. There is also good discrimination between high and low flows, i.e. when high flows are forecast, low flows are not observed, and vice versa. The diversity of forecast performance metrics reflects the multi-attribute nature of forecasts and ensembles. 7/13/2017 2 1. Introduction Seasonal water supply outlooks, or volume of total seasonal runoff, are routinely used by decision makers in the southwestern United States for making commitments for water deliveries, determining industrial and agricultural water allocation, and carrying out reservoir operations. In the Colorado R. basin, the National Weather Service (NWS) Colorado Basin R. Forecast Center (CBRFC) and the Natural Resources Conservation Service (NRCS) jointly issue seasonal water supply outlook (WSO) forecasts of naturalized, or unimpaired, flow, i.e. the flow that would most likely occur in the absence of diversions and reservoir storage (e.g., CBRFC, 1992; Soil Conservation Service and NWS, 1994). Currently WSO’s are issued once each month from January to June. However, until the mid-1990s, the forecasts were only issued until May. The forecast period is the period of time over which the forecasted flow is predicted to occur. It is not the same for all sites, all years at one location, or even all months in a single year. In the past decade, the most common forecast period has been April-July for most sites in the upper Colorado R. basin and January-May for the Lower Colorado, for each month a forecast was issued. However, previously many sites used April-September forecast periods, and prior to that the forecast period for the January forecast was January-September, for the February forecast the forecast period was February-September, etc. Both the CBRFC and NRCS base their WSO’s on multivariate regression relationships (Day, 1985; Hartmann et al., 2002; Pagano et al., 2004). Unique regressions for each forecast period and location use subsets of monthly or seasonal observations of precipitation, streamflow, ground- based snow-water depths, and routed forecasted streamflows; some Arizona locations incorporate Southern Oscillation Index values to reflect climatic teleconnections. The regression equations produce only a single deterministic water supply volume, sometimes termed the “most probable” forecast in some publications, although the term is not statistically rigorous or preferred terminology (Hartmann et al., 2002). The WSO’s typically also compare this value to the mean or median for a historical climatological period (usually 10-30 years). Additionally, seasonal total water volumes corresponding to 10%, 30%, 70% and 90% exceedance values have often been provided in the outlook bulletins. These quantiles are obtained by overlaying a normalized error distribution, determined during regression equation fitting, centered on the deterministic regression forecast that 7/13/2017 3 then corresponds to the distribution median. Most of the sites at which forecasts are issued are impaired, i.e. have diversion above the forecast and gauging location. Therefore the CRBRFC combines measured discharges with historical estimates of diversion to reconstruct the unimpeded observed flow (CBRFC, undated). Forecast verification is important for assessing forecast quality and performance, improving forecasting procedures, and providing users with information helpful in applying the forecasts (Murphy and Winkler, 1987). Decision makers take account of forecast skill in using forecast information and are interested in having access to a variety of skill measures (Bales et. al. 2004; Franz et al., 2003). Shafer and Huddleston (1984) examined average forecast error at over 500 forecast points in 10 western states. They used summary statistical measures and found that forecast errors tended to be approximately normally distributed, but with a slightly negative skew that resulted from a few large negative errors (under-forecasts) with no corresponding large positive errors. High errors were not always associated with poor skill scores, however. Pagano et al. (2004) found similar results for 29 locations throughout the West, using correlation-related summary measures. Both evaluation studies treated the WSO’s as strictly deterministic products, i.e., as single-value forecasts without any uncertainty information. The work reported here assesses the skill of forecasts relative to naturalized streamflow across the Colorado R. basin, but using a greater variety of evaluation metrics of interest to stakeholders: traditional scalar measures (linear correlation, linear rootmean square error and bias), categorical measures (false alarm rate, threat score), probabilistic measures (Brier score, ranked probability score) and distributive measures (resolution, reliability and discrimination). The purpose was to assess the strengths and weaknesses of the current water supply forecasts, and provide a comprehensive, multidimensional baseline against which alternative and experimental forecast methods can be compared. 2. Data and methods 2.1. Data WSO records from 136 forecast points on 84 water bodies were assembled, including some forecast locations that are no longer active. NEED TO APPEND DATA 7/13/2017 4 Reconstructed flows were made available by the CRBRFC and NOAA (T. Tolsdorf and S. Shumate, personal communication), however data were not available for all forecast locations. Many current forecast points were established in 1993, and so do not yet have good long-term records. For this study we chose 54 sites having at least 10 years of both forecast and observed data (Figure 1). Another 33 sites have fewer than 10 years of data, but most are still active, and so should be more useful for statistical analysis in a few years time. The earliest water supply forecasts used in this study were issued in 1953 at 22 of the 54 locations. These 54 forecasting sites were divided in 9 smaller basins (or in the case of Lake Powell, a single location), compatible with the divisions used by CBRFC in the tables and graphs accompanying the WSO forecasts (Table 1). The maximum number of years in the combined forecast and observation record was 48 (1953–2000), the minimum used was 21, and the median and average number of years were 46 and 41.5 respectively. Each deterministic forecast was converted into a forecast probability distribution by using the “most probable” value as the distribution median, along with the10% and 90% exceedance values, to also calculate the 30 and 70% exceedance values. Five forecast flow categories were calculated for each forecast, based on exceedance probability: 010%, >10-30%, >30-70%, >70-90%, and >90%. The probability of the flow falling within each of these categories is 0.1, 0.2, 0.4, 0.2 and 0.1 respectively. Selection of these categories was based on their common usage in NRCS communications and reflect categories considered important to a broad range of water resources decision makers. 2.2. Summary and correlation measures Summary measures are scalar measures of accuracy from forecasts of continuous variables, and include the mean absolute error (MAE) and mean square error (MSE): MAE MSE ( 1 n f i oi n i 1 1) ( 1 n f i o i 2 n i 1 2) where for a given location, f is the forecast seasonal runoff for period i and o the naturalized observed flow for the same period. Since MSE is computed by squaring the 7/13/2017 5 forecast errors, it is more sensitive to larger errors than is MAE. Both MSE and MAE increase from zero for perfect forecasts to large positive values as the discrepancies between the forecast and observations become larger. RMSE is the square root of the MSE. Often an accuracy measure is not meaningful by itself, and is compared to a reference value, usually based on the historical record. In order for a forecast technique to be worthwhile, it must generate better results than simply using the cumulative distribution of the climatological record, i.e. assuming that the most likely flow next year is the average flow in the climatological record. In order to judge this, skill scores are calculated for the accuracy measures: SS A A Aref Aperf Aref ( 3) If SSA is a generic skill score, Aref is the accuracy of a reference set of values (e.g. the climatological record) and Aperf is the value of A given by perfect forecasts. If A=Aperf, SSA will be at its maximum, 1. If A=Aref, then SSA=0, indicating no improvement over the reference forecast. If SSA <0, then the forecasts are not as good as the reference. (Wilks, 1995). For MSE: SS MSE MSE MSE cl MSE MSE cl MSE 1 MSE perf MSE cl 0 MSE cl MSE cl ( 4) since for a perfect forecast MSE is 0, and the climatological value is: MSE cl where ( 1 n ocl oi 2 n i 1 ocl 5) is the average observation associated with the reference climatology. Correlation-based measures are widely used to determine the goodness-of-fit of hydrologic models. They have many limitations, including a high sensitivity to extreme values (outliers) and an insensitivity to additive or proportional differences between models and observations (Legates and McCabe, 1999). Correlation provides a summary measure of the joint distribution of the forecasts and observations. However it does not 7/13/2017 6 account for any forecast bias, and when bias is large, the correlation is not likely to be informative. The most widely used correlation measure is the coefficient of determination, which describes the proportion of the variability of the observation that is linearly accounted for by the forecast: n oi o f i f 2 2 i 1 R 0 . 5 0 . 5 2 n n 2 oi o f i f i 1 i 1 2 ( 4) where R2=1 indicates perfect agreement between the observations and predictions and R2=0 no agreement. Another correlation measure often used to evaluate the performance of hydrologic models is the coefficient of efficiency (Nash and Sutcliffe, 1970), also call the NashSutcliffe coefficient: n NSC 1 o i fi o o i 1 n i 1 i 2 ( 5) 2 It has a maximum of 1 for a perfect forecast and a minimum of negative infinity. Physically, NSC is 1 minus the ratio of MSE to the variance of the observed data. If NSC > 0, the forecast is a better predictor of flow than is the observed mean, but if NS C< 0, the observed mean is a better predictor and there is a lack of correlation between the forecast and observed values. Discussion of correlation is often combined with that of the percent bias, which measures the difference between the average forecasted and observed values (Wilks, 1995): Pbias f o 100% o ( 6) which can assume positive (overforecasting), negative (underforecasting) or zero values. 7/13/2017 7 Shafer and Huddleston (1984) used a similar calculation to examine forecast error and the distribution of forecast error in the analysis of seasonal streamflow forecasts. Forecast error for a particular forecast/observation pair was defined as E f o 100 oref ( 7) where o ref is the published seasonal average runoff at the time of the forecast (also called the climatological average or reference value). They also defined a skew coefficient associated with the distribution of a set of errors: n G n Ei E 3 i 1 (n 1)( n 2)( E ) 3 ( 100 8) where E is the standard deviation of errors. 2.3. Categorical measures A categorical forecast states that one and only one set of possible events will occur, with an implied 100% certainty attached to the forecasted category. Contingency tables are used to display the possible combinations of forecast and event pairs, and the count of each pair. An event (e.g. seasonal flow in the upper 30% of the observed distribution) that is successfully forecast (both forecast and observed) occurs a times. An event that is forecast but not observed occurs b times, and an event that is observed but not forecast occurs c times. An event that is not forecast and not observed for the same period occurs d times. The total number of forecasts in the data set is n=a+b+c+d. A perfectly accurate binary (22) categorical forecast will have b=c=0 and a+d=n. However, few forecasts are perfect. Several measures can be used to examine the accuracy of the forecast, including hit rate, threat score, probability of detection and false alarm rate (Wilks, 1995). The hit rate is the proportion correct: HR ad n ( 7) and ranges from one (perfect) to zero (worst). 7/13/2017 8 The threat score, also known as the critical success index, is the proportion of correctly forecast events out of the total number of times the event was either forecast or observed, and does not take into account the accurate non-occurrence of events: TS ( a abc 8) It also ranges from one (perfect) to zero (worst). The probability of detection is the fraction of times when the event was correctly forecast relative to the number of times it actually occurred, or the probability of the forecast given the observation: POD ( a ac 9) A perfect POD is 1 and the worst 0. A related statistic is the false alarm rate, FAR, which is the fraction of forecasted events that do not happen. In terms of conditional probability, it is the probability of not observing an event given the forecast: FAR ( b ab 10) Unlike the other categorical measure described, the FAR has a negative orientation, with the best possible FAR being 0 and the worst being 1. The bias of the categorical forecasts compares the average forecast with the average observation, and is represented by the ratio of “yes” observations to “yes” forecasts: bias ab ac ( 11) A unbiased forecast has a value of 1, showing that the event occurred the same number of times that it was forecast. If the bias is greater than 1, the event is overforecast (forecast more often than observed); if the bias is less than one, the event is underforecast. Since the bias does not actually show anything about whether the forecasts matched the observations, it is not an accuracy measure. 7/13/2017 9 2.4. Probabilistic measures Whereas categorical forecasts contain no expression of uncertainty, probabilistic forecasts do. Linear error in probability space assesses forecast errors with respect to their difference in probability, rather than their overall magnitude: LEPSi Fc f i Fc oi ( 12) Fc(o) refers to the climatological cumulative distribution function of the observations, and Fc(f) to the corresponding distribution for the forecasts. The best possible LEPS values is 0, for identical distributions, and the worst is 1 for completely divergent distributions. LEPS reflects that correct forecasting of extreme events should warrant more credit, compared to more common moderate events. The corresponding skill score is: F f F o 1 0.5 F o n SS LEPS i 1 n c i 1 i c c ( i 13) i using the climatological median as reference forecast. The Brier score is analogous to MSE : BS ( 1 n f i oi 2 n i 1 14) However, it compares the probability associated with a forecast event with whether or not that event occurred instead of comparing the actual forecast and observation. Therefore fi ranges from 0 to 1, oi=1 if the event occurred or oi =0 if the event did not occur, and BS=0 for perfect forecasts. The corresponding skill score is: SS BS 1 ( BS BS ref 15) where the reference forecast is generally the climatological relative frequency. The ranked probability score (RPS) is essentially an extension of the Brier score to multi-event situations. Instead of just looking at the probability associated with one event or condition, it looks simultaneously at the cumulative probability of multiple events occurring. RPS uses the forecast cumulative probability: 7/13/2017 10 m Fm f j , ( m=1,…,J 16) j 1 where fj is the forecast probability at each of the J non-exceedance categories. In this paper, fj = {0.1 0.2 0.4 0.2 0.1} for the five non-exceedance intervals {0-10%, >10-30%, >30-70%, >70-90%, and >90%},so Fm = {0.1 0.3 0.7 0.9 1.0} and J=5. The observation occurs in only one of the flow categories, which will be given a value of 1; all the others are given a value of zero: m Om o j , ( m=1,…,J 17) j 1 The RPS for a single forecast/observation pair is calculated from: J RPS i Fm Om ( 2 18) m 1 and the average RPS over a number of forecasts is calculated from: RPS ( 1 n RPS i n i 1 19) A perfect forecast will assign all the probability to the same percentile in which the event occurs, which will result in RPS=0. The RPS has a lower bound of 0 and an upper bound of J-1. RPS values are rewarded for the observation being closer to the highest probability category. The RPS skill score is defined as: SS RPS 1 ( RPS RPS ref 25) where RPSref is the climatological cumulative frequency. The Brier score focuses on how well the forecasts perform in a single flow category; RPS is a measure of overall forecast quality. Note that statistics calculated from a small number of forecasts are more susceptible to being dominated by sampling variations and make assessing forecast quality difficult (Wilks, 1995). In addition, with smaller sample sizes, it is more likely that some intervals have no data because there are not enough forecasts to represent all combinations of forecast probability and flow categories. 7/13/2017 11 2.5. Distributive Measures We used two distributive measures, reliability and discrimination, to assess the forecasts in various categories (i.e. low, medium, high). The same five forecast probabilities used for RPS were used to represent the probability given to each of the three flow categories. Our application of these measures follows that outlined by Franz et al. (2003). Reliability uses the conditional distribution (p(o|f)) and describes how often an observation occurred given a particular forecast. Ideally, p (o 1 | f ) f (Murphy and Winkler, 1987). That is, for a set of forecasts where a forecast probability value f was given to a particular observation o, the forecasts are considered perfectly reliable if the relative frequency of the observation equals the forecast probability (Murphy and Winkler, 1992). For example, given all the times in which high flows were forecasted with a 50% probability, the forecasts would be considered perfectly reliable if the actual flows turned out to be high in 50% of the cases. On a reliability diagram (Figure 2) the conditional distribution (p(o|f)) of a set of perfectly reliable forecasts will fall along the 1:1 line. Forecasts that fall to the left of the line are underforecasting or not assigning enough probability to the subsequent observation. Those that fall to the right of the line are overforecasting. Conditional distributions of forecasts lacking resolution, meaning they are unable to identify occasions when the event is more or less likely than the overall climatology, plot along the horizontal line associated with their climatology value. The discrimination diagram displays the conditional probability distributions ((p(f|o)) of each possible flow category as a function of forecast probability (Figure 3). If the forecasts are discriminatory, then the probability distribution functions of the forecasted flow categories will have minimal overlap on the discrimination diagram (Murphy et al., 1989). Ideally, a forecast issued prior to an observation of a low flow should say that there is 100% chance of having a low flow and 0% chance of having high or middle flows. A set of forecasts that consistently provide such strong and accurate statements is perfectly discriminatory and will produce a discrimination diagram like Figure 3a. Figure 3b illustrates a case where the sample of forecasts is unable to 7/13/2017 12 consistently assign the largest probability to the occurrence of low flows. Users of forecasts from such a system could have no confidence in the predictions. A discrimination diagram is produced for occurrences of observations in each flow category; therefore, forecasts that were issued prior to observations that occurred in the lowest 30% (low flows), middle 40% (mid-flows), and highest 30% (high flows) are plotted on separate discrimination diagrams. The number of forecasts represented on each plot depends upon the number of historical observations in the respective flow category. 3. Results 3.1. Scalar measures The New Fork R. near Big Piney (Upper Green R., 3,184 km2) and Colorado R. near Dotsero (11,376 km2) together capture many of the patterns seen in the different sites, and are used to illustrate the different types of results. The Pbias values on Figure 4 show that 1997, an above average flow year, represents an almost perfect forecast year for the New Fork at Big Piney, with forecast bias very close to 0. It is an excellent example of consistency in forecasting, with the concentric circles showing that the January forecast was the same as the July forecast. In many of the other years, such as 1992, a below average flow year, there is significant forecast drift, with the January forecast farthest from 0, and values getting progressively better with each month. Comparing Figures 4a (4e) and 4b (4f) shows that years of above average flow (e.g., 1983, 1986, 1995) are often associated with forecasts being too low; conversely, in years of below average flow (e.g. 1988 and 1992), forecasts were too high. This pattern of over versus under forecast is seen more clearly by plotting f i / oi versus oi / o (Figure 5) for the two sites on Figure 4 plus the San Juan R. near Bluff (59,544 km2). Ideally, all points should be in a horizontal line f i / oi 1 , which would indicate that no matter how high or low (above or below average) the observed flow, the forecast values equal the observed value. In general, forecasts issued in May improve over those issued earlier in the year (920500 and 9070500). Note that different years were used to produce the climatology against which forecasts were compared (e.g. Figure 4c and 4g). For example, for 1975-1980, data for 7/13/2017 13 1958-1972 were used, while for 1993-2000, data from the 1961-1990 period were used. This trend is repeated for all the forecast locations. Every five or ten years, the definition of average observed flow changes, and different sites may use data from different time periods; although in 1991-2000 the majority of the forecasts were based on the 19611990 climatology. Starting in 2001, forecasts were based on the 1971-2000 climatology. Another problem in comparing forecasts from one year to another is that the forecast period, or months during which the forecasted flow is supposed to occur, changes, sometimes from month to month, other times from year to year (e.g. Figures 4d and 4h). For example, for 1975-1979, the forecast period for January was January-September and for May it was May-September. For 1980-1990, the forecast period was April through September for every month of issue, and from 1991 to present, it was from April to July. For these locations no one forecasting period has a visibly better correlation than another, nor do forecasts show any marked improvement over the period of record. Like Pbias, R2 values across all the sites are lowest in January (all sites < 0.5) and become progressively higher through May (0.4–0.9, with the highest around 0.8, although there is little difference in February through April values) (Figure 6). Even in April and May there are still many poorly correlated sites. The distributions for MAE and RMSE are similar (Figure 6), with the slightly lower values for RMSE capturing the generally higher bias values for the higher flow years. Although SSMSE is sensitive to high forecast errors, e.g. in extreme flow years, it is a broader distribution because of the poor representation of the annual flow by the climatological mean at most sites. RMSE is a poorer measure of skill that the other summary and correlation measures, as it is as much related to flow volume as anything else. Of the 10 sites with the lowest RMSE, 5 are tributaries in the Lower Green Basin and the other five are smaller creeks/rivers as well. Of the 10 with the highest (worst) RMSE, 4 are on the Colorado R. and 2 are on the San Juan R. Others are Green R., Gunnison R., Yampa R. and Salt R. A similar pattern is seen for NSC as for the other measures, although there is little difference in February through April values (Figure 6). It is seen that no one region, with the possible exception of the Virgin R., has significantly better forecasts than do the other regions (Figure 7). Multiple basins have near zero NSC’s. Two sites have negative 7/13/2017 14 values, indicating that the forecasts are not an improvement over the climatology, during all five months: the Strawberry R. near Duchesne in the Lower Green basin, and the Florida R. inflow to Lemon Reservoir in the San Juan R. basin. One additional site has some negative values (9050700). Of the five sites with the highest average NSC values for all five months, three are in the Upper Green: Green R. at Warren Bridge Pine Creek above Fremont Lake, and Fontelle Reservoir inflow. The other two are the Virgin R. near Virgin, which had good correlations in March-May, despite very low January values, and the Gunnison R. inflow to Blue Mesa Reservoir. Overall the April forecasts display a tendency toward a negative skew of forecast errors (Figure 8), with this being most pronounced in the Gila R. Basin, although most of the other basins had some sites with negative skew of forecast errors, some sites with no skew, and no sites with positive skew. A large negative skew means that the overall tendency of the forecasts is to under predict rather than over predict, although this is often influenced to some extent by a few negative values (Shafer and Huddleston, 1984). 3.2. Categorical measures Hit rate, threat score, false alarm rate and probability of detection (Figure 9-12) for each month and flow category need to be considered together. Eighty to ninety percent of sites have HR for correct predictions for the lower and upper 30% of flow categories (Figure 9), meaning that these flows actually occur a majority of the time that the forecast is for high or low flows. Similarly FAR (0 is perfect) is best for the low and high flows (Figure 11). However, the POD shows that the majority of high and low flows that occur are not being accurately forecast (Figure 12). In January-April, under 5% of flows in the upper or lower 30% are correctly forecast, i.e. the POD was near 0. There were very few forecast locations with POD above 0.5 for the high and low flows. POD for the mid 40% was high, because most forecasts predict that conditions will fall in this category. For the same reason, HR was low and FAR high in the middle category. Note that TS (Figure 10) combines some features of HR and POD – while it is similar to HR for the mid 40%, it is low for the upper and lower 30%. The bias (not shown) was near 0-0.25 (very low) for low and high flows, again showing that they are underpredicted, and between 2-4 (very high) for moderate flows, showing that they are overpredicted. 7/13/2017 15 3.3. Probabilistic measures At the New Fork R. near Big Piney, the LEPS is clearly better than LEPSref (Figure 13a) with the LEPS skill score increasing from January through May (Figure 13d). This same pattern was consistent across the basin (Figure 14a). The Brier scores of the forecast for the New Fork R. near Big Piney were all better than those of the reference set as well (Figure 13b), with skill increasing slightly through the forecast period (Figure 13e). The same is true across the basin (Figure 14b). The drop in the May SSBS (Figure 13e) is due to the shift in both the reference and observed values. Elsewhere in the basin, the February skill scores in the lower Green R. and the San Juan R. basins are lower than those in January; otherwise patterns generally show consistent increases. Five of the sites in the Gila R. basin have negative SSBS values in March, making the basin average negative. The Virgin R. basin had the highest average SSBS. At the New Fork R. near Big Piney, forecast RPS values are better than RPSref only for the earliest forecasts (January), with the poorest performance occurring in March and April. Across the basin, twenty-two of the 54 sites had a negative average SSRPS for January-May. Thirteen had negative SSRPS values for each month that a forecast was issued. Seven of these were the Gila R. basin locations; two were in the San Juan R. basin (San Juan R. near Bluff and the San Juan R. inflow to Navajo Reservoir), one along the main stem of the upper Colorado (Colorado R. near Cisco) one each in the upper Green, lower Green, and the Yampa and White R. basins (Henry’s Fork near Manila. Duchesne R. at Myton, and Little Snake R. near Dixon, respectively). However, four of the remaining San Juan R. basin sites had SSRPS values in the top ten (averaging 30-40) and four of the Yampa and White R. basin sites were among the top fourteen. 3.4. Distributive measures Table 2 shows the sum of the resolution in the <0.1 and >0.9 categories for each of the basins and the study area as a whole. In a forecast system with perfect resolution, this should be equal to 1. For the entire Colorado basin, the basin average of this sum increases from 0.5 in January to 0.8 in May for low and high flows, while for moderate flows, this sum is lower, usually averaging less than 0.5, with values less than or equal to 0.3 in January and March at many of the basins. Low and high flows have the poorest 7/13/2017 16 resolution in the Virgin R. basin. The best average resolution for high and low flows occurs in the lower Green basin. Further analysis of the resolution of the high and low flow at each site shows that six of the ten best forecast sites are in the lower Green R. Basin. The poorest resolution of low flows occurred mostly at sites in the main stem of the upper Colorado R. and in the upper Green R. basins. Table 3 shows this sum for the top 10 and bottom 10 sites in the high and the low flow categories. Six of the sites with the best resolution for high and low flows are in the lower Green R. Basin. The Gila R. at Calva (9466500) has the sixth best resolution of low flows and the second worst resolution of high flows. The poorest resolution of low flows occurred mostly at sites in the main stem of the upper Colorado R. and in the upper Green R. basins. The Eagle R. below Gypsum (9070000) and the Virgin R. Near Virgin UT (9406000) show poor resolution in all flow categories. The reliability diagram (Figure 15) illustrates resolution as well, using a tributary near the New Fork at Big Piney, the Green R. at Warren Bridge, as an example. Later months have better resolution than earlier months, which have a larger fraction of flows being forecast with only 30-70% likelihood, especially moderate flows. For the forecast low flows, forecast of non-occurrence (<10% probability) are much more frequent than forecasts of occurrence (>90% probability). The diagram shows that as forecasts have increasing resolution, the forecast probability becomes more narrowly distributed and more frequently assigned to the extreme intervals (i.e., 0-10% and >90-100%); this is shown in the reliability diagram as sample sizes for the middle probability intervals become smaller with the sharper forecasts (e.g., January versus April reliability diagrams for the highest flows). Reliability for this site shows similar patterns for all five months. Low flows are underconfident at low probability, have no reliability at moderate probability, and are overconfident at high probability. High flows are overforecast at low probabilities and overforecast at 30-70% and 70-90% likelihood, but overall seem to have better reliability than the low flows. Discrimination at this site, however, is better for low flows than the high flows (Figure 16). High flows are rarely observed when low flows are predicted. In MarchMay, when low flows were observed, 80-90% of the forecasts predicted less than 10% 7/13/2017 17 probably of high flow, and low flows were accurately predicted 50% of the time in April and 80% of the time in May. When moderate flows are observed, all flow categories are given about equal chance of occurring, and no flow is given a high probability of occurring, even in late in the year. In the high flow category at this site, even in May, the high flows are only predicted to occur about 50% of the time that they are observed, and are forecast not to occur about 30% of the time that they are observed. However, low flows are almost never observed when high flows are predicted. When high flows are observed, forecast discrimination of moderate flow is accurate in Mar-May as well. 4. Discussion 4.1. General observations Shaefer and Huddleston (1984) compared forecasts for two 15 years periods, 1951– 65 and 1966–80, and concluded that a slight relative improvement (about 10%) in forecast ability occurred about the time computers became widely used in developing forecasts. They attributed the gradual improvement in forecast skill to a combination of greater data processing capacity and the inclusion of data from additional hydrologically important sites. They suggested that “modest improvement might be expected with the addition of satellite derived snow covered area or mean areal water equivalent data”, which were not readily available for most operation applications at the time. Although satellite data of snow-covered area are now available, those data are not being routinely used in WSO’s. We found no significant differences in our various measures across different parts of the period of record. We did not do a direct comparison with the Shaefer and Huddleston (1984) results, as their sites were grouped by state, rather than basin, boundaries. According to their study, Arizona had the highest error (more than 55% for April 1 streamflow forecasts), but also the highest skill. Of the Colorado basin states, Wyoming (only part of which is in the basin) had the lowest forecast error (~20%) paired with the highest skill. In applying Shaefer and Huddleston’s (1984) measures of forecast skill to our data, we found similar trends. The Gunnison/Dolores, Upper Green, and Lake Powell sites consistently had absolute values of percent forecast errors less than 10%, the Gila had percent errors ranging from 24 to 52%, and the five other watersheds mostly had errors 7/13/2017 18 between 8 and 20%. The largest improvement in forecast error occurred between January and April in the Virgin R. Basin (May was not as good, but still under 10%) and between January and March in the Gila (but April was extremely poor), although the March error in the Gila is still higher than at any other site. Skill coefficients generally improved from January to May (except for April in the Gila), from a Colorado basin-wide average of 1.31 to an average of 2.05. Despite the problems seen with some of the other forecast skill methods for Virgin R. data, the Virgin R. combined low forecast errors with high skill coefficients in April and May. 4.2. Regional differences On the Main Stem Upper Colorado, the sites at Eagle R. below Gypsum, Colorado R. near Dotsero and the Colorado R. near Cameo exhibit similar patterns of discrimination, resolution and reliability. The non-occurring event is predicted close to 0 probability for both the high and low flows, and the low flow is giving a good probability (> 50%) of occurring during the period when low flows were observed for the predications made in March, April and May. Discrimination of high flows is not as good, but still an improvement over the climatological means in the later part of the forecast season. At the other six sites, the discrimination of non-occurrence is still good, but the occurrence of high and low flows is not predicted as well, or as early, or both. For example, at the Blue R. inflow to Dillon Reservoir and Williams Fork near Parshall, even in May the low flows are given over a 60% probability of not occurring during the times they were observed. Most of the forecasts do not exhibit much reliability, or even show much improvement in reliability over the forecast season. Hit rates are generally better for the lowest 30% of flows (0.6 - 0.95) than the upper 30% (0.2 -0.8). The five sites in the Gunnison / Dolores show a higher overall hit rate for high flows than the main stem of the Upper Colorado. Best reliability occurs for the lowest 30% of flows at the Gunnison R. inflow to Blue Mesa Reservoir and the East R. at Almont. Discrimination of non-occurrence of extreme events is very good, but the events that do occur are not being forecast. Four of the five sites do a good job of predicting high flows in the April and May forecasts (50-70%). The occurrences of low flows are seldom accurately forecast. 7/13/2017 19 At four of the five sites in the Upper Green (all except Henrys Fork near Manila), the forecasts are usually within a factor of 2 (50% to 200%) of the observed value, with Green R. at Warren Bridge and Pine Creek above Fremont Lake having forecast values closest to that of the naturalized streamflow. During two of the low flow years, 1979 and 1989), forecasts at Henrys Fork near Manila were as much as 7 times the naturalized stream flow. However, the flows at this site were generally less than 2.83 m3s-1 (100 cfs), lower than at any of the other sites in this basin, and even small difference in the forecast can lead to large apparent discrepancies. Despite these low-flow problems, high-flow forecasts at this site were extremely reliable, the best of any site in this basin. Hit rates for high flows were overall better than for low flow, and the probability of detection was zero for most months and sites. Pine Creek above Fremont Lake has the best discrimination of the occurrence of high flows (40-50% Mar-May) and low flows (80100% March-May) of these five sites. 1982-1984 was an extended period of above average flows in the Yampa/White R. basin, characterized by low forecast/observed values at the six sites, while the years of lowest flows (1966, 1976-7, 1989-90, 1994) had the highest forecast/observed values, consistent with the pattern seen elsewhere. Extreme flows are the least well forecast. All the sites except the Little Snake R. near Dixon (which has the shortest record) have very good reliability for predicted low flows during all the forecast months. High flows are less reliable, but still better than climatology, for the most part. Discrimination of high and lows flows is similar to that observed in other basin, with the accurate occurrence of low flows being forecast with strong certainty about 50% of the time in April and May. High flows are forecast with much less certainty. The Lower Green R. basin has 11 sites, the largest number of any of the basins. The poorest forecasts occurred at Strawberry R. near Duchesne and Duchesne R. at Myton. Both had at least four months with negative LEPS and RPS skill scores, indicating that the forecasts were not an improvement over the climatology. For the Strawberry R. near Duchesne, there was a high hit rate for low flows (0.8–0.9), a poor hit rate for high flows (0.3-0.5), and a low POD for either. There was effectively no reliability and no discrimination for any of the flow classes. For example, high flows were only given a 7/13/2017 20 50% probability of occurring about 20% of the time they were observed, and a 0% probability of occurring the other 80% of the time. Other sites had better than climatology, although still imperfect, forecasts. Rock Creek near Mountain Home & Duchesne R. above Knight Diversion had excellent AprilMay low flow discriminations. For Green R. at Green R., low flows generally were given a 40-50% chance of not occurring when they were observed. High flows always were given some possibility of occurring, although sometimes only 10-50%, when high flows were observed. Huntington Creek near Huntington had low but non-zero POD for low flows, but some false alarms. The forecasted low flows in the San Juan R. basin have a high hit rate (generally 0.70.9), while the forecasted high flows generally have a hit rate of only 0.4-0.6. However, the POD of high and low flows is still poor at all the sites. Discrimination of the nonoccurrence of high flows during low-flow periods is excellent as early as February at six of the seven sites (every site except the Florida R. inflow to Lemon Reservoir) although the non-occurrence of low flows during high-flow periods is not as good (it is best at Piedra R. near Arboles, Animas R. near Durango, and the Florida R. inflow to Lemon Reservoir). One disadvantage of the two Virgin R. sites is that both have fairly significant gaps in time. However, this is an important watershed in the Southwest and these sites should not be excluded from the study. Extreme flows are not predicted well at either site, particularly early in the forecast season. Neither site shows any discrimination of the occurrence or non-occurrence of low flows, but the high flows have some discrimination for Mar-May. Low flows tend to be severely overestimated by the tendency to forecast moderate flows. Low-flow bias is very close to 1 for many sites in the Gila R. basin (Salt R. near Roosevelt, San Francisco R. at Clifton, San Francisco R. near Glenwood, Tonto Creek above Gun Creek near Roosevelt, Gila R. below Blue Creek near Virden, Verde R. below Tangle Creek); HR, TS, and POD also tend to be higher than 0.5, indicating that low flows in the basin are often predicted accurately. However, the TS and POS are near 0 for high flows at all months, suggesting a consistent inability to accurate predict highflow seasons in this basin, which would contribute to the large negative values observed 7/13/2017 21 in the Shafer and Huddleston (1984) skew coefficient. Generally negative skill scores for the probabilistic measure (SSBS and SSRPS) values also indicate that flows in this basin are not well forecast. The Gila R. sites do tend to show good discrimination of the nonoccurrence of high events during times of low flow. The Salt R. near Roosevelt and Gila near Gila show good reliability of forecasts of low flows February through April. The reliability of forecasts of moderate and high flows is still poor. The other sites (San Francisco R., Tonto R., Verde R.) show no pattern of reliability at all. 4.3. Comparison of different measures While intuitively appealing, simple summary and correlation measures provide only a broad indication of forecast skill. Pbias is perhaps the most intuitive of the scalar measures, and is simple to communicate for an individual forecast. Pbias has also been used to consider averaged values rather than correspondence between individual forecasts and their associated observations. In that case it is not strictly a measure of forecast accuracy. SSMSE and NSC are also intuitively appealing, in that they directly indicate the improvement of forecasts over climatology. However, they have limited diagnostic value for high versus low flows and forecasts. The categorical measures are an improvement, providing a more complete assessment of forecast skill, by giving information about high and low forecasts. However, they neglect the uncertainty information inherent in the exceedance quantiles that accompany the deterministic forecast value, attributing an implied 100% probability to the “most probable” forecast value, even though that value (as the distribution median) is equally likely to be too high or too low. Probalistic measures are an improvement over categorical measures because they reflect the inherently probabilistic nature of forecasts, by considering the probability specified for each category of interest. While they may be less intuitive, they are analogous to standard error estimates, but in a probability space rather than the measurement space (i.e., flow volumes). Distributive measures, e.g. discrimination and reliability, provide the most comprehensive forecast evaluations and allow performance in all streamflow categories to be examined individually, considering both forecast probabilities and observed 7/13/2017 22 frequencies of occurrence. However, their sensitivity to small sample sizes is an acknowledged limitation. 5. Conclusions Despite their shortcomings, by most measures and at most forecast sites the federally issued seasonal water supply outlooks are an improvement over climatology. The majority of forecast points have very conservative predications of seasonal flow. Belowaverage flows are often over predicted (forecast values are too high) and above average flows are under predicted (forecast values are too low). This problem is most severe for early forecasts (e.g., January) at many locations, and improves somewhat with later forecasts (e.g., May). For the low and high flows there is generally a low false alarm rate, which means than when low and high flows are forecast, these forecast are generally accurate. However, for low and high flows there is also a low probability of detection at most sites, which indicates that these flows are not forecast nearly as often as they are observed. Moderate flows have a very high probability of detection, but also a very high false alarm rate, indicating that the likelihood of moderate flows is overforecast. There is also good discrimination between high and low flows, particularly with forecasts issued later in the year. This means that when high flows are forecast, low flows are not observed, and vice versa. However the probability that high and low flows will be accurately predicted, particularly early in year, is not as good. The accuracy of forecasts tends to improve with each month, so that forecasts issued in May tend to be much more reliable than those issued in January. Not all streams or areas show the same patterns and trends, but there is a lot of similarity in the relationship between forecasts and observations, particularly in the Upper Colorado. The changes in forecasting periods (most recently to April-July in the Upper Basin and forecasting month-May in the Lower Basin) did not affect the accuracy of the forecasts. More use of the categorical, probabilistic, and distributive measures is encouraged. Although the WSO’s have historically been deterministically derived, their calibration error statistics and the use of probabilistic and distributive measures provides an avenue for comparison with experimental forecasts based on ensemble-based probabilistic forecasts. Evaluations would be improved by further development of historical forecast 7/13/2017 23 and observation data, documentation of model details (e.g., changes in regression equations), and more realistic accounting of flow diversions and storage in estimation of naturalized flows. Acknowledgements Support for this research was provided by the NOAA’s Office of Global Programs through the Climate Assessment for the Southwest Project, at the University of Arizona. Additional support was provided by the National Science Foundation through the Center for the Sustainability of semi-Arid Hydrology and Riparian Areas, also centered at the University of Arizona. 7/13/2017 24 References Bales, R. C., D. M. Liverman and B. J. Morehouse, 2004: Integrated Assessment as a Step Toward Reducing Climate Vulnerability in the Southwestern United States, Bulletin of the American Meteorological Society, 85, 1727 CBRFC, 1991: Lower Colorado water supply 1991 review, CBRFC/NWS, Salt Lake City. CBRFC, 1992: Water supply outlook for the Lower Colorado, March 1, 1992. CBRFC/NWS, Salt Lake City. CBRFC, undated: Guide to Water Supply Forecasting. CBRFC/NWS, Salt Lake City. Day, G.N., 1985: Extended streamflow forecasting using NWSRFS. Journal of Water Resources Planning and Management, 111(2), 157-170. Franz, K. J., H. C. Hartmann, S. Sorooshian and R. Bales, 2003: Verification of National Weather Service Ensemble Streamflow Predictions for Water Supply Forecasting in the Colorado R. Basin, Journal of Hydrometeorology,4(6): 1105-1118. Hartmann, H.C., R. Bales, and S. Sorooshian: Weather, climate, and hydrologic forecasting for the U.S. Southwest: a survey. Climate Research, 21, 239-258, 2002. Legates, D. R and G. J. McCabe, Jr.: Evaluating the use of “goodness-of-fit” measures in hydrologic and hydroclimatic model validation. Water Resources Research, 35, 233241, 1999. Murphy, A.H., and Winkler, R.L., 1992: Diagnostic verification of probability forecasts. International Journal of Forecasting, 7, 435-455. Murphy, A.H., Brown, B.G., and Chen, Y., 1989: Diagnostic verification of temperature forecasts. Weather and Forecasting, 4, 485-501. Murphy, A.H. and Winkler, R.L., 1987: A general framework for forecast verification. Monthly Weather Review, 115, 1330-1338. Nash, J. E. and J. V. Sutcliffe, 1970: R. flow forecasting through conceptual models part I — A discussion of principles. Journal of Hydrology, 10 (3), 282–290. Shafer, B.A. and Huddleston, J.M., 1984: Analysis of seasonal volume streamflow forecast errors in the western United States. Proceedings, A Critical Assessment of Forecasting in Water Quality Goals in Western Water Resources Management, Bethesda, MD, American Water Resources Association, 117-126. Soil Conservation Service and NWS: Water supply outlook for the western United States. West National Technical Center, Soil Conservation Service, Portland, 1994. 7/13/2017 25 Wilks, D.S., 1995: Forecast verification. Statistical Methods in the Atmospheric Sciences, Academic Press, 467 p. 7/13/2017 26 Table 1 The 54 sites used in this study. USGS no. Elev., m Name Area, km2 Years MAIN STEM UPPER COLORADO 9019000 Colorado R. inflow to L. Granby, CO 2,415 808 1953-00 9037500 Williams Fork nr. Parshall, CO 2,343 476 1956-96 9050700 Blue R. inflow to Dillon Res., CO 2,628 867 1972-00 9057500 Blue R. inflow to Green Mountain Res., CO 2,305 1,551 1953-00 9070000 Eagle R. bel. Gypsum, CO 1,883 2,444 1974-00 9070500 Colorado R. nr. Dotsero, CO 1,839 11,376 1972-00 9085000 Roaring Fork at Glenwood Springs, CO 1,716 3,756 1953-00 9095500 Colorado R. nr. Cameo, CO 1,444 20,840 1956-00 9180500 Colorado R. nr. Cisco, UT 1,227 62,392 1956-00 9112500 East R. at Almont, CO 2,402 748 1956-00 9124800 Gunnison R. inflow to Blue Mesa Res., CO 2,145 8,939 1971-00 9147500 Uncompahgre R. at Colona, CO 1,896 1,160 1953-00 9152500 Gunnison R. nr. Grand Junction, CO 1,388 20,525 1953-00 9166500 Dolores R. at Dolores, CO 2,076 1,305 1953-00 9188500 Green R. at Warren Bridge, WY 2,240 1,212 1956-00 9196500 Pine Cr. abv. Fremont L., WY 2,235 197 1969-00 9205000 New Fork R. nr. Big Piney, WY 2,040 3,184 1974-00 9211150 Fontenelle Res. inflow, WY 1,952 11,080 1971-00 9229500 Henrys Fork nr. Manila, UT 1,818 1,346 1971-94 9239500 Yampa R. at Steamboat Springs, CO 2,009 1,564 1953-00 9241000 Elk R. at Clark, CO 2,180 559 1953-93 9251000 Yampa R. nr. Maybell, CO 1,770 8,828 1956-00 9257000 Little Snake R. nr. Dixon, WY 1,899 2,558 1980-00 9260000 Little Snake R. nr. Lily, CO 1,706 9,657 1953-00 9304500 White R. nr. Meeker, CO 1,890 1,955 1953-00 1,869 261 1953-00 GUNNISON / DOLORES UPPER GREEN YAMPA / WHITE LOWER GREEN 9266500 Ashley Cr. nr. Vernal, UT 7/13/2017 27 9275500 W. Fork Duchesne R. nr. Hanna, UT, unimp. 2,165 161 1974-00 9277500 Duchesne R. nr. Tabonia, UT, unimp. 1,857 914 1953-00 9279000 Rock Cr. nr. Mountain Home, UT 2,175 381 1964-00 9279150 Duchesne R. abv. Knight Diversion, UT 1,752 1,613 1964-00 9288180 Strawberry R. nr. Duchesne, UT 1,717 2,374 1953-00 9291000 L. Fork R. bel. Moon L. nr. Mountain Home, UT 2,391 290 1953-00 9295000 Duchesne R. at Myton, UT, unimp. 1,518 6,842 1956-00 9299500 Whiterocks R. nr. Whiterocks, UT 2,160 282 1953-00 9315000 Green R. at Green R., UT 1,212 116,111 1956-00 9317997 Huntington Cr. nr. Huntington, UT 1,935 1953-00 461 SAN JUAN R BASIN 9349800 Piedra R. nr. Arboles, CO 1,844 1,628 1971-00 9353500 Los Pinos R. nr. Bayfield, CO 2,275 699 1953-00 9355200 San Juan R. inflow to Navajo Res., NM 1,697 8,440 1963-00 9361500 Animas R. at Durango, CO 1,951 1,792 1953-00 9363100 Florida R. inflow to Lemon Res., CO 1,941 47 1953-00 9365500 La Plata R. at Hesperus, CO 2,432 96 1954-00 9379500 San Juan R. nr. Bluff, UT 1,214 59,544 1956-00 LAKE POWELL 9379900 L. Powell at Glen Canyon Dam 930 278,822 1963-00 VIRGIN R. 9406000 Virgin R. nr. Virgin, UT 1,050 2,475 1957-00 834 3,881 1972-00 9430500 Gila R. nr. Gila, NM 1,397 4,826 1964-00 9432000 Gila R. bel. Blue Creek nr. Virden, NM 1,227 7,324 1954-00 9444000 San Francisco R. nr. Glenwood, NM 1,368 4,279 1964-00 9444500 San Francisco R. at Clifton, AZ 1,031 7,161 1953-00 9466500 Gila R. at Calva, AZ 755 29,694 1963-98 9498500 Salt R. nr. Roosevelt, AZ 653 11,148 1953-00 9499000 Tonto Cr. abv. Gun Cr., nr. Roosevelt, AZ 757 1,747 1955-00 9508500 Verde R. bel. Tangle Cr., abv. Horseshoe Dam, AZ 609 15,168 1953-00 9408150 Virgin R. nr. Hurricane, UT GILA R. BASIN 7/13/2017 28 Table 2. Resolution in the <0.1 and >0.9 categories of flow Basin Upper Colorado Gunnison/Dolores Upper Green Yampa/White Lower Green San Juan Lake Powell Virgin Gila All Upper Colorado Gunnison/Dolores Upper Green Yampa/White Lower Green San Juan Lake Powell Virgin Gila All Upper Colorado Gunnison/Dolores Upper Green Yampa/White Lower Green San Juan Lake Powell Virgin Gila All 7/13/2017 Jan Feb Mar Low flows 0.4 0.5 0.6 0.5 0.6 0.7 0.3 0.5 0.5 0.5 0.6 0.6 0.6 0.7 0.7 0.5 0.6 0.7 0.8 0.8 0.8 0.4 0.4 0.5 0.6 0.7 0.6 0.5 0.6 0.6 Moderate flows 0.2 0.2 0.3 0.3 0.3 0.3 0.1 0.2 0.2 0.3 0.2 0.3 0.4 0.4 0.5 0.2 0.2 0.3 0.5 0.5 0.5 0.2 0.3 0.4 0.3 0.3 0.4 0.3 0.3 0.4 High flows 0.4 0.5 0.5 0.5 0.5 0.6 0.4 0.6 0.6 0.6 0.5 0.6 0.6 0.6 0.7 0.4 0.4 0.5 0.6 0.6 0.6 0.2 0.4 0.6 0.4 0.5 0.6 0.5 0.5 0.6 Apr May 0.6 0.8 0.6 0.7 0.8 0.7 0.8 0.7 0.6 0.7 0.7 0.8 0.7 0.8 0.9 0.8 0.8 0.5 – 0.8 0.4 0.5 0.3 0.4 0.6 0.4 0.5 0.5 0.4 0.4 0.5 0.6 0.5 0.5 0.7 0.6 0.5 0.4 – 0.6 0.6 0.7 0.6 0.6 0.8 0.7 0.6 0.6 0.6 0.7 0.7 0.8 0.7 0.7 0.8 0.8 0.6 0.7 – 0.8 29 Table 3. Rank of sites by relative resolution of flows Rank Point Resolutiona Basin Low flows, best resolution 1 9277500 0.95 LG 2 9295000 0.94 LG 3 9288180 0.81 LG 4 9317997 0.79 LG 5 9466500 0.76 Gi 6 9379900 0.76 LP 7 9353500 0.74 SJ 8 9315000 0.74 LG 9 9241000 0.73 YW 10 Low flows, worst resolution 45 9070500 0.54 UC 46 9037500 0.54 UC 47 9070000 0.54 UC 48 9095500 0.52 UC 49 9188500 0.49 UG 50 9019000 0.47 UC 51 9211150 0.44 UG 52 9205000 0.43 UG 53 9363100 0.43 SJ 54 9406000 0.41 Vi High flows, best resolution 1 9288180 0.95 LG 2 9257000 0.91 YW 3 9211150 0.85 UG 4 9277500 0.81 LG 5 9317997 0.77 LG 6 9363100 0.74 SJ 7 9299500 0.72 LG 8 9498500 0.70 Gi 7/13/2017 30 9 9295000 0.68 LG 10 High flows, worst resolution 45 9508500 0.50 Gi 46 9365500 0.49 SJ 47 9070000 0.49 UC 48 9406000 0.49 Vi 49 9408150 0.49 Vi 50 9147500 0.47 GD 51 9349800 0.44 SJ 52 9304500 0.43 YW 53 9466500 0.34 Gi 54 9229500 0.25 UG a Sum of high and low probabilities 7/13/2017 31 List of Figures 1. Location of 54 water supply outlook forecast points in the Colorado R. Basin used in this study. 2. Example reliability diagram illustrating relative frequency for various forecast skills. Horizontal lines at 0.3 and 0.4 indicate no resolution for high and low flows and for middle flows, respectively. Light vertical lines indicate forecast categories. 3. Example discrimination diagram illustrating relative frequency for observed low flows. Light vertical lines indicate forecast categories. 4. For each year at 2 forecast points: Pbias values (a, e) with each circle representing a different month (smallest is for 1st forecast made that year, largest is last forecast), observed/average discharge values (b, f), years used in computing the climatological average on which the forecast is based (c,g), and forecast period (d,h), with the top hatch representing the first month and the lower hatch marking the last month of the forecast period. 5. For the same 2 sites as on Figure 4 plus the San Juan R. near Red Bluff, f i / oi against oi / o for April and May. The horizontal lines at 0.8 and 1.2 are provided for reference. 6. Summary and correlation measures associated with forecasts issued in January through May. Pbias is a linear mean of values. There were 54 sites used in JanuaryApril and only 47 used in May, because the 8 Gila R. Basin sites do not issue May forecasts. 7. Nash-Sutcliffe coefficient in April for entire area and each sub-region. 8. Skew coefficient G in April for entire area and each sub-region. 9. Frequency histograms of Hit Rate for observations in the lowest 30% of flows, the middle 40% of flows, and the upper 30% of flows. 10. Frequency histograms of Threat Score for observations in the lowest 30% of flows, the middle 40% of flows, and the upper 30% of flows. 11. Frequency histograms of False Alarm Rate for observations in the lowest 30% of flows, the middle 40% of flows, and the upper 30% of flows. 12. Frequency histograms of Probability of Detection for observations in the lowest 30% of flows, the middle 40% of flows, and the upper 30% of flows. 7/13/2017 32 13. The LEPS value, Brier Score and Ranked Probability score, and the associated skill scores for the New Fork R. near Big Piney (USGS 9205000). 14. Monthly average (a) LEPS skill scores (b) Brier skill scores and (c) Ranked probability skill scores for each basin. 15. Reliability diagrams for Green R. at Warren Bridge (9188500). The size of circles indicates the relative frequency of the forecast. 16. Discrimination diagrams for Green R. at Warren Bridge (9188500). 7/13/2017 33
© Copyright 2026 Paperzz