An Assessment of Water Supply Outlook Forecasts in the Colorado

An Assessment of Seasonal Water Supply Outlooks in the Colorado R. Basin
Jean C. Morrill1, Holly C. Hartmann1 and Roger C. Bales2,a
1
Department of Hydrology and Water Resources, University of Arizona, Tucson, AZ,
USA
2
a
School of Engineering, University of California, Merced, CA, USA
Corresponding author:
7/13/2017
Roger C. Bales
University of California, Merced
P.O. Box 2039
Merced, CA 95344
209-724-4348 (o)
209-228-4047 (fax)
[email protected]
1
Abstract
A variety of forecast skill measures of interest to the water resources applications
community and other stakeholders were used to assess the strengths and weaknesses of
seasonal water supply outlooks (WSO’s) at 54 sites in the Colorado R. basin, and provide
a baseline against which alternative and experimental forecast methods can be compared.
These included traditional scalar measures, categorical measures, probabilistic measures
and distribution-oriented measures. Despite the shortcomings of the WSO’s they are an
improvement over climatology at most sites over the period of record. The majority of
forecast points have very conservative predications of seasonal flow, with below-average
flows often over predicted and above-average flows under predicted. Late-season
forecasts at most locations are generally better than those issued in January. There is a
low false alarm rate for both low and high flows at most sites, however, low and high
flows are not forecast nearly as often as they are observed. Moderate flows have a very
high probability of detection, but are forecast more often than they occur. There is also
good discrimination between high and low flows, i.e. when high flows are forecast, low
flows are not observed, and vice versa. The diversity of forecast performance metrics
reflects the multi-attribute nature of forecasts and ensembles.
7/13/2017
2
1. Introduction
Seasonal water supply outlooks, or volume of total seasonal runoff, are routinely used by
decision makers in the southwestern United States for making commitments for water
deliveries, determining industrial and agricultural water allocation, and carrying out
reservoir operations. In the Colorado R. basin, the National Weather Service (NWS)
Colorado Basin R. Forecast Center (CBRFC) and the Natural Resources Conservation
Service (NRCS) jointly issue seasonal water supply outlook (WSO) forecasts of
naturalized, or unimpaired, flow, i.e. the flow that would most likely occur in the absence
of diversions and reservoir storage (e.g., CBRFC, 1992; Soil Conservation Service and
NWS, 1994).
Currently WSO’s are issued once each month from January to June. However, until
the mid-1990s, the forecasts were only issued until May. The forecast period is the
period of time over which the forecasted flow is predicted to occur. It is not the same for
all sites, all years at one location, or even all months in a single year. In the past decade,
the most common forecast period has been April-July for most sites in the upper
Colorado R. basin and January-May for the Lower Colorado, for each month a forecast
was issued. However, previously many sites used April-September forecast periods, and
prior to that the forecast period for the January forecast was January-September, for the
February forecast the forecast period was February-September, etc.
Both the CBRFC and NRCS base their WSO’s on multivariate regression relationships (Day,
1985; Hartmann et al., 2002; Pagano et al., 2004). Unique regressions for each forecast period and
location use subsets of monthly or seasonal observations of precipitation, streamflow, ground-
based snow-water depths, and routed forecasted streamflows; some Arizona locations incorporate
Southern Oscillation Index values to reflect climatic teleconnections. The regression equations
produce only a single deterministic water supply volume, sometimes termed the “most probable”
forecast in some publications, although the term is not statistically rigorous or preferred terminology
(Hartmann et al., 2002). The WSO’s typically also compare this value to the mean or median for a
historical climatological period (usually 10-30 years). Additionally, seasonal total water volumes
corresponding to 10%, 30%, 70% and 90% exceedance values have often been provided in the
outlook bulletins. These quantiles are obtained by overlaying a normalized error distribution,
determined during regression equation fitting, centered on the deterministic regression forecast that
7/13/2017
3
then corresponds to the distribution median. Most of the sites at which forecasts are issued are
impaired, i.e. have diversion above the forecast and gauging location. Therefore the
CRBRFC combines measured discharges with historical estimates of diversion to reconstruct
the unimpeded observed flow (CBRFC, undated).
Forecast verification is important for assessing forecast quality and performance,
improving forecasting procedures, and providing users with information helpful in
applying the forecasts (Murphy and Winkler, 1987). Decision makers take account of
forecast skill in using forecast information and are interested in having access to a variety
of skill measures (Bales et. al. 2004; Franz et al., 2003).
Shafer and Huddleston (1984) examined average forecast error at over 500 forecast
points in 10 western states. They used summary statistical measures and found that
forecast errors tended to be approximately normally distributed, but with a slightly
negative skew that resulted from a few large negative errors (under-forecasts) with no
corresponding large positive errors. High errors were not always associated with poor
skill scores, however. Pagano et al. (2004) found similar results for 29 locations
throughout the West, using correlation-related summary measures. Both evaluation
studies treated the WSO’s as strictly deterministic products, i.e., as single-value forecasts
without any uncertainty information.
The work reported here assesses the skill of forecasts relative to naturalized
streamflow across the Colorado R. basin, but using a greater variety of evaluation metrics
of interest to stakeholders: traditional scalar measures (linear correlation, linear rootmean square error and bias), categorical measures (false alarm rate, threat score),
probabilistic measures (Brier score, ranked probability score) and distributive measures
(resolution, reliability and discrimination). The purpose was to assess the strengths and
weaknesses of the current water supply forecasts, and provide a comprehensive, multidimensional baseline against which alternative and experimental forecast methods can be
compared.
2. Data and methods
2.1. Data
WSO records from 136 forecast points on 84 water bodies were assembled, including
some forecast locations that are no longer active. NEED TO APPEND DATA
7/13/2017
4
Reconstructed flows were made available by the CRBRFC and NOAA (T. Tolsdorf and
S. Shumate, personal communication), however data were not available for all forecast
locations. Many current forecast points were established in 1993, and so do not yet have
good long-term records. For this study we chose 54 sites having at least 10 years of both
forecast and observed data (Figure 1). Another 33 sites have fewer than 10 years of data,
but most are still active, and so should be more useful for statistical analysis in a few
years time. The earliest water supply forecasts used in this study were issued in 1953 at
22 of the 54 locations.
These 54 forecasting sites were divided in 9 smaller basins (or in the case of Lake
Powell, a single location), compatible with the divisions used by CBRFC in the tables
and graphs accompanying the WSO forecasts (Table 1). The maximum number of years
in the combined forecast and observation record was 48 (1953–2000), the minimum used
was 21, and the median and average number of years were 46 and 41.5 respectively.
Each deterministic forecast was converted into a forecast probability distribution by
using the “most probable” value as the distribution median, along with the10% and 90%
exceedance values, to also calculate the 30 and 70% exceedance values. Five forecast
flow categories were calculated for each forecast, based on exceedance probability: 010%, >10-30%, >30-70%, >70-90%, and >90%. The probability of the flow falling
within each of these categories is 0.1, 0.2, 0.4, 0.2 and 0.1 respectively. Selection of these
categories was based on their common usage in NRCS communications and reflect
categories considered important to a broad range of water resources decision makers.
2.2. Summary and correlation measures
Summary measures are scalar measures of accuracy from forecasts of continuous
variables, and include the mean absolute error (MAE) and mean square error (MSE):
MAE 
MSE 
(
1 n
 f i  oi
n i 1
1)
(
1 n
 f i  o i 2

n i 1
2)
where for a given location, f is the forecast seasonal runoff for period i and o the
naturalized observed flow for the same period. Since MSE is computed by squaring the
7/13/2017
5
forecast errors, it is more sensitive to larger errors than is MAE. Both MSE and MAE
increase from zero for perfect forecasts to large positive values as the discrepancies
between the forecast and observations become larger. RMSE is the square root of the
MSE.
Often an accuracy measure is not meaningful by itself, and is compared to a
reference value, usually based on the historical record. In order for a forecast technique
to be worthwhile, it must generate better results than simply using the cumulative
distribution of the climatological record, i.e. assuming that the most likely flow next year
is the average flow in the climatological record. In order to judge this, skill scores are
calculated for the accuracy measures:
SS A 
A  Aref
Aperf  Aref
(
3)
If SSA is a generic skill score, Aref is the accuracy of a reference set of values (e.g. the
climatological record) and Aperf is the value of A given by perfect forecasts. If A=Aperf,
SSA will be at its maximum, 1. If A=Aref, then SSA=0, indicating no improvement over the
reference forecast. If SSA <0, then the forecasts are not as good as the reference. (Wilks,
1995). For MSE:
SS MSE 
MSE  MSE cl
MSE  MSE cl
MSE

 1
MSE perf  MSE cl
0  MSE cl
MSE cl
(
4)
since for a perfect forecast MSE is 0, and the climatological value is:
MSE cl 
where
(
1 n
ocl  oi  2

n i 1
ocl
5)
is the average observation associated with the reference
climatology.
Correlation-based measures are widely used to determine the goodness-of-fit of
hydrologic models. They have many limitations, including a high sensitivity to extreme
values (outliers) and an insensitivity to additive or proportional differences between
models and observations (Legates and McCabe, 1999). Correlation provides a summary
measure of the joint distribution of the forecasts and observations. However it does not
7/13/2017
6
account for any forecast bias, and when bias is large, the correlation is not likely to be
informative.
The most widely used correlation measure is the coefficient of determination, which
describes the proportion of the variability of the observation that is linearly accounted for
by the forecast:
n




oi  o  f i  f 2



2
i 1
R 

0
.
5
0
.
5
2
 n
 n

2
  oi  o     f i  f   
  i 1
 
  i 1
2
(
4)
where R2=1 indicates perfect agreement between the observations and predictions and
R2=0 no agreement.
Another correlation measure often used to evaluate the performance of hydrologic
models is the coefficient of efficiency (Nash and Sutcliffe, 1970), also call the NashSutcliffe coefficient:
n
NSC  1 
 o
i
 fi 
 o
 o
i 1
n
i 1
i
2
(
5)
2
It has a maximum of 1 for a perfect forecast and a minimum of negative infinity.
Physically, NSC is 1 minus the ratio of MSE to the variance of the observed data. If NSC
> 0, the forecast is a better predictor of flow than is the observed mean, but if NS C< 0,
the observed mean is a better predictor and there is a lack of correlation between the
forecast and observed values.
Discussion of correlation is often combined with that of the percent bias, which
measures the difference between the average forecasted and observed values (Wilks,
1995):
Pbias 
f o
 100%
o
(
6)
which can assume positive (overforecasting), negative (underforecasting) or zero values.
7/13/2017
7
Shafer and Huddleston (1984) used a similar calculation to examine forecast error
and the distribution of forecast error in the analysis of seasonal streamflow forecasts.
Forecast error for a particular forecast/observation pair was defined as
E
f o
 100
oref
(
7)
where o ref is the published seasonal average runoff at the time of the forecast (also called
the climatological average or reference value).
They also defined a skew coefficient associated with the distribution of a set of
errors:
n
G

n Ei  E

3
i 1
(n  1)( n  2)( E ) 3
(
 100
8)
where  E is the standard deviation of errors.
2.3. Categorical measures
A categorical forecast states that one and only one set of possible events will occur,
with an implied 100% certainty attached to the forecasted category. Contingency tables
are used to display the possible combinations of forecast and event pairs, and the count of
each pair. An event (e.g. seasonal flow in the upper 30% of the observed distribution) that
is successfully forecast (both forecast and observed) occurs a times. An event that is
forecast but not observed occurs b times, and an event that is observed but not forecast
occurs c times. An event that is not forecast and not observed for the same period occurs
d times. The total number of forecasts in the data set is n=a+b+c+d. A perfectly
accurate binary (22) categorical forecast will have b=c=0 and a+d=n. However, few
forecasts are perfect. Several measures can be used to examine the accuracy of the
forecast, including hit rate, threat score, probability of detection and false alarm rate
(Wilks, 1995).
The hit rate is the proportion correct:
HR 
ad
n
(
7)
and ranges from one (perfect) to zero (worst).
7/13/2017
8
The threat score, also known as the critical success index, is the proportion of
correctly forecast events out of the total number of times the event was either forecast or
observed, and does not take into account the accurate non-occurrence of events:
TS 
(
a
abc
8)
It also ranges from one (perfect) to zero (worst).
The probability of detection is the fraction of times when the event was correctly
forecast relative to the number of times it actually occurred, or the probability of the
forecast given the observation:
POD 
(
a
ac
9)
A perfect POD is 1 and the worst 0.
A related statistic is the false alarm rate, FAR, which is the fraction of forecasted
events that do not happen. In terms of conditional probability, it is the probability of not
observing an event given the forecast:
FAR 
(
b
ab
10)
Unlike the other categorical measure described, the FAR has a negative orientation, with
the best possible FAR being 0 and the worst being 1.
The bias of the categorical forecasts compares the average forecast with the average
observation, and is represented by the ratio of “yes” observations to “yes” forecasts:
bias 
ab
ac
(
11)
A unbiased forecast has a value of 1, showing that the event occurred the same number of
times that it was forecast. If the bias is greater than 1, the event is overforecast (forecast
more often than observed); if the bias is less than one, the event is underforecast. Since
the bias does not actually show anything about whether the forecasts matched the
observations, it is not an accuracy measure.
7/13/2017
9
2.4. Probabilistic measures
Whereas categorical forecasts contain no expression of uncertainty, probabilistic
forecasts do. Linear error in probability space assesses forecast errors with respect to
their difference in probability, rather than their overall magnitude:
LEPSi  Fc  f i   Fc oi 
(
12)
Fc(o) refers to the climatological cumulative distribution function of the observations,
and Fc(f) to the corresponding distribution for the forecasts. The best possible LEPS
values is 0, for identical distributions, and the worst is 1 for completely divergent
distributions. LEPS reflects that correct forecasting of extreme events should warrant
more credit, compared to more common moderate events. The corresponding skill score
is:
 F  f   F o 
 1
 0.5  F o 
n
SS LEPS
i 1
n
c
i 1
i
c
c
(
i
13)
i
using the climatological median as reference forecast.
The Brier score is analogous to MSE :
BS 
(
1 n
 f i  oi 2

n i 1
14)
However, it compares the probability associated with a forecast event with whether or not
that event occurred instead of comparing the actual forecast and observation. Therefore fi
ranges from 0 to 1, oi=1 if the event occurred or oi =0 if the event did not occur, and
BS=0 for perfect forecasts. The corresponding skill score is:
SS BS  1 
(
BS
BS ref
15)
where the reference forecast is generally the climatological relative frequency.
The ranked probability score (RPS) is essentially an extension of the Brier score to
multi-event situations. Instead of just looking at the probability associated with one event
or condition, it looks simultaneously at the cumulative probability of multiple events
occurring. RPS uses the forecast cumulative probability:
7/13/2017
10
m
Fm   f j ,
(
m=1,…,J
16)
j 1
where fj is the forecast probability at each of the J non-exceedance categories. In this
paper, fj = {0.1 0.2 0.4 0.2 0.1} for the five non-exceedance intervals {0-10%, >10-30%,
>30-70%, >70-90%, and >90%},so Fm = {0.1 0.3 0.7 0.9 1.0} and J=5. The observation
occurs in only one of the flow categories, which will be given a value of 1; all the others
are given a value of zero:
m
Om   o j ,
(
m=1,…,J
17)
j 1
The RPS for a single forecast/observation pair is calculated from:
J
RPS i   Fm  Om 
(
2
18)
m 1
and the average RPS over a number of forecasts is calculated from:
RPS 
(
1 n
 RPS i
n i 1
19)
A perfect forecast will assign all the probability to the same percentile in which the event
occurs, which will result in RPS=0. The RPS has a lower bound of 0 and an upper bound
of J-1. RPS values are rewarded for the observation being closer to the highest
probability category. The RPS skill score is defined as:
SS RPS  1 
(
RPS
RPS ref
25)
where RPSref is the climatological cumulative frequency.
The Brier score focuses on how well the forecasts perform in a single flow category;
RPS is a measure of overall forecast quality.
Note that statistics calculated from a small number of forecasts are more
susceptible to being dominated by sampling variations and make assessing forecast
quality difficult (Wilks, 1995). In addition, with smaller sample sizes, it is more likely
that some intervals have no data because there are not enough forecasts to represent all
combinations of forecast probability and flow categories.
7/13/2017
11
2.5. Distributive Measures
We used two distributive measures, reliability and discrimination, to assess the
forecasts in various categories (i.e. low, medium, high). The same five forecast
probabilities used for RPS were used to represent the probability given to each of the
three flow categories. Our application of these measures follows that outlined by Franz
et al. (2003).
Reliability uses the conditional distribution (p(o|f)) and describes how often an
observation occurred given a particular forecast. Ideally, p (o  1 | f )  f (Murphy and
Winkler, 1987). That is, for a set of forecasts where a forecast probability value f was
given to a particular observation o, the forecasts are considered perfectly reliable if the
relative frequency of the observation equals the forecast probability (Murphy and
Winkler, 1992). For example, given all the times in which high flows were forecasted
with a 50% probability, the forecasts would be considered perfectly reliable if the actual
flows turned out to be high in 50% of the cases.
On a reliability diagram (Figure 2) the conditional distribution (p(o|f)) of a set of
perfectly reliable forecasts will fall along the 1:1 line. Forecasts that fall to the left of the
line are underforecasting or not assigning enough probability to the subsequent
observation. Those that fall to the right of the line are overforecasting. Conditional
distributions of forecasts lacking resolution, meaning they are unable to identify
occasions when the event is more or less likely than the overall climatology, plot along
the horizontal line associated with their climatology value.
The discrimination diagram displays the conditional probability distributions
((p(f|o)) of each possible flow category as a function of forecast probability (Figure 3). If
the forecasts are discriminatory, then the probability distribution functions of the
forecasted flow categories will have minimal overlap on the discrimination diagram
(Murphy et al., 1989). Ideally, a forecast issued prior to an observation of a low flow
should say that there is 100% chance of having a low flow and 0% chance of having high
or middle flows. A set of forecasts that consistently provide such strong and accurate
statements is perfectly discriminatory and will produce a discrimination diagram like
Figure 3a. Figure 3b illustrates a case where the sample of forecasts is unable to
7/13/2017
12
consistently assign the largest probability to the occurrence of low flows. Users of
forecasts from such a system could have no confidence in the predictions.
A discrimination diagram is produced for occurrences of observations in each flow
category; therefore, forecasts that were issued prior to observations that occurred in the
lowest 30% (low flows), middle 40% (mid-flows), and highest 30% (high flows) are
plotted on separate discrimination diagrams. The number of forecasts represented on
each plot depends upon the number of historical observations in the respective flow
category.
3. Results
3.1. Scalar measures
The New Fork R. near Big Piney (Upper Green R., 3,184 km2) and Colorado R. near
Dotsero (11,376 km2) together capture many of the patterns seen in the different sites,
and are used to illustrate the different types of results. The Pbias values on Figure 4
show that 1997, an above average flow year, represents an almost perfect forecast year
for the New Fork at Big Piney, with forecast bias very close to 0. It is an excellent
example of consistency in forecasting, with the concentric circles showing that the
January forecast was the same as the July forecast. In many of the other years, such as
1992, a below average flow year, there is significant forecast drift, with the January
forecast farthest from 0, and values getting progressively better with each month.
Comparing Figures 4a (4e) and 4b (4f) shows that years of above average flow (e.g.,
1983, 1986, 1995) are often associated with forecasts being too low; conversely, in years
of below average flow (e.g. 1988 and 1992), forecasts were too high.
This pattern of over versus under forecast is seen more clearly by plotting f i / oi
versus oi / o (Figure 5) for the two sites on Figure 4 plus the San Juan R. near Bluff
(59,544 km2). Ideally, all points should be in a horizontal line f i / oi  1 , which would
indicate that no matter how high or low (above or below average) the observed flow, the
forecast values equal the observed value. In general, forecasts issued in May improve
over those issued earlier in the year (920500 and 9070500).
Note that different years were used to produce the climatology against which
forecasts were compared (e.g. Figure 4c and 4g). For example, for 1975-1980, data for
7/13/2017
13
1958-1972 were used, while for 1993-2000, data from the 1961-1990 period were used.
This trend is repeated for all the forecast locations. Every five or ten years, the definition
of average observed flow changes, and different sites may use data from different time
periods; although in 1991-2000 the majority of the forecasts were based on the 19611990 climatology. Starting in 2001, forecasts were based on the 1971-2000 climatology.
Another problem in comparing forecasts from one year to another is that the forecast
period, or months during which the forecasted flow is supposed to occur, changes,
sometimes from month to month, other times from year to year (e.g. Figures 4d and 4h).
For example, for 1975-1979, the forecast period for January was January-September and
for May it was May-September. For 1980-1990, the forecast period was April through
September for every month of issue, and from 1991 to present, it was from April to July.
For these locations no one forecasting period has a visibly better correlation than another,
nor do forecasts show any marked improvement over the period of record.
Like Pbias, R2 values across all the sites are lowest in January (all sites < 0.5) and
become progressively higher through May (0.4–0.9, with the highest around 0.8, although
there is little difference in February through April values) (Figure 6). Even in April and
May there are still many poorly correlated sites.
The distributions for MAE and RMSE are similar (Figure 6), with the slightly lower
values for RMSE capturing the generally higher bias values for the higher flow years.
Although SSMSE is sensitive to high forecast errors, e.g. in extreme flow years, it is a
broader distribution because of the poor representation of the annual flow by the
climatological mean at most sites. RMSE is a poorer measure of skill that the other
summary and correlation measures, as it is as much related to flow volume as anything
else. Of the 10 sites with the lowest RMSE, 5 are tributaries in the Lower Green Basin
and the other five are smaller creeks/rivers as well. Of the 10 with the highest (worst)
RMSE, 4 are on the Colorado R. and 2 are on the San Juan R. Others are Green R.,
Gunnison R., Yampa R. and Salt R.
A similar pattern is seen for NSC as for the other measures, although there is little
difference in February through April values (Figure 6). It is seen that no one region, with
the possible exception of the Virgin R., has significantly better forecasts than do the other
regions (Figure 7). Multiple basins have near zero NSC’s. Two sites have negative
7/13/2017
14
values, indicating that the forecasts are not an improvement over the climatology, during
all five months: the Strawberry R. near Duchesne in the Lower Green basin, and the
Florida R. inflow to Lemon Reservoir in the San Juan R. basin. One additional site has
some negative values (9050700). Of the five sites with the highest average NSC values
for all five months, three are in the Upper Green: Green R. at Warren Bridge Pine Creek
above Fremont Lake, and Fontelle Reservoir inflow. The other two are the Virgin R.
near Virgin, which had good correlations in March-May, despite very low January
values, and the Gunnison R. inflow to Blue Mesa Reservoir.
Overall the April forecasts display a tendency toward a negative skew of forecast
errors (Figure 8), with this being most pronounced in the Gila R. Basin, although most of
the other basins had some sites with negative skew of forecast errors, some sites with no
skew, and no sites with positive skew. A large negative skew means that the overall
tendency of the forecasts is to under predict rather than over predict, although this is often
influenced to some extent by a few negative values (Shafer and Huddleston, 1984).
3.2. Categorical measures
Hit rate, threat score, false alarm rate and probability of detection (Figure 9-12) for
each month and flow category need to be considered together. Eighty to ninety percent
of sites have HR for correct predictions for the lower and upper 30% of flow categories
(Figure 9), meaning that these flows actually occur a majority of the time that the forecast
is for high or low flows. Similarly FAR (0 is perfect) is best for the low and high flows
(Figure 11).
However, the POD shows that the majority of high and low flows that occur are not
being accurately forecast (Figure 12). In January-April, under 5% of flows in the upper
or lower 30% are correctly forecast, i.e. the POD was near 0. There were very few
forecast locations with POD above 0.5 for the high and low flows. POD for the mid 40%
was high, because most forecasts predict that conditions will fall in this category. For the
same reason, HR was low and FAR high in the middle category. Note that TS (Figure 10)
combines some features of HR and POD – while it is similar to HR for the mid 40%, it is
low for the upper and lower 30%. The bias (not shown) was near 0-0.25 (very low) for
low and high flows, again showing that they are underpredicted, and between 2-4 (very
high) for moderate flows, showing that they are overpredicted.
7/13/2017
15
3.3. Probabilistic measures
At the New Fork R. near Big Piney, the LEPS is clearly better than LEPSref (Figure
13a) with the LEPS skill score increasing from January through May (Figure 13d). This
same pattern was consistent across the basin (Figure 14a).
The Brier scores of the forecast for the New Fork R. near Big Piney were all better
than those of the reference set as well (Figure 13b), with skill increasing slightly through
the forecast period (Figure 13e). The same is true across the basin (Figure 14b). The
drop in the May SSBS (Figure 13e) is due to the shift in both the reference and observed
values. Elsewhere in the basin, the February skill scores in the lower Green R. and the
San Juan R. basins are lower than those in January; otherwise patterns generally show
consistent increases. Five of the sites in the Gila R. basin have negative SSBS values in
March, making the basin average negative. The Virgin R. basin had the highest average
SSBS.
At the New Fork R. near Big Piney, forecast RPS values are better than RPSref only
for the earliest forecasts (January), with the poorest performance occurring in March and
April. Across the basin, twenty-two of the 54 sites had a negative average SSRPS for
January-May. Thirteen had negative SSRPS values for each month that a forecast was
issued.
Seven of these were the Gila R. basin locations; two were in the San Juan R.
basin (San Juan R. near Bluff and the San Juan R. inflow to Navajo Reservoir), one along
the main stem of the upper Colorado (Colorado R. near Cisco) one each in the upper
Green, lower Green, and the Yampa and White R. basins (Henry’s Fork near Manila.
Duchesne R. at Myton, and Little Snake R. near Dixon, respectively). However, four of
the remaining San Juan R. basin sites had SSRPS values in the top ten (averaging 30-40)
and four of the Yampa and White R. basin sites were among the top fourteen.
3.4. Distributive measures
Table 2 shows the sum of the resolution in the <0.1 and >0.9 categories for each of
the basins and the study area as a whole. In a forecast system with perfect resolution, this
should be equal to 1. For the entire Colorado basin, the basin average of this sum
increases from 0.5 in January to 0.8 in May for low and high flows, while for moderate
flows, this sum is lower, usually averaging less than 0.5, with values less than or equal to
0.3 in January and March at many of the basins. Low and high flows have the poorest
7/13/2017
16
resolution in the Virgin R. basin. The best average resolution for high and low flows
occurs in the lower Green basin. Further analysis of the resolution of the high and low
flow at each site shows that six of the ten best forecast sites are in the lower Green R.
Basin. The poorest resolution of low flows occurred mostly at sites in the main stem of
the upper Colorado R. and in the upper Green R. basins.
Table 3 shows this sum for the top 10 and bottom 10 sites in the high and the low
flow categories. Six of the sites with the best resolution for high and low flows are in the
lower Green R. Basin. The Gila R. at Calva (9466500) has the sixth best resolution of
low flows and the second worst resolution of high flows. The poorest resolution of low
flows occurred mostly at sites in the main stem of the upper Colorado R. and in the upper
Green R. basins. The Eagle R. below Gypsum (9070000) and the Virgin R. Near Virgin
UT (9406000) show poor resolution in all flow categories.
The reliability diagram (Figure 15) illustrates resolution as well, using a tributary
near the New Fork at Big Piney, the Green R. at Warren Bridge, as an example. Later
months have better resolution than earlier months, which have a larger fraction of flows
being forecast with only 30-70% likelihood, especially moderate flows. For the forecast
low flows, forecast of non-occurrence (<10% probability) are much more frequent than
forecasts of occurrence (>90% probability). The diagram shows that as forecasts have
increasing resolution, the forecast probability becomes more narrowly distributed and
more frequently assigned to the extreme intervals (i.e., 0-10% and >90-100%); this is
shown in the reliability diagram as sample sizes for the middle probability intervals
become smaller with the sharper forecasts (e.g., January versus April reliability diagrams
for the highest flows).
Reliability for this site shows similar patterns for all five months. Low flows are
underconfident at low probability, have no reliability at moderate probability, and are
overconfident at high probability. High flows are overforecast at low probabilities and
overforecast at 30-70% and 70-90% likelihood, but overall seem to have better reliability
than the low flows.
Discrimination at this site, however, is better for low flows than the high flows
(Figure 16). High flows are rarely observed when low flows are predicted. In MarchMay, when low flows were observed, 80-90% of the forecasts predicted less than 10%
7/13/2017
17
probably of high flow, and low flows were accurately predicted 50% of the time in April
and 80% of the time in May. When moderate flows are observed, all flow categories are
given about equal chance of occurring, and no flow is given a high probability of
occurring, even in late in the year. In the high flow category at this site, even in May, the
high flows are only predicted to occur about 50% of the time that they are observed, and
are forecast not to occur about 30% of the time that they are observed. However, low
flows are almost never observed when high flows are predicted. When high flows are
observed, forecast discrimination of moderate flow is accurate in Mar-May as well.
4. Discussion
4.1. General observations
Shaefer and Huddleston (1984) compared forecasts for two 15 years periods, 1951–
65 and 1966–80, and concluded that a slight relative improvement (about 10%) in
forecast ability occurred about the time computers became widely used in developing
forecasts. They attributed the gradual improvement in forecast skill to a combination of
greater data processing capacity and the inclusion of data from additional hydrologically
important sites. They suggested that “modest improvement might be expected with the
addition of satellite derived snow covered area or mean areal water equivalent data”,
which were not readily available for most operation applications at the time. Although
satellite data of snow-covered area are now available, those data are not being routinely
used in WSO’s. We found no significant differences in our various measures across
different parts of the period of record.
We did not do a direct comparison with the Shaefer and Huddleston (1984) results,
as their sites were grouped by state, rather than basin, boundaries. According to their
study, Arizona had the highest error (more than 55% for April 1 streamflow forecasts),
but also the highest skill. Of the Colorado basin states, Wyoming (only part of which is
in the basin) had the lowest forecast error (~20%) paired with the highest skill.
In applying Shaefer and Huddleston’s (1984) measures of forecast skill to our data,
we found similar trends. The Gunnison/Dolores, Upper Green, and Lake Powell sites
consistently had absolute values of percent forecast errors less than 10%, the Gila had
percent errors ranging from 24 to 52%, and the five other watersheds mostly had errors
7/13/2017
18
between 8 and 20%. The largest improvement in forecast error occurred between January
and April in the Virgin R. Basin (May was not as good, but still under 10%) and between
January and March in the Gila (but April was extremely poor), although the March error
in the Gila is still higher than at any other site. Skill coefficients generally improved
from January to May (except for April in the Gila), from a Colorado basin-wide average
of 1.31 to an average of 2.05. Despite the problems seen with some of the other forecast
skill methods for Virgin R. data, the Virgin R. combined low forecast errors with high
skill coefficients in April and May.
4.2. Regional differences
On the Main Stem Upper Colorado, the sites at Eagle R. below Gypsum, Colorado
R. near Dotsero and the Colorado R. near Cameo exhibit similar patterns of
discrimination, resolution and reliability. The non-occurring event is predicted close to 0
probability for both the high and low flows, and the low flow is giving a good probability
(> 50%) of occurring during the period when low flows were observed for the
predications made in March, April and May. Discrimination of high flows is not as good,
but still an improvement over the climatological means in the later part of the forecast
season. At the other six sites, the discrimination of non-occurrence is still good, but the
occurrence of high and low flows is not predicted as well, or as early, or both. For
example, at the Blue R. inflow to Dillon Reservoir and Williams Fork near Parshall, even
in May the low flows are given over a 60% probability of not occurring during the times
they were observed. Most of the forecasts do not exhibit much reliability, or even show
much improvement in reliability over the forecast season. Hit rates are generally better
for the lowest 30% of flows (0.6 - 0.95) than the upper 30% (0.2 -0.8).
The five sites in the Gunnison / Dolores show a higher overall hit rate for high flows
than the main stem of the Upper Colorado. Best reliability occurs for the lowest 30% of
flows at the Gunnison R. inflow to Blue Mesa Reservoir and the East R. at Almont.
Discrimination of non-occurrence of extreme events is very good, but the events that do
occur are not being forecast. Four of the five sites do a good job of predicting high flows
in the April and May forecasts (50-70%). The occurrences of low flows are seldom
accurately forecast.
7/13/2017
19
At four of the five sites in the Upper Green (all except Henrys Fork near Manila), the
forecasts are usually within a factor of 2 (50% to 200%) of the observed value, with
Green R. at Warren Bridge and Pine Creek above Fremont Lake having forecast values
closest to that of the naturalized streamflow. During two of the low flow years, 1979 and
1989), forecasts at Henrys Fork near Manila were as much as 7 times the naturalized
stream flow. However, the flows at this site were generally less than 2.83 m3s-1 (100 cfs),
lower than at any of the other sites in this basin, and even small difference in the forecast
can lead to large apparent discrepancies. Despite these low-flow problems, high-flow
forecasts at this site were extremely reliable, the best of any site in this basin. Hit rates for
high flows were overall better than for low flow, and the probability of detection was
zero for most months and sites. Pine Creek above Fremont Lake has the best
discrimination of the occurrence of high flows (40-50% Mar-May) and low flows (80100% March-May) of these five sites.
1982-1984 was an extended period of above average flows in the Yampa/White R.
basin, characterized by low forecast/observed values at the six sites, while the years of
lowest flows (1966, 1976-7, 1989-90, 1994) had the highest forecast/observed values,
consistent with the pattern seen elsewhere. Extreme flows are the least well forecast. All
the sites except the Little Snake R. near Dixon (which has the shortest record) have very
good reliability for predicted low flows during all the forecast months. High flows are
less reliable, but still better than climatology, for the most part. Discrimination of high
and lows flows is similar to that observed in other basin, with the accurate occurrence of
low flows being forecast with strong certainty about 50% of the time in April and May.
High flows are forecast with much less certainty.
The Lower Green R. basin has 11 sites, the largest number of any of the basins. The
poorest forecasts occurred at Strawberry R. near Duchesne and Duchesne R. at Myton.
Both had at least four months with negative LEPS and RPS skill scores, indicating that
the forecasts were not an improvement over the climatology. For the Strawberry R. near
Duchesne, there was a high hit rate for low flows (0.8–0.9), a poor hit rate for high flows
(0.3-0.5), and a low POD for either. There was effectively no reliability and no
discrimination for any of the flow classes. For example, high flows were only given a
7/13/2017
20
50% probability of occurring about 20% of the time they were observed, and a 0%
probability of occurring the other 80% of the time.
Other sites had better than climatology, although still imperfect, forecasts. Rock
Creek near Mountain Home & Duchesne R. above Knight Diversion had excellent AprilMay low flow discriminations. For Green R. at Green R., low flows generally were
given a 40-50% chance of not occurring when they were observed. High flows always
were given some possibility of occurring, although sometimes only 10-50%, when high
flows were observed. Huntington Creek near Huntington had low but non-zero POD for
low flows, but some false alarms.
The forecasted low flows in the San Juan R. basin have a high hit rate (generally 0.70.9), while the forecasted high flows generally have a hit rate of only 0.4-0.6. However,
the POD of high and low flows is still poor at all the sites. Discrimination of the nonoccurrence of high flows during low-flow periods is excellent as early as February at six
of the seven sites (every site except the Florida R. inflow to Lemon Reservoir) although
the non-occurrence of low flows during high-flow periods is not as good (it is best at
Piedra R. near Arboles, Animas R. near Durango, and the Florida R. inflow to Lemon
Reservoir).
One disadvantage of the two Virgin R. sites is that both have fairly significant gaps
in time. However, this is an important watershed in the Southwest and these sites should
not be excluded from the study. Extreme flows are not predicted well at either site,
particularly early in the forecast season. Neither site shows any discrimination of the
occurrence or non-occurrence of low flows, but the high flows have some discrimination
for Mar-May. Low flows tend to be severely overestimated by the tendency to forecast
moderate flows.
Low-flow bias is very close to 1 for many sites in the Gila R. basin (Salt R. near
Roosevelt, San Francisco R. at Clifton, San Francisco R. near Glenwood, Tonto Creek
above Gun Creek near Roosevelt, Gila R. below Blue Creek near Virden, Verde R. below
Tangle Creek); HR, TS, and POD also tend to be higher than 0.5, indicating that low
flows in the basin are often predicted accurately. However, the TS and POS are near 0
for high flows at all months, suggesting a consistent inability to accurate predict highflow seasons in this basin, which would contribute to the large negative values observed
7/13/2017
21
in the Shafer and Huddleston (1984) skew coefficient. Generally negative skill scores for
the probabilistic measure (SSBS and SSRPS) values also indicate that flows in this basin are
not well forecast. The Gila R. sites do tend to show good discrimination of the nonoccurrence of high events during times of low flow. The Salt R. near Roosevelt and Gila
near Gila show good reliability of forecasts of low flows February through April. The
reliability of forecasts of moderate and high flows is still poor. The other sites (San
Francisco R., Tonto R., Verde R.) show no pattern of reliability at all.
4.3. Comparison of different measures
While intuitively appealing, simple summary and correlation measures provide only
a broad indication of forecast skill. Pbias is perhaps the most intuitive of the scalar
measures, and is simple to communicate for an individual forecast. Pbias has also been
used to consider averaged values rather than correspondence between individual forecasts
and their associated observations. In that case it is not strictly a measure of forecast
accuracy. SSMSE and NSC are also intuitively appealing, in that they directly indicate the
improvement of forecasts over climatology. However, they have limited diagnostic value
for high versus low flows and forecasts.
The categorical measures are an improvement, providing a more complete
assessment of forecast skill, by giving information about high and low forecasts.
However, they neglect the uncertainty information inherent in the exceedance quantiles
that accompany the deterministic forecast value, attributing an implied 100% probability
to the “most probable” forecast value, even though that value (as the distribution median)
is equally likely to be too high or too low.
Probalistic measures are an improvement over categorical measures because they
reflect the inherently probabilistic nature of forecasts, by considering the probability
specified for each category of interest. While they may be less intuitive, they are
analogous to standard error estimates, but in a probability space rather than the
measurement space (i.e., flow volumes).
Distributive measures, e.g. discrimination and reliability, provide the most
comprehensive forecast evaluations and allow performance in all streamflow categories
to be examined individually, considering both forecast probabilities and observed
7/13/2017
22
frequencies of occurrence. However, their sensitivity to small sample sizes is an
acknowledged limitation.
5. Conclusions
Despite their shortcomings, by most measures and at most forecast sites the federally
issued seasonal water supply outlooks are an improvement over climatology. The
majority of forecast points have very conservative predications of seasonal flow. Belowaverage flows are often over predicted (forecast values are too high) and above average
flows are under predicted (forecast values are too low). This problem is most severe for
early forecasts (e.g., January) at many locations, and improves somewhat with later
forecasts (e.g., May).
For the low and high flows there is generally a low false alarm rate, which means
than when low and high flows are forecast, these forecast are generally accurate.
However, for low and high flows there is also a low probability of detection at most sites,
which indicates that these flows are not forecast nearly as often as they are observed.
Moderate flows have a very high probability of detection, but also a very high false alarm
rate, indicating that the likelihood of moderate flows is overforecast.
There is also good discrimination between high and low flows, particularly with
forecasts issued later in the year. This means that when high flows are forecast, low
flows are not observed, and vice versa. However the probability that high and low flows
will be accurately predicted, particularly early in year, is not as good. The accuracy of
forecasts tends to improve with each month, so that forecasts issued in May tend to be
much more reliable than those issued in January.
Not all streams or areas show the
same patterns and trends, but there is a lot of similarity in the relationship between
forecasts and observations, particularly in the Upper Colorado. The changes in
forecasting periods (most recently to April-July in the Upper Basin and forecasting
month-May in the Lower Basin) did not affect the accuracy of the forecasts.
More use of the categorical, probabilistic, and distributive measures is encouraged.
Although the WSO’s have historically been deterministically derived, their calibration
error statistics and the use of probabilistic and distributive measures provides an avenue
for comparison with experimental forecasts based on ensemble-based probabilistic
forecasts. Evaluations would be improved by further development of historical forecast
7/13/2017
23
and observation data, documentation of model details (e.g., changes in regression
equations), and more realistic accounting of flow diversions and storage in estimation of
naturalized flows.
Acknowledgements
Support for this research was provided by the NOAA’s Office of Global Programs
through the Climate Assessment for the Southwest Project, at the University of Arizona.
Additional support was provided by the National Science Foundation through the Center
for the Sustainability of semi-Arid Hydrology and Riparian Areas, also centered at the
University of Arizona.
7/13/2017
24
References
Bales, R. C., D. M. Liverman and B. J. Morehouse, 2004: Integrated Assessment as a
Step Toward Reducing Climate Vulnerability in the Southwestern United States,
Bulletin of the American Meteorological Society, 85, 1727
CBRFC, 1991: Lower Colorado water supply 1991 review, CBRFC/NWS, Salt Lake City.
CBRFC, 1992: Water supply outlook for the Lower Colorado, March 1, 1992. CBRFC/NWS,
Salt Lake City.
CBRFC, undated: Guide to Water Supply Forecasting. CBRFC/NWS, Salt Lake City.
Day, G.N., 1985: Extended streamflow forecasting using NWSRFS. Journal of Water
Resources Planning and Management, 111(2), 157-170.
Franz, K. J., H. C. Hartmann, S. Sorooshian and R. Bales, 2003: Verification of National
Weather Service Ensemble Streamflow Predictions for Water Supply Forecasting in
the Colorado R. Basin, Journal of Hydrometeorology,4(6): 1105-1118.
Hartmann, H.C., R. Bales, and S. Sorooshian: Weather, climate, and hydrologic forecasting for
the U.S. Southwest: a survey. Climate Research, 21, 239-258, 2002.
Legates, D. R and G. J. McCabe, Jr.: Evaluating the use of “goodness-of-fit” measures in
hydrologic and hydroclimatic model validation. Water Resources Research, 35, 233241, 1999.
Murphy, A.H., and Winkler, R.L., 1992: Diagnostic verification of probability forecasts.
International Journal of Forecasting, 7, 435-455.
Murphy, A.H., Brown, B.G., and Chen, Y., 1989: Diagnostic verification of temperature
forecasts. Weather and Forecasting, 4, 485-501.
Murphy, A.H. and Winkler, R.L., 1987: A general framework for forecast verification.
Monthly Weather Review, 115, 1330-1338.
Nash, J. E. and J. V. Sutcliffe, 1970: R. flow forecasting through conceptual models part I
— A discussion of principles. Journal of Hydrology, 10 (3), 282–290.
Shafer, B.A. and Huddleston, J.M., 1984: Analysis of seasonal volume streamflow
forecast errors in the western United States. Proceedings, A Critical Assessment of
Forecasting in Water Quality Goals in Western Water Resources Management,
Bethesda, MD, American Water Resources Association, 117-126.
Soil Conservation Service and NWS: Water supply outlook for the western United States.
West National Technical Center, Soil Conservation Service, Portland, 1994.
7/13/2017
25
Wilks, D.S., 1995: Forecast verification. Statistical Methods in the Atmospheric
Sciences, Academic Press, 467 p.
7/13/2017
26
Table 1 The 54 sites used in this study.
USGS no.
Elev.,
m
Name
Area,
km2
Years
MAIN STEM UPPER COLORADO
9019000 Colorado R. inflow to L. Granby, CO
2,415
808
1953-00
9037500 Williams Fork nr. Parshall, CO
2,343
476
1956-96
9050700 Blue R. inflow to Dillon Res., CO
2,628
867
1972-00
9057500 Blue R. inflow to Green Mountain Res., CO
2,305
1,551
1953-00
9070000 Eagle R. bel. Gypsum, CO
1,883
2,444
1974-00
9070500 Colorado R. nr. Dotsero, CO
1,839
11,376
1972-00
9085000 Roaring Fork at Glenwood Springs, CO
1,716
3,756
1953-00
9095500 Colorado R. nr. Cameo, CO
1,444
20,840
1956-00
9180500 Colorado R. nr. Cisco, UT
1,227
62,392
1956-00
9112500 East R. at Almont, CO
2,402
748
1956-00
9124800 Gunnison R. inflow to Blue Mesa Res., CO
2,145
8,939
1971-00
9147500 Uncompahgre R. at Colona, CO
1,896
1,160
1953-00
9152500 Gunnison R. nr. Grand Junction, CO
1,388
20,525
1953-00
9166500 Dolores R. at Dolores, CO
2,076
1,305
1953-00
9188500 Green R. at Warren Bridge, WY
2,240
1,212
1956-00
9196500 Pine Cr. abv. Fremont L., WY
2,235
197
1969-00
9205000 New Fork R. nr. Big Piney, WY
2,040
3,184
1974-00
9211150 Fontenelle Res. inflow, WY
1,952
11,080
1971-00
9229500 Henrys Fork nr. Manila, UT
1,818
1,346
1971-94
9239500 Yampa R. at Steamboat Springs, CO
2,009
1,564
1953-00
9241000 Elk R. at Clark, CO
2,180
559
1953-93
9251000 Yampa R. nr. Maybell, CO
1,770
8,828
1956-00
9257000 Little Snake R. nr. Dixon, WY
1,899
2,558
1980-00
9260000 Little Snake R. nr. Lily, CO
1,706
9,657
1953-00
9304500 White R. nr. Meeker, CO
1,890
1,955
1953-00
1,869
261
1953-00
GUNNISON / DOLORES
UPPER GREEN
YAMPA / WHITE
LOWER GREEN
9266500 Ashley Cr. nr. Vernal, UT
7/13/2017
27
9275500 W. Fork Duchesne R. nr. Hanna, UT, unimp.
2,165
161
1974-00
9277500 Duchesne R. nr. Tabonia, UT, unimp.
1,857
914
1953-00
9279000 Rock Cr. nr. Mountain Home, UT
2,175
381
1964-00
9279150 Duchesne R. abv. Knight Diversion, UT
1,752
1,613
1964-00
9288180 Strawberry R. nr. Duchesne, UT
1,717
2,374
1953-00
9291000 L. Fork R. bel. Moon L. nr. Mountain Home, UT
2,391
290
1953-00
9295000 Duchesne R. at Myton, UT, unimp.
1,518
6,842
1956-00
9299500 Whiterocks R. nr. Whiterocks, UT
2,160
282
1953-00
9315000 Green R. at Green R., UT
1,212 116,111
1956-00
9317997 Huntington Cr. nr. Huntington, UT
1,935
1953-00
461
SAN JUAN R BASIN
9349800 Piedra R. nr. Arboles, CO
1,844
1,628 1971-00
9353500 Los Pinos R. nr. Bayfield, CO
2,275
699 1953-00
9355200 San Juan R. inflow to Navajo Res., NM
1,697
8,440 1963-00
9361500 Animas R. at Durango, CO
1,951
1,792 1953-00
9363100 Florida R. inflow to Lemon Res., CO
1,941
47 1953-00
9365500 La Plata R. at Hesperus, CO
2,432
96 1954-00
9379500 San Juan R. nr. Bluff, UT
1,214
59,544 1956-00
LAKE POWELL
9379900 L. Powell at Glen Canyon Dam
930 278,822
1963-00
VIRGIN R.
9406000 Virgin R. nr. Virgin, UT
1,050
2,475 1957-00
834
3,881 1972-00
9430500 Gila R. nr. Gila, NM
1,397
4,826 1964-00
9432000 Gila R. bel. Blue Creek nr. Virden, NM
1,227
7,324 1954-00
9444000 San Francisco R. nr. Glenwood, NM
1,368
4,279 1964-00
9444500 San Francisco R. at Clifton, AZ
1,031
7,161 1953-00
9466500 Gila R. at Calva, AZ
755
29,694 1963-98
9498500 Salt R. nr. Roosevelt, AZ
653
11,148 1953-00
9499000 Tonto Cr. abv. Gun Cr., nr. Roosevelt, AZ
757
1,747 1955-00
9508500 Verde R. bel. Tangle Cr., abv. Horseshoe Dam, AZ
609
15,168 1953-00
9408150 Virgin R. nr. Hurricane, UT
GILA R. BASIN
7/13/2017
28
Table 2. Resolution in the <0.1 and >0.9 categories of flow
Basin
Upper Colorado
Gunnison/Dolores
Upper Green
Yampa/White
Lower Green
San Juan
Lake Powell
Virgin
Gila
All
Upper Colorado
Gunnison/Dolores
Upper Green
Yampa/White
Lower Green
San Juan
Lake Powell
Virgin
Gila
All
Upper Colorado
Gunnison/Dolores
Upper Green
Yampa/White
Lower Green
San Juan
Lake Powell
Virgin
Gila
All
7/13/2017
Jan
Feb Mar
Low flows
0.4
0.5
0.6
0.5
0.6
0.7
0.3
0.5
0.5
0.5
0.6
0.6
0.6
0.7
0.7
0.5
0.6
0.7
0.8
0.8
0.8
0.4
0.4
0.5
0.6
0.7
0.6
0.5
0.6
0.6
Moderate flows
0.2
0.2
0.3
0.3
0.3
0.3
0.1
0.2
0.2
0.3
0.2
0.3
0.4
0.4
0.5
0.2
0.2
0.3
0.5
0.5
0.5
0.2
0.3
0.4
0.3
0.3
0.4
0.3
0.3
0.4
High flows
0.4
0.5
0.5
0.5
0.5
0.6
0.4
0.6
0.6
0.6
0.5
0.6
0.6
0.6
0.7
0.4
0.4
0.5
0.6
0.6
0.6
0.2
0.4
0.6
0.4
0.5
0.6
0.5
0.5
0.6
Apr
May
0.6
0.8
0.6
0.7
0.8
0.7
0.8
0.7
0.6
0.7
0.7
0.8
0.7
0.8
0.9
0.8
0.8
0.5
–
0.8
0.4
0.5
0.3
0.4
0.6
0.4
0.5
0.5
0.4
0.4
0.5
0.6
0.5
0.5
0.7
0.6
0.5
0.4
–
0.6
0.6
0.7
0.6
0.6
0.8
0.7
0.6
0.6
0.6
0.7
0.7
0.8
0.7
0.7
0.8
0.8
0.6
0.7
–
0.8
29
Table 3. Rank of sites by relative resolution of flows
Rank
Point
Resolutiona
Basin
Low flows, best resolution
1
9277500
0.95
LG
2
9295000
0.94
LG
3
9288180
0.81
LG
4
9317997
0.79
LG
5
9466500
0.76
Gi
6
9379900
0.76
LP
7
9353500
0.74
SJ
8
9315000
0.74
LG
9
9241000
0.73
YW
10
Low flows, worst resolution
45
9070500
0.54
UC
46
9037500
0.54
UC
47
9070000
0.54
UC
48
9095500
0.52
UC
49
9188500
0.49
UG
50
9019000
0.47
UC
51
9211150
0.44
UG
52
9205000
0.43
UG
53
9363100
0.43
SJ
54
9406000
0.41
Vi
High flows, best resolution
1
9288180
0.95
LG
2
9257000
0.91
YW
3
9211150
0.85
UG
4
9277500
0.81
LG
5
9317997
0.77
LG
6
9363100
0.74
SJ
7
9299500
0.72
LG
8
9498500
0.70
Gi
7/13/2017
30
9
9295000
0.68
LG
10
High flows, worst resolution
45
9508500
0.50
Gi
46
9365500
0.49
SJ
47
9070000
0.49
UC
48
9406000
0.49
Vi
49
9408150
0.49
Vi
50
9147500
0.47
GD
51
9349800
0.44
SJ
52
9304500
0.43
YW
53
9466500
0.34
Gi
54
9229500
0.25
UG
a
Sum of high and low probabilities
7/13/2017
31
List of Figures
1. Location of 54 water supply outlook forecast points in the Colorado R. Basin used in
this study.
2. Example reliability diagram illustrating relative frequency for various forecast skills.
Horizontal lines at 0.3 and 0.4 indicate no resolution for high and low flows and for
middle flows, respectively. Light vertical lines indicate forecast categories.
3. Example discrimination diagram illustrating relative frequency for observed low
flows. Light vertical lines indicate forecast categories.
4. For each year at 2 forecast points: Pbias values (a, e) with each circle representing a
different month (smallest is for 1st forecast made that year, largest is last forecast),
observed/average discharge values (b, f), years used in computing the climatological
average on which the forecast is based (c,g), and forecast period (d,h), with the top
hatch representing the first month and the lower hatch marking the last month of the
forecast period.
5. For the same 2 sites as on Figure 4 plus the San Juan R. near Red Bluff, f i / oi against
oi / o for April and May. The horizontal lines at 0.8 and 1.2 are provided for
reference.
6. Summary and correlation measures associated with forecasts issued in January
through May. Pbias is a linear mean of values. There were 54 sites used in JanuaryApril and only 47 used in May, because the 8 Gila R. Basin sites do not issue May
forecasts.
7. Nash-Sutcliffe coefficient in April for entire area and each sub-region.
8. Skew coefficient G in April for entire area and each sub-region.
9. Frequency histograms of Hit Rate for observations in the lowest 30% of flows, the
middle 40% of flows, and the upper 30% of flows.
10. Frequency histograms of Threat Score for observations in the lowest 30% of flows,
the middle 40% of flows, and the upper 30% of flows.
11. Frequency histograms of False Alarm Rate for observations in the lowest 30% of
flows, the middle 40% of flows, and the upper 30% of flows.
12. Frequency histograms of Probability of Detection for observations in the lowest 30%
of flows, the middle 40% of flows, and the upper 30% of flows.
7/13/2017
32
13. The LEPS value, Brier Score and Ranked Probability score, and the associated skill
scores for the New Fork R. near Big Piney (USGS 9205000).
14. Monthly average (a) LEPS skill scores (b) Brier skill scores and (c) Ranked
probability skill scores for each basin.
15. Reliability diagrams for Green R. at Warren Bridge (9188500). The size of circles
indicates the relative frequency of the forecast.
16. Discrimination diagrams for Green R. at Warren Bridge (9188500).
7/13/2017
33