an n-variate smoothed bootstrap

A SMOOTHED BOOTSTRAP APPROACH FOR MONTE CARLO
GENERATION OF N SPATIALLY CORRELATED GRAPHS OF
ANNUAL FLOW DURATION OR CUMULATIVE DISCHARGE
W.E. BARDSLEY
Department of Earth Sciences, University of Waikato, Private Bag 3105
Hamilton, New Zealand.
It is common for river discharges in the same or adjacent river basins to be spatially
correlated because the flows are generated from the same precipitation systems passing
over the region. This correlation needs to be incorporated when simulating general
seasonal flow patterns at N different sites for input to models of multi-tributary or multiriver engineering schemes. However, it is often difficult to generate sets of correlated
discharge variables which maintain a good description of the correlation relations seen in
the flow record. As a means around this difficulty, a smoothed bootstrap approach is
advocated by which the original record of correlated discharge observations is interpreted
as a sequence of realizations from an unknown N-variate multivariate distribution. A
nonparametric estimate of this unknown distribution is obtained via a multivariate kernel
approach by which N-variate normal distributions are located over each of the original
data vectors. Simulation then reduces to a two-step process for a given realization. First,
(i) any one of the multivariate normal distributions is selected at random, and (ii) a set of
N correlated discharges is simulated from that distribution. This two-step process is
particularly simple and requires only the generation of multivariate normal random
variables. The method has the potential for extension to large values of N, regardless of
the complexity of the correlation structure in the original data. This represents an
improvement over the copula approach which can have difficulty in representing
complex correlations as N increases. The smoothed bootstrap approach is illustrated with
simulation of correlated annual cumulative inflows into two adjacent hydro storage lakes
in the South Island of New Zealand. A similar approach is applicable to simulating
correlated annual flow duration curves or indeed to simulating any set of correlated
variables for which there is an existing data record.
INTRODUCTION
It is common for large irrigation, hydro power, and flood control schemes to be impacted
by the discharge variations of multiple tributaries or multiple rivers. Setting the design
parameters of such large and complex schemes is aided by simulating many years of
discharges which seek to mimic the patterns of existing discharges without necessarily
simply repeating them. Simple repetition of past records is always possible of course by
way of bootstrap resampling of the data record. However, such tight control given to past
1
2
data may not be desirable. In particular, discharge extremes can never exceed the
recorded extreme observations.
The next logical step in moving to allow greater flexibility in the simulated values is
to view the existing record from N different discharge recording sites as equivalent to
random variables generated from N different statistical distributions fitted to site data. A
feature which must be incorporated in this process is the site discharge spatial
correlations which arise as a consequence of the rivers experiencing the same sequence of
runoff-generating events in the form of precipitation systems and/or snowmelts. The
specific causal mechanism of these correlations varies with the degree of averaging of the
discharge variable of interest. For example, instantaneous discharges may be correlated
through experiencing common storm events while mean annual flows will be correlated
by experiencing the same sequence of wet and dry years.
Discharge spatial correlations present a problem for simulations in that it is not a
simple process to generate sets of correlated random variables from N-variate
distributions which are flexible enough to accurately represent the existing N-variate
discharge data set. Typically such data are highly skewed with complex correlation
structures far removed from the standard multivariate distributions such at the
multivariate normal distribution.
One potential solution to the multivariate discharge simulation problem is through
the use of copulas [2]. In essence, copulas are a flexible means of representing
multivariate distributions in terms of independently defined marginal distributions and
correlation structures. However, the correlation structure is fully determined by the
choice of copula and it seems probable that as N increases any given copula is unlikely to
have the capacity to accurately reflect in all dimensions the correlation structures evident
in the data record. The purpose of this paper is to present briefly a more robust but simple
alternative to the copula approach to simulation, through the use of a multivariate
smoothed bootstrap.
AN N-VARIATE SMOOTHED BOOTSTRAP
The smoothed bootstrap approach adopted here utilises a kernel technique [4] for
estimating the unknown N-variate probability density. The respective sets of correlated
data are first rescaled to unit standard deviation and an N-variate normal distribution is
placed on each vector of correlated rescaled data, such that each of the N distribution
means corresponds to a data value. The component standard deviations of any given Nvariate normal distribution are set to a constant common value. The magnitude of this
single value is free to be varied to influence the simulated data pattern, as is the
associated matrix of correlation coefficients. Thus defined, a given set of K N-variate
normal distributions will display as K bivariate normal distributions when viewed in any
2-variate plane, with the long axis gradient of all the bivariate distributions being +1.0 or
-1.0 depending on the sign of the corresponding bivariate correlation coefficient [3
p.255]. A data point will appear in the centre of each bivariate distribution.
3
In setting up the simulation process, the user visits a display of each possible 2variate plane and by trial and error adjusts the bivariate correlation coefficients and the
single standard deviation value per N-variate normal distribution. This process of
adjustment inevitably involves a degree of subjectivity but the data display does provide
constraint on the simulation pattern. The respective correlation coefficients can be
adjusted independently in each 2-variate plane, but a change in a standard deviation value
will influence simulation patterns in all 2-variate planes because this single value is
common to all planes for a given N-variate normal distribution. In setting up the desired
parameter values it is helpful to display the 99% bivariate confidence ellipses about each
data point in the various planes. Increasing a correlation coefficient increases the
elongation of the associated confidence ellipse and increasing the standard deviation
increases the area of the ellipse. As will be shown in the example, a reasonable first
approximation can be achieved by giving all the N-variate distributions a common
correlation matrix and a common standard deviation.
The simulation can proceed once the N-variate normal distribution parameter values
have been set. A given simulation realization is achieved by selecting one of the Nvariate normal distributions at random and then generating a vector of N correlated
variables from that distribution. This process is a smoothed bootstrap technique because
the simulated data can be near or far from the data points depending on the extent to
which the parameter specification results in the probability density being concentrated
around the data points. In the limit as the standard deviations tend to zero the simulation
process reverts to classical bootstrap resampling of the multivariate data set. In this way
the user can determine the extent to which their simulations are constrained by the data
record.
EXAMPLE
The simulation method described above is general and can be applied to any data set of
correlated values, which need not necessarily arise from an observation process. This
point is illustrated in the following example which simulates cumulative annual inflows
plots for Lakes Tekapo (88 km2)and Pukaki (169 km2), located in the South Island of
New Zealand. These two lakes provide most of New Zealand’s hydro electric storage and
their adjacent location results in a considerable degree of correlation in their seasonal
water inflows.
The simulations were not intended to capture the fine structure of inflow variations
so the base data used was simply cumulative lake inflow volumes at the end of each
month. The simulation method could in principle be applied to generate correlated sets of
the 24 inflow values for the two lakes for each year of simulation. However, a degree of
reduction in dimensionality is desirable if only to avoid the tedious process of inspecting
276 2-variate planes. To this end, the following equations were employed to give 3parameter approximations to the respective 12-value inflow plots per year:
4
1.23
w m
2i
1.11
1.93
3.86
 m
 m
2i
3i
Pc  w m
im
1i
Tc   m
im
1i
1.92
w m
3i
3.14
(1)
(2)
which respectively give, for Lakes Pukaki (P) and Tekapo (T), the cumulative inflow
volume at the end of the mth month of the ith year of record. These expressions are in the
form of weighted additive power functions, with each year requiring three weight values,
symbolised as w or  to emphasise that the numerical weight values are different for the
respective lakes. These 3-parameter functions give reasonable approximation to the
monthly inflows. Two example fits for Lake Pukaki are shown in Figure 1 and the
observed and fitted values for all months of record are shown in Figures 3 and 4.
Cumulative Inflow (108m3)
40
30
1992
20
1993
10
0
0
0.2
0.4
0.6
0.8
1
Fraction of Year
Values from Fitted Function
Figure 1. Example fitted function (1) to Lake Pukaki inflows for 1992 and 1993
60.0
40.0
30.0
20.0
0.0
0.0
0.0
30.0
60.0
Lake Pukaki Monthly Cum ulative Inflow s
(a)
0.0
20.0
40.0
Lake Tekapo Monthly Cum ulative Inflow s
(b)
Figure 2. All observed and fitted monthly cumulative inflows for (a) Lake Pukaki (19262000) and (b) Lake Tekapo (1940-2000). Fitted functions are eqs (1) and (2) respectively.
Inflow volumes are in units of 108 cubic metres.
5
The weighted power expressions (1) and (2) are not unique in any way and no doubt
numerous other functions could also describe the data at least as well. Nor is the
goodness of fit shown in Figures 2 and 3 of any particular surprise because those plots
represent multiple sets of 12 variables being approximated by 3-parameter functions. The
functions are simply a utilitarian means of reducing the dimensionality of the simulation
process from 24 correlated inflow variables to 6 correlated weight variables.
The simulation process in this case generates random (scaled) weight variables via
suitably paramaterised 6-variate normal distributions as described earlier. For a given
realization the resulting six simulated correlated variables are then rescaled back to serve
as the weight values in eqs. (1) and (2), defining a single realization of a correlated pair
of annual inflow hydrographs.
For the purposes of this paper a simple multivariate parameterisation was adopted,
with all the component 6-variate normal distributions having a common standard
deviation of 0.15 and a common correlation matrix (Table 1). Assignment of the 15
correlation coefficients values was made on the basis of inspection of the 15 possible 2variate plots of the standardised weight values. Where a higher degree of correlation of
the weight data plots was evident, a higher correlation coefficient was assigned to extend
the probability density along the 1.0 gradient – corresponding to a situation of narrower
bivariate confidence ellipses for the 2-variate plane concerned. More broadly scattered
data points suggests a more diffuse probability density and lower correlation coefficients
were assigned. Figures 3 and 4 respectively illustrate the effect of the application of
higher and lower correlation coefficients. Pukaki weight values from 1940 onwards were
employed to ensure no missing values in the utilised multivariate weight set.
Table 1. Common correlation matrix of the three standardised weights for lakes Pukaki
(P) and Tekapo (T), which was applied to all the multivariate normal distributions used in
the simulations.
PW1
PW2
PW3
TW1
TW2
TW3
PW1
1
-0.85
0.70
0.70
-0.70
0.60
PW2
PW3
TW1
TW2
1
-0.85
-0.70
0.70
-0.60
1
0.7
-0.7
0.7
1
-0.85
0.7
1
-0.85
TW3
1
The patterns of simulated weight values shown in Figures 3 and 4 represent two of
the 15 possible 2-variate projections of 5,000 simulated sets of 6 correlated weight
values. The data display and simulation process was carried out using MATLAB code,
utilising the mvnrnd function in the Statistics Toolbox to generate the multivariate normal
random variables. The simulated data shows a tendency to cluster around the actual
weight values, indicating that the assigned 0.15 common standard deviation value gives a
6
9
8
Lake Tekapo Weight 1
7
6
5
4
3
2
2.5
3
3.5
4
4.5
5
5.5
Lake Pukaki Weight 1
6
6.5
7
7.5
Figure 3. Standardised actual (black) and simulated (grey) weight values in the PW1
TW1 plane, showing the elongation effect of the 0.7 correlation coefficient (Table 1)
7
6
Lake Tekapo Weight 3
5
4
3
2
1
0
2.5
3
3.5
4
4.5
5
5.5
Lake Pukaki Weight 1
6
6.5
7
7.5
Figure 4. Standardised actual (black) and simulated (grey) weight values in the PW1
TW3 plane, showing the dispersive effect of the 0.6 correlation coefficient (Table 1)
7
Cumulative Inflow
fairly strong emphasis to the weights originally obtained in the fitting process. The next
refinement might be to assign higher standard deviations to the outlying points to give a
more even spread of probability density through to the main mass of the data values.
Figure 5 shows just four of the 5000 pairs of simulated correlated cumulative inflow
plots. It is interesting to note how the random variation from the multivariate normal
distributions can sometimes generate a similar inflow pair, while more different curves
arise from other simulated years. Further analysis of the simulated cumulative inflows
could be carried out to establish joint probabilities of low inflows to both lakes as part of
hydro power risk analysis. However, this in not pursued further here.
50.0
50.0
25.0
25.0
0.0
0.0
Cumulative Inflow
0
0.2
0.4
0.6
0.8
1
50.0
50.0
25.0
25.0
0
0.2
0
0.2
0.4
0.6
0.8
1
0.4
0.6
0.8
1
0.0
0.0
0
0.2
0.4
0.6
0.8
1
Fraction of Year
Fraction of Year
Figure 5. Selected pairs of simulated correlated cumulative inflow curves for Lakes
Pukaki (upper curves) and Tekapo (lower curves). Inflow volumes are in units of 108 m3
ANNUAL FLOW DURATION CURVES
The above methodology of simulating annual cumulative discharge curves can be
extended to annual flow duration curves. This would give a nonparametric alternative to
deriving annual flow duration variability using parametric distributions of river discharge
variations [1]. Allowance needs to be made for the minimum point on the curve being the
annual discharge minimum rather than zero, so an additive weighted power function
approximation to the annual flow duration curve could be written:
8
a
b
Q  w  w P  w P  w P c  ...
P
1
2
3
4
(3)
where P is the proportion of time flow is less than the discharge QP, and the w values are
correlated weights. The annual minima and maxima are respectively the first weight and
the sum of all weights, so the annual maxima is now a multivariate realization.
CONCLUSION
It might be argued that simulating correlated hydrological variables using the smoothed
bootstrap approach is weakened by the subjective allocation of the multivariate normal
standard deviations and correlation coefficients. However, subjectivity is unavoidable in
any data-fitting process involving distribution selection. For example, copulas require
subjective specification of the marginal distributions and the inevitable error introduced
in this selection cannot be offset by the application of formal estimation of parameter
values. Also the selection of the utilised copula is itself a subjective decision, guided by
how well the simulated points appear to fall among the data points in the various
projection planes.
The smoothed bootstrap method would therefore appear to be no better or worse than
other means of simulating multivariate hydrological variables. However, it does have a
significant advantage in terms of flexibility and ease of use and therefore has potential as
a practical tool for application in a wide range of water resource simulations in a
correlated multivariate context.
REFERENCES
[1] Castellarin A., Vogel R. M. and Brath A., “A stochastic index flow model of flow
duration curves”, Water Resour. Res., Vol. 40, (2004), W03104, doi:
10.1029/2003WR002524.
[2] Favre A-C., El Adlouni S., Perreault L., Thiémonge N. and Bobée B., “Multivariate
hydrological frequency analysis using copulas”, Water Resour. Res., Vol. 40, (2004),
W01101, doi: 10.1029/2003WR002456.
[3] Kotz S., Balakrishnan N. and Johnson N. L., “Continuous Multivariate Distributions
Volume 1 ”, 2nd edition, Wiley, (2000).
[4] Scott D.W., “Mutivariate Density Estimation”, 1st edition, Wiley, (1992).