The classification of hydrologically homogeneous regions

Hydrological Stiences~Journal-des Sciences Hydrologiques, 44(5) October 1999
693
The classification of hydrologically homogeneous
regions
M. J. HALL & A. W. MINNS
International Institute for Infrastructural, Hydraulic and Environmental Engineering,
PO Box 3015, 2601 DA Delft, The Netherlands
e-mail: [email protected]
Abstract With the operation and maintenance of streamgauging networks in many
developing countries coming under increasing pressure through lack of funds and
suitably trained personnel, greater reliance must be placed on procedures for transferring information from gauged to ungauged catchment areas. These approaches to
generalizing hydrological variables, such as the quantiles of the frequency distributions of floods and low flows, are collectively referred to as regionalization
methods. An important feature of these methods is the demarcation of hydrologically
homogeneous regions. The latter may be regarded as an example of the wider problem
of classification of data sets, for which a variety of modern informatic tools, such as
artificial neural networks and fuzzy sets, may be invoked. Application of examples of
these techniques to flood data for the southwest of England and Wales has
demonstrated that classes may be defined by Representative Regional Catchments
(RRCs), whose characteristics are hydrologically more appealing than those imparted
merely by geographical proximity. The techniques employed, Kohonen networks and
fuzzy c-means, are straightforward in application, and were found to identify broadly
similar RRCs. The results indicate the feasibility of employing these methodologies
on a country-wide basis.
Classification en régions hydrologiques homogènes
Résumé Du fait du manque de moyens et de personnel qualifié, la maintenance et la
mise en œuvre des réseaux de mesures en rivière dans les pays en voie de
développement deviennent de plus en plus difficiles. Aussi est-il nécessaire de
développer des procédures fiables pour le transfert de l'information entre bassins
jaugés et bassins non-jaugés. Ces approches de généralisation des variables hydrologiques comme les quantiles des pointes de crue et des étiages, sont désignées sous le
terme de méthodes de régionalisation. Une caractéristique importante de ces méthodes
est la détermination de régions hydrologiques homogènes. Ce problème peut être vu
comme un cas particulier de problème de classification, pour lequel divers outils
informatiques modernes comme les réseaux de neurones artificiels ou les ensembles
flous peuvent être utilisés. L'application de telles techniques à des données de crues
dans le sud-ouest de l'Angleterre et du Pays de Galles a permis de montrer que les
classes peuvent être définies en termes de Bassins Versants Représentatifs (BVR),
dont les caractéristiques hydrologiques sont un facteur déterminant plus que la
proximité géographique. Les techniques employées, des réseaux neuronaux de type
Kohonen et des méthodes de moyennes floues, sont très faciles d'utilisation et
permettent d'identifier des BVR similaires. Les résultats indiquent qu'il est possible
d'employer de telles méthodes à l'échelle du pays.
INTRODUCTION
The estimation of flow quantiles for a catchment area having no records of discharge
continues to be one of the principal problems facing the engineering hydrologist. The
flood formulae, relating the so-called maximum flood of a catchment to its morphological characteristics in general, and basin area in particular, can be recognized as one
Open for discussion until 1 April 2000
694
M. J. Hall & A. W.Minns
of the first approaches to this problem. The advent of systematic river gauging
subsequently provided the basis for the application of statistical methods, which in turn
led to the exploration of techniques by which the frequency information for the gauged
catchments could be transferred to nearby ungauged river basins. This process of
regionalization has generally been based upon two components:
(a) a dimensionless growth curve relating the quotient of the magnitudes of a I1-year
flood and an index flood to return period, T, for a group of sites; and
(b) an equation relating the magnitude of an index flood, such as the mean annual
flood, to catchment and generalized rainfall characteristics that can be read from
available maps for the same group of sites.
This methodology has been widely applied (e.g. the review by Hall (1981) and
references therein), and depends fundamentally upon the ability to identify groupings
of sites which define a region. Invariably, item (b) above has been developed by the
systematic application of multiple linear regression analysis (MLRA) in which:
(i) the index flood is regressed upon catchment and rainfall characteristics for the
whole data set;
(ii) the residuals, i.e. the difference between observed and computed values of the
index flood, are plotted geographically in order to identify groups of these
differences that are similar in both magnitude and sign and therefore may be
regarded as a sub-region; and
(iii)the regression analysis is repeated for the sub-regions identified and then
generalized across the whole region.
Obviously, this approach depends heavily on geographical proximity in defining
the sub-regions. In contrast, more recent work has turned to the application of multivariate techniques, such as cluster analysis to define homogeneous regions, and discriminant analysis to allocate an ungauged catchment to an appropriate region
(e.g. Mosley, 1981; Hawley & McCuen, 1982; Acreman & Sinclair, 1986; Wiltshire,
1986b; Burn, 1989; Bhaskar & O'Connor, 1989). However, echoing the warnings in
the statistical literature (e.g. Chatfield & Collins, 1980, Chapter 10), Nathan &
McMahon (1990) have clearly summarized the pitfalls of these approaches. In
particular, any group of variables is capable of yielding clusters, and different
structures can be produced by adopting different algorithms and distance measures.
Moreover, discriminant analysis will always allocate an ungauged site to one of the
sub-regions. Nevertheless, the notion that a group of sites does not necessarily have to
be geographically close to form a sub-region as in Wiltshire (1985), or that each site
may have affinity with a different set of sites for quantile estimation, as in the regionof-influence approach (Burn, 1990), is intuitively appealing.
An alternative to geographical proximity as a measure of affinity offered by some
clustering algorithms is Euclidean distance in the «-dimensional feature space that is
defined by the n characteristics that have been adopted for site description. This
quantity is defined more formally below, but may be based on standardized flow
statistics (Mosley, 1981; Wiltshire, 1985), selected physical features of a catchment
(Acreman & Sinclair, 1986), or a combination of both (Burn, 1990). In the latter study,
a threshold value of distance was employed to identify the group of catchments that
define the region of influence of one particular site. This process of feature detection or
pattern classification may also be carried out using modern informatic tools, such as
Artificial Neural Networks (ANNs). To date, ANNs have been applied successfully to
The classification of hydrologically homogeneous
regions
695
rainfall-runoff modelling (e.g. Minns & Hall, 1996; Dawson & Wilby, 1998) by
training the network to develop a relationship between a rainfall "input" and a
discharge "output", a process known as supervised learning. However, neural
networks can also be applied in a mode of unsupervised or competitive learning (see
Beale & Jackson, 1990; Aleksander & Morton, 1990) using a particular type of ANN
called a Kohonen network for which there are no output data as such, but a feature line
or map. A Kohonen network therefore has the potential to define both the number of
"classes" in a data set and the features that define each class.
A possible disadvantage to methods of classification based upon Euclidean
distance, including Kohonen networks, is the absolute certainty of the allocation to a
particular class. Taking an alternative viewpoint, Wiltshire (1986b) employed
discriminant analysis based upon catchment characteristics to evaluate the fractional
memberships of the clusters previously defined from the flow statistics. These
concepts provided tacit recognition of the possibility that the prior definition of subregions may not be all-embracing, i.e. some sites may have an affinity with more than
one sub-region. This situation can also be expressed conveniently in terms of fuzzy
variables, which may have different levels of membership of different fuzzy sets.
Indeed, the situation may arise that the allocation of a set of features to (say) one of
two sets is totally ambiguous, with the features showing a membership level close to
0.5 for both.
In this paper, the problem of defining regions for the analysis of flow quantiles is
re-examined in the framework of both ANNs and fuzzy sets. In both cases, each site
may be defined by a finite number of features. As described in the following section,
the Kohonen network uses these features as inputs, and identifies similar patterns by
firing particular output units. The number of units that are fired defines the potential
number of classes, and the input patterns that trigger each output unit serve to quantify
that class. Alternatively, the allocation of a set of features to one of a predetermined
number of classes may be derived in terms of a membership level. This allocation may
be accomplished using the technique of fuzzy c-means (Ross, 1995; Klir & Yuan,
1995), as described in the next section. In the final section, the application of these
approaches to a sample of 101 sites from two of the regions identified in the United
Kingdom Flood Studies Report (FSR) (NERC, 1975a) is evaluated in terms of the
features adopted to define the clusters. The potential utility of the suggested
approaches is summarized in the concluding remarks.
CLASSIFICATION
The general problem of classification can be summarized in the following terms. Given
a sample Xof K data, i.e.
X = [xl,x2,x3,...,xk,
...,xKj
(1)
where each data point is defined by N features:
X
k
=
\Xk\
'Xk2
'X«J-">-X,W J
W
a procedure is required to identify the number of classes c into which X can be
partitioned, where 2 < c < K.. The upper limit to this range represents the trivial case in
M. J. Hall & A. W. Minns
696
which each data point forms a separate cluster, whereas the use of 2 as the lower limit
avoids the notion that there are no clusters at all in the data set. The process of
classification is based upon the assumption that the members of a cluster are
mathematically more similar to each other than to members of other clusters. A
commonly-applied measure of similarity is the Euclidean distance, the use of which
depends upon such distances being considerably less between points in the same
cluster than between points in different clusters. Given that the features included are
indeed sensitive to the purpose of the analysis, the objective should be to identify the
c-value that partitions the data set into the most plausible number of clusters. Hence,
the dual objectives are (a) to minimize the Euclidean distance between each point in
the feature space and the centre of the cluster to which it belongs; and (b) to maximize
the distance between the centres of the clusters. Among the techniques available to
accomplish these objectives are methodologies based upon artificial neural networks
and the theory of fuzzy sets.
Artificial neural networks
ANNs originated largely in the field of pattern recognition, and are notable for their
ability to "leam" the relationship between a set of inputs and outputs without a priori
knowledge of the underlying physical process that connects them. In general, the
numbers of input and output nodes in the network correspond to the numbers of inputs
and outputs of the deterministic relation being learned. However, sandwiched in
between the layers of input and output nodes are one or more intermediate layers
whose nodes are directly connected to all those in the input and output layers.
Associated with each connection is a weight which can either inhibit or amplify the
signal being transmitted. The nodes then act as summation devices for the (weighted)
incoming signals, which are then transformed to an output signal using a threshold
function, which restricts its range to the interval zero-to-one. Standard algorithms, such
as the back-propagation method, are available for manipulating the weights so that the
ANN reproduces the output from the input with minimum error (Beale & Jackson,
1990; Aleksander & Morton, 1990). The process of adjusting the weights is referred to
as "training the network", and the desired input-output relationship is encapsulated in
the weights.
This is the process which is referred to as "supervised learning". In "unsupervised
learning", the emphasis changes from "learning" input-output relationships to that of
"recognizing" patterns in the input data. The Kohonen network is a typical tool for this
purpose, consisting of a single layer of/output nodes, each of which is connected to
all TV input nodes. The training process begins by initializing the weights, wnj, between
the nth input and the y'fh output nodes. The network must then "decide" which output
node is associated with each of the K input patterns, xk, as it is presented, and then to
"fire" it. The node to be fired is decided by computing a similarity measure, such as
the Euclidean distance for each output node, y :
Dkj=SL{xkn-w^
(3)
The classification ofhydrologically homogeneous regions
697
The "winning" node for a given input pattern is then selected as that with the smallest
Euclidean distance measure. The affinity of the winning node to the input is then
enhanced by adjusting the weights connected to the winning node by an amount that is
proportional to the difference between the input vector and the weight vector. Similar
input patterns should therefore fire nodes that are close together. In order to maintain
this neighbourhood feature, the weights of connections to nodes that are adjacent to the
winner are also updated, but the number of nodes being changed decreases as training
progresses. A visual impression of the final output can be obtained by mapping the
positions of the winning nodes for each vector of inputs, or by counting the number of
occasions each node is fired for the whole input data set. In effect, each frequentlyfired node defines a class, and the input vectors that fire that node are the members of
that class.
Fuzzy classification
The method of c-means (Ross, 1995, Chapter 11; Klir & Yuan, 1995, Chapter 13) is a
method of classification that may be applied using either hard (crisp) or soft (fuzzy)
partitions. With hard partitions of the data, each point is assigned to one, and only one,
cluster. However, if the partitions are fuzzy, each point is allowed a degree of
membership in more than one class. In effect, the fuzzy partitioning defines a family of
fuzzy sets, Ah i = 1, 2, ..., c, on the universe of data points, X. The membership value
that the data point k has in the class / may be denoted by:
c
provided that ^ \iik = 1 for all k
(4)
The restrictions on membership dictate that the sum of all membership values for a
single data point over all classes must be unity; and that there are neither empty classes
nor a class that contains all the data points. These membership values may be
summarized in terms of a fuzzy partition matrix, U, with c rows and K columns. Since
the number of membership values that are possible to describe class membership for
each data point is infinite, a classification criterion or objective function is required to
cluster the data set. In the method of c-means, the objective function is based upon the
Euclidean distance between each data point and its cluster centre, vi, i = 1, 2,..., c:
dik =d{xk-vi)=
£(x f o ,-v i n ) 2 where v,. ={v,1,vI.2>...,vw}
(5)
V »=i
Using the definition of membership of equation (4), the objective function is given
by:
FobJ = É É W r W 2
k=l
i=l
(6)
698
M.J. Hall & A. W.Minns
where r is a weighting parameter controlling the amount of fuzziness in the process of
classification. When r-\,
the partitioning becomes hard, but as r increases the
membership assignments of the clustering become more fuzzy. Reported values (Ross,
1995) are generally in the range 1.25 < r < 2.
The coordinates of the cluster centre in the feature space for class i may be
computed from:
xw**
n = \,2,...,N
(7)
The optimum fuzzy c-partition is associated with the minimum value of Fobj from
equation (6). An iterative approach may be applied to determine the best solution
available to a prescribed level of accuracy as follows:
1. Select values for the number of classes, c, and the weighting parameter, r; then
denoting each step by a superscript (p) = 0, 1, 2,...,
2. Initialize the partition matrix, U(0).
3. Calculate the cluster centres, v'/', using equation (7).
4. Update the partition matrix U ^ , the elements of which are given by:
2
ip+i)
u
r-ik
'du(P)X
ik
S
j=\\ jk
a
(8)
J
5. If U^ + '] does not differ from U w by more than a prescribed limit e, then terminate
the algorithm; otherwise set/) =p+l and repeat from step 2.
In practice, this algorithm has been shown to be robust and tolerant of the
membership values assumed in the initial partition matrix, U(0). However, convergence
tends to be slower as the value assumed for r increases.
The entries in the partition matrix corresponding to minimum F„bj indicate the
extent to which any point Xk has shared membership across the assumed number of
classes, c. A measure of the success to which the data set has been decomposed into
classes is given by the fuzzy partition coefficient:
F
c
tr(\J*XJT)
= ^ - ^ -
(9)
where * and T denote the standard matrix operations of multiplication and transposition,
and tr denotes the trace that is the sum of the diagonal elements of the c x c matrix within
the brackets. An Fc value of unity is obtained if the partitioning has been crisp, i.e. all
entries in U are either zero or one, and a value of lie indicates complete ambiguity, i.e. all
membership values are lie. In effect, the diagonal elements in the matrix are proportional
to the unshared membership of the data sets within the fuzzy classes.
Although the ambiguity in the classification may be of importance, particularly
with respect to the behaviour of individual data points, an ultimate assignment to a
particular class may be obtained by hardening the fuzzy partition matrix, U. The two
most common methods of defuzzification are the maximum membership and the
The classification of hydrologically homogeneous
regions
699
nearest centre methods. In the former, the largest element in each column of matrix, U,
is assigned a value of unity and all the others are set to zero. In the latter, each data
point is assigned to the class to which it is closest, i.e. the criterion is the minimum
Euclidean distance between the data point and the nearest cluster centre.
APPLICATION OF ALGORITHMS
In practice, the situation faced by the engineering hydrologist is that of having flow data
and catchment characteristics for some stations but only the latter for other sites for
which flow quantiles are required. Therefore, tests of homogeneity applied to observed
flow series (e.g. Wiltshire, 1986a; Hosking & Wallis, 1993) are only partially helpful in
that a methodology is still required to classify the ungauged sites. Moreover, the
resources available to develop regionalized flow estimates are generally constrained
such that only the most easily measured, or the most readily available, catchment
characteristics can be adopted as a basis for classification. Therefore, in order to test the
two above-mentioned methodologies, a case study was developed from previouslypublished information in the knowledge that the selected characteristics inevitably do not
provide a fully comprehensive description of each catchment.
The data selected for this study consisted of tabulated characteristics for gauged
catchments within Region 8 (South West England) and Region 9 (Wales), as
designated in the United Kingdom Flood Studies Report (FSR) (NERC, 1975a). The
characteristics selected included catchment area, AREA, main stream length, MSL,
main stream slope, 51085, mean annual rainfall, SAAR, and winter rain acceptance
potential or soil index, SOIL. The data sets consisting of these five features were
compiled from listings in volume II of the FREND Study (Gustard et al., 1989)
supplemented by volume IV of the FSR (NERC, 1975b). A total of 47 data sets,
including representative sites from Hydrometric Areas 45-53, were obtained for
Region 8, and 54 data sets, covering Hydrometric Areas 54-67, were extracted for
Region 9. Since the five features had different units, all data sets were standardized
prior to analysis, i.e. the standardized variate for feature n at site k:
where yt„ is the nth feature at site k, and yn and a„ are the mean and standard
deviation of the nth feature within the data set. These standardized characteristics were
employed as the basis for classifying the catchments using the method of fuzzy
c-means. For the latter, the weighting factor was set to two for all computations, and
calculations were continued until the elements of the partition matrices did not change
more than e = 0.01 between iterations. However, in applying the Kohonen network, the
input data were standardized to lie within the interval zero-to-one using the alternative
formulation:
J kn
xk„=
yn{mm)
Z
.
.
(11)
where yn(màX) and y„(mm) are the maximum and minimum values of the nth feature
within the data set.
M.J. Hall & A. W.Minns
700
Application of the Kohonen network
The principle of training a Kohonen network is the same as that for any ANN, namely
the repeated presentation of the input data sets (in this case, five catchment
characteristics per gauging site) until the output response of each input vector has
stabilized and the resultant weight changes are negligible. In this application, there
were five input nodes and 101 patterns. For the design of a Kohonen network, Meissen
et al. (1994) suggest that:
Number of patterns » Number of output nodes > 2 x Number of classes
Since at least two, or possibly three, classes were expected, a set of ten output
nodes was adopted. The results from repeated presentations of the input patterns with
different randomized starting weights are summarized in Table 1, which indicates a
clustering around three distinct output nodes. These "classes" contained 25, 35 and 41
members, respectively. The weights associated with the five connections to each of the
input nodes, which are the standardized cluster centres in Euclidean space, define what
may be termed Representative Regional Catchments (RRCs). The de-standardized
catchment characteristics of these RRCs are presented in Table 1 for all eight non-zero
output nodes. This table demonstrates that the variations of each site characteristic are
essentially monotonie, and that the classes identified move from relatively small, steep
catchments with an average annual rainfall around 1600 mm and a high winter rain
acceptance potential to larger, flat drainage areas with an average annual rainfall of
about 1250 mm and a smaller SOIL index. For each of these three groupings, the
averages of the five catchment features were computed, giving rise to the RRCs
summarized in Table 2. For convenience, the grouping of the smallest catchments is
referred to as class I, and that of the largest catchments as class III. Table 2 shows that
class II is an intermediate case between the first two. That an unsupervised learning
technique should produce classes which are supportable from a hydrological viewpoint
is a gratifying result. The numbers of sites per class are also summarized in Table 3.
Application of fuzzy c-means
Unlike the Kohonen network, the expected number of classes must be specified for the
fuzzy c-means algorithm. For comparison purposes, runs were undertaken for all 101
sets of catchment characteristics using both two and three classes. The cluster centres,
which for this approach may also be interpreted as RRCs, are presented in Table 2, and
the numbers of sites falling into each class are summarized in Table 3.
When two classes were used, the algorithm converged in 11 iterations, giving an
Fc value of 0.665. Since the latter value exceeds 0.5 by some margin, there is plainly
unshared membership in the data set between the two classes. Hardening the partition
matrix using both maximum membership and minimum Euclidean distance resulted in
the allocation of all sites to the same classes. Table 3 shows that the two classes had 67
and 34 members respectively, which according to Table 2 broadly represent smaller,
steeper catchments with 1700 mm of mean annual rainfall and a soil index of 0.43, and
larger, flatter drainage areas with lower mean annual rainfalls and lower soil indices.
Of particular interest is the number of instances of shared membership between the two
The classification of hydrologically homogeneous regions
701
Table 1 Classification of sites by Kohonen network, with numbers allocated and the characteristics of
the Representative Regional Catchments for ten potential classes.
Class
(output node)
Number
of sites
1
2
3
4
5
6
7
8
9
10
21
4
0
4
17
14
0
1
17
23
Representative regional catchments
AREA
MSL
51085
SAAR
SOIL
72.3
98.7
15.18
18.76
16.64
14.60
1621
1561
0.416
0.408
100.6
105.9
203.3
18.73
19.62
29.61
14.44
12.81
7.65
1549
1449
1279
0.407
0.381
0.350
205.8
207.7
250.2
29.65
29.82
34.37
7.54
7.42
6.26
1244
1225
1223
0.344
0.339
0.337
Table 2 Characteristics of Representative Regional Catchments for different algorithms.
Characteristic
Kohonen network
I
II
III
127
22.6
9.68
1322
0.37
272
35.6
5.72
1175
0.32
208
30.3
7.07
1209
0.34
-
127
22.0
8.06
1148
0.33
312
40.6
6.95
1390
0.36
analysis:
AREA
58.6
13.5
MSL
21.4
51085
SAAR
1893
0.46
SOIL
Fuzzy c--means analysis (two classes):
97.3
AREA
MSL
17.5
17.1
51085
1711
SAAR
SOIL
0.43
Fuzzy c--means analysis (three classes):
63.7
AREA
13.8
MSL
19.7
51085
SAAR
1831
SOIL
0.45
Table 3 Numbers of sites allocated to each class for different algorithms.
Method
Kohonen network
Fuzzy c-means (2 classes)
Fuzzy c-means (3 classes)
I
25
67
25
II
35
34
47
III
41
29
classes. Denoting membership levels above 0.42 but below 0.58 as ambiguous
identified eleven such cases out of the 101 data sets.
When the number of classes was increased to three, convergence required 35
iterations, and the Fc value was 0.546, well over the figure of 0.333 which would
indicate an ambiguous classification. Hardening by maximum membership and
702
M. J. Hall & A. W. Minns
minimum distance again resulted in the same class allocations. Table 3 shows that the
third class which emerged obtained almost as many members as the other two classes
combined. Moreover, Table 2 shows that the RRC characteristics are remarkably
similar to those identified by the Kohonen network. Ambiguity in the classifications,
as indicated by roughly equal membership of all classes, is obviously less likely as the
classes themselves become better differentiated, as indicated by the classification
metric. However, one site out of the 101 proved to have almost equal membership of
all three classes. Where the membership levels of two of the three classes were within
0.1 of each other, the results were regarded as ambiguous between those two classes.
Using this definition, only four cases were identified: two class II ambiguous with
class I, one class III ambiguous with class II and one class II ambiguous with class III.
Comparing the catchment characteristics of these sites with those of the RRCs showed
that the major differences tended to be associated with the SAAR and SOIL values,
which were notably either higher or lower than those associated with the designated
cluster centres. There is therefore probably insufficient variety in the possible
combinations of these features within the selected data set.
Comparison between algorithms
Tables 2 and 3 demonstrate that, although the characteristics of the RRCs identified by
the two algorithms were in reasonable agreement, the allocations of sites differed by
about 25% in two out of the three classes. Of the 101 sites, 68 were allocated to the
same class by both the Kohonen network and the method of fuzzy c-means. More
particularly, of the 41 sites allocated to class III by the Kohonen network, 19 were also
similarly identified using fuzzy c-means, but the remaining 22 were placed in class II
by the latter algorithm. Class I (the smallest catchments) had 24 out of 25 sites in
common, the odd one within the Kohonen network list being placed in class III by
fuzzy c-means. Of the 35 Kohonen network results in class II, 25 were similarly placed
by the method of fuzzy c-means, with nine in class III and one in class I.
A notable difference between the characteristics of the RRCs obtained by the
two methods evident in Table 2 is the variation between classes of the SAAR and
SOIL values. As noted already, the Kohonen network results are essentially
monotonie in terms of the variations in catchment characteristics across the classes,
as illustrated in Table 1. In contrast, the class II results obtained by fuzzy c-means
show SAAR and SOIL values that are smaller than those for class III. These
differences can be attributed primarily to the manner in which the fuzzy c-means
algorithm computes the cluster centres, v,., taking into account the membership
levels (see equations (7) and (8)).
In summary, the most distinctive classification identified by both algorithms was
class I, the smaller, steeper catchments with high SOIL and high average annual
rainfall. However, allocations to the other two classes displayed less agreement, with
poorer demarcation between what constituted a "larger" catchment and which sites
could be regarded as "intermediate". This difficulty could obviously be alleviated
partly by increasing the number of sites to provide a better sample of the regional
variations in catchment features, and partly by the introduction of additional
characteristics, possibly relating to land use and vegetal cover.
The classification of hydrologically homogeneous
regions
703
CONCLUDING REMARKS
Although the standardization of the features prior to analysis ensured that undue
weight was not attributed to the features with the highest absolute numbers, this
precaution did not affect any correlation that might be evident between individual
features. The catchment characteristics employed in this exercise included both AREA
and MSL, which are widely known to be related by Hack's Law (see Rigon et al,
1996). Indeed, for the data used, MSL was proportional to AREA raised to the power
0.58, with an explained variance of 85%. Nevertheless, the ambiguous membership
values for two classes produced by the fuzzy c-means analysis tend to demonstrate that
the correlations between features are less important than the need to sample as wide a
spread as possible of combinations of features within the data set used for
classification.
The reasonable agreement between the features of the RRCs defining the cluster
centres on which the classes were based is encouraging, and demonstrates once again
that, in hydrological terms, combinations of catchment characteristics are perhaps a
more logical basis for regionalization than geographical proximity. The next step in the
development of a regionalization procedure would be to relate the magnitude of flood
quantiles (or the parameters of a frequency distribution common to all sites) to the
characteristics of the catchment. As reported elsewhere by Hall & Minns (1998), such
a relationship can be developed by supervised learning with a standard three-layer,
feed-forward ANN.
Ideally, a classification algorithm should require as little subjective judgement as
possible on behalf of the analyst. Of the two methodologies examined above, the
Kohonen network selects the number of clusters as well as allocating each site to a
cluster. In contrast, the fuzzy c-means method requires a priori knowledge of the
number of clusters, but draws attention to "borderline" cases having significant
membership levels of more than one class. There is therefore scope for a hybrid
approach in which the Kohonen network is employed to identify the number of
clusters, and perhaps the preliminary allocation of sites to clusters, and fuzzy c-means
is then used to refine the allocation, taking into consideration the membership levels.
Of course, such an approach raises several further questions, such as whether a site that
is ambiguous between two classes should be included in both, or only in that for which
the membership is a maximum? And if the former, should the quantile estimates be
weighted sums of those obtained separately for each class, perhaps utilizing the
membership levels as weights? These questions are the subject of continuing study.
REFERENCES
Acreman, M. C. & Sinclair, C. D. (1986) Classification of drainage basins according to their physical characteristics: an
application for flood frequency analysis in Scotland. J. Hydrol. 84, 365-380.
Aleksander, I. & Morton, H. (1990) An Introduction to Neural Computing. Chapman & Hall, London.
Beale, R. & Jackson, T. (1990) Neural Computing: An Introduction. Institute of Physics, Bristol, UK.
Bhaskar, N. R. & O'Connor, C. A. (1989) Comparison of method of residuals and cluster analysis for flood
regionalization. J. Wat. Resour. Plan. Manage. 115(6), 793-808.
Burn, D. H. (1989) Cluster analysis as applied to regional flood frequency. J. Wat. Resour. Plan. Manage. 115(5), 567582.
Burn, D. H. (1990) An appraisal of the "region of influence" approach to flood frequency analysis. Hydrol Sci. J. 35, 149—
165.
704
M. J. Hall & A.
W.Minns
Chatfield, C. & Collins, A. J. (1980) Introduction to Multivariate Analysis. Chapman & Hall, London.
Dawson, C. W. & Wilby, R. ( 1998) An artificial neural network approach to rainfall-runoff modelling. Hydrol. Sci. J.
43(1), 47^66.
Gustard, A., Roald, L. A., Demuth, S., Lumadjeng, H. S. & Gross, R. (1989) Flow Regimes from Experimental and
Network Data (FREND), vol. II, Hydrological Data. Institute of Hydrology, Wallingford, UK.
Hall, M. J. (1981) A historical perspective on the Flood Studies Report. In: Flood Studies Report—Five Years On.
11-16. Thomas Telford, London.
Hall, M. J. & Minns, A. W. (1998) Regional flood frequency analysis using artificial neural networks. Proc.
Hydroinformatics '98, 3rd Int. Conf. on Hydroinformalics (Copenhagen, Denmark), vol. 2, 759-763. Balkema,
Rotterdam, The Netherlands.
Hawley, M. E. & McCuen, R. H. (1982) Water yield estimation in western United States. Proc. Am. Soc. Civ. Engrs J.
Irrig. Drain. Div. 108(IR1), 25-34.
Hosking, J. R. M. & Wallis, J. R. (1993) Some statistics useful in regional frequency analysis. Wat. Resour. Res. 29(2),
271-281.
Klir, G. J. & Yuan, B. (1995) Fuzzy Sets and Fuzzy Logic. Theory and Applications. Prentice Hall PTR, Upper Saddle
River, New Jersey, USA.
Meissen, W. J., Smits, J. R. M., Buydens, L. M. C. & Kateman, G. (1994) Using artificial neural networks for solving
chemical problems, part II: Kohonen self-organizing feature maps and Hopfield networks. Chemometrics and
Intelligent Laboratory Systems 23, 267-291.
Minns, A. W. & Hall, M. J. (1996) Artificial neural networks as rainfall-runoff models. Hydrol. Sci. J. 41(3),
399^17.
Mosley, M. P. (1981) Delimitation of New Zealand hydrologie regions. J. Hydrol. 49, 173-192.
Nathan, R. J. & McMahon, T. A. (1990) Identification of homogeneous regions for the purposes of regionalization.
J. Hydrol. 121,217-238.
NERC (Natural Environment Research Council) (1975a) Flood Studies Report, vol. I, Hydrological Studies. NERC,
London.
NERC (Natural Environment Research Council) (1975b) Flood Studies Report, vol. IV, Hydrological Data. NERC,
London.
Rigon, R., Rodriguez-Iturbe, I., Maritan, A., Giacometti. A., Tarboton, D. G. & Rinaldo, A. (1996) On Hack's Law. Wat.
Resour. Res. 32(11), 3367-3374.
Ross, T. J. (1995) Fuzzy Logic with Engineering Applications. McGraw-Hill, New York.
Wiltshire, S. E. (1985) Grouping basins for regional flood frequency analysis. Hydrol. Sci. J. 30,151-159.
Wiltshire, S. E. (1986a) Regional flood frequency analysis, I: homogeneity statistics. Hydrol. Sci. J. 31, 321-333.
Wiltshire, S. E. (1986b) Regional flood frequency analysis, II: multivariate classification of drainage basins in Britain.
Hydrol. Sci. J. 31, 335-346.
Received 2 October 1998; accepted 10 February 1999