definition of homogeneous regions through entropy theory

3rd STAHY International Workshop on STATISTICAL METHODS FOR HYDROLOGY AND WATER RESOURCES MANAGEMENT
October 1-2, 2012 Tunis, Tunisia
DEFINITION OF HOMOGENEOUS REGIONS THROUGH ENTROPY
THEORY
By
Maura Rianna(1), Elena Ridolfi(1), Luca Lorino(1), Leonardo Alfonso(2), Valeria Montesarchio(1),
Giuliano Di Baldassarre(2), Fabio Russo(1) and Francesco Napolitano(1)
(1)
Dipartimento di Ingegneria Civile, Edile e Ambientale, Sapienza Università di Roma, Italy ([email protected],;
[email protected]; [email protected]; [email protected]; [email protected] )
(2)
Department of Integrated Water Systems and Governance - Hydroinformatics, UNESCO-IHE Institute for Water Education, Delft,
The Netherlands ([email protected]; [email protected] ).
ABSTRACT
An important topic for regional flow frequency analysis is the definition of regions, within which the
hydrological information can be transferred. In particular, a homogeneous region is a set of catchments, that
can be considered homogeneous in terms of hydrologic response and statistical distribution of the
hydrological extremes. In the scientific literature different techniques are presented and utilized to group
catchments, such as the so called region of influence, the canonical correlation analysis, the correspondence
analysis and the L-moments. The first two methods are used to determine the region of influence (ROI) of a
particular drainage basin. For these procedures each target site is assumed to have its own homogeneous
region. The last two methods are used to identify a fixed region.
In this work the issue of identifying an homogeneous region is handled by a standard method (ROI defined
by Euclidean distance) and by a new entropy based approach. Since basins are defined homogeneous if
sharing redundant information, they are characterized by an high value of total correlation (i.e. a measure of
redundancy). On the other hand the grouped basins must provide a relevant amount of information, that is
measured through their joint entropy (i.e. measure of joint information). In this framework the cluster
problem is faced through the definition of a new index (named SNI) that represents the rate between the
redundant information and the joint information provided by couple of stations. The higher the value of SNI,
the higher the relationship between the two considered stations. From the analysis of the SNI values it is
possible to associate to each station the most related ones.
Statistical tests are then carried out to establish whether the grouped data arises from a common regional
distribution.
Keywords: regionalization, entropy, homogeneous region, statistical analysis, ungauged basins.
1
INTRODUCTION
The aim of regionalization is to estimate more precise flow quantiles in correspondence of gauging stations
that are characterized by a lack of data or to calculate them in ungauged basins. The first step for the
estimation of flood probabilities at a regional scale is the identification of homogeneity of basins under
study, so that it is possible to infer information from region sites to others ungauged. A suitable choice of the
method for grouping variables is essential. Many researchers have explored various ways of delineating a set
of gauging stations that can be considered constituting a region with sufficient homogeneity in extreme flow
characteristics. Regions are generally created using some measures of similarity between catchments and
Definition of homogeneous regions through entropy theory
1
3rd STAHY International Workshop on STATISTICAL METHODS FOR HYDROLOGY AND WATER RESOURCES MANAGEMENT
October 1-2, 2012 Tunis, Tunisia
through different approaches such as the method of region of influence (Burn, 1990a,b), the canonical
correlation analysis (Ribeiro-Correa et al., 1995; Cavadias et al., 2001), the method of residuals (Thomas
and Benson, 1970; Glatfelter, 1984) and the cluster analysis (Mosley, 1981; Tasker, 1982; Acreman and
Sinclair, 1986; Jingyi and Hall, 2004; Rao and Srinivas, 2006a,b).
The method of Region of Influence (ROI) and the canonical correlation analysis are usually used as
alternative approaches to the fixed regions method and they are used to estimate for every site a region that
contains all sites that can be used to estimate its frequency distribution. This method is applied when a
continous transition is necessary between regions, but it is less simple to be used than the fixed regions
method, because the user must evaluate every time the homogeneous region for the site of interest and
calculate for it the growth curve. Literature has largely studied this method, as it is possible to see from the
big numer of papers on this topic (Zrinji and Burn, 1994, 1996; Tasker et al., 1996; Burn, 1997; Castellarin
et al., 2001; Holmes et al., 2002; Merz and Bloschl, 2005).
The aim of this work is to compare the reliability of the ROI standard method (Burn, 1990a,b) with a new
entropy based approach for cluster analysis. Thanks to its versatility, entropy has been applied to hydraulic
and water resources fields (Singh, 1997), a discussion about its advantages can be found in Wei (1987).
Since data collection is at the basis of hydrologic applications, several authors faced the issue of optimizing a
hydraulic network trhough an entropy based approach (e.g. Krstanovic and Singh, 1992 and Yoo et al., 2008)
and the evaluation of the maximum non redundant information content (Ridolfi et al., 2011). In the
assessment of groundwater wells network, Mogheir and Singh (2002) and Mogheir et al., (2004) evaluated
the transferred information between couple of wells, thorugh the directional information transfer index
(DIT). The DIT is the rate between the transferred information (as measure of redundancy between
variables) and the marginal entropy of one variable. It is the information rate that can be inferred from a
given variable about another one. Rajsekhar et al. (2010) and Yang and Burn (1994) applied DIT for
regionalisation tasks: the first one applied it for grouping homogeneous cells of a basin case of study, while
the latter used it for evaluating the information connection between flow gauging stations, in order to cluster
them. As suggested by Yang and Burn (1994), a direct application of DIT is the regionalisation of stations
belonging to a network. Since DIT is not symmetric, it is possible to evaluate the relationship between
couple of variables only, and in particular it is not a bijective function. In this work the shared normalized
information (SNI) index is used for evaluating the connection between variables. This index represents the
rate between the common information (i.e. the total correlation) between two flow gauging stations and their
joint information, represented by their joint entropy. The total correlation index has widely applied for
clustering purposes for monitoring location (Alfonso et al., 2010a), for analysing the mutual dependence of
attributes (Jakulin and Bratko, 2004). Once that SNI is computed for each couple of stations, the
correspondent symmetric matrix is built and it is possible to evaluate which stations are more related to a
given station. This new method, inspired by the ROI, allows the creation of region of influence in terms of
redundant information.
The paper is organized as follows: first an overview on the ROI method and on entropy theory is given.
Second, the case study is presented and gauging stations are grouped in region of influence using both the
entropy theory and the Euclidean distance method (Burn, 1990). Then statistic tests are used to evaluate the
robustness of ROIs created through the two methods.
2
ROI METHOD THROUGH EUCLIDEAN DISTANCE EVALUATION
Cluster analysis collects data into clusters by joining together objects with similar attribute data.
The ROI technique is employed as the basis to cluster catchments. The ROI method permits to form a
potentially unique collection of stations for each catchment, that can be used for the calculation of the flow
quantiles at the site of interest.
The main point of ROI approach to regionalization is the selection of a method for evaluating the similarity
between one station and others. In this work the closeness between N basins is evaluated by the Euclidean
distance (Burn, 1990) in the M space, where M represents the variables used to define spatial similarity. It is
possible to define a distance measure as:
Dij =
[∑
M
m=1
(X
i
m
− X mj
)]
2 1/ 2
Definition of homogeneous regions through entropy theory
1
2
3rd STAHY International Workshop on STATISTICAL METHODS FOR HYDROLOGY AND WATER RESOURCES MANAGEMENT
October 1-2, 2012 Tunis, Tunisia
where Dij is the distance between catchments i and j, and X mi and X mj are the values of attributes m for
catchments i and j. The result of this method is a N x N symmetrical matrix.
The choice of the number of sites that falls into a region is done through homogeneity tests. Then for all the
stations clustered the test is carried out from the maximum number of stations to the minimum number that
satisfies the test.
3
BRIEF OVERVIEW ON ENTROPY THEORY
The entropy theory was introduced by Shannon (1948) in the communication field, starting from its
thermodynamic definition. It is defined as the the uncertainty associated to the outcome of a certain event
(Papoulis, 1991). In this work an entropy based approach is used for identifying homogeneous regions. Since
the shared information is the key factor for determining an homogeneous region, stations are clustered on the
basis of their redundant information. The entropy of a random vector RV X can be seen as its information
content. Papoulis (1991) defines the marginal entropy of a discrete type RV as:
H ( X ) = −∑ pi log 2 pi
2
where P{X = xi}=pi. It is measured in bit as the base of the logarithm is 2. Considering 2 RVs, X and Y, their
joint entropy is:
i
H ( X , Y ) = −∑∑ pi , j log 2 pi , j
i
j
3
where pij is the joint probability of the two variables. The conditional entropy of X to Y is:
H ( X | Y ) = −∑∑ p i, j log 2
pi , j
pj
4
It represents the amount of information of Y that can be inferred from X. The redundant information among
couple of stations is measured through the coefficient of non transferred information:
i
j
T ( X , Y ) = H ( X ) − H ( X | Y ) = H ( X ) + H (Y ) − H ( X , Y )
5
where H(X) and H(Y) are marginal entropy of two RVS X and Y, while H(X,Y) is their joint entropy. It
quantifies the amount of information of a variables (e.g. Y) contained in X. (Cover and Thomas, 1991) and it
can be seen as the reduction of uncertainty of X, once that Y is known, (Alfonso et al., 2010a).
From the definition of T, several authors (e.g. Rajsekhar et al., 2010; Mogheir et al., 2004; Yang and Burn,
1994) employed a directional information transfer index (DIT) for measuring the relationship between each
pair of stations:
DIT( X , Y ) =
T ( X ,Y )
H(X )
6
DIT(X,Y) is the fraction of information about station Y that can be inferred from X. It is not symmetrical,
thus, DIT(X,Y) is not equal to DIT(Y,X).
In this work, the issue of identifying which stations are more related to another one is faced using a
coefficient of shared information normalized by the joint information provided by the considered couple.
This normalization allows to evaluate how much information is in common between stations in respect to the
maximum information at their disposal, represented by their joint entropy.
This new index, introduced for the fisrt time, is the shared normalized information and is defined as:
SNI ( X ,Y ) =
TotC( X ,Y )
H ( X ,Y )
7
where TotC is the total correlation coefficient, defined as (Watanabe, 1960):
N
TotC ( X 1 , X 2 ,... X N ) = ∑ H ( X i ) − H ( X 1 , X 2 ,.. X N )
i =1
8
It can be noted that T is a particular case of TotC for two RVs:
TotC( X , Y ) = H ( X ) + H (Y ) − H ( X , Y )
9
Definition of homogeneous regions through entropy theory
3
3rd STAHY International Workshop on STATISTICAL METHODS FOR HYDROLOGY AND WATER RESOURCES MANAGEMENT
October 1-2, 2012 Tunis, Tunisia
The total correlation coefficient represents the amount of information shared by N variables (i.e. number of
gauging stations), taking into account multivariate dependencies. While the physical meaning of DIT is the
fraction of information transferred from one station to another (Yang and Burn, 1994), the SNI represents the
fraction of shared information in respect to the total amount of information that the stations provide
considered as a unique element. The advantage of using ToTC istead of T consists in the possibility of
assessing the total amount of information shared by N variables, without taking into account any partial
dependencies among variables (i.e. conditional entropy). In Figure 1, the concepts of marginal and joint
entropy and ToTC are graphically explained.
Figure 1 – (a) Three circles representing the marginal entropy of variables X, Y and Z; (b) joint entropy H(X,Y,Z) of
the three variables; (c) areas of shared information. Total correlation is given by the summation of the three secondorder dependencies and the third order dependency.
In order to evaluate the jont entropy of Eq.3, the joint probability is computed subdividing each variables in
the same classes, then to each class is associated a number. In this case study X and Y are the data series of
the maximum annual discharge values registered by the the two correspondent stations.
For every variable, to each component falling into a class is assigned the numerical value associated to that
class. For instance, a vector X is given in the form: X=(1, 4, 5, 2, 7, 2). The joint probability of two vectors X
and Y is given as the marginal entropy of the conglomeration of X and Y, thus if Y=(2, 4, 3, 5, 8, 5), the
conglomerate vector is Z=(12, 44, 5,3, 25, 78, 25). The probability of each component is given by their
frequency of occurrence. For the grouping property of mutual information H(X,Y)= H(Z), (Kraskov et al.,
2005). An extensive explanation of this method is given by Alfonso et al., (2010b).
Once that the joint entropy among each station and every others is computed, it is possible to build a
symmetrical matrix. Each component is the SNI index for each couple of stations.
This matrix represents the amount of information that is shared between each station and every others in
respect to the total information at their disposal: their joint entropy. Sorting in descendent order SNI for each
station, it is possible to determine the region of influence of that particular station: it is possible to identify
which variables are more related to that one taken as the centrum of the region (i.e. that one with SNI=1).
4
HOMOGENEITY TESTS
After Euclidean method and Entropy grouping method are applied a test is requested to evaluate if data
observed at the different sites in a homogeneous region arise from a common regional distribution. If the test
fails the region cannot be considered homogeneous. The tests used are the two homogeneity tests of Hosking
and Wallis (1993).
These measures estimate the degree of heterogeneity of a group of sites. Specifically, the heterogeneity tests
compare the variation among sites in sample L-moments to the value expected for an homogeneous region
(Hosking and Wallis, 1993). To calculate the test is necessary to:
Definition of homogeneous regions through entropy theory
4
3rd STAHY International Workshop on STATISTICAL METHODS FOR HYDROLOGY AND WATER RESOURCES MANAGEMENT
October 1-2, 2012 Tunis, Tunisia
•
Evaluate the L-moments (t = L-CV, t3 = L-Skewness, t4 = L-Kurtosis) of the sample of the i-site of
the study region (made by k – sites);
• Estimate the regional L-moments , , , as the mean of the sites L-moments;
• Calculate the value V, that is the weighted standard deviation formulation of the at-site sample L
CVs:
(
= ∑
− / ∑
10
(
• Evaluate the 4 parameters of a kappa distribution from the regional L-moments , , , and
generate a big number of simulations, Nsim, of k samples.
The i-site samples of the simulated region is created through the kappa distribution and it has the same length
ni of the real one. From each simulated region it is possible to calculate the V value. At the end, it is possible
to obtain Nsim number of V values and then the mean and the standard deviation of the Nsim values of V (
.
From this it is possible to calculate the heterogeneity measure:
H1 = ( 11
Hosking and Wallis (1993) suggest that the region may be assumed as “acceptably homogeneous” if H1 < 1,
“possibly homogenous” if 1 < H1 < 2, and “definitely heterogeneous” if H1 > 2.
The H1 statistic measures heterogeneity only in the dispersion of the samples, since it is based uniquely on
the differences between the sample L-CVs in the region. As such, it is unaffected to heterogeneity that get up
between sites having equal L-CV but different L-Skewness. Hosking and Wallis (1993) also give an
alternative heterogeneity measure (that we call H2), in which V of Eq. 10 is replaced by:
/
(
V = ∑+ n !"t ( − t̅% + 't − t( ) *
,∑+ n
12
The test statistic in this case becomes:
H2 = (. /0. 10.
13
with similar acceptability limits as the H1 statistic. Hosking and Wallis (1997) judge H2 to be inferior to H1
and say that it rarely produces values larger than 2 even for grossly heterogeneous regions.
5
CASE STUDY
The catchments considered in this study (Fig. 2) belong to the Central Italy area.
The geological asset of the study area is the result of a long process. This geological process has influenced
the flow regime of the area. It is constituted of three different morphologic envinroments:
•
The appenninic dorsal, that divides basins with river mouth in the Adriatic sea, and in the Tirrenic sea.
It is formed by carbonatic reliefs, that arrive to the sea;
•
The Tiber river plan;
•
The volcanic area of Amiata Mount, Vulsini, Cimini, Sabatini, Albani Mounts. This area has large
permeability, with a boundary of impervious rocks, that create a big natural reservoir that feeds the springs.
For this study 23 hydrometric stations are considered. For simplicity to each station is associated an ID code
from 1 to 23, Table I. Data used, obtained by the VAPI project, are the maximum annual flows calculated by
the mean daily flows. For all the stations 31 contemporary years of data were available, starting from 1934 to
1940 and from 1950 to 1973. These stations were chosen because their registrations were contemporary,
providing data series of the same length. This characteristic is necessary for applying the entropy method,
thus in case of gaps, interpolation methods are necessary for filling missing values (Dialynas et al. 2010;
Koutsoyiannis, 2002, 2010)
Definition of homogeneous regions through entropy theory
5
3rd STAHY International Workshop on STATISTICAL METHODS FOR HYDROLOGY AND WATER RESOURCES MANAGEMENT
October 1-2, 2012 Tunis, Tunisia
Figure 2 – Case study area (Latium, Tuscany, Umbria, Marche, Abruzzo) and hydrometric stations.
To evaluate the region of influence for all stations with the Euclidean distance method, different variables are
used. In particular, the basin area (km2), the pervious percentage (%), the maximum and mean altitude (m),
the latitude (m), longitude (m) and the mean of the maximum annual flows (m3/s). Variables used for each
station are shown in Table II.
Table I – IDs of the considered hydrometric gauging stations.
Stations
IDs
Stations
IDs
Stations
IDs
Stations
IDs
Lunghezza
1
Subiaco
8
Maraone
15
Traponzo
22
Bettona
2
Terria
9
Molina
16
M.Amiata
23
Felcino
3
Torgiano
10
Montenero
17
Morrano
4
T.Orsina
11
S.Pellegrino
18
P.N.Tevere
5
P.Ferrovia
12
S.Vito
19
Ripetta
6
Arno
13
Teramo
20
S.Lucia
7
Cannucciaro
14
Treponti
21
Table II – Geomorphologic variables.
Station
1
2
3
Area
(km2)
1115,0
1220,0
2033,0
Pervious
(%)
76,0
56,0
4,0
Hmax
(m.s.m.)
2176,0
1570,0
1454,0
Definition of homogeneous regions through entropy theory
Hmean
(m.s.m.)
523,0
552,0
528,0
Latitude
(m)
306668,9
297429,6
291443,2
Longitude
(m)
4644528,0
4766676,5
4778275,7
Flow
(m3/s)
150,7
126,2
415,4
6
3rd STAHY International Workshop on STATISTICAL METHODS FOR HYDROLOGY AND WATER RESOURCES MANAGEMENT
October 1-2, 2012 Tunis, Tunisia
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
422,0
4147,0
16545,0
934,0
233,0
2076,0
1956,0
1445,0
1272,0
58,0
430,7
2003,0
1303,0
32,0
213,0
50,0
147,0
114,0
851,0
580,0
0,0
16,0
32,0
16,0
100,0
72,0
32,0
98,0
3,0
52,0
57,0
70,0
60,0
81,0
66,0
5,0
5,0
34,0
17,0
4,0
1148,0
1570,0
2487,0
1454,0
2176,0
2487,0
1570,0
2422,0
1056,0
2914,0
1570,0
2795,0
2532,0
1260,0
2750,0
460,0
2435,0
1616,0
1053,0
1148,0
349,0
523,0
524,0
580,0
1100,0
970,0
530,0
1014,0
409,0
1950,0
616,0
1080,0
1120,0
1080,0
1200,0
178,0
930,0
1026,0
340,0
445,0
264798,8
290894,8
290580,5
276544,8
341792,9
318798,0
292533,0
314558,2
1728495,0
872005,0
845190,0
899637,0
892978,0
923822,0
906273,0
948602,0
902195,0
848642,0
736862,0
694818,0
4738132,0
4766131,3
4642487,2
4811461,0
4643224,2
4701421,7
4767317,7
4715726,7
4816533,0
4722441,0
4798922,0
4681189,0
4676143,0
4634871,0
4708599,0
4697845,0
4754610,0
4712469,0
4693515,0
4771355,0
86,7
595,7
1214,6
217,9
37,4
123,0
274,1
65,3
172,4
19,3
53,8
65,3
39,9
11,8
28,5
10,2
38,4
13,7
99,8
115,5
For each site the similarity with other stations is evaluated in the parameters space. The homogeneity is
tested starting from the 7 more similar stations until the 2 more ones.
On the other hand, the entropy-based clustering method is performed considering as random vectors the
series of the maximum annual discharge of each gauging station. Through the evaluation of the SNI index,
Eq.7, stations are clustered forming ensenbles from 2 to 7 elements.
The choice of the number of sites that fall into a region is done through homogeneity tests. Then for all the
clustered stations, evaluated with both approaches, the test is done on the growth factor that is the maximum
annual discharge standardized by the mean regional discharge.
The test is carried out from the closest to farest station until it is passed.
6
RESULTS AND DISCUSSION
The entropy based approach has led to the SNI matrix and thus, to the determination of 23 regions, one for
each gauging station. Applying the traditional Euclidean method, the matrix of the Euclidean distance
between each station and every others is computed. In Tables III and IV the regions of influence evaluated
with the second and the first method and tests results, respectively, are presented for four stations only.
For instance in Table III are reported stations belonging to the region of influence of Lunghezza, Ripetta,
Felcino and Morrano. Concerning Lunghezza, the closer stations (in terms of SNI) is the number 6 (from
Table I, it is Ripetta), then there is the number 8 (i.e. Subiaco) and so on up to the 7th station that is the
number 2 (Bettona).
For every region of influence, the Hosking and Wallis tests are carried out, considering an increasing number
of stations from 2 to 7. In this way it is possible to know how many stations belong to a particular region.
The first column reports the ID of the main station, while the columns from the 2nd to the 7th give the IDs of
the other stations from the closest to the 7th farest (in terms of SNI or Euclidean distance). In the last two
columns the results of the two tests are presented. For instance, considering Lunghezza, the Euclidean
method preforms better: with seven stations we obtain a value lower than the threshold value (i.e. 2). On the
contrary, with the entropy method, the test is satisfied with 4 stations in the region only. The same thing
happens considering Felcino as the main station. On the other hand, considering Ripetta, the situation is
Definition of homogeneous regions through entropy theory
7
3rd STAHY International Workshop on STATISTICAL METHODS FOR HYDROLOGY AND WATER RESOURCES MANAGEMENT
October 1-2, 2012 Tunis, Tunisia
reversed: the SNI gives better results than the other one. Considering Morrano, the cluster analysis is much
better with the entropy method than with the classical Euclidean method: the two tests are satisfied and their
values are far from the threshold value of 2, while with the classical approach, after 5 stations the tests are
not passed.
Table III – Values of the two homogeneity tests (H1 and H2) for 4 stations taken as an example. The fisrt column
represents the main station of the ROI, the others are the stations ordered from the closest to the farest in terms of
Euclidean distance.
Lunghezza
1
1
1
1
1
1
Ripetta
6
6
6
6
6
6
Felcino
3
3
3
3
3
3
Morrano
4
4
4
4
4
4
2nd
6
6
6
6
6
6
2nd
1
1
1
1
1
1
2nd
10
10
10
10
10
10
2nd
5
5
5
5
5
5
3rd 4th 5th 6th 7th Test H1
-0.75
8
-0.95
8
9
-0.51
8
9
11
-0.64
8
9
11 4
1.28
8
9
11 4
2
1.20
3rd 4th 5th 6th 7th Test H1
-0.75
8
-0.94
8
9
-0.52
8
9
11
-0.68
8
9
11 4
1.19
8
9
11 4
5
1.52
3rd 4th 5th 6th 7th Test H1
1.43
5
0.86
5
2
1.14
5
2
7
1.23
5
2
7
4
2.36
5
2
7
4
11 2.23
3rd 4th 5th 6th 7th Test H1
2.33
10
1.38
10 2
1.00
10 2
3
1.82
10 2
3
11
2.12
10 2
3
11 9
1.79
Test H2
1.59
0.94
0.41
2.14
1.83
1.74
Test H2
1.69
0.92
0.41
2.25
1.65
3.24
Test H2
1.25
1.27
1.72
1.18
2.22
2.50
Test H2
2.67
1.52
0.83
2.03
3.10
3.30
Table IV – Values of the two homogeneity tests (H1 and H2) for 4 stations taken as an example. The fisrt column
represents the main station of the region of influence evaluated with the entropy method, the others are the stations
ordered from the closest to the farest in terms of SNI.
Lunghezza
1
1
1
2nd 3rd 4th 5th 6th 7th Test H1 Test H2
16
-1.00
-0.43
16
6
-1.27
0.70
16
6
11
-1.39
2.01
Definition of homogeneous regions through entropy theory
8
3rd STAHY International Workshop on STATISTICAL METHODS FOR HYDROLOGY AND WATER RESOURCES MANAGEMENT
October 1-2, 2012 Tunis, Tunisia
1
1
1
Ripetta
6
6
6
6
6
6
Felcino
3
3
3
3
3
3
Morrano
4
4
4
4
4
4
16
16
16
2nd
5
5
5
5
5
5
2nd
9
9
9
9
9
9
2nd
9
9
9
9
9
9
6
6
6
3rd
11
11
11
4th
5
-1.73
5
15
0.31
5
15 9
0.84
5th 6th 7th Test H1
-1.02
10
0.84
10 9
0.44
10 9
1
0.38
10 9
1
16
0.16
10 9
1
16 11 0.06
3rd 4th 5th 6th 7th Test H1
0.65
6
0.14
6
5
-0.16
6
5
15
2.05
6
5
15 4
3.78
6
5
15 4
18 5.30
3rd 4th 5th 6th 7th Test H1
0.43
6
1.52
6
10
1.21
6
10 2
0.85
6
10 2
8
0.82
6
10 2
8
23 1.38
2.66
2.76
2.89
Test H2
0.18
0.68
1.58
1.91
1.52
2.34
Test H2
2.53
1.76
1.92
1.96
2.94
3.56
Test H2
-0.72
0.73
0.37
0.04
0.19
0.28
Figure 3 resumes the results showed in Tables III and IV for all stations. The figure shows the percentage of
regions that satisfies the homogeneity tests considering an homogeneous region with an increasing number
of stations from 1 to 7 (expressed by the X axis). For instance, with the Euclidean method, the percentage of
stations that passes the test when the region includes only one station (i.e. the main one) is 17,4%, while
21,7%, passes the tests with maximum two stations in the considered region. With the Entropy method, the
4,3% satisfies the tests with the main station only in the region, while the 8,0% overcomes the tests with two
stations. It is important to underline that 34,8% passes the tests with maximum 7 stations using the entropy
approach, while only 13,0% passes the tests with the other method. Besides, with the standard method the
big percentage of regions passes the tests with maximum 5 stations, while with the Entropy method the
majority of stations satisfies it with 7 stations. This result shows the big potentiality of the entropy method
for regionalization purposes. Indeed this approach performs better in clustering homogeneous regions,
because it creates bigger regions: even considering ensembles of 7 stations there is a large percentage of
groups that satisfy the test. Even if tests have been carried out for each region considering all 23 stations, for
the sake of brevity in this work results are shown up to the 7th station. The choice of showing homogeneous
regions up to the 7th element, depends by the fact that were not found regions homogeneous with more
stations.
Definition of homogeneous regions through entropy theory
9
3rd STAHY International Workshop on STATISTICAL METHODS FOR HYDROLOGY AND WATER RESOURCES MANAGEMENT
October 1-2, 2012 Tunis, Tunisia
Figure 3 – Percentage of stations that pass the homogeneity tests with different number of stations.
7
CONCLUSIONS
In this work a new entropy based approach is compared to a standard procedure for finding the regions of
influence. The entropy approach consists in the introduction of a new index: the shared normalized
information (SNI), that represents the rate between amount of information shared by couple of stations and
the total amount of information that the two variables provide together (i.e. their joint entropy). This index
represents a novelty in respect with a directional index (DIT) that has been widely applied in literature. The
SNI contribution is the possibility of evaluating the mutual relationship between stations, because it is
symmetrical compared to DIT. Homogeneous regions evaluated through SNI are compared with those
obtained with the Euclidean distance method. Results show that the entropy approach, in respect with the
standard Euclidean one, performs better permitting to obtain homogeneous regions with a bigger ensemble of
stations.
This paper represents a pilot study: a more exhaustive study is necessary, applying the regionalisation
method up to the final step that is the flow quantile calculation for the considered site. Thus future
perspective is to compare results of the two methods in terms of quantiles.
This method requires historical data series of discharge for every considered stations. They have to be of
identical length, thus method of interpolation can be necessary for filling the gaps in data.
Furthermore this approach can be applied only if the discharge values are known for the gauging stations,
thus further studies are on going in order to use it in ungauged basins.
In despite of these limitations the SNI method is useful, because it allows a simple evaluation of similar sites,
being based on the information content of recorded discharge data and it shows good performances
compared to a traditional method.
8
REFERENCES
Acreman, M.C. and Sinclair, C.D. (1986). Classification of drainage basins according to their physical characteristics: An
application for flood frequency analysis in Scotland. Journal of Hydrology, 84:365-80.
Alfonso, L., Lobbrecht, A. and Price, R. (2010a). Information theory-based approach for location of monitoring water level
gauges in polders. Water Resources Research, 46: W03528, doi:10.1029/2009WR008101.
Alfonso, L., Lobbrecht, A. and Price, R. (2010b). Optimization of water level monitoring network in polder system using
information theory. Water Resources Research, 46: W12553, doi:10.1029/2009WR008953.
Burn, D.H. (1990a). An appraisal of the region of influence approach to flood frequency analysis. Hydrological Sciences Journal,
35(2):149-165.
Burn, D.H. (1990b). Evaluation of regional flood frequency analysis with a region of influence approach. Water Resources
Research, 26(10):2257-2265.
Definition of homogeneous regions through entropy theory
10
3rd STAHY International Workshop on STATISTICAL METHODS FOR HYDROLOGY AND WATER RESOURCES MANAGEMENT
October 1-2, 2012 Tunis, Tunisia
Burn, D.H. (1994). Hydrologic effects of climatic change in West-Central Canada. Journal of Hydrology, 160:53-70.
Burn, D.H. (1997). Catchments similarity for regional flood frequency analysis using seasonality measures. Journal of
Hydrology, 202(1-4): 212-230.
Castellarin, A., Burn D.H., Brath, A. (2001). Assessing the effectiveness of hydrological similarity measures for flood frequency
analysis. Journal of Hydrology, 241(2): 270-285.
Cavadias, G.S., Ouarda, T.B.M.J., Bobee, B. and Girard, C. (2001). A canonical correlation approach to the determination of
homogeneous regions for regional flood estimation of ungauged basins. Hydrological Sciences-Journal~des Sciences
Hydrologiques, 46(4): 499-512.
Cover, T.M. and Thomas, J.A. (1991). Elements of information theory. New York,Wiley.
Cunderlik, J.M. and Burn, D.H. (2003). Non-stationary pooled flood frequency analysis. Journal of Hydrology, 276:210-223.
Dialynas, Y., P. Kossieris, K. Kyriakidis, A. Lykou, Y. Markonis, C. Pappas, S.M. Papalexiou, and D. Koutsoyiannis, Optimal
infilling of missing values in hydrometeorological time series, European Geosciences Union General Assembly 2010,
Geophysical Research Abstracts, Vol. 12, Vienna, EGU2010-9702, European Geosciences Union, 2010.
Glatfelter, D.R. (1984). Techniques for estimating magnitude and frequency of floods on streams in Indiana. US Geological
Survey. Water Resources Investigations Report, 84-4134.
Holmes, M.G.R., Young, A.R., Gustard, A.G. and Grew, R. (2002a). A region of influence approach to predicting flow duration
curves within ungauged catchments. Hydrology and Earth System Sciences, 6(4): 721-731.
Holmes, M.G.R., Young, A.R., Gustard, A.G. and Grew, R. (2002b). A new approach to estimating mean flow in the United
Kingdom. Hydrology and Earth System Sciences, 6(4): 709-720.
Holmes, M.G.R. and Young, A.R. (2002c). Estimating seasonal low flow statistics in ungauged catchments. British Hydrological
Society: Proc. 8th National Hydrology Symposium – Birmingham, 97-102.
Hosking, J., Wallis, J. (1993). Some statistics useful in regional frequency analysis. Water Resources Research, 29(2):271-281.
IPCC, (2007). Climate Change 2007: Synthesis Report. Contribution of Working Groups I, II and III to the Fourth Assessment
Report of the Intergovernmental Panel on Climate Change, Core Writing Team, Pachauri, R.K and Reisinger, A. (eds.).
IPCC, Geneva, Switzerland, 104 pp.
Jakulin, A., and I. Bratko (2004), Testing the significance of attribute interactions, in Proceedings of the Twenty‐First
International Conference on Machine Learning, ACM Int. Conf. Proc. Ser., vol. 69, p. 52, Assoc. for Comput. Mach., Banff,
Alberta, Canada.
Jingyi, Z. and Hall, M.J. (2004). Regional flood frequency analysis for the Gan-Ming river basin in China. Journal of Hydrology,
296:98-117.
Kottegoda, N.T., Natale, L. and Raiteri, E. (2007). Gibbs sampling of climatic trend and periodicities. Journal of Hydrology,
339:54-64.
Koutsoyiannis, D., The Hurst phenomenon and fractional Gaussian noise made easy. Hydrological Sciences Journal, 47(4):573–
595, 2002.
Koutsoyiannis, D. (2010). A random walk on water, Hydrology and Earth System Sciences, 14, 585-601, doi:10.5194/hess-14585-2010.
Kraskov, A., Stoegbauer, H., Andrzejak, R. G. and Grassberger, P. (2005). Hierarichical clustering based on mutual information.
Europhysics Letters 70:278.
Krstanovic, P. F. and Singh, V. P. (1992). Evaluation of rainfall networks using entropy: I. Theoretical development. Water
Resources Management, 6:279-293.
Merz, R. and Bloeschl, G. (2005). Flood frequency regionalisation—spatial proximity vs. catchment attributes. Journal of
Hydrology, 302:283–306.
Mogheir, Y. and Singh, V. P. (2002). Application of information theory to groundwater quality monitoring networks. Water
Resources Management, 16:37-49.
Mogheir, Y., de Lima J. L. M. P. and Singh, V. P. (2004). Characterizing the spartial variability of groundwater quality using the
entropy theory, I. Synthetic data. Hydrological Processes, 18:2165-2179.
Mosley, M.P. (1981). Delimitation of New Zealand hydrologie regions. Journal of Hydrology, 49(1):173-192.
North, M., (1980). Time-dependent stochastic model of floods. Journal of the Hydraulics Division, ASCE 106 (HY5), 649-665.
Papoulis, A. (1991). Probability, random variables and stochastic processes, 3rd Edn, McGraw-Hill, New York.
Perreault, L., Bernier, J.; Bobee, B., Parent, E. (2000a). Bayesian change-point analysis in hydrometeorological time series. Part
1. The normal model revised. Journal of Hydrology, 235:221-241.
Perreault, L., Bernier, J.; Bobee, B., Parent, E. (2000b). Bayesian change-point analysis in hydrometeorological time series. Part
2. Comparison of change-point models and forecasting.. Journal of Hydrology, 235:242-263.
Rao, A.R. and Srinivas, V.V. (2006a). Regionalization of watersheds by fuzzy cluster analysis. Journal of Hydrology, 318(14):37-56.
Rao, A.R. and Srinivas V.V. (2006b). Regionalization of watersheds by hybrid cluster analysis. Journal of Hydrology, 318(14):57-79.
Renard, B., Lang, M. and Bois, P. (2006). Statistical analysis of extreme events in a non-stationary context via a Bayesian
framework: case study with peak-over-threshold data. Stochastic Environamental Research and Risk Assessment, 21:97-112.
Definition of homogeneous regions through entropy theory
11
3rd STAHY International Workshop on STATISTICAL METHODS FOR HYDROLOGY AND WATER RESOURCES MANAGEMENT
October 1-2, 2012 Tunis, Tunisia
Ribeiro-Correa, B., Cavadias, G.S., Clement, B. and Rousselle, J. (1995). Identification of hydrological neighbourhoods using
canonical correlation analysis. Journal of Hydrology, 173:71-89.
Ridolfi, E., Montesarchio, V., Russo, F. and Napolitano, F. (2011). An entropy approach for evaluating the maximum
Information content achievable by an urban rainfall network. Natural Hazards and Earth System Sciences, 11:2075–2083,
doi:10.5194/nhess-11-2075-2011.
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27:379-423, 623-656.
Singh, V. P. (1997). The use of entropy in hydrology and water resources. Hydrological Processes, 11:587-626.
Tasker, G.D. (1982). Comparing methods of hydrologic regionalization. Water Resources Bullettin, 18(6):965-970.
Tasker, G.D., Hodge, S.A. and Barks C.S. (1996). Region of influence regression for estimating the 50-year flood at ungaged
sites. Water Resources Bulletin, 32:163-170.
Thomas, D.M. and Benson, M.A. (1970). Generalization of streamflow characteristics from drainage basin characteristics. U.S.
Geol. Survey Water Supply Paper, 1975:1-55.
Yang, Y., Burn, D.H. (1994). An entropy approach to data collection network design. Journal of Hydrology, 157:307-324.
Yoo, C., Jung, K. and Lee, J. (2008). Evaluation of rain gauge network using entropy theory: comparison of mixed and continuos
distribuition function applications. Journal of Hydrologic Engineering, 13:226-235.
Watanabe, S. (1960). Information theoretical analysis of multivariate correlation. IBM Journal of Research and Development,
4(1): 6682.
Wei, Y. (1987). Variance, entropy and uncertainty measure. Procedings of the survey research methods section, 609-610.
American Statistical Association. Alexandria,Va.
Zrinji, Z., and. Burn, D.H. (1994). Flood frequency analysis for ungauged sites using a region of influence approach. Journal of
Hydrology, 153(1):1-21.
Zrinji, Z., Burn, D.H. (1996). Regional flood frequency with hierarchical region of influence. Journal of Water Resources
Planning and Management, 153(1):1-21.
Definition of homogeneous regions through entropy theory
12