Mean-variance analysis of the performance of spatial ordering

int. j. geographical information science, 1998 , vol. 12 , no. 3 , 269 ± 289
Research Article
M ean-variance analysis of the performance of spatial ordering
methods
AKHIL KUMAR
College of Business, University of Colorado, Campus Box 419, Boulder,
CO 80309, USA
WALEED A. MUHANNA
Fisher College of Business, The Ohio State University, 1775 College Road,
Columbus, OH 43210, USA
email [email protected]
and RAYMOND A. PATTERSON
School of Management, The University of Texas at Dallas, P.O. Box 830688,
Richardson, TX 75083, USA
( Received 4 July 1995; accepted 13 June 1997 )
Geographical Information Systems (GIS) involve the manipulation of
large spatial data sets, and the performance of these systems is often determined
by how these data sets are organized on secondary storage (disk). This paper
describes a simulation study investigating the performance of two non-recursive
spatial clustering methodsÐ the Inverted Naive and the Spiral methodsÐ in
extensive detail and comparing them with the Hilbert fractal method that has
been shown in previous studies to outperform other recursive clustering methods.
The paper highlights the importance of analysing the sample variance when
evaluating the relative performance of various spatial ordering methods. The
clustering performance of the methods is examined in terms of both the mean
and variance values of the number of clusters (runs of consecutive disk blocks)
that must be accessed to retrieve a query region of a given size and orientation.
The results show that, for a blocking factor of 1, the mean values for the Spiral
method are the best, and on average, about 30% better than for the other two
methods. In terms of variance, the inverted naive method is the best followed by
the Spiral and Hilbert methods, in that order. We also study the impact of varying
query size and the skew ratio ( between the X and Y dimensions) for each method.
While these performance results do not generalize for higher blocking factors, we
believe that they are useful for both researchers and practitioners to know because
several previous studies have also examined this special case, and also because it
has important implications for the performance of GIS applications.
Abstract.
1. Introduction
Spatial and multi-dimensional data are commonly encountered in many
Geographical Information System (GIS) applications, and the organization of such
data for e cient retrieval has emerged as a signi® cant area of research. Two kinds
of queries arise frequently in spatial data management in GIS (Star and Estes 1990):
1. What is found at a given location, e.g.
1365± 8816/98 $12´00
Ñ
1998 Taylo r & Francis Ltd.
270
A. Kumar et al.
Ð
What is the elevation of a speci® c geographical location?
At a speci® ed coordinate what is the soil type?
2. Are there examples of speci® ed objects within a speci® ed area, e.g.
Ð Are there any ® re towers in a speci® ed region?
Ð Where are the sources of potable water in a speci® ed region?
Ð
In this paper, we focus on the second kind of queries, and study how they can be
answered e ciently in a GIS. Geographical information systems and geographical
data modelling approaches often involve the use of relational database technology
(Goodchild 1992, Batty 1992 ). Consequently, geographical data is stored in relational
database systems and a common data structure for GIS data is a raster or a grid
(Star and Estes 1990 ). In a raster structure, a value for the parameter of interest is
developed for every cell (or pixel ) in a grid. In the database system, this data is
stored on ® les or relations on disk in a continuous and linear manner. For example,
two simple ordering schemes are row-wise ordering and column-wise ordering . In rowwise ordering, the grid data is stored by rows (i.e. one cell in a row followed by
another, and so on), while in column-wise ordering, it is stored by columns. However,
the obvious drawback in these techniques is that in row-wise ordering two neighbouring cells in adjacent rows are stored at locations far from one another, while in
column-wise ordering the same problem occurs for two cells in neighbouring columns.
The objective of spatial ordering (or spatial clustering) is to store data about neighbouring cells in close physical proximity on disk. This would speed up processing
of the queries described above because the time spent in accessing data from the
disk is reduced when the desired data is stored more contiguously. Therefore,
the objective of this paper is to compare various spatial ordering techniques and
determine how they perform and compare across various kinds of queries.
Another potential bene® t of good spatial ordering can result from positive autocorrelation (Star and Estes 1990, Goodchild 1992). Positive autocorrelation means that
neighbouring grid cells are likely to have similar attribute values; e.g. elevation,
rainfall amounts, etc. It is possible to take advantage of this correlation by reducing
the amount of information stored, for instance when neighbouring cells have identical
values for a certain attribute. As noted by Goodchild ( 1992 ), the similarity between
variables increases as the locations converge. In this paper, we simply note this as a
side bene® t and defer further investigation of it to future work.
One approach for processing spatial queries e ciently consists of indexing spatial
data. Indexing techniques, like B-trees, which work well on one-dimensional data,
do not perform as well when the data lies in multi-dimensional space, and, therefore,
new techniques have been devised for such data.
Two of the earliest methods for organizing spatial data are quadtrees and K-D
trees. Assuming two-dimensional data, quadtrees ( Finkel and Bently 1974 ) successively divide the data space into four smaller quadrants at each stage by examining
the X and Y coordinate values. (A variant of quadtrees called MX-CIF quadtrees
(Abel and Smith 1983 ) has been implemented in a GIS software product from IBM
called GeoManager ( Batty 1992 ).) K-D trees ( Bentley 1975), on the other hand,
divide the space into two parts at each stage by examining alternating coordinates,
one coordinate at a time. However, both of these structures are primarily intended
for internal searching, where the entire tree structure is main memory resident, and
as such they are not very suitable for external searching, when the tree structure is
stored on a secondary storage device. These trees can also get unbalanced easily.
Mean-variance analysis of the performance of spatial ordering methods
271
A common approach for organizing such data on secondary storage, such as
disk, is to divide a k -dimensional space or region into smaller k -dimensional partitions
or subregions, and store one or more partitions in a single disk block or a bucket .
To answer a query, the relevant buckets are brought into main memory from disk
and searched. In Grid ® les (Nievergelt et al . 1984, Hinrichs 1985), a directory is
maintained to track the correspondence between partitions and disk buckets. In
KDB trees ( Robinson 1981 ), the regions are organized into a tree similar to a B-tree.
The leaf pages of this tree correspond to disk buckets, while higher level pages
encompass successively larger non-overlappi ng regions, which are navigated in order
to locate the appropriate leaf page. R-trees (Guttman 1984 ) are somewhat similar;
however, they permit non-overlapping regions. Both KDB-trees and R-trees can be
maintained on secondary storage. Examples of other structures are BD trees
( Dandamundi and Sorenson 1985, 1986), and zkdb trees (Orenstein and Merrett
1984, Orenstein 1986, 1989 ). Samet ( 1991 ) presents an excellent discussion of these
methods.
Since the disk space is a continuum of blocks, it is logically one-dimensional.
Therefore, a mapping is required from the multi-dimensional data space into the
one-dimensional disk space. A very desirable property for such mapping to have, is
that it should be distance preserving ; i.e. points that are close in the multi-dimensional
space should also be close in the one dimensional (disk) space. If this is the case,
then the various buckets that must be accessed to answer a query would be contiguous, or sequentially clustered together. This would reduce the number of random
disk accesses (disk seek operations and rotational delay) required, thereby speeding
up the total access time for all the buckets. On the other hand, in the worst case
the buckets could be scattered across the disk, resulting in poor access performance. Therefore, good clustering is a very useful property and several existing data
structures stand to bene® t from it.
Several techniques for spatial clustering have been proposed. They can be classi® ed into two groups:
1. Non-recursive techniques such as: inverted naive, spiral.
2. Recursive techniques such as: binary z-ordering, gray z-ordering, gray
nu-ordering, hilbert curve.
The non-recursive techniques follow some simple pattern for traversing (ordering)
the points in the data space. The recursive techniques are based on recursively
partitioning a data space, in a pattern that repeats itself at each level of recursion.
The binary z-ordering ( BZ) method was proposed and studied in Orenstein and
Merrett 1984 (Orenstein 1986). The gray z-ordering (GZ) was proposed by Faloutsos
( 1988 ) and shown to outperform BZ. Subsequently, Hilbert ( H ) ordering was
described by Faloutsos and Roseman ( 1989 ) and shown to be superior to both BZ
and GZ ordering. Kumar ( 1994 ) proposed another technique called gray nu-ordering
(GN ), but the study showed that Hilbert was the best among these four recursive
methods. These studies have clearly established that Hilbert is the best method
among the recursive ones. However, the case for the superiority of Hilbert over the
non-recursive methods has not so clearly been made, particularly relative to the
Spiral method whose performance has not been investigated in the literature.
In our present study, we investigate, using stochastic Monte Carlo simulations,
the performance of the (recursive) Hilbert method and two non-recursive methods,
the Inverted Naive and Spiral, in very extensive detail. We examine the clustering
272
A. Kumar et al.
performance of the methods in terms of both the mean and variance values of the
number of clusters (runs of consecutive disk blocks) that must be accessed to retrieve
a query region. Previous studies judged relative performance solely on the basis of
sample means. This paper highlights the importance of analysing the sample variance
when evaluating the relative performance of various methods. The results are very
interesting because they suggest that the non-recursive methods generally perform
better than the Hilbert method. In the past, researchers (e.g. Jagadish 1990) have
noted that the inverted naive method has a bias in one dimension or the other, and
argued that if the orientation of a query is unknown then the naive method would
work well for queries of the `favourable’ orientation, and poorly for queries of the
`unfavourable’ orientation. This has been a basis for rejecting the method in situations
where the query orientation is not known in advance, which is usually the case.
One caveat should be clearly noted at the outset. Our results and comparisons
apply to the special case where the blocking factor is 1; i.e. as in ( Faloutsos 1988),
we assume that each point in the multi-dimensional data space is mapped to exactly
one disk bucket ( block); each grid point is a unit of access from disk. Even though
the results do not generalize to larger blocking factors (admittedly, recursive methods
are better for larger blocking factors), we believe they are worth reporting because:
( 1) they advance the results of previous studies which made the same assumptions;
( 2) the results are important and useful for researchers to know; and (3 ) we believe
that in certain situations, a blocking factor of 1 is quite realistic. For an example of
where a blocking factor of 1 might arise, consider a geographical information system
in which maps and satellite images are stored and searched electronically, and a user
can select and click on a small section of a map in order to zoom-in to expand that
section. Here, each grid point will be associated with a binary large object ( BLOB)
which will be stored with it in the same disk bucket. Since the BLOB is a large
object, it is reasonable to assume that one grid point would map into one disk bucket.
The balance of the paper is organized as follows. Section 2 brie¯ y describes the
three methods being evaluated. Section 3 describes the results of a series of simulation
experiments and statistical (ANOVA) tests conducted to evaluate the relative performance of the three methods under di€ erent conditions. Section 4 examines the
sample variance for the three methods (on queries of di€ erent sizes and orientations)
more closely. Section 5 provides a discussion of our results and a comparison with
results obtained from previous studies. Section 6 concludes the paper.
2. Clustering schemes
In this section, we brie¯ y describe the Inverted Naive, Spiral and Hilbert methods.
2.1. Inverted Naive (IN) ordering
The clustering pattern created by the IN method, also known as column-wise
snake scan ( Jagadish 1990 ) and column-prime order (Samet 1991), is shown in
® gure 1 (a ) for an 8 Ö 8 data space. It traces a path of successive vertical lines, going
in alternate directions until the entire grid is covered. This ordering method covers
the entire grid by taking only one step at a time; however, there are clearly more
vertical steps than horizontal steps.
2.2. Spiral method (S)
The clustering pattern created by the Spiral method is shown in ® gure 1 (b ). It
starts at one of the corners of the grid, and produces a spiral path of successive,
Mean-variance analysis of the performance of spatial ordering methods
Figure 1.
Inverted-Naive and Spiral ordering in an 8Ö
273
8 data space.
non-concentric circles that become smaller until the centre of the grid is reached.
Again, this ordering method also covers the entire grid by taking only single steps
at a time, but it is more symmetrical than the IN method because there are roughly
equal number of up, down, left, and right steps.
2.3. Hilbert curve (H ) clustering
Hilbert ordering is produced by taking an initial pattern and rotating and
re¯ ecting it to produce a new pattern that spans a larger region. This idea is
illustrated in ® gure 2. The ® rst pattern H 1 corresponds to the case where the X and
Y coordinates are represented by one bit. In the next two patterns, H 2 and H 3 , each
coordinate is represented by 2 and 3 bits respectively. Given the ® rst pattern, any
subsequent pattern can be visually constructed in an inductive manner as follows:
Make a copy of the previous pattern and rotate the copy by 90ß clockwise. Place
the rotated copy directly under the pattern. Connect the top-left corner of the copy
with the bottom-left corner of the pattern. Make a vertically re¯ ected mirror-image
copy of this new pattern on the right. Join the two images. Basically, as shown in
® gure 2, each successive pattern is four times the size of the previous pattern. An
algorithm for determining the Hilbert code for a given coordinate position in a
2n Ö 2n grid can be found in ( Faloutsos and Roseman 1989). The (almost perfect)
symmetry of the recursive Hilbert curve in terms of the number of steps in the `up’,
`down’, `left’ and `right’ directions can be seen in ® gure 2.
Figure 2.
Hilbert ordering (H 1 , H 2 , H 3 ).
274
A. Kumar et al.
3. Performance evaluation
3.1. Introductio n and overview of tests
This section describes the results of a series of simulation experiments and
statistical tests conducted to evaluate the performance of the three di€ erent schemes
discussed above. These results give insights into how the various methods fare relative
to one another in a variety of situations.
Our performance criterion is the average number of clusters that must be accessed
to answer a query. As mentioned earlier, each grid point is a unit of access from
disk. For instance, say, the nine grid elements that must be accessed to answer a
3 Ö 3 query are mapped to disk blocks: 10, 12± 13, 16± 20, and 25. In this case there
are 4 clusters; on the other hand, if the grid elements were numbered 10± 14, and
22± 25, this would correspond to only 2 clusters. From a clustering point of view,
the second arrangement is 50% better than the ® rst one. The simulation algorithm
and a brief overview are given next, and the results of the experiments follow
afterwards.
The IN, S and H clustering schemes were tested by simulation experiments using
the above criterion. We considered a two-dimensional 512Ö 512 data space (we also
considered a 1024 Ö 1024 data space. The results were in general agreement with the
results reported here for the 512 Ö 512 data space) and randomly generated ( by
simulation) the boundaries of rectangular query regions of various sizes within it as
follows. The X and Y coordinates of the bottom-left corner of the query rectangle
were generated randomly, and a query rectangle of certain given dimensions was
placed there. Then, the number of clusters into which the buckets lying within the
query region would fall was determined.
The steps of the simulation algorithm are listed as follows:
Algorithm Simulation
sum_clust= 0;
sumSQ=0;
repeat ( 500 times)
begin
count =1;
1
2
Generate a random query region of size q Ö q ;
For each grid element g on the boundary of the region
If (successor ( g ) lies outside the query region) count ++ ;
sum_clust= sum_clust+ count;
sumSQ=sumSQ + (count*count);
end
sample_mean =sum_clust/ 500;
sample_variance = (( 500*sumSQ) Õ (sum_clust*sum_clust)) / (500*499);
sample_std_dev =sqrt(sample_variance);
The experiment is repeated 500 times and the values of the sample mean and
standard deviation are reported. ( In cases involving extremely large query region
sizes where fewer than 500 distinct samples exist in the data space, complete enumeration was performed.) The algorithm must randomly place a query region of a ® xed
size inside the data space, and examine each grid point on the boundary of the query
region. If the successor of a grid point lies outside the region, then the cluster count
is increased by 1. The cluster count after all the grid points on the boundary have
been examined is the total number of clusters for the given query region.
3.2. Square query regions
The size of the data space was maintained at 512Ö 512, while the query region
size was varied from 30Ö 30 to 480Ö 480. The sample means are shown in table 1
Mean-variance analysis of the performance of spatial ordering methods
Table 1.
Cols Rows
30
60
90
120
150
180
210
240
270
300
330
360
390
420
450
480
30
60
90
120
150
180
210
240
270
300
330
360
390
420
450
480
275
Average number of clusters (square query regions).
Area
IN
S
H
F-stat
900
3600
8100
14 400
22 500
32 400
44 100
57 600
72 900
90 000
108 900
129 600
152 100
176 400
202 500
230 400
29´94 (0%)
59´70 (0%)
90´00 (0%)
119´05(0%)
149´26(0%)
179´29(0%)
208´95(0%)
238´80(0%)
268´93(0%)
299´10(0%)
328´03(0%)
355´33(0%)
387´67(0%)
416´23(0%)
441´03(0%)
466´60(0%)
29´98(Õ 0´13%)
59´45( 0´42%)
88´71( 1´43%)
115´66( 2´85%)
141´06( 5´49%)
161´48( 9´93%)
174´33( 16´57%)
176´77( 25´98%)
166´51( 38´08%)
141´05( 52´84%)
120´42( 63´29%)
103´00( 70´90%)
83´37( 78´49%)
61´04( 85´34%)
42´61( 90´34%)
22´11( 95´26%)
29´24 (2´34%)
58´59 (1´86%)
89´64 (0´40%)
117´33(1´44%)
149´03(0´15%)
174´81(2´50%)
209´75(Õ 0´38%)
246´51(Õ 3´23%)
272´53(Õ 1´34%)
291´34(2´59%)
328´18(Õ 0´05%)
357´08(Õ 0´49%)
385´40(0´59%)
411´95(1´03%)
443´52(Õ 0´56%)
467´03(Õ 0´09%)
2´30
0´96
0´60
1´84
9´04
20´95
75´30
186´69
429´70
806´08
1327´01
1703´76
2138´77
2331´58
2848´16
2639´93
P-value
0´10
0´39
0´55
0´16
0´00
0´00
0´00
0´00
0´00
0´00
0´00
0´00
0´00
0´00
0´00
0´00
for the three methods. The percentages in parentheses show how the other methods
fare relative to IN clustering. A positive % value indicates that a method, on average,
does better than IN, while a negative % values indicates that a method does worse.
To analyse how the two factors (clustering method and query size) a€ ect performance (average number of clusters), we ® rst conducted a two-factor ANOVA test. As
expected, the test revealed the presence of signi® cant main factors as well as interaction e€ ects ( p -value 0´0001 ). In view of this, additional (single factor) ANOVA
tests were conducted to examine, at each ® xed query size level, whether or not there
are real (i.e. statistically signi® cant) performance di€ erences among the three clustering methods. The results, shown in the last two columns of table 1, indicate the
presence of signi® cant performance di€ erences for all square query regions equal to
or larger than 150Ö 150. Additional ANOVA runs indicate that the observed di€ erences in the average performance between the IN and H methods are not statistically
signi® cant ( 0´05 level ). We therefore accept the null hypothesis and conclude that
IN and H perform equally for all square query regions. In comparison, the performance
of the Spiral method continues to improve relative to the IN method as the query
region grows larger. The performance gap, in favour of S, is statistically signi® cant
for query regions 150 Ö 150 in size or larger, and sharply widens as the query region
size increases; for a query region size of 480Ö 480 the gap is 95%. Clearly, the Spiral
method performs better than the other two methods for square query regions.
3.3. Skewed query regions
In the next set of experiments, we considered skewed query regions where one
side of the query region was longer than the other side. Our objective here was to
see how the skew factor (i.e. the ratio) between the long side and the short side of
the query region a€ ects the relative performance of these methods. We also wanted
to see how the total area of the query region would a€ ect the performance and
whether there was any pattern similar to the one found for square regions where
relative performance of the methods was a function of the area of the query region.
Consequently, we considered query regions of a ® xed area at a time, but varied the
276
A. Kumar et al.
X and Y dimensions of the regions (in such a way that the area remained constant).
For a ® xed area value, the lengths of the X and Y dimensions of the query region
were varied in such a way that the skew ratio (i.e. ratio between the long side and
the short side) ranged from Õ 25 to + 25. ( By convention, a minus sign for the skew
ratio indicates that the long side is along the X axis (more columns than rows),
while a plus sign indicates the opposite.) The size of the data space was again a
512 Ö 512 grid, and we show the results for three representative values for the area
of the query regions:
small query regionÐ area of 900
medium query regionÐ area of 22 500
large query regionÐ area of 57 600
Table 2 shows the mean values for small query regions. Again, the number of
clusters is computed as before and the performance gaps (relative to the IN method )
are shown in parenthesis (as percentages). However, the table entries are grouped
by magnitude of the skew factor, and negative and positive values are shown together.
For example, in table 2 the skew factor magnitudes considered are: 1, 2´25, 4, 6´25,
11´1 and 25. In each case the results are shown for both positive and negative skew
factors in consecutive rows, and they are combined together in the immediately
following row. For example, rows 2 and 3 of table 2 give the results for skew factor
values of Õ 2´25 and 2´25, respectively, while row 4 aggregates the results for these
two rows and gives an overall average for all query regions with a skew factor of
magnitude 2´25. The subsequent rows of the table are organized in the same manner.
We believe examining aggregate values is more useful because the IN method is
extremely biased towards query regions with fewer columns and more rows (i.e.
positive skew values). This bias e€ ect can be eliminated by evaluating the aggregates,
and therefore such a comparison is more useful.
An ANOVA test on the observations summarized in table 2 revealed the presence
of both main factors and interaction e€ ects ( p -values<0´001 ). In view of this, a series
of ANOVA tests were conducted to examine, at each ® xed value of the skew factor,
whether or not there are real performance di€ erences among each pair of clustering
Table 2.
Skew ratio
Average number of clusters for skewed query regions, region area=900.
Rows
IN
S
1
2´25
2´25
30
45
20
30
20
45
Õ
4
4
60
15
15
60
Õ
6´25
6´25
75
12
12
75
29´94 (0 )
44´91 (0 )
19´90 (0 )
32´41( 0)
59´70 (0 )
15´00 (0 )
37´35( 0)
74´78 (0 )
11´98 (0 )
43´38( 0)
99´90 (0 )
8´97(0 )
54´44( 0)
149´85(0 )
5´98(0 )
77´92( 0)
29´98(Õ 0´13 )
30´61( 31´84)
31´49(Õ 58´24)
31´05 (4´18)
33´91( 43´20)
35´24(Õ 134´93)
34´58 (7´43)
40´06( 46´43)
39´19(Õ 227´13)
39´63 (8´66)
50´67( 49´28)
46´43(Õ 417´61)
48´55 (10´81 )
64´24( 57´13)
60´59(Õ 912´71)
62´40 (19´91 )
Õ
Cols
Õ
11´1
11´1
100
9
9
100
Õ
25
25
150
6
6
150
H
29´24( 2´34)
31´60( 29´64)
32´84(Õ 65´03)
32´22 (0´57)
37´45( 37´27)
37´47(Õ 149´80)
37´46 (Õ 0´29)
43´79( 41´44)
43´65(Õ 264´36)
43´72 (Õ 0´78)
54´71( 45´24)
54´59(Õ 508´58)
54´65 (Õ 0´39)
77´02( 48´60)
74´92(Õ 1152´8)
75´97 (2´50)
277
Mean-variance analysis of the performance of spatial ordering methods
methods. The results, shown in table 3, support the null hypotheses that the aggregate
behaviou r of the Hilbert and IN methods is identical (see columns labelled IN-H ) for
all skew factor magnitudes. On the other hand, the Spiral method (with the exception
of the case where the skew factor is 1 or 2´25) continuously outperforms both the
IN method (see columns labelled IN-S) and the Hilbert method (see columns labelled
S-H ), with respect to the mean number of clusters accessed. The Spiral method’s
performance advantage (or gap) over the IN method increases as the skew factor
becomes larger, and is 19´9% for a skew factor magnitude of 25. We therefore
conclude that for small query regions and small skew factors (1 or 2´25), the three
methods yield similar aggregate performance results. Overall, however, we conclude
that for small query regions the Spiral method performs better than both the IN and
H methods .
The results for medium size query regions are shown in table 4 in the same
ANOVA results for skewed query regions, region area=900.
Table 3.
IN-S
Skew
IN-H
Cols
Rows
F-stat
P-value
1
2´25
2´25
30
45
20
30
20
45
Õ
4
4
60
15
15
60
Õ
6´25
6´25
75
12
12
75
0´82
708´77
455´12
3´06
748´69
468´36
4´07
729´08
449´36
3´88
765´38
456´33
4´78
1131´61
462´02
14´46
0´37
0´00
0´00
0´80
0´00
0´00
0´04
0´00
0´00
0´05
0´00
0´00
0´03
0´00
0´00
0´00
Õ
Õ
11´1
11´1
100
9
9
100
Õ
25
25
150
6
6
150
Table 4.
Skew ratio
2´17
582´77
569´19
0´06
5472´43
6304´20
0´01
1170´68
1155´49
0´04
16 282´85
16 266´95
0´01
1811´43
1645´63
0´30
P-value
F-stat
P-value
0´14
0´00
0´00
0´81
0´00
0´00
0´92
0´00
0´00
0´84
0´00
0´00
0´92
0´00
0´00
0´58
2´44
1´68
3´11
2´34
13´09
5´21
8´63
5´70
7´91
6´75
4´99
20´79
11´47
17´45
21´98
19´89
0´12
0´20
0´08
0´13
0´00
0´02
0´00
0´02
0´00
0´01
0´03
0´00
0´00
0´00
0´00
0´00
Average number of clusters for skewed query regions, region area=22 500.
Cols
Rows
IN
S
H
1
2´25
2´25
150
225
100
150
100
225
Õ
4
4
300
75
75
300
Õ
6´25
6´25
375
60
60
375
500
45
45
500
149´3( 0)
224´8( 0)
99´7( 0)
162´3 ( 0)
299´4( 0)
74´78( 0)
187´1 ( 0)
374´3( 0)
59´41( 0)
216´9 ( 0)
499´5( 0)
41´61( 0)
270´6 ( 0)
141´1 (5´49)
140´7 (37´42 )
137´6 (Õ 38´01 )
139´2( 14´2)
145´6 (51´38 )
145´8 (Õ 94´91 )
145´7( 22´1)
178(52´43 )
171´7 (Õ 188´92)
174´9( 19´4)
269´5 (46´05 )
267´7 (Õ 543´33)
268´6( 0´7)
149´0 (0´15)
159´3 (29´12 )
160(Õ 60´43)
159´7 (1´6)
185´3 (38´11 )
187´3 (Õ 150´40)
186´3 (0´4)
209´7 (43´97 )
217´4 (Õ 266´00)
213´6 (1´5)
270(45´94 )
270´7 (Õ 550´66)
270´4 (0´1)
Õ
Õ
F-stat
S-H
11´1
11´1
278
A. Kumar et al.
format which was used in table 2. All the query regions have an area of 22 500; and,
the skew factor varies between Õ 11 and 11. (As the area of the query region becomes
larger, the range of values the skew factor can assume becomes more constrained ).
Again, a two-factor ANOVA test revealed the presence of both main factors and
interaction e€ ects ( p -values<0´001). In view of this, a series of ANOVA tests were
conducted to examine, at each ® xed value of the skew factor, whether or not there
are real performance di€ erences among each pair of clustering methods. Again, we
will consider only the aggregate values for each magnitude of skew factor in our
discussion. The results, shown in table 5, indicate that the aggregate behavior of the
Hilbert and IN methods is identical (see IN-H columns) for all skew factor magnitudes.
The S method does better than IN throughout; however, in this case the gap between
S and IN increases at ® rst with increasing magnitude of the skew factor, and peaks
at 22´1% for a skew factor of 4. The gap however decreases as the skew factor is
further increased, and drops to 0´7% for a skew factor of 11´1. ANOVA tests (see
table 5 ) suggest that the small di€ erences observed (when the magnitude of the skew
factor is 11´1) are not statistically signi® cant. This means that for medium size regions
with large skew factors, all three methods perform about the same. However, we again
conclude that overall, the Spiral method dominates the other two methods for medium
size queries.
The performance results for large query regions are shown in table 6, and the
results of the corresponding ANOVA tests are shown in table 7. All the query regions
have an area of 57 600 in this table, and the skew factor lies between Õ 4 and 4.
Again, the aggregate performance of the IN and H methods is nearly within 3% of
one another. However, our ANOVA tests suggest that these di€ erences are not
signi® cant. We therefore conclude that the aggregate behaviou r of the two methods is
identical . On the other hand, the relative performance between IN and S exhibits a
somewhat similar trend to the one observed in table 4. We ® nd that when the skew
factor has a magnitude of 1, S is better than IN by about 26%. This gap grows to
about 29% when the skew factor is 1´56, and then drops to about 11% when the
skew factor increases to 4. ( It is not possible to ® nd a query region of this area size
and a larger skew factor.) On the basis of the results in tables 6 and 7 we conclude
Table 5.
ANOVA results for skewed query regions, region area=22 500.
IN-S
Skew
Cols
Rows
F-stat
P-value
1
2´25
2´25
150
225
100
150
100
225
Õ
4
4
300
75
75
300
Õ
6´25
6´25
375
60
60
375
500
45
45
500
46´61
1853´66
413´42
46´49
3326´41
692´25
52´83
2112´35
702´11
26´05
1483´98
1471´53
0´03
0´00
0´00
0´00
0´00
0´00
0´00
0´00
0´00
0´00
0´00
0´00
0´00
0´87
Õ
Õ
IN-H
11´1
11´1
F-stat
0´02
622´68
507´80
0´46
6721´81
7250´23
0´02
1317´73
1287´75
0´16
17 760´89
16 803´28
0´00
S-H
P-value
F-stat
P-value
0´90
0´00
0´00
0´50
0´00
0´00
0´88
0´00
0´00
0´69
0´00
0´00
0´98
8´76
32´62
47´45
39´71
181´40
191´37
186´49
26´28
56´03
39´65
0´01
0´24
0´09
0´00
0´00
0´00
0´00
0´00
0´00
0´00
0´00
0´00
0´00
0´94
0´62
0´77
279
Mean-variance analysis of the performance of spatial ordering methods
Table 6.
Average number of clusters for skewed query regions, region area=57 600.
Skew ratio
Cols
Rows
IN
S
1
1´56
1´56
240
300
192
240
192
300
Õ
2´25
2´25
360
160
160
360
Õ
4
4
480
120
120
480
238´8( 0)
299´4( 0)
190´9( 0)
245´1 ( 0)
358´9( 0)
159´0( 0)
259´0 ( 0)
479´5( 0)
115´6( 0)
297´6 ( 0)
176´8 (25´98 )
176´6 (41´02 )
172(9´87)
174´3( 28´9)
191´5 (46´66 )
191´4 (Õ 20´32 )
191´4( 26´1)
265´3 (44´67 )
263´7 (Õ 128´07)
264´5( 11´1)
Õ
Table 7.
246´5 (Õ 3´23 )
243´9 (18´53 )
247´3 (Õ 29´60 )
245´6 (Õ 0´2)
249´9 (30´38 )
255´5 (Õ 60´63)
252´7 (2´4)
289´9 (39´55 )
291(Õ 151´76)
290´5 (2´4)
ANOVA results for skewed query regions, region area=57 600.
IN-S
IN-H
S-H
Cols
Rows
F-stat
P-value
F-stat
P-value
F-stat
P-value
1
1´56
1´56
240
300
192
240
192
300
Õ
2´25
2´25
360
160
160
360
Õ
4
4
480
120
120
480
527´76
5298´56
127´31
574´13
19 679´98
885´95
214´55
2763´65
1350´07
13´25
0´00
0´00
0´00
0´00
0´00
0´00
0´00
0´00
0´00
0´00
3´61
155´53
160´98
0´01
489´51
444´92
0´93
974´77
828´93
0´49
0´06
0´00
0´00
0´92
0´00
0´00
0´33
0´00
0´00
0´48
209´55
203´17
256´20
228´92
136´85
188´36
160´87
11´41
14´23
12´84
0´00
0´00
0´00
0´00
0´00
0´00
0´00
0´00
0´00
0´00
Õ
Skew
H
that the Spiral method signi® cantly outperforms both the Hilbert and Inverted Naive
methods for large query regions, irrespective of orientatio n.
3.4. Discussion
The strength of the S method lies in the fact that it is somewhat similar to and
has the bene® ts of the IN method, and yet it is symmetric. The above results show
that in terms of average performance: ( 1) the Spiral method clearly dominates Inverted
naive and Hilbert; and ( 2) the Inverted naive and Hilbert methods performance is, on
average, equal to one anothe r assuming no prior knowledg e about the skew factor. The
results, in tables 2, 4 and 6 do however, suggest that Hilbert has a slight advantage
because it is symmetric, and not sensitive to the orientation of the query region (or
sign of the skew factor) while inverted naive is very sensitive to it. Based on this one
might argue that Hilbert is relatively better. However, such a conclusion would be
premature unless a more detailed examination of the variance of each method is
carried out, which we turn to in the next section.
4. Ex am ination of variance
So far our emphasis was on studying the mean values for the number of clusters.
In this section, we examine the sample variance for the three methods more closely.
The purpose of studying the variance is to get a sense for how much the actual
280
A. Kumar et al.
values might deviate from the average, and this would give an indication of the
possible worst cases that might arise.
Table 8 gives standard deviations for the same square query regions for which
the mean values were reported in table 1. Table 8 clearly shows that the standard
deviation is the least for the IN method and highest for the H method, with Spiral
in between. The standard deviation in both IN and H methods grows with a larger
query region size, however in the Spiral method it ® rst increases, and then begins to
drop after reaching a maximum value for query regions of size 240Ö 240. This table
in conjunction with table 1 clearly suggests that the IN method dominates the H
method for square query regions. On the other hand, between IN and S, even though
S does not clearly dominate IN for all square query regions in terms of variance,
the considerable advantage of S in terms of mean values clearly outweighs its slightly
higher standard deviation. Therefore, it is quite clear that for square query regions,
S is the best method followed by IN .
Tables 9 and 10 respectively show the standard deviation values for small and
medium skewed query regions. Since these two tables exhibit similar patterns, we
will only discuss the latter. The standard deviation results shown in table 10 correspond to the mean values shown in table 4 for medium size query regions (area=
22 500). Note that for each magnitude of the skew factor, the standard deviations
are ® rst shown separately for each orientation, and then aggregated together across
both orientations to obtain the standard deviation of the pooled sample. As before,
the discussion relates to the aggregate values because that o € ers a more reasonable
comparison. The following observations can be made from this table. As the skew
factor increases, the standard deviation for the IN method rises very fast, and for a
ratio of 2´25 it is slightly higher than the standard deviation for the H method.
Therefore, the H method performs better for skew factors greater than 2. Between S
and H also there is a similar cut-over: S has a smaller standard deviation than H
for small values of skew factor ( less than 3 ), and H has a smaller standard deviation
than S for larger values of skew factor ( greater than 6); in between, the two methods
are comparable. Therefore, for small values of skew factor, S dominates H, while for
Table 8.
Standard deviations for square query regions.
Cols
Rows
Area
IN
S
H
30
60
90
120
150
180
210
240
270
300
330
360
390
420
450
480
30
60
90
120
150
180
210
240
270
300
330
360
390
420
450
480
900
3600
8100
14 400
22 500
32 400
44 100
57 600
72 900
90 000
108 900
129 600
152 100
176 400
202 505
230 400
0´92
2´95
0´00
7´47
7´39
7´96
10´44
11´91
11´95
11´54
17´91
28´59
21´18
27´89
44´01
55´09
0´36
4´05
7´52
15´37
25´82
36´82
50´19
59´19
56´20
50´64
43´49
36´49
28´35
22´61
15´04
7´93
10´58
22´51
32´40
45´28
53´81
68´74
74´29
90´01
96´69
109´88
118´49
129´10
142´25
159´51
160´87
185´36
Mean-variance analysis of the performance of spatial ordering methods
Standard deviations for skewed query regions, region area=900.
Table 9.
Skew factor
Cols
Rows
IN
S
H
1
2´25
2´25
30
45
20
30
20
45
Õ
4
4
60
15
15
60
Õ
6´25
6´25
75
12
12
75
0´92
1´39
0´96
12´57
2´30
0´32
22´49
2´86
0´35
31´48
2´19
0´36
45´51
3´31
0´20
72´01
0´36
11´93
12´11
12´02
20´95
20´91
20´92
28´61
28´70
28´65
39´73
39´21
39´51
56´81
56´81
56´09
10´58
12´25
12´09
12´18
6´32
6´32
6´53
20´05
20´83
20´44
7´61
7´99
7´80
38´12
38´00
38´06
Õ
Õ
11´1
11´1
100
9
9
100
Õ
25
25
150
6
6
150
Table 10.
Standard deviations for skewed query regions, region area=22 500.
Skew factor
Cols
Rows
IN
S
H
1
2´25
2´25
150
225
100
150
100
225
Õ
4
4
300
75
75
300
Õ
6´25
6´25
375
60
60
375
500
45
45
500
7´39
5´01
3´87
62´74
9´48
2´86
112´58
11´82
4´12
157´77
11´14
7´95
229´26
25´82
43´39
41´50
42´46
58´87
60´29
59´56
94´77
94´67
94´72
133´04
131´55
132´23
53´81
58´48
59´71
59´07
29´64
29´41
29´53
100´70
98´36
99´56
36´86
38´71
37´78
Õ
Õ
281
11´1
11´1
large values of skew factor, there is a trade-o € : S has a smaller mean value but a
higher standard deviation.
The standard deviation results shown in table 11 correspond to the mean values
shown in table 6 for large size query regions (area=57 600). The following observations can be made from this table. As the skew factor increases, the standard deviation
for the IN method rises very fast, and for a ratio of 2´25 it is slightly lower than the
standard deviation for the H method. Therefore, the H method performs better for
skew factors greater than 2´25. This behaviour is somewhat consistent with that
observed above with respect to small and medium size query regions. In conjunction
with table 6, however, table 11 clearly shows that the S method dominates both the
IN and H methods for large query regions.
To get a better perspective on the distribution pattern for a given query with the
three methods, we plotted distribution graphs for sample query region sizes. The
distribution patterns for a 90Ö 90 query region are depicted in the histograms shown
in ® gure 3 for all three methods. The IN and S methods have a large spike at 90,
282
A. Kumar et al.
Table 11.
Standard deviations for skewed query regions, region area=57 600.
Skew ratio
Rows
IN
S
H
1
1´56
1´56
240
300
192
240
192
300
Õ
2´25
2´25
360
160
160
360
Õ
4
4
480
120
120
480
11´91
9´45
10´41
55´20
13´91
8´68
100´66
10´69
15´59
182´54
59´19
36´52
35´98
36´31
22´77
22´74
22´74
90´48
88´77
89´59
90´01
99´06
98´85
98´92
109´28
101´93
105´65
135´37
135´33
135´28
Õ
Cols
while the distribution for the H method is scattered into three sub-regions, ranging
in value from 37´5 to 150. This explains the higher standard deviation observed for
H. Similar distribution plots, but for a query region of 480Ö 480 is shown in ® gure 4.
Again notice that the spread of values is smallest for IN, and largest for H. In fact,
in the worst case the H method accesses about 800 clusters. The spread for the S
method is quite small, and all values lie in the range from 10 to 40.
We also did similar plots for skewed query regions in order to understand the
distribution patterns for the three methods. We selected 18 Ö 450 and 450Ö 18 query
regions because the skew factor in these extreme cases is Õ 25 and 25, and the area
is the same as the area of a 90Ö 90 region. The graphs are shown in ® gures 5 and
6. Now it should be noted that the spread is largest for the S method (in the range
from 18 to 450), while for the H method there are two subregions clumped together
(one subregion lies between 100 and 150, and the second lies between 325 and 375).
Also notice, that the distribution patterns for S and H are quite similar in both
® gures 5 and 6. Again, as expected the IN method has the least standard deviation,
but the mean value is 18 in one case and 450 in the other.
An intuitive explanation of these results is as follows. For large skew factors, the
variance in the S method is large; i.e. it is very sensitive to the location of the query
region. This agrees with our intuition because when the query region is near the
boundaries of the data space; i.e. the long side of the query regions lies parallel to
and very close to any one of the four boundaries, the number of clusters is small,
nearly equal to the small dimension of the query region ( 18 for a 18Ö 450 or 450Ö 18
region). On the other hand, when the long side of the query region lies away from
a boundary, then the number of clusters increases. The worst case occurs when the
query region lies near the centre of the data space, and now there are as many
clusters as the long side of the query region ( 450 for a 18Ö 450 or 450Ö 18 region).
If we had a square of the same area (say, a 90Ö 90 square), the variance with the S
method is very small because the number of clusters is very insensitive to the exact
location of the query region in the data space. Therefore, with the S method, the
variance is much higher for a skewed rectangle than a square of the same area, and
increases with the skew factor.
In the H method also, the variance of a skewed query region increases with
increasing skew at ® rst, but there are two di€ erences: ( 1) the increase in variance is
not so pronounced as in the Spiral method; and ( 2) as the skew ratio increases the
variance tends to oscillate, as evidenced by the data in tables 9 and 10. We are not
Mean-variance analysis of the performance of spatial ordering methods
Figure 3.
Distribution plot (90Ö
283
90).
able to explain this behaviour fully, but we think it might partly be due to the fact
that the H method tends to perform better when the query region is aligned with
the pattern or boundaries of the Hilbert curve, which is more likely to occur when
the query region lies exactly on a corner and overlaps two boundaries of the data
space. By contrast, the performance of the method tends to be poor when the query
region is poorly aligned with the curve, which is more likely to occur when the
query region lies in the center of the data space. A skewed query region can `hug’
the boundary even more so than a square query region of the same area, and the
284
A. Kumar et al.
Figure 4.
Distribution plot (480Ö
480).
di€ erence between its performance at the centre versus at the boundary is much
greater. Hence, the variance tends to increase with the skew factor. However, once
the skew factor becomes `too large’, then the query region is always in the proximity
of some boundary! This is illustrated in table 10 where the variance drops when the
skew factor is 11´1. This is because the size of the query region is 500Ö 45, and in a
data space of 512Ö 512, it is always in the vicinity of a boundary, and never really
in the `centre’.
Mean-variance analysis of the performance of spatial ordering methods
Figure 5.
Distribution plot ( 18Ö
285
450).
5. Discussion and comparison with related work
These results clearly show that deciding which method is best is a complex issue
even in the special case that we have studied; i.e. the blocking factor is 1. We can,
however, summarize our ® ndings in the form of a series of recommendations that
apply under various scenarios.
Scenario 1: Nothing is known abou t expected workload
S is the best method.
Scenario 2: Only skew factor known
If skew =1 (i.e. only square queries), then S is the best method.
286
A. Kumar et al.
Figure 6.
Distribution plot ( 450Ö
18).
If skew>1, then column-wise IN is the best method.
If skew<1, then row-wise IN is the best method.
In selecting the Spiral method as the default method when nothing is known,
our strategy is to minimize the downside risk in the worst case which could occur
when the area is large (say, a 480Ö 480 query region), while minimizing the overall
mean and standard deviation of number of clusters that must be accessed for queries
of various sizes and orientation. Here the H method could result in close to 800
clusters! ( With the IN and S methods, the absolute worst case is 512 clusters under
all circumstances). Moreover, as query regions get larger in area, the maximum skew
Mean-variance analysis of the performance of spatial ordering methods
287
factor is constrained, and the relative advantage of the H method declines.
Furthermore, for large square query regions the S method has a big advantage over
H. Thus, the rationale is to minimize the error when the query regions are large,
and query processing times would be longer than in the case where query regions
are small.
A small caveat to the above generalizations should be mentioned here. We varied
the area of the query region over a wide range, and found that over a very small
part of this range, for example when the query region area is 22 500 and the skew
factor is 11´1, the means for both the S and H methods are close, but the standard
deviation of the S method is about two to three times larger than the standard
deviation of the H method. In this case, we would consider the H method to be
superior to the S method. However, such cases were very few, and it was not possible
to generalize them in a meaningful manner. Therefore, we believe that in the absence
of precise knowledge about the dimensions of the query region, the Spiral method
is almost uniformly superior.
The most relevant related work that we are familiar with is by Jagadish ( 1990).
Some of the conclusions in that study are similar to ours (for example that there is
no clear winner between IN and H in terms of the overall average number of
clusters), but the author did not examine sample variances, and also did not consider
the Spiral method. As our study shows, both these factors have a considerable impact
on the ® nal conclusions. In particular, the default method (Scenario 1) in ( Jagadish
1990) was found to be Hilbert, while our study suggests that Spiral would be a better
default method. In fact, we feel that even IN might be superior to Hilbert as the
default method because it reduces the downside risk for large query regions.
6. Conclusions
GIS need e cient ways of storing raster or grid data in a database system. The
data has to be clustered (or ordered ) properly for e cient retrieval and fast query
processing. In this paper, we discussed several clustering methods and showed that
the choice of an appropriate method has considerable performance implications for
the GIS. In particular, we conducted a detailed investigation into the clustering
performance of the Hilbert method (which has been shown previously in various
other studies to outperform other recursive methods) and two non-recursive methods,
Inverted Naive and Spiral. The Spiral method has not been studied previously, nor
have earlier studies examined the variance issue which we have found to be of crucial
importance.
The most important conclusion of this paper is that in the case when the blocking
factor is 1, the Hilbert method has a fairly high variance and is inferior to other
non-recursive methods. This point has been overlooked in previous studies where
only average values were compared and variance was not considered. This variance
naturally plays an important role in the choice of the best method for spatial
clustering. Even though these results apply to a special case and do not generalize
to higher blocking factors, we believe that these results are rather surprising and,
therefore, useful for both researchers and practitioners to know.
Our main conclusion is that while all three methods perform well in their own
niches, in situations where no information is available on the nature of the query
regions, the Spiral and IN method (in that order) are better than the Hilbert method.
Our analysis of sample variance showed that in the worst case, the performance of
the Hilbert method can be extremely poor.
288
A. Kumar et al.
We believe that a good clustering method can have an even greater impact on
performance if positive autocorrelation is present in the data (which is often the case
for spatial data). In future work, we expect to quantify the e€ ect of this factor on
the performance of the various methods.
References
A bel, D ., and S mith, J ., 1983, A data structure and algorithm based on a linear key for a
rectangle retrieval problem. Computer V ision Graphics and Image Processing , 24, 1± 13.
B atty, P ., 1992, Exploiting relational database technology in a GIS. Computers and
Geosciences, 18, 453± 462.
B entley, J ., 1975, Multidimensional binary search trees used for associative searching.
Communications of ACM , 18, 509± 517.
D andamundi, S ., and S orenson, P ., 1985, An empirical performance comparison of some
variations of the k-d tree and BD tree. International Journal of Computer and
Information Sciences, 14, 135± 159.
D andamundi, S ., and S orenson, P ., 1986, Algorithms for BD-trees. Software Practice and
Experience , 16, 1077± 1096.
F aloutsos, C . , 1988, Gray codes for partial match and range queries. IEEE T ransactions on
Sof tware Engineering , 14, 1381± 1393.
F aloutsos, C . , and R oseman, S ., 1989, Fractals for secondary key retrieval. In Proceedings
of the Eighth ACM SIGACT -SIGMOD Symposium on Principles of Database Systems
held in Austin, T exas on 29± 31 March 1989 , (New York: Association for Computing
Machinery), pp. 247± 252.
F inkel, R ., and B entley, J ., 1974, Quad trees: a data structure for retrieval on multiple keys.
Acta Informatica , 4, 1± 9.
G oodchild, M . , 1992, Geographical data modeling. Computers and Geosciences, 18, 401± 408.
G uttman, A ., 1984, R-trees: a dynamic index structure for spatial searching. In Proceedings
of the ACM SIGMOD International Conference on Management of Data held in Boston
on 18± 21 June 1984 , edited by Beatrice Yormark (New York: Association for
Computing Machinery), pp. 47± 57.
H inrichs, K ., 1985, Implementation of the grid ® le: design concepts and experience. BIT ,
25, 569± 592.
J agadish, H . V . , 1990, Linear clustering of objects with multiple attributes. In Proceedings of
the ACM SIGMOD International Conference on Management of Data held in Atlantic
City on 23± 25 May 1990 , edited by H. Garcia-Moline and H. Jagadish ( New York:
Association for Computing Machinery), pp. 332± 341.
K umar, A ., 1994, A study of spatial clustering techniques. In Proceedings of the 5th
International Conference on Database and Expert System Applications held in Athens
Greece on 7± 9 September 1994 , Lecture Notes in Computer Science, 856, 57± 71.
N eivergelt, J ., H interberger, H ., and S evcik, K ., 1994, The grid ® le: an adaptable, symmetric
multikey ® le structure. ACM T ransactions on Database Systems, 9, 38± 71.
O renstein, J ., 1986, Spatial query processing in an object-oriented database system. In
Proceedings of the ACM SIGMOD International Conference on Management of Data
held in Washington, D.C. on 28± 30 May 1986 , edited by C. Zaniolo ( New York:
Association for Computing Machinery), pp. 326± 336.
O renstein, J ., 1989, Redundancy in spatial database systems. In Proceedings of the ACM
SIGMOD International Conference on Management of Data held in Portland, OR on
May 31± June 2 1989 , edited by H. Garcia-Moline and H. Jagadish (New York:
Association for Computing Machinery), pp. 294± 305.
O renstein, J ., and M errett, T . , 1984, A class of data structures for associative searching. In
Proceedings of the T hird ACM SIGACT -SIGMOD Symposium on Principles of Database
Systems held in Waterloo, Canada on 2± 4 April 1984 , (New York: Association for
Computing Machinery), pp. 181± 190.
R obinson, J . T . , 1981, The KDB tree: a search structure for large multidimensional dynamic
indexes. In Proceedings of the ACM SIGMOD International Conference on Management
Mean-variance analysis of the performance of spatial ordering methods
289
of Data held in Ann Arbor, Michigan, April 29± May 1, 1981 , edited by Y. E. Lien (New
York: Association for Computing Machinery), pp. 10± 18.
S amet, H ., 1991, T he Design and Analysis of Spatial Data Structures (Reading: Addison-Wesley) .
S tar, J ., and E stes, J ., 1990, Geographical Information Systems: An Introduction (Englewood
Cli€ s: Prentice Hall ).