int. j. geographical information science, 1998 , vol. 12 , no. 3 , 269 ± 289 Research Article M ean-variance analysis of the performance of spatial ordering methods AKHIL KUMAR College of Business, University of Colorado, Campus Box 419, Boulder, CO 80309, USA WALEED A. MUHANNA Fisher College of Business, The Ohio State University, 1775 College Road, Columbus, OH 43210, USA email [email protected] and RAYMOND A. PATTERSON School of Management, The University of Texas at Dallas, P.O. Box 830688, Richardson, TX 75083, USA ( Received 4 July 1995; accepted 13 June 1997 ) Geographical Information Systems (GIS) involve the manipulation of large spatial data sets, and the performance of these systems is often determined by how these data sets are organized on secondary storage (disk). This paper describes a simulation study investigating the performance of two non-recursive spatial clustering methodsÐ the Inverted Naive and the Spiral methodsÐ in extensive detail and comparing them with the Hilbert fractal method that has been shown in previous studies to outperform other recursive clustering methods. The paper highlights the importance of analysing the sample variance when evaluating the relative performance of various spatial ordering methods. The clustering performance of the methods is examined in terms of both the mean and variance values of the number of clusters (runs of consecutive disk blocks) that must be accessed to retrieve a query region of a given size and orientation. The results show that, for a blocking factor of 1, the mean values for the Spiral method are the best, and on average, about 30% better than for the other two methods. In terms of variance, the inverted naive method is the best followed by the Spiral and Hilbert methods, in that order. We also study the impact of varying query size and the skew ratio ( between the X and Y dimensions) for each method. While these performance results do not generalize for higher blocking factors, we believe that they are useful for both researchers and practitioners to know because several previous studies have also examined this special case, and also because it has important implications for the performance of GIS applications. Abstract. 1. Introduction Spatial and multi-dimensional data are commonly encountered in many Geographical Information System (GIS) applications, and the organization of such data for e cient retrieval has emerged as a signi® cant area of research. Two kinds of queries arise frequently in spatial data management in GIS (Star and Estes 1990): 1. What is found at a given location, e.g. 1365± 8816/98 $12´00 Ñ 1998 Taylo r & Francis Ltd. 270 A. Kumar et al. Ð What is the elevation of a speci® c geographical location? At a speci® ed coordinate what is the soil type? 2. Are there examples of speci® ed objects within a speci® ed area, e.g. Ð Are there any ® re towers in a speci® ed region? Ð Where are the sources of potable water in a speci® ed region? Ð In this paper, we focus on the second kind of queries, and study how they can be answered e ciently in a GIS. Geographical information systems and geographical data modelling approaches often involve the use of relational database technology (Goodchild 1992, Batty 1992 ). Consequently, geographical data is stored in relational database systems and a common data structure for GIS data is a raster or a grid (Star and Estes 1990 ). In a raster structure, a value for the parameter of interest is developed for every cell (or pixel ) in a grid. In the database system, this data is stored on ® les or relations on disk in a continuous and linear manner. For example, two simple ordering schemes are row-wise ordering and column-wise ordering . In rowwise ordering, the grid data is stored by rows (i.e. one cell in a row followed by another, and so on), while in column-wise ordering, it is stored by columns. However, the obvious drawback in these techniques is that in row-wise ordering two neighbouring cells in adjacent rows are stored at locations far from one another, while in column-wise ordering the same problem occurs for two cells in neighbouring columns. The objective of spatial ordering (or spatial clustering) is to store data about neighbouring cells in close physical proximity on disk. This would speed up processing of the queries described above because the time spent in accessing data from the disk is reduced when the desired data is stored more contiguously. Therefore, the objective of this paper is to compare various spatial ordering techniques and determine how they perform and compare across various kinds of queries. Another potential bene® t of good spatial ordering can result from positive autocorrelation (Star and Estes 1990, Goodchild 1992). Positive autocorrelation means that neighbouring grid cells are likely to have similar attribute values; e.g. elevation, rainfall amounts, etc. It is possible to take advantage of this correlation by reducing the amount of information stored, for instance when neighbouring cells have identical values for a certain attribute. As noted by Goodchild ( 1992 ), the similarity between variables increases as the locations converge. In this paper, we simply note this as a side bene® t and defer further investigation of it to future work. One approach for processing spatial queries e ciently consists of indexing spatial data. Indexing techniques, like B-trees, which work well on one-dimensional data, do not perform as well when the data lies in multi-dimensional space, and, therefore, new techniques have been devised for such data. Two of the earliest methods for organizing spatial data are quadtrees and K-D trees. Assuming two-dimensional data, quadtrees ( Finkel and Bently 1974 ) successively divide the data space into four smaller quadrants at each stage by examining the X and Y coordinate values. (A variant of quadtrees called MX-CIF quadtrees (Abel and Smith 1983 ) has been implemented in a GIS software product from IBM called GeoManager ( Batty 1992 ).) K-D trees ( Bentley 1975), on the other hand, divide the space into two parts at each stage by examining alternating coordinates, one coordinate at a time. However, both of these structures are primarily intended for internal searching, where the entire tree structure is main memory resident, and as such they are not very suitable for external searching, when the tree structure is stored on a secondary storage device. These trees can also get unbalanced easily. Mean-variance analysis of the performance of spatial ordering methods 271 A common approach for organizing such data on secondary storage, such as disk, is to divide a k -dimensional space or region into smaller k -dimensional partitions or subregions, and store one or more partitions in a single disk block or a bucket . To answer a query, the relevant buckets are brought into main memory from disk and searched. In Grid ® les (Nievergelt et al . 1984, Hinrichs 1985), a directory is maintained to track the correspondence between partitions and disk buckets. In KDB trees ( Robinson 1981 ), the regions are organized into a tree similar to a B-tree. The leaf pages of this tree correspond to disk buckets, while higher level pages encompass successively larger non-overlappi ng regions, which are navigated in order to locate the appropriate leaf page. R-trees (Guttman 1984 ) are somewhat similar; however, they permit non-overlapping regions. Both KDB-trees and R-trees can be maintained on secondary storage. Examples of other structures are BD trees ( Dandamundi and Sorenson 1985, 1986), and zkdb trees (Orenstein and Merrett 1984, Orenstein 1986, 1989 ). Samet ( 1991 ) presents an excellent discussion of these methods. Since the disk space is a continuum of blocks, it is logically one-dimensional. Therefore, a mapping is required from the multi-dimensional data space into the one-dimensional disk space. A very desirable property for such mapping to have, is that it should be distance preserving ; i.e. points that are close in the multi-dimensional space should also be close in the one dimensional (disk) space. If this is the case, then the various buckets that must be accessed to answer a query would be contiguous, or sequentially clustered together. This would reduce the number of random disk accesses (disk seek operations and rotational delay) required, thereby speeding up the total access time for all the buckets. On the other hand, in the worst case the buckets could be scattered across the disk, resulting in poor access performance. Therefore, good clustering is a very useful property and several existing data structures stand to bene® t from it. Several techniques for spatial clustering have been proposed. They can be classi® ed into two groups: 1. Non-recursive techniques such as: inverted naive, spiral. 2. Recursive techniques such as: binary z-ordering, gray z-ordering, gray nu-ordering, hilbert curve. The non-recursive techniques follow some simple pattern for traversing (ordering) the points in the data space. The recursive techniques are based on recursively partitioning a data space, in a pattern that repeats itself at each level of recursion. The binary z-ordering ( BZ) method was proposed and studied in Orenstein and Merrett 1984 (Orenstein 1986). The gray z-ordering (GZ) was proposed by Faloutsos ( 1988 ) and shown to outperform BZ. Subsequently, Hilbert ( H ) ordering was described by Faloutsos and Roseman ( 1989 ) and shown to be superior to both BZ and GZ ordering. Kumar ( 1994 ) proposed another technique called gray nu-ordering (GN ), but the study showed that Hilbert was the best among these four recursive methods. These studies have clearly established that Hilbert is the best method among the recursive ones. However, the case for the superiority of Hilbert over the non-recursive methods has not so clearly been made, particularly relative to the Spiral method whose performance has not been investigated in the literature. In our present study, we investigate, using stochastic Monte Carlo simulations, the performance of the (recursive) Hilbert method and two non-recursive methods, the Inverted Naive and Spiral, in very extensive detail. We examine the clustering 272 A. Kumar et al. performance of the methods in terms of both the mean and variance values of the number of clusters (runs of consecutive disk blocks) that must be accessed to retrieve a query region. Previous studies judged relative performance solely on the basis of sample means. This paper highlights the importance of analysing the sample variance when evaluating the relative performance of various methods. The results are very interesting because they suggest that the non-recursive methods generally perform better than the Hilbert method. In the past, researchers (e.g. Jagadish 1990) have noted that the inverted naive method has a bias in one dimension or the other, and argued that if the orientation of a query is unknown then the naive method would work well for queries of the `favourable’ orientation, and poorly for queries of the `unfavourable’ orientation. This has been a basis for rejecting the method in situations where the query orientation is not known in advance, which is usually the case. One caveat should be clearly noted at the outset. Our results and comparisons apply to the special case where the blocking factor is 1; i.e. as in ( Faloutsos 1988), we assume that each point in the multi-dimensional data space is mapped to exactly one disk bucket ( block); each grid point is a unit of access from disk. Even though the results do not generalize to larger blocking factors (admittedly, recursive methods are better for larger blocking factors), we believe they are worth reporting because: ( 1) they advance the results of previous studies which made the same assumptions; ( 2) the results are important and useful for researchers to know; and (3 ) we believe that in certain situations, a blocking factor of 1 is quite realistic. For an example of where a blocking factor of 1 might arise, consider a geographical information system in which maps and satellite images are stored and searched electronically, and a user can select and click on a small section of a map in order to zoom-in to expand that section. Here, each grid point will be associated with a binary large object ( BLOB) which will be stored with it in the same disk bucket. Since the BLOB is a large object, it is reasonable to assume that one grid point would map into one disk bucket. The balance of the paper is organized as follows. Section 2 brie¯ y describes the three methods being evaluated. Section 3 describes the results of a series of simulation experiments and statistical (ANOVA) tests conducted to evaluate the relative performance of the three methods under di erent conditions. Section 4 examines the sample variance for the three methods (on queries of di erent sizes and orientations) more closely. Section 5 provides a discussion of our results and a comparison with results obtained from previous studies. Section 6 concludes the paper. 2. Clustering schemes In this section, we brie¯ y describe the Inverted Naive, Spiral and Hilbert methods. 2.1. Inverted Naive (IN) ordering The clustering pattern created by the IN method, also known as column-wise snake scan ( Jagadish 1990 ) and column-prime order (Samet 1991), is shown in ® gure 1 (a ) for an 8 Ö 8 data space. It traces a path of successive vertical lines, going in alternate directions until the entire grid is covered. This ordering method covers the entire grid by taking only one step at a time; however, there are clearly more vertical steps than horizontal steps. 2.2. Spiral method (S) The clustering pattern created by the Spiral method is shown in ® gure 1 (b ). It starts at one of the corners of the grid, and produces a spiral path of successive, Mean-variance analysis of the performance of spatial ordering methods Figure 1. Inverted-Naive and Spiral ordering in an 8Ö 273 8 data space. non-concentric circles that become smaller until the centre of the grid is reached. Again, this ordering method also covers the entire grid by taking only single steps at a time, but it is more symmetrical than the IN method because there are roughly equal number of up, down, left, and right steps. 2.3. Hilbert curve (H ) clustering Hilbert ordering is produced by taking an initial pattern and rotating and re¯ ecting it to produce a new pattern that spans a larger region. This idea is illustrated in ® gure 2. The ® rst pattern H 1 corresponds to the case where the X and Y coordinates are represented by one bit. In the next two patterns, H 2 and H 3 , each coordinate is represented by 2 and 3 bits respectively. Given the ® rst pattern, any subsequent pattern can be visually constructed in an inductive manner as follows: Make a copy of the previous pattern and rotate the copy by 90ß clockwise. Place the rotated copy directly under the pattern. Connect the top-left corner of the copy with the bottom-left corner of the pattern. Make a vertically re¯ ected mirror-image copy of this new pattern on the right. Join the two images. Basically, as shown in ® gure 2, each successive pattern is four times the size of the previous pattern. An algorithm for determining the Hilbert code for a given coordinate position in a 2n Ö 2n grid can be found in ( Faloutsos and Roseman 1989). The (almost perfect) symmetry of the recursive Hilbert curve in terms of the number of steps in the `up’, `down’, `left’ and `right’ directions can be seen in ® gure 2. Figure 2. Hilbert ordering (H 1 , H 2 , H 3 ). 274 A. Kumar et al. 3. Performance evaluation 3.1. Introductio n and overview of tests This section describes the results of a series of simulation experiments and statistical tests conducted to evaluate the performance of the three di erent schemes discussed above. These results give insights into how the various methods fare relative to one another in a variety of situations. Our performance criterion is the average number of clusters that must be accessed to answer a query. As mentioned earlier, each grid point is a unit of access from disk. For instance, say, the nine grid elements that must be accessed to answer a 3 Ö 3 query are mapped to disk blocks: 10, 12± 13, 16± 20, and 25. In this case there are 4 clusters; on the other hand, if the grid elements were numbered 10± 14, and 22± 25, this would correspond to only 2 clusters. From a clustering point of view, the second arrangement is 50% better than the ® rst one. The simulation algorithm and a brief overview are given next, and the results of the experiments follow afterwards. The IN, S and H clustering schemes were tested by simulation experiments using the above criterion. We considered a two-dimensional 512Ö 512 data space (we also considered a 1024 Ö 1024 data space. The results were in general agreement with the results reported here for the 512 Ö 512 data space) and randomly generated ( by simulation) the boundaries of rectangular query regions of various sizes within it as follows. The X and Y coordinates of the bottom-left corner of the query rectangle were generated randomly, and a query rectangle of certain given dimensions was placed there. Then, the number of clusters into which the buckets lying within the query region would fall was determined. The steps of the simulation algorithm are listed as follows: Algorithm Simulation sum_clust= 0; sumSQ=0; repeat ( 500 times) begin count =1; 1 2 Generate a random query region of size q Ö q ; For each grid element g on the boundary of the region If (successor ( g ) lies outside the query region) count ++ ; sum_clust= sum_clust+ count; sumSQ=sumSQ + (count*count); end sample_mean =sum_clust/ 500; sample_variance = (( 500*sumSQ) Õ (sum_clust*sum_clust)) / (500*499); sample_std_dev =sqrt(sample_variance); The experiment is repeated 500 times and the values of the sample mean and standard deviation are reported. ( In cases involving extremely large query region sizes where fewer than 500 distinct samples exist in the data space, complete enumeration was performed.) The algorithm must randomly place a query region of a ® xed size inside the data space, and examine each grid point on the boundary of the query region. If the successor of a grid point lies outside the region, then the cluster count is increased by 1. The cluster count after all the grid points on the boundary have been examined is the total number of clusters for the given query region. 3.2. Square query regions The size of the data space was maintained at 512Ö 512, while the query region size was varied from 30Ö 30 to 480Ö 480. The sample means are shown in table 1 Mean-variance analysis of the performance of spatial ordering methods Table 1. Cols Rows 30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 275 Average number of clusters (square query regions). Area IN S H F-stat 900 3600 8100 14 400 22 500 32 400 44 100 57 600 72 900 90 000 108 900 129 600 152 100 176 400 202 500 230 400 29´94 (0%) 59´70 (0%) 90´00 (0%) 119´05(0%) 149´26(0%) 179´29(0%) 208´95(0%) 238´80(0%) 268´93(0%) 299´10(0%) 328´03(0%) 355´33(0%) 387´67(0%) 416´23(0%) 441´03(0%) 466´60(0%) 29´98(Õ 0´13%) 59´45( 0´42%) 88´71( 1´43%) 115´66( 2´85%) 141´06( 5´49%) 161´48( 9´93%) 174´33( 16´57%) 176´77( 25´98%) 166´51( 38´08%) 141´05( 52´84%) 120´42( 63´29%) 103´00( 70´90%) 83´37( 78´49%) 61´04( 85´34%) 42´61( 90´34%) 22´11( 95´26%) 29´24 (2´34%) 58´59 (1´86%) 89´64 (0´40%) 117´33(1´44%) 149´03(0´15%) 174´81(2´50%) 209´75(Õ 0´38%) 246´51(Õ 3´23%) 272´53(Õ 1´34%) 291´34(2´59%) 328´18(Õ 0´05%) 357´08(Õ 0´49%) 385´40(0´59%) 411´95(1´03%) 443´52(Õ 0´56%) 467´03(Õ 0´09%) 2´30 0´96 0´60 1´84 9´04 20´95 75´30 186´69 429´70 806´08 1327´01 1703´76 2138´77 2331´58 2848´16 2639´93 P-value 0´10 0´39 0´55 0´16 0´00 0´00 0´00 0´00 0´00 0´00 0´00 0´00 0´00 0´00 0´00 0´00 for the three methods. The percentages in parentheses show how the other methods fare relative to IN clustering. A positive % value indicates that a method, on average, does better than IN, while a negative % values indicates that a method does worse. To analyse how the two factors (clustering method and query size) a ect performance (average number of clusters), we ® rst conducted a two-factor ANOVA test. As expected, the test revealed the presence of signi® cant main factors as well as interaction e ects ( p -value 0´0001 ). In view of this, additional (single factor) ANOVA tests were conducted to examine, at each ® xed query size level, whether or not there are real (i.e. statistically signi® cant) performance di erences among the three clustering methods. The results, shown in the last two columns of table 1, indicate the presence of signi® cant performance di erences for all square query regions equal to or larger than 150Ö 150. Additional ANOVA runs indicate that the observed di erences in the average performance between the IN and H methods are not statistically signi® cant ( 0´05 level ). We therefore accept the null hypothesis and conclude that IN and H perform equally for all square query regions. In comparison, the performance of the Spiral method continues to improve relative to the IN method as the query region grows larger. The performance gap, in favour of S, is statistically signi® cant for query regions 150 Ö 150 in size or larger, and sharply widens as the query region size increases; for a query region size of 480Ö 480 the gap is 95%. Clearly, the Spiral method performs better than the other two methods for square query regions. 3.3. Skewed query regions In the next set of experiments, we considered skewed query regions where one side of the query region was longer than the other side. Our objective here was to see how the skew factor (i.e. the ratio) between the long side and the short side of the query region a ects the relative performance of these methods. We also wanted to see how the total area of the query region would a ect the performance and whether there was any pattern similar to the one found for square regions where relative performance of the methods was a function of the area of the query region. Consequently, we considered query regions of a ® xed area at a time, but varied the 276 A. Kumar et al. X and Y dimensions of the regions (in such a way that the area remained constant). For a ® xed area value, the lengths of the X and Y dimensions of the query region were varied in such a way that the skew ratio (i.e. ratio between the long side and the short side) ranged from Õ 25 to + 25. ( By convention, a minus sign for the skew ratio indicates that the long side is along the X axis (more columns than rows), while a plus sign indicates the opposite.) The size of the data space was again a 512 Ö 512 grid, and we show the results for three representative values for the area of the query regions: small query regionÐ area of 900 medium query regionÐ area of 22 500 large query regionÐ area of 57 600 Table 2 shows the mean values for small query regions. Again, the number of clusters is computed as before and the performance gaps (relative to the IN method ) are shown in parenthesis (as percentages). However, the table entries are grouped by magnitude of the skew factor, and negative and positive values are shown together. For example, in table 2 the skew factor magnitudes considered are: 1, 2´25, 4, 6´25, 11´1 and 25. In each case the results are shown for both positive and negative skew factors in consecutive rows, and they are combined together in the immediately following row. For example, rows 2 and 3 of table 2 give the results for skew factor values of Õ 2´25 and 2´25, respectively, while row 4 aggregates the results for these two rows and gives an overall average for all query regions with a skew factor of magnitude 2´25. The subsequent rows of the table are organized in the same manner. We believe examining aggregate values is more useful because the IN method is extremely biased towards query regions with fewer columns and more rows (i.e. positive skew values). This bias e ect can be eliminated by evaluating the aggregates, and therefore such a comparison is more useful. An ANOVA test on the observations summarized in table 2 revealed the presence of both main factors and interaction e ects ( p -values<0´001 ). In view of this, a series of ANOVA tests were conducted to examine, at each ® xed value of the skew factor, whether or not there are real performance di erences among each pair of clustering Table 2. Skew ratio Average number of clusters for skewed query regions, region area=900. Rows IN S 1 2´25 2´25 30 45 20 30 20 45 Õ 4 4 60 15 15 60 Õ 6´25 6´25 75 12 12 75 29´94 (0 ) 44´91 (0 ) 19´90 (0 ) 32´41( 0) 59´70 (0 ) 15´00 (0 ) 37´35( 0) 74´78 (0 ) 11´98 (0 ) 43´38( 0) 99´90 (0 ) 8´97(0 ) 54´44( 0) 149´85(0 ) 5´98(0 ) 77´92( 0) 29´98(Õ 0´13 ) 30´61( 31´84) 31´49(Õ 58´24) 31´05 (4´18) 33´91( 43´20) 35´24(Õ 134´93) 34´58 (7´43) 40´06( 46´43) 39´19(Õ 227´13) 39´63 (8´66) 50´67( 49´28) 46´43(Õ 417´61) 48´55 (10´81 ) 64´24( 57´13) 60´59(Õ 912´71) 62´40 (19´91 ) Õ Cols Õ 11´1 11´1 100 9 9 100 Õ 25 25 150 6 6 150 H 29´24( 2´34) 31´60( 29´64) 32´84(Õ 65´03) 32´22 (0´57) 37´45( 37´27) 37´47(Õ 149´80) 37´46 (Õ 0´29) 43´79( 41´44) 43´65(Õ 264´36) 43´72 (Õ 0´78) 54´71( 45´24) 54´59(Õ 508´58) 54´65 (Õ 0´39) 77´02( 48´60) 74´92(Õ 1152´8) 75´97 (2´50) 277 Mean-variance analysis of the performance of spatial ordering methods methods. The results, shown in table 3, support the null hypotheses that the aggregate behaviou r of the Hilbert and IN methods is identical (see columns labelled IN-H ) for all skew factor magnitudes. On the other hand, the Spiral method (with the exception of the case where the skew factor is 1 or 2´25) continuously outperforms both the IN method (see columns labelled IN-S) and the Hilbert method (see columns labelled S-H ), with respect to the mean number of clusters accessed. The Spiral method’s performance advantage (or gap) over the IN method increases as the skew factor becomes larger, and is 19´9% for a skew factor magnitude of 25. We therefore conclude that for small query regions and small skew factors (1 or 2´25), the three methods yield similar aggregate performance results. Overall, however, we conclude that for small query regions the Spiral method performs better than both the IN and H methods . The results for medium size query regions are shown in table 4 in the same ANOVA results for skewed query regions, region area=900. Table 3. IN-S Skew IN-H Cols Rows F-stat P-value 1 2´25 2´25 30 45 20 30 20 45 Õ 4 4 60 15 15 60 Õ 6´25 6´25 75 12 12 75 0´82 708´77 455´12 3´06 748´69 468´36 4´07 729´08 449´36 3´88 765´38 456´33 4´78 1131´61 462´02 14´46 0´37 0´00 0´00 0´80 0´00 0´00 0´04 0´00 0´00 0´05 0´00 0´00 0´03 0´00 0´00 0´00 Õ Õ 11´1 11´1 100 9 9 100 Õ 25 25 150 6 6 150 Table 4. Skew ratio 2´17 582´77 569´19 0´06 5472´43 6304´20 0´01 1170´68 1155´49 0´04 16 282´85 16 266´95 0´01 1811´43 1645´63 0´30 P-value F-stat P-value 0´14 0´00 0´00 0´81 0´00 0´00 0´92 0´00 0´00 0´84 0´00 0´00 0´92 0´00 0´00 0´58 2´44 1´68 3´11 2´34 13´09 5´21 8´63 5´70 7´91 6´75 4´99 20´79 11´47 17´45 21´98 19´89 0´12 0´20 0´08 0´13 0´00 0´02 0´00 0´02 0´00 0´01 0´03 0´00 0´00 0´00 0´00 0´00 Average number of clusters for skewed query regions, region area=22 500. Cols Rows IN S H 1 2´25 2´25 150 225 100 150 100 225 Õ 4 4 300 75 75 300 Õ 6´25 6´25 375 60 60 375 500 45 45 500 149´3( 0) 224´8( 0) 99´7( 0) 162´3 ( 0) 299´4( 0) 74´78( 0) 187´1 ( 0) 374´3( 0) 59´41( 0) 216´9 ( 0) 499´5( 0) 41´61( 0) 270´6 ( 0) 141´1 (5´49) 140´7 (37´42 ) 137´6 (Õ 38´01 ) 139´2( 14´2) 145´6 (51´38 ) 145´8 (Õ 94´91 ) 145´7( 22´1) 178(52´43 ) 171´7 (Õ 188´92) 174´9( 19´4) 269´5 (46´05 ) 267´7 (Õ 543´33) 268´6( 0´7) 149´0 (0´15) 159´3 (29´12 ) 160(Õ 60´43) 159´7 (1´6) 185´3 (38´11 ) 187´3 (Õ 150´40) 186´3 (0´4) 209´7 (43´97 ) 217´4 (Õ 266´00) 213´6 (1´5) 270(45´94 ) 270´7 (Õ 550´66) 270´4 (0´1) Õ Õ F-stat S-H 11´1 11´1 278 A. Kumar et al. format which was used in table 2. All the query regions have an area of 22 500; and, the skew factor varies between Õ 11 and 11. (As the area of the query region becomes larger, the range of values the skew factor can assume becomes more constrained ). Again, a two-factor ANOVA test revealed the presence of both main factors and interaction e ects ( p -values<0´001). In view of this, a series of ANOVA tests were conducted to examine, at each ® xed value of the skew factor, whether or not there are real performance di erences among each pair of clustering methods. Again, we will consider only the aggregate values for each magnitude of skew factor in our discussion. The results, shown in table 5, indicate that the aggregate behavior of the Hilbert and IN methods is identical (see IN-H columns) for all skew factor magnitudes. The S method does better than IN throughout; however, in this case the gap between S and IN increases at ® rst with increasing magnitude of the skew factor, and peaks at 22´1% for a skew factor of 4. The gap however decreases as the skew factor is further increased, and drops to 0´7% for a skew factor of 11´1. ANOVA tests (see table 5 ) suggest that the small di erences observed (when the magnitude of the skew factor is 11´1) are not statistically signi® cant. This means that for medium size regions with large skew factors, all three methods perform about the same. However, we again conclude that overall, the Spiral method dominates the other two methods for medium size queries. The performance results for large query regions are shown in table 6, and the results of the corresponding ANOVA tests are shown in table 7. All the query regions have an area of 57 600 in this table, and the skew factor lies between Õ 4 and 4. Again, the aggregate performance of the IN and H methods is nearly within 3% of one another. However, our ANOVA tests suggest that these di erences are not signi® cant. We therefore conclude that the aggregate behaviou r of the two methods is identical . On the other hand, the relative performance between IN and S exhibits a somewhat similar trend to the one observed in table 4. We ® nd that when the skew factor has a magnitude of 1, S is better than IN by about 26%. This gap grows to about 29% when the skew factor is 1´56, and then drops to about 11% when the skew factor increases to 4. ( It is not possible to ® nd a query region of this area size and a larger skew factor.) On the basis of the results in tables 6 and 7 we conclude Table 5. ANOVA results for skewed query regions, region area=22 500. IN-S Skew Cols Rows F-stat P-value 1 2´25 2´25 150 225 100 150 100 225 Õ 4 4 300 75 75 300 Õ 6´25 6´25 375 60 60 375 500 45 45 500 46´61 1853´66 413´42 46´49 3326´41 692´25 52´83 2112´35 702´11 26´05 1483´98 1471´53 0´03 0´00 0´00 0´00 0´00 0´00 0´00 0´00 0´00 0´00 0´00 0´00 0´00 0´87 Õ Õ IN-H 11´1 11´1 F-stat 0´02 622´68 507´80 0´46 6721´81 7250´23 0´02 1317´73 1287´75 0´16 17 760´89 16 803´28 0´00 S-H P-value F-stat P-value 0´90 0´00 0´00 0´50 0´00 0´00 0´88 0´00 0´00 0´69 0´00 0´00 0´98 8´76 32´62 47´45 39´71 181´40 191´37 186´49 26´28 56´03 39´65 0´01 0´24 0´09 0´00 0´00 0´00 0´00 0´00 0´00 0´00 0´00 0´00 0´00 0´94 0´62 0´77 279 Mean-variance analysis of the performance of spatial ordering methods Table 6. Average number of clusters for skewed query regions, region area=57 600. Skew ratio Cols Rows IN S 1 1´56 1´56 240 300 192 240 192 300 Õ 2´25 2´25 360 160 160 360 Õ 4 4 480 120 120 480 238´8( 0) 299´4( 0) 190´9( 0) 245´1 ( 0) 358´9( 0) 159´0( 0) 259´0 ( 0) 479´5( 0) 115´6( 0) 297´6 ( 0) 176´8 (25´98 ) 176´6 (41´02 ) 172(9´87) 174´3( 28´9) 191´5 (46´66 ) 191´4 (Õ 20´32 ) 191´4( 26´1) 265´3 (44´67 ) 263´7 (Õ 128´07) 264´5( 11´1) Õ Table 7. 246´5 (Õ 3´23 ) 243´9 (18´53 ) 247´3 (Õ 29´60 ) 245´6 (Õ 0´2) 249´9 (30´38 ) 255´5 (Õ 60´63) 252´7 (2´4) 289´9 (39´55 ) 291(Õ 151´76) 290´5 (2´4) ANOVA results for skewed query regions, region area=57 600. IN-S IN-H S-H Cols Rows F-stat P-value F-stat P-value F-stat P-value 1 1´56 1´56 240 300 192 240 192 300 Õ 2´25 2´25 360 160 160 360 Õ 4 4 480 120 120 480 527´76 5298´56 127´31 574´13 19 679´98 885´95 214´55 2763´65 1350´07 13´25 0´00 0´00 0´00 0´00 0´00 0´00 0´00 0´00 0´00 0´00 3´61 155´53 160´98 0´01 489´51 444´92 0´93 974´77 828´93 0´49 0´06 0´00 0´00 0´92 0´00 0´00 0´33 0´00 0´00 0´48 209´55 203´17 256´20 228´92 136´85 188´36 160´87 11´41 14´23 12´84 0´00 0´00 0´00 0´00 0´00 0´00 0´00 0´00 0´00 0´00 Õ Skew H that the Spiral method signi® cantly outperforms both the Hilbert and Inverted Naive methods for large query regions, irrespective of orientatio n. 3.4. Discussion The strength of the S method lies in the fact that it is somewhat similar to and has the bene® ts of the IN method, and yet it is symmetric. The above results show that in terms of average performance: ( 1) the Spiral method clearly dominates Inverted naive and Hilbert; and ( 2) the Inverted naive and Hilbert methods performance is, on average, equal to one anothe r assuming no prior knowledg e about the skew factor. The results, in tables 2, 4 and 6 do however, suggest that Hilbert has a slight advantage because it is symmetric, and not sensitive to the orientation of the query region (or sign of the skew factor) while inverted naive is very sensitive to it. Based on this one might argue that Hilbert is relatively better. However, such a conclusion would be premature unless a more detailed examination of the variance of each method is carried out, which we turn to in the next section. 4. Ex am ination of variance So far our emphasis was on studying the mean values for the number of clusters. In this section, we examine the sample variance for the three methods more closely. The purpose of studying the variance is to get a sense for how much the actual 280 A. Kumar et al. values might deviate from the average, and this would give an indication of the possible worst cases that might arise. Table 8 gives standard deviations for the same square query regions for which the mean values were reported in table 1. Table 8 clearly shows that the standard deviation is the least for the IN method and highest for the H method, with Spiral in between. The standard deviation in both IN and H methods grows with a larger query region size, however in the Spiral method it ® rst increases, and then begins to drop after reaching a maximum value for query regions of size 240Ö 240. This table in conjunction with table 1 clearly suggests that the IN method dominates the H method for square query regions. On the other hand, between IN and S, even though S does not clearly dominate IN for all square query regions in terms of variance, the considerable advantage of S in terms of mean values clearly outweighs its slightly higher standard deviation. Therefore, it is quite clear that for square query regions, S is the best method followed by IN . Tables 9 and 10 respectively show the standard deviation values for small and medium skewed query regions. Since these two tables exhibit similar patterns, we will only discuss the latter. The standard deviation results shown in table 10 correspond to the mean values shown in table 4 for medium size query regions (area= 22 500). Note that for each magnitude of the skew factor, the standard deviations are ® rst shown separately for each orientation, and then aggregated together across both orientations to obtain the standard deviation of the pooled sample. As before, the discussion relates to the aggregate values because that o ers a more reasonable comparison. The following observations can be made from this table. As the skew factor increases, the standard deviation for the IN method rises very fast, and for a ratio of 2´25 it is slightly higher than the standard deviation for the H method. Therefore, the H method performs better for skew factors greater than 2. Between S and H also there is a similar cut-over: S has a smaller standard deviation than H for small values of skew factor ( less than 3 ), and H has a smaller standard deviation than S for larger values of skew factor ( greater than 6); in between, the two methods are comparable. Therefore, for small values of skew factor, S dominates H, while for Table 8. Standard deviations for square query regions. Cols Rows Area IN S H 30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 900 3600 8100 14 400 22 500 32 400 44 100 57 600 72 900 90 000 108 900 129 600 152 100 176 400 202 505 230 400 0´92 2´95 0´00 7´47 7´39 7´96 10´44 11´91 11´95 11´54 17´91 28´59 21´18 27´89 44´01 55´09 0´36 4´05 7´52 15´37 25´82 36´82 50´19 59´19 56´20 50´64 43´49 36´49 28´35 22´61 15´04 7´93 10´58 22´51 32´40 45´28 53´81 68´74 74´29 90´01 96´69 109´88 118´49 129´10 142´25 159´51 160´87 185´36 Mean-variance analysis of the performance of spatial ordering methods Standard deviations for skewed query regions, region area=900. Table 9. Skew factor Cols Rows IN S H 1 2´25 2´25 30 45 20 30 20 45 Õ 4 4 60 15 15 60 Õ 6´25 6´25 75 12 12 75 0´92 1´39 0´96 12´57 2´30 0´32 22´49 2´86 0´35 31´48 2´19 0´36 45´51 3´31 0´20 72´01 0´36 11´93 12´11 12´02 20´95 20´91 20´92 28´61 28´70 28´65 39´73 39´21 39´51 56´81 56´81 56´09 10´58 12´25 12´09 12´18 6´32 6´32 6´53 20´05 20´83 20´44 7´61 7´99 7´80 38´12 38´00 38´06 Õ Õ 11´1 11´1 100 9 9 100 Õ 25 25 150 6 6 150 Table 10. Standard deviations for skewed query regions, region area=22 500. Skew factor Cols Rows IN S H 1 2´25 2´25 150 225 100 150 100 225 Õ 4 4 300 75 75 300 Õ 6´25 6´25 375 60 60 375 500 45 45 500 7´39 5´01 3´87 62´74 9´48 2´86 112´58 11´82 4´12 157´77 11´14 7´95 229´26 25´82 43´39 41´50 42´46 58´87 60´29 59´56 94´77 94´67 94´72 133´04 131´55 132´23 53´81 58´48 59´71 59´07 29´64 29´41 29´53 100´70 98´36 99´56 36´86 38´71 37´78 Õ Õ 281 11´1 11´1 large values of skew factor, there is a trade-o : S has a smaller mean value but a higher standard deviation. The standard deviation results shown in table 11 correspond to the mean values shown in table 6 for large size query regions (area=57 600). The following observations can be made from this table. As the skew factor increases, the standard deviation for the IN method rises very fast, and for a ratio of 2´25 it is slightly lower than the standard deviation for the H method. Therefore, the H method performs better for skew factors greater than 2´25. This behaviour is somewhat consistent with that observed above with respect to small and medium size query regions. In conjunction with table 6, however, table 11 clearly shows that the S method dominates both the IN and H methods for large query regions. To get a better perspective on the distribution pattern for a given query with the three methods, we plotted distribution graphs for sample query region sizes. The distribution patterns for a 90Ö 90 query region are depicted in the histograms shown in ® gure 3 for all three methods. The IN and S methods have a large spike at 90, 282 A. Kumar et al. Table 11. Standard deviations for skewed query regions, region area=57 600. Skew ratio Rows IN S H 1 1´56 1´56 240 300 192 240 192 300 Õ 2´25 2´25 360 160 160 360 Õ 4 4 480 120 120 480 11´91 9´45 10´41 55´20 13´91 8´68 100´66 10´69 15´59 182´54 59´19 36´52 35´98 36´31 22´77 22´74 22´74 90´48 88´77 89´59 90´01 99´06 98´85 98´92 109´28 101´93 105´65 135´37 135´33 135´28 Õ Cols while the distribution for the H method is scattered into three sub-regions, ranging in value from 37´5 to 150. This explains the higher standard deviation observed for H. Similar distribution plots, but for a query region of 480Ö 480 is shown in ® gure 4. Again notice that the spread of values is smallest for IN, and largest for H. In fact, in the worst case the H method accesses about 800 clusters. The spread for the S method is quite small, and all values lie in the range from 10 to 40. We also did similar plots for skewed query regions in order to understand the distribution patterns for the three methods. We selected 18 Ö 450 and 450Ö 18 query regions because the skew factor in these extreme cases is Õ 25 and 25, and the area is the same as the area of a 90Ö 90 region. The graphs are shown in ® gures 5 and 6. Now it should be noted that the spread is largest for the S method (in the range from 18 to 450), while for the H method there are two subregions clumped together (one subregion lies between 100 and 150, and the second lies between 325 and 375). Also notice, that the distribution patterns for S and H are quite similar in both ® gures 5 and 6. Again, as expected the IN method has the least standard deviation, but the mean value is 18 in one case and 450 in the other. An intuitive explanation of these results is as follows. For large skew factors, the variance in the S method is large; i.e. it is very sensitive to the location of the query region. This agrees with our intuition because when the query region is near the boundaries of the data space; i.e. the long side of the query regions lies parallel to and very close to any one of the four boundaries, the number of clusters is small, nearly equal to the small dimension of the query region ( 18 for a 18Ö 450 or 450Ö 18 region). On the other hand, when the long side of the query region lies away from a boundary, then the number of clusters increases. The worst case occurs when the query region lies near the centre of the data space, and now there are as many clusters as the long side of the query region ( 450 for a 18Ö 450 or 450Ö 18 region). If we had a square of the same area (say, a 90Ö 90 square), the variance with the S method is very small because the number of clusters is very insensitive to the exact location of the query region in the data space. Therefore, with the S method, the variance is much higher for a skewed rectangle than a square of the same area, and increases with the skew factor. In the H method also, the variance of a skewed query region increases with increasing skew at ® rst, but there are two di erences: ( 1) the increase in variance is not so pronounced as in the Spiral method; and ( 2) as the skew ratio increases the variance tends to oscillate, as evidenced by the data in tables 9 and 10. We are not Mean-variance analysis of the performance of spatial ordering methods Figure 3. Distribution plot (90Ö 283 90). able to explain this behaviour fully, but we think it might partly be due to the fact that the H method tends to perform better when the query region is aligned with the pattern or boundaries of the Hilbert curve, which is more likely to occur when the query region lies exactly on a corner and overlaps two boundaries of the data space. By contrast, the performance of the method tends to be poor when the query region is poorly aligned with the curve, which is more likely to occur when the query region lies in the center of the data space. A skewed query region can `hug’ the boundary even more so than a square query region of the same area, and the 284 A. Kumar et al. Figure 4. Distribution plot (480Ö 480). di erence between its performance at the centre versus at the boundary is much greater. Hence, the variance tends to increase with the skew factor. However, once the skew factor becomes `too large’, then the query region is always in the proximity of some boundary! This is illustrated in table 10 where the variance drops when the skew factor is 11´1. This is because the size of the query region is 500Ö 45, and in a data space of 512Ö 512, it is always in the vicinity of a boundary, and never really in the `centre’. Mean-variance analysis of the performance of spatial ordering methods Figure 5. Distribution plot ( 18Ö 285 450). 5. Discussion and comparison with related work These results clearly show that deciding which method is best is a complex issue even in the special case that we have studied; i.e. the blocking factor is 1. We can, however, summarize our ® ndings in the form of a series of recommendations that apply under various scenarios. Scenario 1: Nothing is known abou t expected workload S is the best method. Scenario 2: Only skew factor known If skew =1 (i.e. only square queries), then S is the best method. 286 A. Kumar et al. Figure 6. Distribution plot ( 450Ö 18). If skew>1, then column-wise IN is the best method. If skew<1, then row-wise IN is the best method. In selecting the Spiral method as the default method when nothing is known, our strategy is to minimize the downside risk in the worst case which could occur when the area is large (say, a 480Ö 480 query region), while minimizing the overall mean and standard deviation of number of clusters that must be accessed for queries of various sizes and orientation. Here the H method could result in close to 800 clusters! ( With the IN and S methods, the absolute worst case is 512 clusters under all circumstances). Moreover, as query regions get larger in area, the maximum skew Mean-variance analysis of the performance of spatial ordering methods 287 factor is constrained, and the relative advantage of the H method declines. Furthermore, for large square query regions the S method has a big advantage over H. Thus, the rationale is to minimize the error when the query regions are large, and query processing times would be longer than in the case where query regions are small. A small caveat to the above generalizations should be mentioned here. We varied the area of the query region over a wide range, and found that over a very small part of this range, for example when the query region area is 22 500 and the skew factor is 11´1, the means for both the S and H methods are close, but the standard deviation of the S method is about two to three times larger than the standard deviation of the H method. In this case, we would consider the H method to be superior to the S method. However, such cases were very few, and it was not possible to generalize them in a meaningful manner. Therefore, we believe that in the absence of precise knowledge about the dimensions of the query region, the Spiral method is almost uniformly superior. The most relevant related work that we are familiar with is by Jagadish ( 1990). Some of the conclusions in that study are similar to ours (for example that there is no clear winner between IN and H in terms of the overall average number of clusters), but the author did not examine sample variances, and also did not consider the Spiral method. As our study shows, both these factors have a considerable impact on the ® nal conclusions. In particular, the default method (Scenario 1) in ( Jagadish 1990) was found to be Hilbert, while our study suggests that Spiral would be a better default method. In fact, we feel that even IN might be superior to Hilbert as the default method because it reduces the downside risk for large query regions. 6. Conclusions GIS need e cient ways of storing raster or grid data in a database system. The data has to be clustered (or ordered ) properly for e cient retrieval and fast query processing. In this paper, we discussed several clustering methods and showed that the choice of an appropriate method has considerable performance implications for the GIS. In particular, we conducted a detailed investigation into the clustering performance of the Hilbert method (which has been shown previously in various other studies to outperform other recursive methods) and two non-recursive methods, Inverted Naive and Spiral. The Spiral method has not been studied previously, nor have earlier studies examined the variance issue which we have found to be of crucial importance. The most important conclusion of this paper is that in the case when the blocking factor is 1, the Hilbert method has a fairly high variance and is inferior to other non-recursive methods. This point has been overlooked in previous studies where only average values were compared and variance was not considered. This variance naturally plays an important role in the choice of the best method for spatial clustering. Even though these results apply to a special case and do not generalize to higher blocking factors, we believe that these results are rather surprising and, therefore, useful for both researchers and practitioners to know. Our main conclusion is that while all three methods perform well in their own niches, in situations where no information is available on the nature of the query regions, the Spiral and IN method (in that order) are better than the Hilbert method. Our analysis of sample variance showed that in the worst case, the performance of the Hilbert method can be extremely poor. 288 A. Kumar et al. We believe that a good clustering method can have an even greater impact on performance if positive autocorrelation is present in the data (which is often the case for spatial data). In future work, we expect to quantify the e ect of this factor on the performance of the various methods. References A bel, D ., and S mith, J ., 1983, A data structure and algorithm based on a linear key for a rectangle retrieval problem. Computer V ision Graphics and Image Processing , 24, 1± 13. B atty, P ., 1992, Exploiting relational database technology in a GIS. Computers and Geosciences, 18, 453± 462. B entley, J ., 1975, Multidimensional binary search trees used for associative searching. Communications of ACM , 18, 509± 517. D andamundi, S ., and S orenson, P ., 1985, An empirical performance comparison of some variations of the k-d tree and BD tree. International Journal of Computer and Information Sciences, 14, 135± 159. D andamundi, S ., and S orenson, P ., 1986, Algorithms for BD-trees. Software Practice and Experience , 16, 1077± 1096. F aloutsos, C . , 1988, Gray codes for partial match and range queries. IEEE T ransactions on Sof tware Engineering , 14, 1381± 1393. F aloutsos, C . , and R oseman, S ., 1989, Fractals for secondary key retrieval. In Proceedings of the Eighth ACM SIGACT -SIGMOD Symposium on Principles of Database Systems held in Austin, T exas on 29± 31 March 1989 , (New York: Association for Computing Machinery), pp. 247± 252. F inkel, R ., and B entley, J ., 1974, Quad trees: a data structure for retrieval on multiple keys. Acta Informatica , 4, 1± 9. G oodchild, M . , 1992, Geographical data modeling. Computers and Geosciences, 18, 401± 408. G uttman, A ., 1984, R-trees: a dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD International Conference on Management of Data held in Boston on 18± 21 June 1984 , edited by Beatrice Yormark (New York: Association for Computing Machinery), pp. 47± 57. H inrichs, K ., 1985, Implementation of the grid ® le: design concepts and experience. BIT , 25, 569± 592. J agadish, H . V . , 1990, Linear clustering of objects with multiple attributes. In Proceedings of the ACM SIGMOD International Conference on Management of Data held in Atlantic City on 23± 25 May 1990 , edited by H. Garcia-Moline and H. Jagadish ( New York: Association for Computing Machinery), pp. 332± 341. K umar, A ., 1994, A study of spatial clustering techniques. In Proceedings of the 5th International Conference on Database and Expert System Applications held in Athens Greece on 7± 9 September 1994 , Lecture Notes in Computer Science, 856, 57± 71. N eivergelt, J ., H interberger, H ., and S evcik, K ., 1994, The grid ® le: an adaptable, symmetric multikey ® le structure. ACM T ransactions on Database Systems, 9, 38± 71. O renstein, J ., 1986, Spatial query processing in an object-oriented database system. In Proceedings of the ACM SIGMOD International Conference on Management of Data held in Washington, D.C. on 28± 30 May 1986 , edited by C. Zaniolo ( New York: Association for Computing Machinery), pp. 326± 336. O renstein, J ., 1989, Redundancy in spatial database systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data held in Portland, OR on May 31± June 2 1989 , edited by H. Garcia-Moline and H. Jagadish (New York: Association for Computing Machinery), pp. 294± 305. O renstein, J ., and M errett, T . , 1984, A class of data structures for associative searching. In Proceedings of the T hird ACM SIGACT -SIGMOD Symposium on Principles of Database Systems held in Waterloo, Canada on 2± 4 April 1984 , (New York: Association for Computing Machinery), pp. 181± 190. R obinson, J . T . , 1981, The KDB tree: a search structure for large multidimensional dynamic indexes. In Proceedings of the ACM SIGMOD International Conference on Management Mean-variance analysis of the performance of spatial ordering methods 289 of Data held in Ann Arbor, Michigan, April 29± May 1, 1981 , edited by Y. E. Lien (New York: Association for Computing Machinery), pp. 10± 18. S amet, H ., 1991, T he Design and Analysis of Spatial Data Structures (Reading: Addison-Wesley) . S tar, J ., and E stes, J ., 1990, Geographical Information Systems: An Introduction (Englewood Cli s: Prentice Hall ).
© Copyright 2026 Paperzz