Web data retrieval: solving spatial range queries using k

Geoinformatica
DOI 10.1007/s10707-008-0055-2
Web data retrieval: solving spatial range queries
using k-nearest neighbor searches
Wan D. Bae · Shayma Alkobaisi · Seon Ho Kim ·
Sada Narayanappa · Cyrus Shahabi
Received: 24 October 2007 / Revised: 25 July 2008 /
Accepted: 28 August 2008
© Springer Science + Business Media, LLC 2008
Abstract As Geographic Information Systems (GIS) technologies have evolved,
more and more GIS applications and geospatial data are available on the web. Spatial
objects in a given query range can be retrieved using spatial range query − one of the
most widely used query types in GIS and spatial databases. However, it can be challenging to retrieve these data from various web applications where access to the data
The author’s work is supported in part by the National Science Foundation under award
numbers IIS-0324955 (ITR), EEC-9529152 (IMSC ERC) and IIS-0238560 (PECASE) and in
part by unrestricted cash gifts from Microsoft and Google. Any opinions, findings, and
conclusions or recommendations expressed in this material are those of the authors and do not
necessarily reflect the views of the National Science Foundation.
W. D. Bae (B)
Department of Mathematics, Statistics and Computer Science,
University of Wisconsin-Stout, Menomonie, WI, USA
e-mail: [email protected]
S. Alkobaisi
College of Information Technology, United Arab Emirates University,
Al-Ain, United Arab Emirates
e-mail: [email protected]
S. H. Kim
Department of Computer Science Information & Technology,
University of District of Columbia, Washington, DC, USA
e-mail: [email protected]
S. Narayanappa
Department of Computer Science, University of Denver, Denver, CO, USA
e-mail: [email protected]
C. Shahabi
Department of Computer Science, University of Southern California, Los Angeles, CA, USA
e-mail: [email protected]
Geoinformatica
is only possible through restrictive web interfaces that support certain types of queries. A typical scenario is the existence of numerous business web sites that provide
their branch locations through a limited “nearest location” web interface. For example, a chain restaurant’s web site such as McDonalds can be queried to find some of
the closest locations of its branches to the user’s home address. However, even
though the site has the location data of all restaurants in, for example, the state of
California, it is difficult to retrieve the entire data set efficiently due to its restrictive
web interface. Considering that k-Nearest Neighbor (k-NN) search is one of the most
popular web interfaces in accessing spatial data on the web, this paper investigates
the problem of retrieving geospatial data from the web for a given spatial range query
using only k-NN searches. Based on the classification of k-NN interfaces on the web,
we propose a set of range query algorithms to completely cover the rectangular shape
of the query range (completeness) while minimizing the number of k-NN searches as
possible (efficiency). We evaluated the efficiency of the proposed algorithms through
statistical analysis and empirical experiments using both synthetic and real data sets.
Keywords Range queries · k-Nearest neighbor queries · Web data · Web interfaces ·
Web integration · GIS
1 Introduction
Due to the recent advances in Geographic Information Systems (GIS) technologies
and emergence of diverse web applications, a large amount of geospatial data
has become available on the web. For example, numerous businesses release the
locations of their branches on the web. The web sites of government organizations
such as the US Postal Office provide the list of its offices close to one’s residence.
Various non-profit organizations also publicly post a large amount of geospatial data
for different purposes.
There is a growing number of GIS applications that crawl or wrap various web
sources into one integrated solution for their users, such as Information Mediators.
As an example, a fast food restaurant may need to know the distribution of other
restaurants, e.g., McDonalds, in a certain geographic region for its future business
plan. There could be a government or environmental research organization needing
to know about the characteristic of some businesses, social activities or changes
in the nature. These applications, however, have difficulties to efficiently retrieve
information for a certain region from various web sources because of their restrictive
web interfaces [4]. On the web, access to geospatial data is only possible through
the interfaces provided by the applications. These interfaces are usually designed
for one specific query type and hence cannot support general access to the data.
They could retrieve data only using some particular types of queries such as k-NN
query. For example, the McDonalds web site provides a restaurant locator service
through which one can ask for the five closest restaurants from a given location.
This type of web interface to search for a number of “nearest neighbors” from a
given geographical point (e.g., a mailing address) or an area (e.g., a zip code) is very
popular for accessing geospatial data on the web. It nicely serves the intended goal
of quick and convenient dissemination of business location information to potential
customers. However, if the consumer of the data is a computer program, as in the case
Geoinformatica
of web data integration utilities (e.g., wrappers) and search programs (e.g., crawlers),
such an interface may be a very inefficient way of retrieving all data in a given query
region. To illustrate, suppose a web crawler wants to access the McDonalds web site
to retrieve all the restaurants in the state of California. Even though the site has
the required information, the interface only allows the retrieval of five locations at a
time. Even worse, the interface needs the center of the search as input. Hence, the
crawler needs to identify a set of nearest location searches that both covers the entire
state of California (for completeness) and has minimum overlap between the result
sets (for efficiency). The number of results returned even varies depending on the
application. This problem necessitates for methods that utilize certain query types to
provide solutions to non-supported queries.
In this paper, we conceptualize the aforementioned problem into a more general
problem of supporting spatial range query to retrieve web data in a given rectangular
shaped region using k-Nearest Neighbor (k-NN) search. Besides web data integration and search applications, the solution to this more general problem is beneficial
for other application domains in the areas of sensor networks, online maps and adhoc mobile networks where their applications require great deal of data integration
for data analysis and decision support queries.
Figure 1 illustrates the challenges of this general problem through an example.
Suppose that we want to find the locations of all the points in a given region
(rectangle), and the only available interface is the k-NN search with a fixed k (k = 3
in the example of Fig. 1). The only parameter that we can vary is the query point
for the k-NN search (shown as cross marks in Fig. 1). Given a query point q, the
result of the k-NN search is a set of k nearest points to q. It defines a circle centered
at q with the radius equal to the distance from q to its kth nearest neighbor (i.e.,
the 3rd closest point in this example). The area of this circle, covering part of the
rectangle, determines the searched (i.e., covered) portion of the rectangular region.
To complete the range query, we need to identify a set of input locations of q
corresponding to a series of k-NN searches that would result in a set of circles
covering the entire rectangle.
Based on the classification of k-NN interfaces on the web, we propose a set
of range query algorithms: the Density-Based (DB) algorithm for the case with
known statistics of the data sets assuming the most flexible k-NN interface (i.e., k
Fig. 1 A range query on
the web data
P ur
q
3-NN Search
Pll
Range Query R
Geoinformatica
is a user input), the Quad Drill Down (QDD) and Dynamic Constrained Delaunay
Triangulation (DCDT) algorithms for the case with no knowledge of the distribution
of the data sets and with the most restrictive k-NN interface (i.e., no control on the
value of k), and the Hybrid (H B) of the two approaches, DB and QDD, for the
solution of the case with bounded values of k. Our algorithms guarantee the complete
coverage of any given spatial query range. The efficiencies of the algorithms are
measured by statistical analysis and empirical experiments using both synthetic and
real data sets.
The remainder of this paper is organized as follows: The problem is formally
defined in Section 2 and the related work is discussed in Section 3. In Section 4
we propose our algorithms to solve range queries using a number of k-NN queries.
Section 5 provides the performance evaluation of the proposed algorithms. We first
discuss the statistical analysis of DB and QDD. Then we present the efficiency of our
algorithms through empirical experiments. Finally, Section 6 concludes the paper.
2 Problem definition
A range query in spatial database applications can be formally defined as follows:
Let R be a given rectangular query region represented by two points, Pll and Pur ,
where Pll = (x1 , y1 ) and Pur = (x2 , y2 ). Pll and Pur refer to the lower-left corner and
the upper-right corner of R, respectively. Given a set of objects S and a region R,
spatial range query finds all objects of S in R [8].
Similarly, a nearest neighbor (NN) query can be defined. Given a set of objects S
and a query point q, an NN query searches for a point p ∈ S such that dist(q, p) ≤
dist(q, r) for any r ∈ S. The point p is the nearest neighbor of q. One can be interested
in finding more than one point, say k points, which are the nearest neighbors of the
query point q. This is called the k-Nearest Neighbor (k-NN) problem also known
as the top-k selection query. Given a set of objects S (with cardinality(S) = N),
a number k ≤ N and a query point q, a k-NN query searches for a subset S ⊆ S
with cardinality(S ) = k such that for any p ∈ S and r ∈ S − S , then distance(q, p) ≤
distance(q, r) [16].
Our problem is defined as follows: Given a rectangular shaped query region
R, find all spatial objects within the query range using k-NN search interface as
supported by various web applications. The complexity of this problem is unknown.
Therefore only approximation results are provided. In reality, there can be more
than one web site involved in the given range query. Then, the general problem is
to find all objects within the query region R using k-NN searches on each of these
web sources and to integrate the results from the sources into one final query result.
Figure 2a and b show examples of two different k-NN interfaces, k = 3 in (a) and
k = 6 in (b). Two different data sets in the given same query region can be retrieved
from the different web sources.
The optimization objective of this problem is to minimize the number of k-NN
searches while retrieving all objects in a given query region. Each k-NN search results
in communication overhead between the client and the server in addition to the
actual execution cost of the k-NN search at the server. The range query execution
is broken down into four basic steps: 1) Client sends a request to the web server with
the query point q, 2) Server executes the k-NN search using q, 3) Server returns the
Geoinformatica
Northglenn
Broomfield
Northglenn
Broomfield
H
Thornton
Westminster
Thornton
Westminster
H
Arvada
H
H
285
Wheat Ridge
Golden
H
H
H H
ARVADA
COUNTY
Denver
Lakewood
Aurora
Lakewood
H
H
Aurora
Englewood
Cherry Hills village
Indian Hills
Centennial
I 70
Glendale
I 225
285
5
q
Littleton
I 25
Denver
Englewood
Cherry Hills village
Indian Hills
ARVADA
COUNTY
Wheat Ridge
I 70
Glendale
I 225
H
285
Arvada
285
Golden
H
Littleton
Centennial
I 25
H
Highlands Ranch
Highlands Ranch
Parker
Parker
(a) A range query using 3-NN search
(b) A range query using 6-NN search
Fig. 2 Range queries on the web data (a, b)
result of the k-NN search to client, 4) Client prunes the result (client may invoke
additional requests to server). Let Cn be the total cost of the range query and Cq ,
Cknn , Cres , and C pru represent the cost associated with each of the four steps, where n
is the total number of k-NN searches required to complete a range query. Then, Cn
is defined as follows:
Cn = C pru(n) +
n
[Cq(i) + Cknn(i) + Cres(i) ].
i=1
Steps 1 and 3 both involve communication over the network. For step 2, we assume
that any of the known k-NN search algorithms [6, 7, 13, 17] is available to the web
applications; if any tree index structure is used, the complexity of k-NN search is
known to be O(logN + k) [7], where N is the total number of objects in the database,
and k is the number of objects we want to retrieve. The cost of step 4 is the CPU time
required to prune the k-NN result at the client side. If additional k-NN searches are
required to cover the query region, then additional costs of all Cq , Cknn , and Cres must
be taken into account. However, C pru is required only once for a given query. Hence,
we focus on reducing the number of k-NN searches in this paper while placing our
circles to have as few overlaps as possible. Finally, the fact that we do not know the
radius of each circle prior to the actual evaluation of its corresponding k-NN query
makes the problem even harder.
To summarize, the optimal solution should find a minimal set of query locations
that minimizes the number of k-NN searches to completely cover the given query
region. In many web applications, the client has no knowledge about the data set at
the server and no control over the value of k (i.e., k is fixed depending on the web
applications). These applications return a fixed number of k nearest objects to the
user’s query point. A typical value of k in real applications rages from 5 to 20 [4],
multiple queries are evaluated to completely cover a reasonably large region.
In this paper we consider all possible k-NN interfaces. Any web applications
supporting k-NN search can be classified based on the value of k:
1. The value of k is fixed, and the user does not have any control over k−the general
problem. Many applications always return a fixed number of k nearest objects to
the user’s query point. Different applications may have different values of k.
Geoinformatica
2. The value of k can be any positive integer determined by users. Some web
applications accept user-defined k values for their k-NN searches.
3. The value of k can be any positive integer less than or equal to a certain value.
Some web applications support k-NN search with user-defined k values but allow
only a bounded range of k values.
To simplify the discussion, we made the following assumptions: 1) a range query
will be supported by a single web source; 2) a web application supports only one type
of k-NN query (fixed, flexible or bounded k value); 3) a k-NN search takes a query
point q as an input parameter.
3 Related work
Some studies have discussed the problems of data integration and query processing
on the web [12, 22]. A survey of the general problems in the integration of web data
sources can be found in [9]. The paper discussed challenges of the heterogeneous
nature of the web and different ways of describing the same data, and proposed
general ideas of how to tackle those problems. Nie et al. [12] presented a query
processing framework for web-based data integration and implemented the evaluation of query planning and statistics gathering modules. The problem of querying web
sites through limited query interfaces has been studied in the context of information
mediation systems in [22]. Based on the capabilities of the mediator, the authors
proposed an idea to compute the set of supported queries.
Several studies have focused on performing spatial queries on the web [1, 4, 10].
In [1], the authors demonstrated an information integration application that allows
users to retrieve information about theaters and restaurants from various U.S. cities,
including an interactive map. Their system showed how to build applications rapidly
from existing web data sources and integration tools. The problem of spatial coverage
using only the k-NN interface was introduced in [4]. The paper provided a quad-tree
based approach. A quad-tree data structure was used to check the complete coverage
by marking regions that have been covered so far. The authors follow a greedy
approach using the coverage information to determine the next query’s position. Our
previous work [2] provided an efficient and complete solution for the problem in the
case when the value of k is fixed and users have no control on the web interface
and no knowledge on the data. This paper extends our previous work and provides
comprehensive solutions on the most general case based on the classification of all
possible k-NN interfaces on the web. In our proposed QDD algorithm, we follow
a systematic way of dividing the region to be covered which spare us the need of
the quad tree data structure used in [4]. This systematic division of the range query
provides the information of coverage without needing to access any data structure to
decide on the next query’s position.
The problem of supporting k-NN query using range queries was studied in [10].
The idea of successively increasing query range to find k points was described
assuming statistical knowledge of the data set. This is the complement to the problem
we are focusing on.
Many k-NN query algorithms have been proposed [6, 13, 17]. In [13], the authors
proposed branch-and-bound R-tree traversal algorithm to find the k nearest neighbors of a given query point. They also introduced two metrics for both searching
Geoinformatica
and pruning, which are MINDIST that produces the most optimistic ordering and
MINMAXDIST that produces the most pessimistic ordering possible. A new cost
model for optimizing nearest neighbor searches in low and medium dimensional
spaces was proposed in [17]. They proposed a new method that captures the
performance of nearest neighbor queries based on approximation. Their technique
was based on the vicinity rectangles and the Minkowski rectangles. In [6], given a
set of points, the authors proposed two algorithms to solve two k nearest neighbor
related problems. The first enumerates the k smallest distances between pairs of
points in nondecreasing order and the second finds the k nearest neighbors of each
point. Both are based on a Delaunay triangulation.
The Delaunay triangulation and Voronoi diagram based approaches have been
studied for various purposes in wireless sensor networks [15, 19, 21]. In [19], the
authors used the Vornonoi diagram to discover the existence of coverage holes,
assuming that each sensor knows the location of its neighbors. The region is partitioned into Voronoi cells and each cell contains one sensor. If a cell is not covered
by a sensor, the coverage holes are found. The authors use a greedy approach by
targeting the largest hole which is a Voronoi cell as the next region to be covered.
Sensors replacement that have fixe coverage radius by minimizing the movements
of the sensors is one of the main ideas in their paper. Similarly, our proposed
DCDT algorithm greedily targets the largest uncovered region, a triangle, to be
covered, however, coverage is determined by the non fixed radius of the k-NN
query. The Delaunay triangulation and Vornonoi diagram was used to determine the
maximal breach path (MBP) and the maximal support path (MSP) for a given sensor
network in [11]. In [21] proposed a wireless sensor network deployment method was
proposed based on the Delaunay triangulation. This method was applied for planning
the positions of sensors in an environment with obstacles. Their main goal was to
maximize the coverage of the sensors and not to cover the complete region. Each
candidate position of a sensor is generated from the current sensor configuration
by calculating its scored using a probabilistic sensor detection model. The proposed
approach retrieved the location information of obstacles and pre-deployed sensors,
then constructed a Delaunay triangulation to find candidate positions of new sensors.
In our DCDT algorithm, a constrained Delaunay triangulation (CDT) is used to find
uncovered region with the most coverage gains. The CDT is dynamically updated
while covering the query region with k-NN circles.
4 Range query algorithms using k-NN search
We study and investigate different approaches for all possible k-NN search interfaces
defined in Section 2. We develop four spatial range query algorithms that utilize
k-NN search to retrieve all objects in a given rectangular shaped query region:
Density Based (DB) for flexible values of k, Quad Drill Down (QDD) and Dynamic
Constrained Delaunay Triangulation (DCDT) for fixed values of k, and Hybrid
(H B) for bounded values of k.
The notations in Table 1 are used throughout this paper and Fig. 3 illustrates an
example of range query using 5-NN search using the notations. Let R be a given
query region and C Rq be the circle inscribing R having q and Rq as the center and
radius of C Rq , respectively. Notice that q is also the center point of R. Let Pk =
Geoinformatica
Table 1 Notations
Notation
Description
R
Pll
Pur
q
Rq
C Rq
Pk
A query region (rectangle)
The lower-left corner point of R
The upper-right corner point of R
The query point for a k-NN search
Half of the diagonal of R; the distance from q to Pll
Circle inscribing R; the circle of radius Rq centered at q
The set of k nearest neighbors obtained by a k-NN search ordered ascendingly
by distance from q, i.e., P5 = { p1 , p2 , p3 , p4 , p5 } in Fig. 3
The distance from q to the farthest point pk in Pk
k-NN circle; the circle of radius r centered at q
Tighter bound k-NN circle; the circle of radius · r centered at q
r
Cr
Cr
{ p1 , p2 , ..., pk } be the k nearest neighbors, ordered by distance from q obtained by
the k-NN search with q as the query point. Let qpk = r, i.e., r is the distance from q
to the farthest point pk . Let Cr be the circle of radius r centered at q, which we refer
to as a k-NN circle.
If r > Rq , then the k-NN search must return all the points inside C Rq . Therefor, we
can obtain all the points inside the region R after pruning out any points of Pk that
are outside R but inside Cr . Then, the range query is complete. Otherwise, we need
to perform some additional k-NN searches to completely cover R. For example, if
we use a 5-NN search for the data set shown in Fig. 3, then the resulting Cr is smaller
than C Rq . Note that it is not sufficient to have r ≥ Rq in order to retrieve all the
points inside R when q has more than one kth nearest neighbor since one of them will
be randomly returned as q’s kth nearest neighbor. Assuming four points in the four
corners of R, four points strictly inside of Cr , and the value of k = 5, the returned
result is P5 = { p1 , p2 , ..., p5 }. The condition, r > Rq , is required since 5-NN search
randomly picks one from four points on the corners as p5 discarding the rest of three
points. As a result, we use r > Rq as a terminating condition in our algorithms.
Fig. 3 A range query using
5-NN search
Pur
Cr
P5
r
P2
P3
q
P1
P4
Rq
Pll
Range Query R
C Rq
Geoinformatica
4.1 Algorithm using flexible values of k
Some web applications can support k-NN interface with flexible values of k so that
users can choose any integer value for k-NN search. Some also provide the total
number of data (objects, i.e., stores or branches) and their service area. By carefully
determining the k value, a range query can be evaluated with a single k-NN search if
all the objects in the given query region can be found from the results of the very first
k-NN search. Based on this assumption, we propose the Density Based (DB) range
query algorithm, a statistical approach for retrieving web data in a given region.
The main idea of DB is to find a good approximation of the k value using some
statistical knowledge of the web data, i.e., the density of the data sets. Assuming that
the density information of web data is available, let N be the total number of objects
in a web source and Aglob al be the area of the minimum bounding rectangle (MBR)
that includes all those objects. Then, two estimation methods are considered: global
and local estimations.
N
The global estimation method is as follows: The global density is Aglob
, and Rq is
al
half the diagonal of R as described in Fig. 3. We then estimate the number of objects
(k) that are inside C Rq with the parameters, Aglob al , N, and Rq :
k = π Rq2 · N/Aglob al .
(1)
k is used to call the first k-NN search to the Web data. From the k-NN result, Cr (kNN circle) is expected to match C Rq . If Cr ≤ C Rq , then we need extra k-NN searches.
Thus, we need to re-estimate the k value using the local estimation.
The local estimation method is as follows: Let Alocal be the area of the current
Cr , Area(Cr )=πr2 and k be the value used by the previous k-NN search. The local
k
density is Alocal
. Then the new estimation of k is:
k = π Rq2 · k/Alocal .
(2)
k is used to call the next k-NN search to the Web data.
In the DB algorithm we first estimate the initial k using the global density of the
data set. The center point of the query region R is the query point q (see Fig. 3).
If Cr is not larger than C Rq , the value of k is under-estimated because Cr cannot
cover the entire R. Then, it is necessary to increase the k value. A new value k is
then calculated using the local density. This local estimation of k continues until Cr
becomes larger than C Rq . If Cr is larger than C Rq , the result of k-NN includes all the
objects inside C Rq . Finally, pruning is required to eliminate the objects outside R but
inside C Rq .
Even though we determine the value of k by taking advantage of known statistics,
there can be a significant number of underestimated/overesitmated cases due to
statistical variation. Underestimation of k value incurs an expensive cost, i.e., a new
k-NN search requiring all new C pru , Cq , Cknn , and Cres , while overestimation of k
value does not.1 From the observation that a slight overestimation of k value can
absorb a significant amount of statistical variance (i.e., more range queries can be
evaluated in a single k-NN search) and it does not create a significant overhead
1 Actually,
overestimation introduce a slight overhead in each cost function but it is way lower than
the overhead caused by underestimation.
Geoinformatica
cost, we introduce the overestimation constant, , for a loosely bounded k value.
To obtain a loosely bounded k value, the k value is multiplied by ( > 1). The
value can be determined from empirical experiments and statistics based upon
the density and distribution of the data set. Algorithm 1 describes the DB range
query. The DB algorithm will be evaluated through both statistical analysis and
empirical experiments in Section 5. We will also demonstrate how the value affects
the performance of the DB algorithm.
Algorithm 1 DB rangeQuery(Pll , Pur , Aglob al , N)
1: ← a constant factor
2: Pc .x ← (Pll .x + Pur .x)/2
3: Pc .y ← (Pll .y + Pur .y)/2
√
4: Rq ← (Pur .x − Pc .x)2 + (Pur .y − Pc .y)2
5: q ← Pc
6: kinit ← estKvalue(Rq , Aglob al , N) // using Eq. (1)
7: Knn{} ← kNNSearch(q, · kinit )
8: r ← maxDistkNN(q, Knn{})
9: k ← kinit
10: while r ≤ Rq do
11:
clear Knn{}
12:
Alocal ← π · r2
13:
k ← estKvalue(Rq , Alocal , k) // using Eq. (2)
14:
Knn{} ← kNNSearch(q, · k)
15:
r ← maxDistkNN(q, Knn{})
16: end while
17: result{} ← pruneResult(Knn{}, Pll , Pur )
18: return result{}
4.2 Algorithms using fixed values of k
In many web applications, the client has no knowledge of the data set at the server
and no control over the value of k (i.e., k is fixed). Considering a typical value of k
in real applications, which ranges from 5 to 25 [4], in general, multiple queries are
required to completely cover a reasonably large R.
Our approach to this general range query problem on the web is as follows: 1)
divide the query region R into subregions, R = {R1 , R2 , ..., Rm }, such that any Ri ∈
R can be covered by a single k-NN search, 2) obtain the resulting points from the
k-NN search on each of the subregions, 3) combine the partial results to provide a
complete result to the original range query R. Then, the main focus is how to divide
R so that the total number of k-NN searches can be minimized. We consider the
following three approaches:
1. A naive approach: divide the entire R into equi-sized subregions (grid cells) such
that any cell can be covered by a single k-NN search.
2. A recursive approach: conduct a k-NN search in the current query region, divide
the current query region into subregions with same sizes if the k-NN search fails
to cover R, and call k-NN search for each of these subregions. Repeat the process
until all subregions are covered.
Geoinformatica
3. A greedy and incremental approach: divide R into subregions with different
sizes. Then select the largest subregion for the next k-NN search. Check the
covered region by the k-NN search and select the next largest subregion for
another k-NN search.
Taking the example in Fig. 3, if r > Rq , then the k-NN search must return all
the points inside C Rq . Therefore, we can obtain all the points inside the region R
after pruning out any points of Pk that are outside R but inside Cr . Then, the range
query is complete. Otherwise, we need to perform some additional k-NN searches
to completely cover R. For example, if we use a 5-NN search for the data set shown
in Fig. 3, then the resulting Cr is smaller than C Rq . For the rest of this section, we
focus on the latter case, which is a more realistic scenario. In a naive approach,
R is divided into a grid where the size of cells is small enough to be covered by
a single k-NN search. However, it might not be feasible to find the exact cell size
with no knowledge of the data sets. Even in the case that we obtain such a grid,
this approach is inefficient due to large amounts of overlapping areas among k-NN
circles and wasted time to search empty cells considering that most real data sets
are not uniformly distributed. Therefore, we provide a recursive approach and an
incremental approach in the following subsections.
4.2.1 Quad drill down (QDD) algorithm
We propose a recursive approach, the Quad Drill Down (QDD) algorithm, as a
solution to the range query problem on the web using the properties of the quadtree. A quad-tree is a tree whose nodes either are leaves or have four children, and it
is one of the most commonly used data structures in spatial databases [14]. The quadtree represents a recursive quaternary decomposition of space wherein at each level
a subregion is divided into four equal sized subregions (quadrants). The properties
of the quad-tree provide a natural framework for optimizing decomposition [20].
Hence, the quad-tree can give us an efficient way to divide the query region R. We
adopt the partitioning idea of the quad-tree, but we do not actually construct the
tree. The main idea of QDD is to divide R into equal-sized quadrants, and then
recursively divide quadrants further until each subregion is fully covered by a single
k-NN search so that all objects in it are obtained. An example of QDD is illustrated
in Fig. 4 (when k = 3), where twenty one k-NN calls are required to retrieve all the
points in R.
Algorithm 2 describes QDD. First, a k-NN search is invoked with a point q which
is the center of R. Next, the k-NN circle Cr is obtained from the k-NN result. If Cr
is larger than C Rq , it entirely covers R, then all objects in R are retrieved. Finally,
a pruning step is necessary to retrieve only the objects that are inside R − a trivial
case. However, if Cr is smaller than or equal to C Rq , the query region is partitioned
into four subregions by equally dividing the width and height of the region by two.
The previous steps are repeated for every new subregion. The algorithm recursively
partitions the query region into subregions until each subregion is covered by a k-NN
circle. The pruning step eliminates those objects that are inside the k-NN circle but
outside the subregion. Statistical analysis of QDD is demonstrated in Section 5.2 as
well as the empirical experiment results.
Geoinformatica
Fig. 4 A Quad Drill Down
range query (a, b)
P3
x
x
x
x
q2
q3
x
x
x
x
x
x
x
P2
x
x
x
x
3-NN Search
P1
q
x
x
q1
x
x
Pur
q4
x
x
Pll
P0
Range Query R
(a) A QDD Range Query
q
q1
q11 q12 q13 q14
q2
q21 q22 q23 q24
q4
q3
q41 q42 q43 q44
q111 q112 q113 q114
(b) A Quad-tree representation of QDD
4.2.2 Dynamic constrained delaunay triangulation (DCDT) Algorithm
We propose the Dynamic Constrained Delaunay Triangulation (DCDT) algorithm
as another approach to solve the range query using the most restricted k-NN
interface − a greedy and incremental approach using the Constrained Delaunay
Triangulation (CDT).2 DCDT uses triangulations to divide the query range and
keeps track of covered triangles by k-NN circles using the characteristics of constrained edges in CDT. DCDT greedily selects the largest uncovered triangle for
the next k-NN search while QDD follows a pre-defined order.
To overcome redundant k-NN search on the same area, DCDT includes a
propagation algorithm to cover the maximum possible area within a k-NN circle.
Thus, no k-NN search will be wasted because a portion of a k-NN circle is always
added to the covered area of the query range.
2 The terms, CDT and DCDT, are used for both the corresponding triangulation algorithms and the
data structures that support these algorithms in this chapter.
Geoinformatica
Algorithm 2 QDDrangeQuery(Pll , Pur )
1: q ← getCenterOfRegion(Pll , Pur )
2: Rq ← getHalfDiagonal(q, Pll , Pur )
3: Knn{} ← add kNNSearch(q)
4: r ← getDistToK th NN(q, Knn{})
5: if r > Rq then
6:
result{} ← pruneResult(Knn{}, Pll , Pur )
7: else
8:
clear Knn{}
9:
P0 ← (q.x, Pll .y)
10:
P1 ← (Pur .x, q.y)
11:
P2 ← (Pll .x, q.y)
12:
P3 ← (q.x, Pur .y)
13:
Knn{} ← add QDDrangeQuery(Pll , q)
14:
Knn{} ← add QDDrangeQuery(P0 , P1 )
15:
Knn{} ← add QDDrangeQuery(P2 , P3 )
16:
Knn{} ← add QDDrangeQuery(q, Pur )
17:
result{} ← Knn{}
18: end if
19: return result{}
// q1
// q4
// q2
// q3
Data Structure for DCDT Given a planar graph G with a set of n vertices, P,
in the plane, a set of edges, E, and a set of special edges, C, CDT of G is a
triangulation of the vertices of G that includes C as part of the triangulation [5], i.e.,
all edges in C appear as edges of the resulting triangulation. These edges are referred
as to constrained edges, and they are not crossed (destroyed) by any other edges
of triangulation. CDT is also called as an obstacle triangulation or a generalized
Delaunay Triangulation. Figure 5 illustrates a graph G and the corresponding CDT.
In this example, Fig. 5a shows a set of vertices, P = {P1 , P2 , P3 , P4 , P5 , P6 , P7 }, and
a set of constrained edges, C = {C1 , C2 }. Figure 5b shows the result of CDT that
includes the constrained edges of C1 and C2 as part of the triangulation.
We define the following data structures to maintain the covered and uncovered
regions of R:
•
•
•
pList: a set of all vertices of G; initially {Pll , Prl , Plr , Pur }
cEdges: a set of all constrained edges of G; initially empty
tList: a set of all uncovered triangles of G; initially empty
pList and cEdges are updated based on the current covered region by a k-NN
search. R is then triangulated (partitioned into subregions) using CDT with the
current pList and cEdges. For every new triangulation, we obtain a new tList.
DCDT keeps track of covered triangles (subregions) and uncovered triangles; covered triangles are bounded by constrained edges (i.e., all three edges are constrained
edges) and uncovered triangles are kept in tList, which are sorted in descending
order by the area of the triangles. DCDT chooses the largest uncovered triangle
in tList and calls k-NN search using the centroid, which always lies inside the triangle
as opposed to the center of the triangle, as the query point. Our algorithm uses a
heuristic approach that picks the largest triangle from the list of uncovered triangles.
Geoinformatica
Fig. 5 Example of
Constrained Delaunay
triangulation (a, b)
p2
p2
p4
p3
p1
p3
p1
c1
c1
p5
p5
p6
p6
c2
p7
p4
c2
p7
(a) Constrained edges
(b) CDT
The algorithm terminates when no more uncovered triangles exist. We describe the
details of DCDT shown in Algorithm 3 and Algorithm 4 in the following sections.
The First k-NN search in DCDT DCDT invokes the first k-NN search using the
center point of the query region R as the query point q. Figure 6a shows an example
of a 3-NN search. If the resulting k-NN circle Cr completely covers R (r > Rq ), then
we prune and return the result − a trivial case. If Cr does not completely cover R,
DCDT needs to use Cr , which is a little smaller circle than Cr , for checking covered
regions from this point for further k-NN searches. We have previously discussed the
possibility that the query point q has more than one kth nearest neighbor, and one of
them is randomly returned. In that case, DCDT may not be able to retrieve all the
points in R if it uses Cr . Hence we define Cr using r = δ · r, where 0 < δ < 1 (δ = 0.99
is used in our experiment).
DCDT creates an n-gon inscribed into Cr . Choosing the value of n depends on
the tradeoff between computation time and the coverage of area; n = 6 (a hexagon)
is used in our experiments as shown in Fig. 6b. All vertices of the n-gon are added
into pList. To mark the n-gon as a covered region, the n-gon is triangulated and the
resulting edges are added as constrained edges into cEdges (getConstrainedEdges()
line 12 of Algorithm 3). The algorithm constructs a new triangulation with the current
Pur
q
C‘r
Pll
(a) Query region R and center point q
(b) Marking a hexagon as covered region
Fig. 6 An example of first k-NN call in query region R (a, b)
Geoinformatica
pList and cEdges, then a newly created tList is returned (constructDCDT() line 13
of Algorithm 3).
Covering Regions DCDT selects the largest uncovered triangle from tList for
the next k-NN search. With the new Cr , DCDT updates pList and cEdges and
creates a new triangulation. For example, if an edge lies within Cr , then DCDT adds
it into cEdges; on the other hand, if an edge intersects Cr , then the partially covered
edge is added into cEdges. Figure 7 shows an example of how to update and maintain
pList, cEdges and tList. 1 (max ) is the largest triangle in tList, therefore 1 is
selected for the next k-NN search. DCDT uses the centroid of 1 as the query
point q for the k-NN search. Cr partially covers 1 : vertices v5 and v6 are inside Cr but
vertex v3 is outside Cr . Cr intersects 1 at k1 along v3 v6 and at k2 along v3 v5 , hence,
k1 and k2 are added into pList (getCoveredVertices() line 3 of Algorithm 4). Let
Rc be the covered area (polygon) of 1 by Cr (Fig. 7c). Then Rc is triangulated
and the resulting edges, k1 k2 , k1 v6 , k2 v5 , v5 v6 and k1 v5 are added into cEdges.
The updated pList and cEdges are used for constructing a new triangulation of R
(Fig. 7c).
Algorithm 3 DCDTrangeQuery(Pll , Pur , δ)
1: pList{} ← {Pll , Prl , Plr , Pur }; cEdges{} ← { }; tList{} ← { };
2: Knn{} ← { } // all objects retrieved by k-NN search
3: q ← getCenterOfRegion(Pll , Pur )
4: Knn{} ← add kNNSearch(q)
5: r ← getDistToK th NN(q, Knn{})
6: Cr ← circle centered at q with radius r
7: Rq ← getHalfDiagonal(q, Pll , Pur )
8: if r ≤ Rq ; Cr does not cover the given region R then
9:
Cr ← circle centered at q with radius δ · r
10:
Ng ← create an n-gon inscribing Cr with q as the center
11:
pList{} ← add {all vertices of Ng }
12:
cEdges{} ← getConstrainedEdges({all vertices of Ng })
13:
tList{} ← constructDCDT( pList{}, cEdges{})
14:
while tList is not empty do
15:
mark all triangles in tList to unvisited
16:
max ← getMaxTriangle()
17:
q ← getCentroid(max )
18:
Knn{} ← add kNNSearch(q)
19:
r ← getDistToKth NN(q, Knn{})
20:
Cr ← circle centered at q with radius δ · r
21:
checkCoverRegion(max , Cr , pList{}, cEdges{})
22:
tList{} ← constructDCDT( pList{}, cEdges{})
23:
end while
24: end if
25: result{} ← pruneResult(Knn{}, Pll , Pur )
26: return result{}
Geoinformatica
Algorithm 4 checkCoverRegion(, Cr , pList{}, cEdges{})
1: if is marked unvisited then
2:
mark as visited
3:
local PList{} ← getCoveredVertices(, Cr )
4:
localCEdges{} ← getConstrainedEdges(local PList)
5:
if localCEdges is not empty then
6:
pList{} ← add local PList{}
7:
cEdges{} ← add localcEdges{}
8:
neighb ors{} ← findNeighbors()
9:
for every i in neighbors{}, i = 1, 2, 3 do
10:
checkCoverRegion(i , Cr , pList{}, cEdges{})
11:
end for
12:
end if
13: end if
As shown in Fig. 7c, the covered region of 1 can be a lot smaller than the
coverage area of Cr . In order to maximize the covered region by a k-NN search,
DCDT propagates to (visits) the neighboring triangles of 1 (triangles that share an
edge with 1 ). DCDT marks the covered regions starting from max , and recursively
visiting neighboring triangles ( f indNeighb ors() line 8–11 of Algorithm 4). Figure 8
shows an example of the propagation process. In Fig. 8a, max is completely covered
and its neighboring triangles 11 , 12 and 13 (1-hop neighbors of max ) are partially
covered in Fig. 8b, c and d, respectively. In Fig. 8e, the neighboring triangles of max ’s
1-hop neighbors, i.e., 2-hop neighbors of max , are partially covered. Finally, based
on the returned result of checkCover Region() shown in Fig. 8e, DCDT constructs a
new triangulation as shown in Fig. 8f. Algorithm 4 describes the propagation process.
Notice that DCDT covers a certain portion of R at every step, and therefore the
covered region grows as the number of k-NN searches increases. On the contrary,
QDD discards the returned result from a k-NN call when it does not cover the
whole subregion, which results in re-visiting the subdivisions of the same region
with 4 additional k-NN calls (see line 5–16 in Algorithm 2). Also note that the cost
‘
‘
pList ={v1,v2,v3,v4,v5,v6}
cEdges ={ }
tList = { Δ1 , Δ 2 , Δ3 , Δ 4 , Δ5 , Δ 6 }
pList ={v1,v2,v3,v4,v5,v6,k1,k2}
cEdges ={ v5v6 , v6k1 , k1k 2 , k 2v5 , v5k1 }
tList = { Δ1 , Δ 2 , Δ3 , Δ 4 , Δ5 , Δ6 }
pList ={v1,v2,v3,v4,v5,v6,k1,k2}
cEdges ={ v5v6 , v6k1 , k1k 2 , k 2v5 , v5k1 }
tList = { Δ 2 , Δ5 , Δ6 , Δ7 , Δ8 , Δ9 , Δ10 , Δ11 }
(a) step1
(b) step2
(c) step3
Fig. 7 Example of pList, cEdges and tList: covering max (a–c)
Geoinformatica
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 8 Example of checkCoverRegion() (a–f)
of triangulation is negligible as compared to k-NN search because triangulation is
performed in memory and incrementally.
DCDT repeats the processes until no more uncovered regions exist. The resulted
k-NN circle Cr can either completely cover R or partially cover R. The example in
Fig. 9a shows the first step of DCDT. The shaded triangles are covered-subregions
and all the edges of these triangles are now considered as constrained edges.
Figure 9b and c illustrate examples of DCDT result after the 5th k-NN search and
the 9th k-NN search, respectively.
Cases of Covered Regions In this section we define the cases of covered region Rc
and discuss the details of Algorithm 4. Let Cr be the returned k-NN circle centered
by q and T be the given triangle that we consider covering using Cr . Let a, b and c
Pur
Pur
Pur
Cr
Cr
Cr
Pll
Pll
(a) 1st k-NN call (12%)
Pll
(b) 5th k-NN call (52%)
Fig. 9 Example of DCDT w/k=15 (coverage rate) (a–c)
(c) 9th k-NN call (92%)
Geoinformatica
Fig. 10 Examples of the cases
for covered region Rc (a–h)
(a) Case 1
(c) Case 3.1
(e) Case 4.1
(g) Case 4.3a
(b) Case 2
(d) Case 3.2
(f) Case 4.2
(h) Case 4.3b
be the three vertices of T . The following cases are defined for covered region Rc ,
and the corresponding examples are illustrated in Fig. 10.
Case 1: All three vertices, a, b and c, are inside Cr so that T is completely covered
by a k-NN search: Rc is ab c .
Case 2: Two vertices (say a and b ) are inside Cr but the third vertex (say c)
is outside Cr . Then Cr has one intersection (say b ) with b c and one
intersection (say a ) with ca: Rc is the polygon with the edges, ab , aa , b b and a b .
Case 3: One vertex (say a) is inside Cr but two other vertices (say b and c) are
outside Cr ;
Case 3.1: Cr has one intersection (say b ) with ab and one intersection
(say c ) with ca: Rc is ab c .
Geoinformatica
Case 3.2: Cr has one intersection (say b )with ab , one intersection (say
c ) with ca, and either one intersection (say k1 ) or two intersections (say k1 and k2 ) with b c: Rc is the polygon with the edges,
ab , b k1 , k1 k2 , k2 c and c a.
Case 4: All three vertices are outside Cr ;
Case 4.1: The center of Cr is inside T and Cr has either no intersection
or only one intersection with all edges of T : create an n-gon
inscribed into Cr and then Rc is n-gon.
Case 4.2: Cr has two intersections with at least two edges of T : Let these
intersection points be k1 , k2 ,...,ki where 4 ≤ i ≤ 6. Rc is the
polygon with the edges, k1 k2 ,k2 k3 ,..., ki k1 .
Case 4.3: Cr has two intersections with one edge (say ab ) of T . Let these
two intersections be k1 and k2 and let m be the center point of
k1 k2 :
Case 4.3a: If the center of Cr is inside T , then add k1 k2 ,
then create an n-gon inscribed into Cr and find the
intersections of the n-gon and ab . Let S be a set of
points that includes: 1) the points of n-gon that are
inside T , 2) the intersection points of the n-gon
and ab , S = {k1 , k2, ..., ki }. Rc is the polygon with
edges in S.
Case 4.3b: If the center of Cr is outside T , then find an intersection point of Cr and the line that is perpendicular
to k1 k2 and passes the point m. Let this point be m .
Then Rc is k1 k2 m .
After we define Rc , Tc is obtained by triangulating Rc . Then all vertices of Rc are
added into pList, and all edges of Tc are added into cEdges. The updated pList and
cEdges are used for creating a new triangulation.
4.3 Algorithm using bounded values of k
Some web applications support k-NN queries with bounded values of k. We propose
the Hybrid (H B) algorithm that shows the possibility of combining the solutions for
the two previous cases of k-NN interfaces, flexible values of k and fixed values of k,
in order to provide a solution to these applications. When the range of k values is
bounded, not only a good estimation of the k value must be considered but also a
good placement of query point q.
4.3.1 Hybrid (H B) algorithm
H B first estimates the value of k based on the global density. If additional k-NN
search calls are required, the new k value is calculated based on the local density.
The estimation of the k value continues until either the k-NN circle covers the query
region R or the k value reaches the maximum value allowed. If the k value reaches
the maximum value of k defined by the application before R is covered, R needs to be
partitioned into subregions and the previous steps are called for each of these regions.
Geoinformatica
The algorithm keeps recursively partitioning the query region into subregions until
every subregion is covered by its k-NN circle.
Considering QDD and DCDT as an approach when the estimated k value
reaches to the maximum before covering the entire region, QDD divides the region
into four equal sized subregions while DCDT divides the region into irregular shapes
of triangles. Notice that the density of each subregion can be assumed to the same as
the density of the entire region in QDD. This may not be applied when dealing with
irregular shaped subregions of DCDT resulting in some overhead when calculating
each subregion’s density. In this paper, we combine the basic approaches of DB and
QDD discussed in Section 4.1 and Section 4.2, respectively. The implementation
of H B is then straightforward, and the details of the H B algorithm is descried in
Algorithm 5.
Algorithm 5 HB rangeQuery(Pll , Pur , Aglob al , N)
1: ε ← a constant factor; kmax ← a system factor
2: q ← getCenterOfRegion(Pll , Pur )
3: Rq ← getHalfDiagonal(q, Pll , Pur )
4: kinit ← estKvalue(Rq , Aglob al , N, ε)
5: Knn{} ← kNNSearch(q, kinit )
6: r ← maxDistkNN(q, Knn{}); k ← kinit
7: while r < Rq and k < kmax do
8:
clear Knn{}
9:
Alocal ← π · r2
10:
k ← estKvalue(Rq , Alocal , k, ε)
11:
Knn{} ← KnnSearch(q, k)
12:
r ← maxDistkNN(q, Knn{})
13: end while
14: if r < Rq then
15:
clear Knn{}
16:
P0 ← (q.x, Pll .y)
17:
P1 ← (Pur .x, q.y)
18:
P2 ← (Pll .x, q.y)
19:
P3 ← (q.x, Pur .y)
20:
Knn{} ← add HB rangeQuery(Pll , q, Alocal , k)
21:
Knn{} ← add HB rangeQuery(P0 , P1 , Alocal , k)
22:
Knn{} ← add HB rangeQuery(P2 , P3 , Alocal , k)
23:
Knn{} ← add HB rangeQuery(q, Pur , Alocal , k)
24:
return result{} ← Knn{}
25: else
26:
return result{} ← prunResult(Knn{}, Pll , Pur )
27: end if
5 Performance evaluation
In this section, we evaluate the performance of our proposed algorithms. We first
discuss the statistical analysis of the DB and QDD algorithms comparing their
Geoinformatica
experimental results. We then present the experimental results of DB, QDD and
DCDT. Based on the observation from the results of DB and QDD, the use of
H B and its performance can be driven.
5.1 Datasets and experimental methodology
In this paper, we evaluate the proposed range query algorithms using both synthetic
and real GIS data sets. Our synthetic data sets use both uniform (random) and
skewed (exponential) distributions. For the uniform data sets, (x, y) locations are
distributed uniformly and independently between 0 and 1. The (x, y) locations for
the skewed data sets are independently drawn from an exponential distribution with
mean 0.3 and standard deviation 0.3. The number of points in each data set varies:
1K, 2K, 4K, and 8K points. Our real data set is from the U.S. Geological Survey in
2001: Geochemistry of consolidated sediments in Colorado in the U.S. [18]. The data
set contains 5,410 objects (points).
Our performance metric is the number of k-NN calls required for a given range
query. We performed range queries using the proposed algorithms with various
data sets and measured the average number of k-NN searches to support a range
query while varying the size of range queries and the distribution of the data sets.
In our experiments, we varied the range query size between 1% and 10% of the
entire region of the data set. Different k values between 5 and 50 were used for the
QDD algorithm, and the DB algorithm used varying values from 1.0 to 2.0. Range
queries were conducted for 100 trials for each query size and the average values were
reported. We present only the most illustrative subset of our experimental results due
to space limitation. Similar qualitative and quantitative trends were observed in all
other experiments.
5.2 Statistical analysis
The data distribution in a query region may affect the performance of the algorithms.
One can devise a better solution with a complete knowledge of the data set. To
show the expected performance of the DB and QDD algorithms and evaluate the
impact of data distribution on the algorithms, this section presents statistical analysis
of the two algorithms. We first statistically analyze DB and QDD, then the expected
performances are compared with the empirical results with our data sets.
5.2.1 Statistical analysis of DB
We generated a number of data points assuming two typical distributions of their
locations, uniform and exponential. Then, while varying the size of range queries, we
analyze how many points will be distributed in a query region and how it will impact
the performance of the algorithms.
We created various sized synthetic data sets (1K, 2K, 4K, 8K, 16K points) using
both uniform and exponential distribution of their locations. The size of query region
varied such that C Rq of the given query region varied: 3%, 5%, and 10% of the entire
region of data set. For each experiment, 100K range queries were issued on randomly
generated query points. Then, the average number of points in a query region was
calculated.
Geoinformatica
Fig. 11 Data distribution of
the number of points in a
query region
0.6
0.5
Probability
0.4
uni form 3%
0.3
uniform 5%
0.2
uniform 10%
exponential 3%
exponential 5%
0.1
exponential 10%
0.0
0
100
200
300
400
500
600
700
800
number of points
Figure 11 shows the distributions of the number of points in a query region with
the query sizes of 3%, 5%, and 10% for the 4K synthetic data sets. The x-axis
represents the number of points inside R, and the y-axis represents the probability
of each number of points. The distribution of the uniform data set is normal while
that of the exponential data set shows a skewed distribution. For example, the 3%
query size in the uniform data set retrieved 120 points on average and the maximum
number of points was 156. Our global estimation method used in the DB algorithm
decides the initial k value using the global estimation (i.e., k = 3 · 4000/100 = 120),
which matches the result from the statistical analysis well. Then, if DB uses = 1.3
(i.e., k = 120 × 1.3 = 156), it can retrieve all the points in R in a single k-NN search.
Even though the exponential data set resulted in a skewed distribution, two or three
k-NN searches can be enough to retrieve all points in R. For example, Approximately
75% of the range queries resulted in less than or equal to the estimated value of k
in DB. In our experimental result of DB in Section 5.3, the value of k was overestimated in most cases with the exponential data set. This proves that DB can be
effective regardless of the distribution of data set.
5.2.2 Statistical analysis of QDD
Suppose that we have a perfectly balanced quad-tree. Each node corresponds to each
region in which a range query is invoked at its center point. Let D be the diagonal
of the initial query region R and dk be the minimum of the distances between any
point in the data set and its kth nearest neighbor. The region represented by a node
becomes smaller and smaller while going down the tree. Finally, when the diagonal of
each region becomes less than or equal to dk , R can be completely covered by k-NN
circles so that all the points in R can be retrieved. When does the diagonal of every
region become less than or equal to dk ? Let L be the tree level when the diagonal
of the region is less than or equal to dk (i.e., L is the leaf level in the quad-tree).
Then, L = log4 (D/dk ). Let Nknn be the total number of nodes of the perfect quad
tree. Then we have
Nknn ≤ 40 + 41 + 42 + ... + 4 L ≤
L
i=0
4i
(3)
Geoinformatica
By solving (3), we get
Nknn ≤
L
log
4 =
i
i=0
D
dk
i=0
D 2
1
4
4 =
−1
3
dk
i
(4)
The performance of the Naive approach is defined by Nknn because the Naive needs
to be perfectly balanced, i.e., each cell should be the same size.
However, the quad-tree representation of QDD does not need to be perfectly
balanced, and the total number of nodes can be smaller than Nknn (see the example
in Fig. 4b). In QDD, each level of the quad tree has a different size of a query region,
i.e., the size of its quadrant. If there are j levels from level 0 to level j − 1, then j
different sizes of the query region are created. In order to analyze the performance
of QDD, let us assume that the distribution of the datasets are known.
First, we do random range queries to get the data distribution of each size of range
query. We then calculate the probability for each level of the quad-tree, which is the
probability that its quadrant cannot be covered by a single k-NN search. Non-zero
probability means that the subregion should be further divided. Finally, we estimate
the total number of nodes of the quad-tree. Figure 12 shows a data distribution of
the number of points in a query region.
Let E be the expected value of the number of points in C Rq , where E N ·
the area of C Rq
Aglob al
. Let P B be the probability that C Rq contains more than k points, i.e.,
P B = P(E > k). Let P Bi be the probability that C Rq contains more than k at the ith
i
level of the tree. Then the number of nodes at the ith level is ( i−1
j=0 P Bj ) · 4 , and
th
the number of nodes at the 0 level is obviously one. Let Nknn denote the expected
number of nodes in a quad-tree representation of QDD; it is quantified as:
Nknn
⎛
⎞
L
i−1
⎝
= 1 + P B0 · 41 + ... + P B0 P B1 · ·P BL−1 · 4 L = 1 +
PBj⎠ · 4i
CRq
R
j=0
Probability
i=1
PB
Aglobal
Fig. 12 Data distribution in a query region
avg . k
number of points
(5)
Geoinformatica
Table 2 Statistical analysis of QDD: 4K synthetic uniformly distributed data set
Level
Query size
P Bi (%)
k=5
10
15
20
30
40
50
Level 0
Level 1
Level 2
Level 3
Level 4
Number of
k-NN
searches
3%
1/41 · 3%
1/42 · 3%
1/43 · 3%
1/44 · 3%
Nknn
Nknn
QDD
P B0
P B1
P B2
P B3
P B4
100
100
54.48
36.94
0
341
108
115
100
100
45.52
0
0
85
51
59
100
99.96
6.35
0
0
85
26
29
100
96.27
0
0
0
85
21
25
100
50.00
0
0
0
21
13
16
100
3.73
0
0
0
21
6
7
100
0
0
0
0
5
5
3
Since P Bi ≤ 1 for every level i, i−1
j=0 P Bj becomes smaller and smaller as the tree
level increases, and finally approaches 0. Therefore, Nknn ≤ Nknn .
Table 2 shows the analysis of QDD for our 4K synthetic uniformly distributed
data set. The quad tree for the 4K data set has six levels from level 0 to 5.
Each probability P Bi for level i is calculated based on the data distribution of the
given range query. We compare Nknn with Nknn as well as with the results of the
QDD algorithm. For example, when k = 15, a quadrant at level 1 is not covered by
a single k-NN search with a 99.96% probability, and one at level 2 is not covered
by a k-NN search with a 6.35% probability. The perfect quad-tree with four levels
has 85 nodes in total. Therefore, the total number of k-NN calls for the Naive
approach is Nknn = 85 while Nknn = 26 for this case. The number of k-NN calls with
QDD for this example was 29 as presented in Section 5.3. Overall, Nknn had 62% of
average reduction compared to Nknn . We also observed that the maximum difference
between QDD and Nknn was 9%. This statistical analysis describes the behavior of
the QDD algorithm.
QDD discards the returned k-NN result when the k-NN circle does not cover the
given query region and repeatedly calls additional four k-NN searches for the new
four subregions. Thus, we can consider skipping some levels of the QDD process
to reduce unnecessary calls. If Rc is the current query region of a k-NN search, and
Ps is the probability that the k-NN circle completely covers Rc , then 1 − Ps is the
probability that the k-NN circle fails to cover the region. From statistical analysis
based on the results of the previous k-NN searches, we estimate the probability Ps
that the k-NN search covers Rc . To skip the current call of k-NN search, the total
expected number of k-NN searches for the current level and the next level must be
greater than four calls; four k-NN searches are required if we skip the current level
and directly go to four subregions for the next level. Hence we obtain the condition of
Ps by the following equation: Ps + (1 − Ps ) > 4. Therefore, QDD can skip a k-NN
search for Rs iff Ps < 14 .
5.3 Experimental evaluation
5.3.1 Results of the DB algorithm
With the statistical knowledge of the data set, DB can decide the k value to
minimize the number of k-NN calls. As a result, DB completed any range query
in one or two k-NN calls in all experiments with both synthetic and real data
sets. The impact of was quantified by measuring the required number of k-
Geoinformatica
120%
100%
80%
1 k-NN call on uni. 4k
60%
2 k-NN calls on uni. 4k
1 k-NN call on exp. 4k
40%
2 k-NN calls on exp. 4k
20%
percentage of range queries
percentage of range queries
120%
0%
100%
80%
60%
1 k-NN call
40%
2 k-NN calls
20%
0%
1
1.2
1.4
1.6
epsilon value
1.8
2
1
(a) 4k Synthetic data set with 3% region queries
1.2
1.4
1.6
epsilon value
1.8
2
(b) Real data set with 3% region queries
Fig. 13 Percentage of k-NN calls of DB(a, b)
NN calls while varying . As increased from 1.0 to 2.0, more range queries
were completed in a single k-NN call. Figure 13a shows what percentage of
range queries were evaluated in one k-NN call (or in two) while varying . For
example, in the case of the uniformly distributed dataset, when = 1.0, 57%
of range queries were covered by one k-NN search while 43% were by two kNN searches. When ≥ 1.2, all range queries were completed in one k-NN call.
Overall, the exponential data set required a higher number of k-NN calls than
the uniform data. The real data set showed a similar trend as the exponential
data set.
DB performed all range queries in one or two k-NN calls with over-estimated
values of k. Because DB uses a loosely bounded k value, the obtained k-NN circle
Cr is larger than C Rq in most cases. To quantify how accurately DB estimates Cr
with regard to C Rq (Cr /C Rq or r/Rq ) using , we measured the ratio of r to Rq while
varying (Fig. 14). The ratio was approximately 1.26 and 1.98 for the uniform and
exponential data sets, respectively, when = 1.5 while the ratio was 1.48 for the
uniform data set and 2.34 for the exponential data set when = 2.0. The ratio for
5.0
5.0
exp. e=2.0
uniform e=2.0
4.0
exp. e=1.5
uniform e=1.5
3.0
2.0
Accuracy (r/Rq)
Accuracy (r/Rq)
4.0
epsilon = 2.0
epsilon = 1.5
3.0
2.0
1.0
1.0
0.0
1% 2% 3% 4% 5% 6% 7% 8% 9% 10%
0.0
1% 2% 3% 4% 5% 6% 7% 8% 9% 10%
size of range query
(a) 4k Synthetic data set with 3% region queries
Fig. 14 Accuracy of DB(a, b)
size of range query
(b) Real data set with 3% region queries
Geoinformatica
120
QDD w/uniform
100
number of k-NN calls
number of k-NN calls
120
DCDT w/uniform
80
N.C. w/uniform
60
40
20
QDD w/exponential
100
0
DCDT w/exponential
80
N.C. w/exponential
60
40
20
0
5
10
15
20
25
30
35
40
45
50
5
10
15
20
value of k
25
30
35
40
45
50
value of k
(a) Uniformly distributed data set
(b) Exponentially distributed data set
Fig. 15 Number of k-NN calls for 4K synthetic data set with 3% range query (a, b)
the real data set are presented in Fig. 14b: 1.51 with = 1.5 and 1.77 with = 2.0.
For both the synthetic and real data sets, the ratio r to Rq decreased as the size of the
range query increased.
5.3.2 Results of QDD and DCDT
First, we compared the performance of QDD and DCDT with the theoretical
minimum, i.e., the necessary condition (N.C.) kn , where n is the number of points in
R. Figure 15a and b show the comparisons of QDD, DCDT and N.C. with uniformly
distributed and exponentially distributed 4k synthetic data sets, respectively. For
both DCDT and QDD, the number of k-NN calls rapidly decreased as the value
of k increased in the range 5-15 for uniformly distributed data and 5-10 for the
exponentially distributed data. Then, it slowly decreased as the value of k became
larger, approaching to those of N.C. when k is over 35. The results for the real
data set have similar trends to those of the exponentially distributed synthetic
data set (Fig. 16). Note that k is determined not by the client but by the Web
interface, and a typical k value in real applications ranges between 5 and 20 [4].
In both synthetic and real data sets, DCDT needed a significantly smaller number
of k-NN calls compared to QDD. For example, with the exponential data set
(4K), DCDT showed 55.3% of average reduction in the required number of k-
140
number of k-NN calls
Fig. 16 Number of k-NN calls
for real data set with 3% range
query
120
QDD
QDD
100
DCDT
N.C.
N.C.
80
60
40
20
0
5
10
15
20
25
30
k value
35
40
45
50
Geoinformatica
Table 3 The average percentage reduction in the number of k-NN calls (%)
Distribution
Query size
k=5
10
15
20
25
30
35
40
45
50
Uniform
3%
5%
10%
3%
5%
10%
51.3
56.3
56.2
60.5
68.4
69.4
55.9
53.5
53.6
58.8
67.2
59.7
51.7
56.9
48.2
58.6
68.8
56.5
52.0
56.3
45.0
56.9
65.7
56.7
50.0
55.6
37.5
51.1
60.7
55.8
43.8
52.5
30.0
40.0
58.3
34.4
27.3
54.0
22.2
37.5
50
46.9
0
43.8
22.2
37.5
41.7
46.4
0
35.7
25.0
20.0
25.0
35.0
0
0
20.0
20.0
20.0
31.3
Exponential
110%
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
k=5
k=7
k = 10
k = 12
k = 15
90% coverage
50% coverage
0
10
20
30
number of k-NN calls
(a) 3% range query
40
50
percentage of coverage
percentage of coverage
NN searches compared to QDD. As discussed in Section 4.2, DCDT covers a
portion of the query region with every k-NN search while QDD revisits the same
subregion when a k-NN search fails to cover the entire subregion. DCDT still
required more k-NN calls than N.C. On the average, DCDT required 1.67 times
more k-NN calls than N.C. However, the gap became very small when k was greater
than 15.
Next, we conducted range queries with different sizes of query regions: 3%, 5%,
and 10% of the entire region of data set. Table 3 shows the average percentage
reduction in the number of k-NN calls between DCDT and QDD for 4K uniformly
and exponentially distributed synthetic data sets. On the average, DCDT resulted
in 50.4% and 58.2% of reduction over QDD for the uniformly and exponentially
distributed data set, respectively. The results show that the smaller the value of k is,
the greater the reduction rate.
In Fig. 17, we plotted the average coverage rate versus the number of k-NN calls
required for DCDT on a 4K synthetic uniformly distributed data set. Figure 17a
shows the average coverage rate of the DCDT algorithm while varying the value
of k. For example, when k=7, 50% of the query range can be covered by the
first 8 k-NN calls while the entire range is covered by 40 k-NN calls. The same
range query required 24 calls for 90% of coverage, and this number is approximately 60% of the total required number of k-NN searches. Figure 17b shows
the average coverage rate of DCDT while varying query size. For 5% range
query, 12 and 31 k-NN searches were required for 50% and 90% coverage,
respectively.
110%
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
3% range query
5% range query
10% range query
90% coverage
50% coverage
0
10
20
30
40
50
60
number of k-NN calls
(b) k=10
Fig. 17 Percentage of coverage using 4K uniformly distributed synthetic data set (a, b)
70
80
90
Geoinformatica
110%
90%
80%
70%
k=5
k=7
k = 10
k = 12
k = 15
90% coverage
50% coverage
60%
50%
40%
30%
20%
10%
0%
0
10
20
30
40
percentage of coverage
percentage of coverage
100%
110%
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
50
number of k-NN calls
(a) 3% range query
3% range query
5% range query
10% range query
90% coverage
50% coverage
0
10
20
30
40
50
60
70
80
90
number of k-NN calls
(b) k =10
Fig. 18 Percentage of coverage using real data set (a, b)
Figure 18a and b show the average coverage rate versus the number of kNN calls required for DCDT on real data set while varying the value of k in
(a) and varying query size in (b). When k=7, 50% of the query range can be
covered by 20% of the total required number of k-NN searches. The same range
query required 16 calls for 90% of coverage, which is 55% of the total required
number of k-NN searches. For 5% range query, 20% and 58% of the total required k-NN searches were required to cover 50% and 90% of the query range,
respectively.
6 Conclusions
In this paper, we introduced the problem of evaluating spatial range queries on
the web data by using only k-Nearest Neighbor searches. The problem of finding
a minimum number of k-NN searches to cover the query range appears to be hard
even if the locations of the objects (points) are known.
Based on the classification of k-NN interfaces, four algorithms were proposed to
achieve the completeness and efficiency requirements of spatial range queries. Quad
Drill Down (QDD) and Dynamic Constrained Delaunay Triangulation (DCDT)
algorithms were proposed for the case where we have no knowledge about the
distribution of the data sets with the most restricted k-NN interface (i.e., no control
on the value of k). We showed that both QDD and DCDT can cover the entire
range even with the most restricted environment while providing reasonable performance. The efficiency of the algorithms were measured using statistical analysis
and empirical experiments. DCDT provides a better performance than QDD. In
the presence the data distribution knowledge and more flexible k-NN interface, we
proposed the DB algorithm that can support any range query with only one or
two k-NN searches as was demonstrated in all our experiments. The Hybrid (H B)
algorithm provides the possibility of customization of our algorithms for real world
applications.
We plan to expand our range query algorithms for parallel k-NN searches on
multiple web site, which is expected to significantly enhance the data integration and
Geoinformatica
search time. Consequently, a new problem of minimizing the overlapping area can
be introduced.
Acknowledgements The authors thank Prof. Petr Vojtěchovský for providing helpful suggestions
and Brandon Haenlein for the valuable discussions throughout this work.
References
1. Barish G, Chen Y, Dipasquo D, Knoblock CA, Minton S, Muslea I, Shahabi C (2000) Theaterloc:
using information integration technology to rapidly build virtual application. In: Proceedings
of international conf. on data engineering (ICDE), 28 February–3 March 2000, San Diego,
pp 681–682
2. Bae WD, Alkobaisi S, Kim SH, Narayanappa S, Shahabi C (2007) Supporting range queries on
web data using k-nearest neighbor search. In: Proceedings of the 7th international symposium on
web and wireless GIS (W2GIS 2007), 28–29 November 2007, Cardiff, pp 61–75
3. Bae WD, Alkobaisi S, Kim SH, Narayanappa S, Shahabi C (2007) Supporting range queries on
web data using k-nearest neighbor search. Technical report DU-CS-08-01, University of Denver
4. Byers S, Freire J, Silva C (2001) Efficient acquisition of web data through restricted query
interface. In: Poster proceedings of the world wide web conference (WWW10), Hong Kong, 1–5
May 2001, pp 184–185
5. Chew LP (1989) Constrained Delaunay triangulations. Algorithmica 4(1):97–108
6. Dickerson M, Drysdale R, Sack J (1992) Simple algorithms for enumerating interpoint distances
and finding k nearest neighbors. Int J Comput Geom Appl 2:221–239
7. Eppstein D, Erickson J (1994) Interated nearest neighbors and finding minimal polytypes.
Discrete Comput Geom 11:321–350
8. Gaede V, Gounter O (1998) Multidimensional access methods. ACM Comput Surv 30(2):
170–231
9. Hieu LQ (2005) Integration of web data sources: a survey of existing problems. In: Proceedings
of the 17th GI-workshop on the foundations of databases (GvD), Wörlitz, 17–20 May 2005
10. Liu D, Lim E, Ng W (2002) Efficient k nearest neighbor queries on remote spatial databases using
range estimation. In: Proceedings of international conf. on scientific and statistical databases
management (SSDMB), Edinburgh, 24–26 July 2002, pp 121–130
11. Mergerian S, Koushanfar F (2005) Worst and best-case coverage in sensor networks. IEEE Trans
Mob Comput 4(1):84–92
12. Nie Z, Kambhampati S, Nambiar U (2005) Effectively mining and using coverage and overlap
statistics for data integration. IEEE Trans Knowl Data Eng 17(5):638–651, May
13. Roussopoulos N, Kelley S, Vincent F (1995) Nearest neighbor queries. In: Proceedings of ACM
SIGMOD, San Jose, May 1995, pp 71–79
14. Samet H (1985) Data structures for quadtree approximation and compression. Commun ACM
28(9):973–993, September
15. Sharifzadeh M, Shahabi C (2006) Utilizing voronoi cells of location data streams for accurate
computation of aggregate functions in sensor networks. GeoInformatica 10(1):9–36
16. Song Z, Roussonpoulos N (2001) K-nearest neighbor search for moving query point. In: Proceedings of international symposium on spatial and temporal databases (SSTD), Redondo Beach,
12–15 July 2001, pp 79–96
17. Tao U, Zhang U, Papadias D, Mamoulis N (2004) An efficient cost model for optimization of
nearest neighbor search in low and medium dimensional spaces. IEEE Trans Knowl Data Eng
16(10):1169–1184, October
18. USGS (2001) USGS mineral resources on-line spatial data. http://tin.er.usgs.gov/
19. Wang G, Cao G, Porta TL (2003) Movement-assisted sensor deployment. In: Proceedings of
IEEE INFOCOM, San Francisco, 30 March–3 April 2003, pp 2469–2479
20. Wang S, Armstrong MP (2003) A quadtree approach to domain decomposition for spatial
interpolation in grid computing environments. Parallel Comput 29(3):1481–1504, April
21. Wu C, Lee K, Chung Y (2006) A Delaunay triangulation based method for wireless sensor
network deployment. In: Proceedings of ICPADS, Minneapolis, July 2006, pp 253–260
22. Yerneni R, Li C, Garcia-Molina H, Ullman J (1999) Computing capabilities of mediators. In:
Proceedings of SIGMOD, Philadelphia, 1–3 June 1999, pp 443–454
Geoinformatica
Wan D. Bae is currently an assistant professor in the Mathematics, Statistics and Computer Science
Department at the University of Wisconsin-Stout. She received her Ph.D. in Computer Science
from the University of Denver in 2007. Dr. Bae’s current research interests include online query
processing, Geographic Information Systems, digital mapping, multidimensional data analysis and
data mining in spatial and spatiotemporal databases.
Shayma Alkobaisi is currently an assistant professor at the College of Information Technology
in the United Arab Emirates University. She received her Ph.D. in Computer Science from the
University of Denver in 2008. Dr. Alkobaisi’s research interests include uncertainty management
in spatiotemporal databases, online query processing in spatial databases, Geographic Information
Systems and computational geometry.
Geoinformatica
Seon Ho Kim is currently an associate professor in the Computer Science & Information Technology
Department at the University of District of Columbia. He received his Ph.D. in Computer Science
from the University of Southern California in 1999. Dr. Kim’s primary research interests include
design and implementation of multimedia storage systems, and databases, spatiotemporal databases,
and GIS. He co-chaired the 2004 ACM Workshop on Next Generation Residential Broadband
Challenges in conjunction with the ACM Multimedia Conference.
Sada Narayanappa is currently an advanced computing technologist at Jeppesen. He received
his Ph.D. in Mathematics and Computer Science from the University of Denver in 2006. Dr.
Narayanappa’s primary research interests include computational geometry, graph theory, algorithms, design and implementation of databases.
Geoinformatica
Cyrus Shahabi is currently an Associate Professor and the Director of the Information Laboratory
(InfoLAB) at the Computer Science Department and also a Research Area Director at the NSF’s
Integrated Media Systems Center (IMSC) at the University of Southern California. He received
his Ph.D. degree in Computer Science from the University of Southern California in August 1996.
Dr. Shahabi’s current research interests include Peer-to-Peer Systems, Streaming Architectures,
Geospatial Data Integration and Multidimensional Data Analysis. He is currently on the editorial
board of ACM Computers in Entertainment magazine. He is also serving on many conference
program committees such as ICDE, SSTD, ACM SIGMOD, ACM GIS. Dr. Shahabi is the recipient
of the 2002 National Science Foundation CAREER Award and 2003 Presidential Early Career
Awards for Scientists and Engineers (PECASE). In 2001, he also received an award from the Okawa
Foundations.