D-Index: Distance Searching Index for Metric Data Sets

Multimedia Tools and Applications, 21, 9–33, 2003
c 2003 Kluwer Academic Publishers. Manufactured in The Netherlands.
D-Index: Distance Searching Index for Metric
Data Sets
VLASTISLAV DOHNAL
Masaryk University Brno, Czech Republic
[email protected]
CLAUDIO GENNARO
PASQUALE SAVINO
ISI-CNR, Via Moruzzi, 1, 56124, Pisa, Italy
[email protected]
[email protected]
PAVEL ZEZULA
Masaryk University Brno, Czech Republic
[email protected]
Abstract. In order to speedup retrieval in large collections of data, index structures partition the data into subsets
so that query requests can be evaluated without examining the entire collection. As the complexity of modern
data types grows, metric spaces have become a popular paradigm for similarity retrieval. We propose a new
index structure, called D-Index, that combines a novel clustering technique and the pivot-based distance searching
strategy to speed up execution of similarity range and nearest neighbor queries for large files with objects stored
in disk memories. We have qualitatively analyzed D-Index and verified its properties on actual implementation.
We have also compared D-Index with other index structures and demonstrated its superiority on several real-life
data sets. Contrary to tree organizations, the D-Index structure is suitable for dynamic environments with a high
rate of delete/insert operations.
Keywords: metric spaces, similarity search, index structures, performance evaluation
1.
Introduction
The concept of similarity searching based on relative distances between a query and database
objects has become essential for a number of application areas, e.g. data mining, signal
processing, geographic databases, information retrieval, or computational biology. They
usually exploit data types such as sets, strings, vectors, or complex structures that can be
exemplified by XML documents. Intuitively, the problem is to find similar objects with
respect to a query object according to a domain specific distance measure. The problem
can be formalized by the mathematical notion of the metric space, so the data elements
are assumed to be objects from a metric space where only pairwise distances between the
objects can be determined.
The growing need to deal with large, possibly distributed, archives requires an indexing
support to speedup retrieval. An interesting survey of indexes built on the metric distance
paradigm can be found in [4]. The common assumption is that the costs to build and to
maintain an index structure are much less important compared to the costs that are needed
to execute a query. Though this is certainly true for prevalently static collections of data,
10
DOHNAL ET AL.
such as those occurring in the traditional information retrieval, some other applications may
need to deal with dynamic objects that are permanently subject of change as a function of
time. In such environments, updates are inevitable and the costs of update are becoming a
matter of concern.
One important characteristic of distance based index structures is that, depending on
specific data and a distance function, the performance can be either I/O or CPU bound.
For example, let us consider two different applications: in the first one, the similarity of
text strings of one hundred characters is computed by using the edit distance, which has a
quadratic computational complexity; in the second application, one hundred dimensional
vectors are compared by the inner product, with a linear computational complexity. Since
the text objects are four times shorter than the vectors and the distance computation on the
vectors is one hundred times faster than the edit distance, the minimization of I/O accesses
for the vectors is much more important that for the text strings. On the contrary, the CPU
costs are more significant for computing the distance of text strings than for the distance of
vectors. In general, the strategy of indexes based on distances should be able to minimize
both of the cost components.
An index structure supporting similarity retrieval should be able to execute similarity
range queries with any search radius. However, a significant group of applications needs
retrieval of objects which are in a very near vicinity of the query object—e.g., copy (replica)
detection, cluster analysis, DNA mutations, corrupted signals after their transmission over
noisy channels, or typing (spelling) errors. Thus in some cases, special attention may be
warranted.
In this article, we propose a similarity search structure, called D-Index, that is able to
reduce both the I/O and CPU costs. The organization stores data objects in buckets with
direct access to avoid hierarchical bucket dependencies. Such organization also results in
a very efficient insertion and deletion of objects. Though similarity constraints of queries
can be defined arbitrarily, the structure is extremely efficient for queries searching for very
close objects.
A short preliminary version of this article is available in [8]. The major extensions concern
generic search algorithms, an extensive experimental evaluation, and comparison with other
index structures. The rest of the article is organized as follows. In Section 2, we define the
general principles of the D-Index. The system structure, algorithms, and design issues are
specified in Section 3, while experimental evaluation results can be found in Section 4. The
article concludes in Section 5.
2.
Search strategies
A metric space M = (D, d) is defined by a domain of objects (elements, points) D
and a total (distance) function d—a non negative and symmetric function which satisfies
the triangle inequality d(x, y) ≤ d(x, z) + d(z, y), ∀x, y, z ∈ D. Without any loss of
generality, we assume that the maximum distance never exceeds the maximum distance
d + . In general, we study the following problem: Given a set X ⊆ D in the metric space
M, pre-process or structure the elements of X so that similarity queries can be answered
efficiently.
D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS
11
For a query object q ∈ D, two fundamental similarity queries can be defined. A range
query retrieves all elements within distance r to q, that is the set {x ∈ X, d(q, x) ≤ r }. A
nearest neighbor query retrieves the k closest elements to q, that is a set A ⊆ X such that
|A| = k and ∀x ∈ A, y ∈ X − A, d(q, x) ≤ d(q, y).
According to the survey [4], existing approaches to search general metric spaces can be
classified in two basic categories:
Clustering algorithms. The idea is to divide, often recursively, the search space into partitions
(groups, sets, clusters) characterized by some representative information.
Pivot-based algorithms. The second trend is based on choosing t distinguished elements
from D, storing for each database element x distances to these t pivots, and using such
information for speeding up retrieval.
Though the number of existing search algorithms is impressive, see [4], the search costs are
still high, and there is no algorithm that is able to outperform the others in all situations.
Furthermore, most of the algorithms are implemented as main memory structures, thus do
not scale up well to deal with large data files.
The access structure we propose in this paper, called D-Index, synergistically combines
more principles into a single system in order to minimize the amount of accessed data,
as well as the number of distance computations. In particular, we define a new technique
to recursively cluster data in separable partitions of data blocks and we combine it with
pivot-based strategies to decrease the I/O costs. In the following, we first formally specify
the underlying principles of our clustering and pivoting techniques and then we present
examples to illustrate the ideas behind the definitions.
2.1.
Clustering through separable partitioning
To achieve our objectives, we base our partitioning principles on a mapping function, which
we call the ρ-split function, where ρ is a real number constrained as 0 ≤ ρ < d + . In order
to gradually explain the concept of ρ-split functions, we first define a first order ρ-split
function and its properties.
Definition 1. Given a metric space (D, d), a first order ρ-split function s 1,ρ is the mapping
s 1,ρ : D → {0, 1, −}, such that for arbitrary different objects x, y ∈ D, s 1,ρ (x) = 0 ∧
s 1,ρ (y) = 1 ⇒ d(x, y) > 2ρ (separable property) and ρ2 ≥ ρ1 ∧ s 1,ρ2 (x) = − ∧ s 1,ρ1 (y) =
− ⇒ d(x, y) > ρ2 − ρ1 (symmetry property).
In other words, the ρ-split function assigns to each object of the space D one of the symbols
0, 1, or −. Moreover, the function must hold the separable and symmetry properties. The
meaning and the importance of these properties will be clarified later.
We can generalize the ρ-split function by concatenating n first order ρ-split functions
with the purpose of obtaining a split function of order n.
1,ρ
1,ρ
Definition 2. Given n first order ρ-split functions s1 , . . . , sn in the metric space (D, d),
1,ρ 1,ρ
1,ρ
a ρ-split function of order n s n,ρ = (s1 , s2 , . . . , sn ) : D → {0, 1, −}n is the mapping,
12
DOHNAL ET AL.
1,ρ
1,ρ
such that for arbitrary different objects x, y ∈ D, ∀ i si (x) = − ∧ ∀ j s j (y) = − ∧
1,ρ
s n,ρ (x) = s n,ρ (y) ⇒ d(x, y) > 2ρ (separable property) and ρ2 ≥ ρ1 ∧ ∀ i si 2 (x) =
1,ρ1
− ∧ ∃ j s j (y) = − ⇒ d(x, y) > ρ2 − ρ1 (symmetry property).
An obvious consequence of the ρ-split function definitions, useful for our purposes, is
1,ρ
1,ρ
that by combining n ρ-split functions of order 1 s1 , . . . , sn , which have the separable and
symmetric properties, we obtain a ρ-split function of order n s n,ρ , which also demonstrates
the separable and symmetric properties. We often refer to the number of symbols generated
by s n,ρ , that is the parameter n, as the order of the ρ-split function. In order to obtain
an addressing scheme, we need another function that transforms the ρ-split strings into
integers, which we define as follows.
Definition 3. Given a string b = (b1 , . . . , bn ) of n elements 0, 1, or −, the function
· : {0, 1, −}n → [0..2n ] is specified as:
b =

n

 [b1 , b2 , . . . , bn ]2 =
2 j−1 b j ,


if ∀ j b j = −
j=1
2n ,
otherwise
When all the elements are different from ‘−’, the function b simply translates the string
b into an integer by interpreting it as a binary number (which is always <2n ), otherwise the
function returns 2n .
By means of the ρ-split function and the b operator we can assign an integer number
i (0 ≤ i ≤ 2n ) to each object x ∈ D, i.e., the function can group objects from X ⊂ D in
2n + 1 disjoint sub-sets.
Assuming a ρ-split function s n,ρ , we use the capital letter S n,ρ to designate partitions
(sub-sets) of objects from X , which the function can produce. More precisely, given a ρn,ρ
split function s n,ρ and a set of objects X , we define S[i] (X ) = {x ∈ X | s n,ρ (x) = i}. We
n
call the first 2 sets the separable sets. The last set, that is the set of objects for which the
function s n,ρ (x) evaluates to 2n , is called the exclusion set.
1,ρ
1,ρ
For illustration, given two split functions of order 1, s1 and s2 , a ρ-split function of
1,ρ
1,ρ
order 2 produces strings as a concatenation of strings returned by functions s1 and s2 .
The combination of the ρ-split functions can also be seen from the perspective of the
object data set. Considering again the illustrative example above, the resulting function of
order 2 generates an exclusion set as the union of the exclusion sets of the two first order
1,ρ
1,ρ
split functions, i.e. (S1[2] (D) ∪ S2[2] (D)). Moreover, the four separable sets, which are given
by the combination of the separable sets of the original split functions, are determined as
1,ρ
1,ρ
1,ρ
1,ρ
1,ρ
1,ρ
1,ρ
1,ρ
follows: {S1[0] (D) ∩ S2[0] (D), S1[0] (D) ∩ S2[1] (D), S1[1] (D) ∩ S2[0] (D), S1[1] (D) ∩ S2[1] (D)}.
The separable property of the ρ-split function allows to partition a set of objects X in
n,ρ
sub-sets S[i] (X ), so that the distance from an object in one sub-set to another object in a
different sub-set is more than 2ρ, which is true for all i < 2n . We say that such disjoint
separation of sub-sets, or partitioning, is separable up to 2ρ. This property will be used
during the retrieval, since a range query with radius ≤ρ requires to access only one of the
separable sets and, possibly the exclusion set. For convenience, we denote a set of separable
D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS
n,ρ
n,ρ
n,ρ
13
n,ρ
partitions {S[0] (D), . . . , S[2n −1] (D)} as {S[·] (D)}. The last partition S[2n ] (D) is the exclusion
set.
When the ρ parameter changes, the separable partitions shrink or widen correspondingly.
The symmetric property guarantees a uniform reaction in all the partitions. This property
is essential in the similarity search phase as detailed in Sections 3.2 and 3.3.
2.2.
Pivot-based filtering
In general, the pivot-based algorithms can be viewed as a mapping F from the original
metric space (D, d) to a t-dimensional vector space with the L ∞ distance. The mapping assumes a set T = { p1 , p2 , . . . , pt } of objects from D, called pivots, and for each
database object o, the mapping determines its characteristic (feature) vector as F(o) =
(d(o, p1 ), d(o, p2 ), . . . , d(o, pt )). We designate the new metric space as MT (Rt , L ∞ ). At
search time, we compute for a query object q the query feature vector F(q) = (d(q, p1 ),
d(q, p2 ), . . . , d(q, pt )) and discard for the search radius r an object o if
L ∞ (F(o), F(q)) > r
(1)
In other words, the object o can be discarded if for some pivot pi ,
|d(q, pi ) − d(o, pi )| > r
(2)
Due to the triangle inequality, the mapping F is contractive, that is all discarded objects
do not belong to the result set. However, some not-discarded objects may not be relevant
and must be verified through the original distance function d(·). For more details, see for
example [3].
2.3.
Illustration of the idea
Before we specify the structure and the insertion/search algorithms of D-Index, we illustrate
the ρ-split clustering and pivoting techniques by simple examples.
Though several different types of first order ρ-split functions are proposed, analyzed, and
evaluated in [7], the ball partitioning split (bps) originally proposed in [12] under the name
excluded middle partitioning, provided the smallest exclusion set. For this reason, we also
apply this approach, which can be characterized as follows.
The ball partitioning ρ-split function bpsρ (x, xv ) uses one object xv ∈ D and the medium
1,ρ
1,ρ
1,ρ
distance dm to partition the data file into three subsets BPS[0] , BPS[1] and BPS[2] —the
result of the following function gives the index of the set to which the object x belongs (see
figure 1).


 0 if d(x, xv ) ≤ dm − ρ
1,ρ
bps (x) = 1 if d(x, xv ) > dm + ρ


− otherwise
(3)
14
Figure 1.
DOHNAL ET AL.
The excluded middle partitioning.
Note that the median distance dm is relative to xv and it is defined so that the number of
objects with distances smaller than dm is the same as the number of objects with distances
larger than dm . For more details see [11].
In order to see that the bpsρ -split function behaves in the desired way, it is sufficient to
prove that the separable and symmetric properties hold.
1,ρ
1,ρ
Separable property. We have to prove that ∀ x0 ∈ BPS[0] , x1 ∈ BPS[1] : d(x0 , x1 ) > 2ρ.
If we consider Definition 1 of the split function, we can write d(x0 , xv ) ≤ dm − ρ
and d(x1 , xv ) > dm + ρ. Since the triangle inequality among x0 , x1 , xv holds, we get
d(x0 , x1 ) + d(x0 , xv ) ≥ d(x1 , xv ). If we combine these inequalities and simplify the
expression, we get d(x0 , x1 ) > 2ρ.
1,ρ
1,ρ
1,ρ
Symmetric property. We have to prove that ∀ x0 ∈ BPS[0] 2 , x1 ∈ BPS[1] 2 , y ∈ BPS[2] 1 :
d(x0 , y) > ρ2 − ρ1 and d(x1 , y) > ρ2 − ρ1 with ρ2 ≥ ρ1 . We prove it only for the object
1,ρ
x0 because the proof for x1 is analogous. Since y ∈ BPS[2] 1 then d(y, xv ) > dm − ρ1 .
1,ρ2
Moreover, since x0 ∈ BPS[0] then dm − ρ2 ≥ d(x0 , xv ). By summing both the sides of
the above inequalities we obtain d(y, xv )−d(x0 , xv ) > ρ2 −ρ1 . Finally, from the triangle
inequality we have that d(x0 , y) > d(y, xv ) − d(x0 , xv ) and then d(x0 , y) ≥ ρ2 − ρ1 .
As explained in Section 2.1, once we have defined a set of first order ρ-split functions, it
is possible to combine them in order to obtain a function which generates more partitions.
This idea is depicted in figure 2, where two bps-split functions, on the two-dimensional
space, are used. The domain D, represented by the grey square, is divided into four regions
2,ρ
2,ρ
2,ρ
2,ρ
2,ρ
{S[·] (D)} = {S[0] (D), S[1] (D), S[2] (D), S[3] (D)}, corresponding to the separable parti2,ρ
tions. The partition S[4] (D), the exclusion set, is represented by the brighter region and it
is formed by the union of the exclusion sets resulting from the two splits.
D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS
Figure 2.
Clustering through partitioning in the two-dimensional space.
Figure 3.
Example of pivots behavior.
15
In figure 3, the basic principle of the pivoting technique is illustrated, where q is the query
object, r the query radius, and pi a pivot. Provided the distance between any object and pi is
known, the gray area represents the region of objects x, that do not belong to the query result.
This can easily be decided without actually computing the distance between q and x, by using
the triangle inequalities d(x, pi ) + d(x, q) ≥ d( pi , q) and d( pi , q) + d(x, q) ≥ d(x, pi ),
and respecting the query radius r . Naturally, by using more pivots, we can improve the
probability of excluding an object without actually computing its distance with respect to q.
3.
System structure and algorithms
In this section we specify the structure of the Distance Searching Index, D-Index, and discuss
the problems related to the definition of ρ-split functions and the choice of reference objects.
16
DOHNAL ET AL.
Then we specify the insertion algorithm and outline the principle of searching. Finally, we
define the generic similarity range and nearest neighbor algorithms.
3.1.
Storage architecture
The basic idea of the D-Index is to create a multilevel storage and retrieval structure that
uses several ρ-split functions, one for each level, to create an array of buckets for storing
objects. On the first level, we use a ρ-split function for separating objects of the whole data
set. For any other level, objects mapped to the exclusion bucket of the previous level are
the candidates for storage in separable buckets of this level. Finally, the exclusion bucket of
the last level forms the exclusion bucket of the whole D-Index structure. It is worth noting
that the ρ-split functions of individual levels use the same ρ. Moreover, split functions can
have different order, typically decreasing with the level, allowing the D-Index structure to
have levels with a different number of buckets. More precisely, the D-Index structure can
be defined as follows.
Definition 4. Given a set of objects X , the h-level distance searching index DI ρ (X, m 1 ,
m 2 , . . . , m h ) with the buckets at each level separable up to 2ρ, is determined by h indepenm ,ρ
dent ρ-split functions si i (i = 1, 2, . . . , h) which generate:
Exclusion bucket E i =
m ,ρ
if i = 1
S1[21 m1 ] (X )
m ,ρ
Si[2i mi ] (E i−1 )
Separable buckets {Bi,0 , Bi,1 , . . . , Bi,2mi −1 } = m ,ρ
S1[·]1 (X )
m ,ρ
Si[·]i (E i−1 )
if i > 1
if i = 1
if i > 1
From the structure point of view, you
see the buckets organized as the following two
can
h
dimensional array consisting of 1 + i=1
2m i elements.
B1,0 , B1,1 , . . . , B1,2m1 −1
B2,0 , B2,1 , . . . , B2,2m2 −1
..
.
Bh,0 , Bh,1 , . . . , Bh,2m h −1 , E h
All separable buckets are included, but only the E h exclusion bucket is present—exclusion
buckets E i<h are recursively re-partitioned on level i + 1. Then, for each row (i.e. the DIndex level) i, 2m i buckets are separable up to 2ρ thus we are sure that do not exist two
buckets at the same level i both containing relevant objects for any similarity range query
with radius r x ≤ ρ. Naturally, there is a tradeoff between the number of separable buckets
and the amount of objects in the exclusion bucket—the more separable buckets there are,
the greater the number of objects in the exclusion bucket is. However, the set of objects in
the exclusion bucket can recursively be re-partitioned, with possibly different number of
bits (m i ) and certainly different ρ-split functions.
D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS
3.2.
17
Insertion and search strategies
In order to complete our description of the D-Index, we present an insertion algorithm and
a sketch of a simplified search algorithm. Advanced search techniques are described in the
next section.
Insertion of object x ∈ X in DI ρ (X, m 1 , m 2 , . . . , m h ) proceeds according to the following
algorithm.
Algorithm 1. Insertion
for i = 1 to h
m ,ρ
if si i (x) < 2m i
x → Bi,simi ,ρ (x) ;
exit;
end if
end for
x → E h ;
Starting from the first level, Algorithm 1 tries to accommodate x into a separable bucket. If
a suitable bucket exists, the object is stored in this bucket. If it fails for all levels, the object x
is placed in the exclusion bucket E h . In any case, the insertion algorithm determines exactly
one bucket to store the object.
Given a query region Q = R(q, rq ) with q ∈ D and rq ≤ ρ, a simple algorithm can
execute the query as follows.
Algorithm 2. Search
for i = 1 to h
return all objects x such that x ∈ Q ∩ Bi,s mi ,0 (q) ;
i
end for
return all objects x such that x ∈ Q ∩ E h ;
The function sim i ,0 (q) always gives a value smaller than 2m i , because ρ = 0. Consequently,
one separable bucket on each level i is determined. Objects from the query response set
cannot be in any other separable bucket on the level i, because rq is not greater than ρ and
the buckets are separable up to 2ρ. However, some of them can be in the exclusion zone,
and Algorithm 2 assumes that exclusion buckets are always accessed. That is why all levels
are considered, and also the exclusion bucket E h is accessed. The execution of Algorithm 2
requires h + 1 bucket accesses, which forms the upper bound of a more sophisticated
algorithm described in the next section.
3.3.
Generic search algorithms
The search Algorithm 2 requires to access one bucket at each level of the D-Index, plus
the exclusion bucket. In the following two situations, however, the number of accesses can
18
DOHNAL ET AL.
even be reduced:
– if the query region is contained in the exclusion partition of the level i, then the query
cannot have objects in the separable buckets of this level and only the next level, if it
exists, must be considered;
– if the query region is contained in a separable partition of the level i, the following levels,
as well as the exclusion bucket for i = h, need not be accessed, thus the search terminates
on this level.
Another drawback of the simple algorithm is that it works only for search radii up to ρ.
However, with additional computational effort, queries with rq > ρ can also be executed.
Indeed, queries with rq > ρ can be executed by evaluating the split function s rq −ρ . In case
s rq −ρ returns a string without any ‘−’, the result is contained in a single bucket (namely
Bs rq −ρ ) plus, possibly, the exclusion bucket.
Let us now consider that the string returned contains at least one ‘−’. We indicate this
string as (b1 , . . . , bn ) with bi = {0, 1, −}. In case there is only one bi = ‘−’, we must access
all buckets B, whose index is obtained by substituting in s rq −ρ the ‘−’ with 0 and 1. In the
most general case we must substitute in (b1 , . . . , bn ) all the ‘−’ with zeros and ones and
generate all possible combinations. A simple example of this concept is illustrated in figure 4.
In order to define an algorithm for this process, we need some additional terms and
notation.
Definition 5. We define an extended exclusive OR bit operator, ⊗, which is based on the
following truth table:
Figure 4.
bit1
0
0
0
1
1
1
−
−
−
bit2
0
1
−
0
1
−
0
1
−
⊗
0
1
0
1
0
0
0
0
0
Example of use of the function G.
D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS
19
Notice that the operator ⊗ can be used bit-wise on two strings of the same length and that it
always returns a standard binary number (i.e., it does not contain any ‘−’). Consequently,
s1 ⊗ s2 < 2n is always true for strings s1 and s2 of length n (see Definition 3).
Definition 6. Given a string s of length n, G(s) denotes a sub-set of n = {0, 1, . . . , 2n −1},
such that all elements ei ∈ n for which s ⊗ ei = 0 are eliminated (interpreting ei as a
binary string of length n).
Observe that, G(− − · · · −) = n , and that the cardinality of the set is the second power
of the number of ‘−’ elements, that is generating all partition identifications in which
the symbol ‘−’ is alternatively substituted by zeros and ones. In fact, we can use G(·)
to generate, from a string returned by a split function, the set of partitions that need be
accessed to execute the query. As an example let us consider that n = 2 as in figure 4. If
s 2,rq −ρ = (−1) then we must access buckets B1 and B3 . In such a case G(−1) = {1, 3}. If
s 2,rq −ρ = (−−) then we must access all buckets, as G(−−) = {0, 1, 2, 3} indicates.
We only apply the function G when the search radius rq is greater than the parameter ρ.
We first evaluate the split function using rq − ρ (which is greater than 0). Next, the function
G is applied to the resulting string. In this way, G generates the partitions that intersect the
query region.
In the following, we specify how such ideas are applied in the D-Index to implement
generic range and nearest neighbor algorithms.
3.3.1. Range queries. Given a query region Q = R(q, rq ) with q ∈ D and rq ≤ d + . An
advanced algorithm can execute the similarity range query as follows.
Algorithm 3. Range search
01. for i = 1 to h
m ,ρ+rq
02. if si i
(q) < 2m i then (exclusive containment in a separable bucket)
03.
return all objects x such that x ∈ Q ∩ Bi,s mi ,ρ+rq (q) ;
i
exit;
04. end if
05. if rq ≤ ρ then (search radius up to ρ)
m ,ρ−rq
06.
if si i
(q) < 2m i then (not exclusive containment in an exclusion bucket)
07.
return all objects x such that x ∈ Q ∩ Bi,s mi ,ρ−rq (q) ;
i
08.
end if
09. else (search radius greater than ρ)
m ,r −ρ
10.
let {l1 , l2, . . . , lk} = G(si i q (q))
11.
return all objects x such that x ∈ Q ∩ Bi,l1 or x ∈ Q ∩ Bi,l2 or . . . or
x ∈ Q ∩ Bi,lk ;
12. end if
13. end for
14. return all objects x such that x ∈ Q ∩ E h ;
20
DOHNAL ET AL.
In general, Algorithm 3 considers all D-Index levels and eventually also accesses the global
exclusion bucket. However, due to the symmetry property, the test on the line 02 can discover
the exclusive containment of the query region in a separable bucket and terminate the search
earlier. Otherwise, the algorithm proceeds according to the size of the query radius. If it
is not greater than ρ, line 05, there are two possibilities. If the test on line 06 is satisfied,
one separable bucket is accessed. Otherwise no separable bucket is accessed on this level,
because the query region is, from this level point of view, exclusively in the exclusion
zone. Provided the search radius is greater than ρ, more separable buckets are accessed
on a specific level as defined by lines 10 and 11. Unless terminated earlier, the algorithm
accesses the exclusion bucket at line 14.
3.3.2. Nearest neighbor queries. The task of the nearest neighbor search is to retrieve k
closest elements from X to q ∈ D, respecting the metric M. Specifically, it retrieves a set
A ⊆ X such that |A| = k and ∀ x ∈ A, y ∈ X − A, d(q, x) ≤ d(q, y). In case of ties, we
are satisfied with any set of k elements complying with the condition. For convenience, we
designate the distance to the k-th nearest neighbor as dk (dk ≤ d + ).
The general strategy of Algorithm 4 works as follows. At the beginning (line 01), we
assume that the response set A is empty and dk = d + . The algorithm then proceeds with
an optimistic strategy (lines 02 through 13) assuming that the k-th nearest neighbor is at
distance maximally ρ. If it fails, see the test on line 14, additional steps of search are
performed to find the correct result. Specifically, the first phase of the algorithm determines
the buckets (maximally one separable at each level) by using the range query strategy with
radius r = min{dk , ρ}. In this way, the most promising buckets are accessed and their
identifications are remembered to avoid multiple bucket accesses. If dk ≤ ρ the search
terminates successfully and A contains the result. Otherwise, additional bucket accesses are
performed (lines 15 through 22) using the strategy for radii greater than ρ of Algorithm 3,
ignoring the buckets already accessed in the first (optimistic) phase.
Algorithm 4.
Nearest neighbor search
01. A = ∅, dk = d + ; (initialization)
02. for i = 1 to h (first - optimistic - phase)
03. r = min{dk , ρ};
m ,ρ+r
04. if si i
(q) < 2m i then (exclusive containment in a separable bucket)
05.
access bucket Bi,s mi ,ρ+r (q) ; update A and dk ;
i
06.
if dk ≤ ρ then exit; (the response set is determined)
07. else
m ,ρ−r
08.
if si i
(q) < 2m i (not exclusive containment in an exclusion bucket)
09.
access bucket Bi,s mi ,0 (q) ; update A and dk ;
i
10.
end if
11. end if
12. end for
13. access bucket E h ; update A and dk ;
14. if dk > ρ (second phase - if needed)
D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS
21
15.
for i = 1 to h
m ,ρ+dk
16.
if si i
(q) < 2m i then (exclusive containment in a separable bucket)
17.
access bucket Bi,s mi ,ρ+dk (q) if not already accessed; update A and dk ; exit;
i
18.
else
m ,d −ρ 19.
let {b1 , b2 , . . . , bk } = G si i k (q)
20.
access buckets Bi,b1 , . . . , Bi,b2 , Bi,bk if not accessed; update A and dk ;
21.
end if
22. end for
23. end if
The output of the algorithm is the result set A and the distance to the last (k-th) nearest
neighbor dk . Whenever a bucket is accessed, the response set A and the distance to the k-th
nearest neighbor must be updated. In particular, the new response set is determined as the
result of the nearest neighbor query over the union of objects from the accessed bucket and
the previous response set A. The value of dk , which is the current search radius in the given
bucket, is adjusted according to the new response set. Notice that the pivots are used during
the bucket accesses, as it is explained in the next subsection.
3.4.
Design issues
In order to apply D-Index, the ρ-split functions must be designed, which obviously requires
specification of reference (pivot) objects. An important feature of D-Index is that the same
reference objects used in the ρ-split functions for the partitioning into buckets are also used
as pivots for the internal management of buckets.
3.4.1. Choosing references. The problem of choosing reference objects (pivots) is important for any search technique in the general metric space, because all such algorithms need,
directly or indirectly, some “anchors” for partitioning and search pruning. It is well known
that the way in which pivots are selected can affect the performance of proper algorithms.
This has been recognized and demonstrated by several researchers, e.g. [11] or [1], but
specific suggestions for choosing good reference objects are rather vague. In this article, we
use the same sets of reference objects both for the separable partitioning and the pivot-based
filtering.
Recently, the problem was systematically studied in [2], and several strategies for selecting pivots have been proposed and tested. The generic conclusion is that the mean µMT
of distance distribution in the feature space MT should be as high as possible, which is
formalized by the hypothesis that the set of pivots T1 = { p1 , p2 , . . . , pt } is better than the
set T2 = { p1 , p2 , . . . , pt } if
µMT1 > µMT2
(4)
However, the problem is how to find the mean for given pivot set T . An estimate of such
quantity is proposed in [2] and computed as:
22
DOHNAL ET AL.
– at random, choose l pairs of elements {(o1 , o1 ), (o2 , o2 ), . . . , (ol , ol )} from D;
– for all pairs of elements, compute their distances in the feature space determined by the
pivot set T ;
– compute µMT as the mean of these distances.
We apply the incremental selection strategy which was found in [2] as the most suitable
for real world metric spaces, which works as follows. At the beginning, a set T = { p1 } of
one element from a sample of m database elements is chosen, such that the pivot p1 has
the maximum µMT value. Then, a second pivot p2 is chosen from another sample of m
elements of the database, creating a new set K = { p1 , p2 } for fixed p1 , maximizing µMT .
The third pivot p3 is chosen from another sample of m elements of the database, creating
another set T = { p1 , p2 , p3 } for fixed p1 , p2 , maximizing µMT . The process is repeated
until t pivots are determined.
In order to verify the capability of the incremental (INC) strategy, we have also tried
to select pivots at random (RAN) and with a modified incremental strategy, called the
middle search strategy (MID). The MID strategy combines the original INC approach
with the following idea. The set of pivots T1 = { p1 , p2 , . . . , pt } is better than the set
T2 = { p1 , p2 , . . . , pt } if |µT1 − dm | < |µT2 − dm |, where dm represents the average global
distance in a given data set and µT is the average distance in the set T . In principle, this
strategy supports choices of pivots with high µMT but also tries to keep distances between
pairs of pivots close to dm . The hypothesis is that too close or too distant pairs of pivots are
not suitable for good partitioning of metric spaces.
3.4.2. Elastic bucket implementation. Since a uniform bucket occupation is difficult to
guarantee, a bucket consists of a header plus a dynamic list of fixed size blocks of capacity
that is sufficient to accommodate any object from the processed file. The header is characterized by a set of pivots and it contains distances to these pivots from all objects stored in
the bucket. Furthermore, objects are sorted according to their distances to the first pivot p1
to form a non-decreasing sequence. In order to cope with dynamic data, a block overflow
is solved by splitting.
Given a query object q and search radius r a bucket evaluation proceeds in the following
steps:
– ignore all blocks containing objects with distances to p1 outside the interval [d(q, p1 ) −
r, d(q, p1 ) + r ];
– for each not ignored block, if ∀ oi , L ∞ (F(oi ), F(q)) > r then ignore the block;
– access remaining blocks and ∀ oi , L ∞ (F(oi ), F(q)) ≤ r compute d(q, oi ). If d(q, oi ) ≤
r , then oi qualifies.
In this way, the number of accessed blocks and necessary distance computations are
minimized.
3.5.
Properties of D-index
On a qualitative level, the most important properties of D-Index can be summarized as
follows:
D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS
23
– An object is typically inserted at the cost of one block access, because objects in buckets are sorted. Since blocks are of fixed length, a block can overflow and a split of an
overloaded block costs another block access. The number of distance
h computations between the reference (pivot) and inserted objects varies from m 1 to i=1
m i . For an object
j
inserted on level j, i=1 m i distance computations are needed.
– For all response sets such that the distance to the most dissimilar object does not exceed
ρ, the number of bucket accesses is maximally h + 1. The smaller the search radius, the
more efficient the search can be.
– For r = 0, i.e. the exact match, the successful search is typically solved with one block
access, unsuccessful search usually does not require any access—very skewed distance
distributions may result in slightly more accesses.
– For search radii greater than ρ, the number of bucket accesses is higher than h + 1, and
for very large search radii, the query evaluation can eventually access all buckets.
– The number of accessed blocks in a bucket depends on the search radius and the distance
distribution with respect to the first pivot.
– The number of distance computations between the query object and pivots is determined
by the highest level
jof an accessed bucket. Provided the level is j, the number
h of distance
computations is i=1 m i , with the upper bound for any kind of query i=1
mi .
– The number of distance computations between the query object and data objects stored in
accessed buckets depends on the search radii and the number of pre-computed distances
to pivots.
From the quantitative point of view, we investigate the performance of D-Index in the
next section.
4.
Performance evaluation and comparison
We have implemented the D-Index and conducted numerous experiments to verify its properties on the following three metric data sets:
VEC 45-dimensional vectors of image color features compared by the quadratic distance
measure respecting correlations between individual colors.
URL sets of URL addresses attended by users during work sessions with the Masaryk
University information system. The distance measure used was based on set similarity,
defined as the fraction of cardinalities of intersection and union (Jacard coefficient).
STR sentences of a Czech national corpus compared by the edit distance measure that counts
the minimum number of insertions, deletions or substitutions to transform one string into
the other.
We have chosen these data sets not only to demonstrate the wide range of applicability of the
D-Index, but also to show its performance over data with significantly different and realistic
data (distance) distributions. For illustration, see figure 5 for distance densities of all our
data sets. Notice the practically normal distribution of VEC, very discrete distribution of
URL, and the skewed distribution of STR.
24
DOHNAL ET AL.
Frequency
VEC
Frequency
URL
Frequency
0.008
0.16
0.008
0.004
0.08
0.004
0
0
0
Figure 5.
2000
4000 6000
Distance
8000
STR
0
0
0.2
0.4
0.6
Distance
0.8
1
0 200
600
1000
Distance
1600
Distance densities for VEC, URL, and STR.
In all our experiments, the query objects are not chosen from the indexed data sets, but they
follow the same distance distribution. The search costs are measured in terms of distance
computations and block reads. We deliberately ignore the L ∞ distance measures used by
the D-Index for the pivot-based filtering, because according to our tests, the costs to compute
such distances are several orders of magnitude smaller than the costs needed to compute
any of the distance functions of our experiments. All presented cost values are averages
obtained by executing queries for 50 different query objects and constant search selectivity,
that is queries using the same search radius or the same number of nearest neighbors.
4.1.
D-Index performance evaluation
In order to test the basic properties of the D-Index, we have considered about 11000 objects
for each of our data sets VEC, URL, and STR. Notice that the number of object used does
not affect the significance of the results, since the cost measures (i.e., the number of block
reads and the number of the distance computations) are not influenced by the cache of the
system. Moreover, as shown in the following, performance scalability is linear with data set
size. First, we have tested the efficiency of the incremental (INC), the random (RAN), and
the middle (MID) strategies to select good reference objects. For each set of the selected
reference objects and specific data set, we have built D-Index organizations of pre-selected
structures. In Table 1, we summarize characteristics of resulting D-Index organizations,
separately for individual data sets.
We measured the search efficiency for range queries by changing the search radius to
retrieve maximally 20% of data, that is about 2000 objects. The results, presented in graphs
Table 1.
D-Index organization parameters.
Set
h
No. of buckets
No. of blocks
Block size
VEC
9
21
2600
1 KB
URL
9
21
228
13 KB
STR
9
21
289
6 KB
ρ
1200
0.225
14
25
D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS
Distance Computations
VEC
7000
6000
Distance Computations
URL
12000
INC
MID
RAN
10000
5000
STR
8000
INC
MID
RAN
INC
MID
RAN
6000
7000
8000
5000
4000
6000
4000
3000
3000
4000
2000
Distance Computations
2000
1000
2000
0
0
1000
0
0 2 4 6 8 10 12 14 16 18 20
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Search radius (x100)
Search radius
Figure 6.
Page Reads
0
10
20 30
40 50
60 70
Search radius
Search efficiency in the number of distance computations for VEC, URL, and STR.
VEC
Page Reads
URL
Page Reads
1800
250
160
INC
MID
RAN
1400
INC
MID
200 RAN
140
150
100
1600
1200
1000
STR
INC
MID
RAN
120
80
800
100
600
400
60
40
50
20
200
0
0
0 2 4 6 8 10 12 14 16 18 20
Search radius (x100)
Figure 7.
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Search radius
0
10
20 30 40 50
Search radius
60
70
Search efficiency in the number of block reads for VEC, URL, and STR.
for all three data sets, are available in figure 6 for the distance computations and in figure 7
for the block reads.
The general conclusion is that the incremental method proved to be the most suitable for
choosing reference objects for the D-Index. Except for very few search cases, the method
was systematically better and can certainly be recommended—for the sentences, the middle
technique was nearly as good as the incremental, and for the vectors, the differences in
performance of all three tested methods were the least evident. An interesting observation
is that for vectors with distribution close to uniform, even the randomly selected reference
points worked quite well. However, when the distance distribution was significantly different
from the uniform, the performance with the randomly selected reference points was poor.
Notice also some differences in performance between the execution costs quantified in
block reads and distance computations. But the general trends are similar and the incremental
method is the best. However, different cost quantities are not equally important for all the
data sets. For example, the average time for quadratic form distances for the vectors were
computed in much less than 90 µsec, while the average time spent computing a distance
between two sentences was about 4.5 msec. This implies that the reduction of block reads
for vectors was quite important in the total search time, while for the sentences, it was not
important at all.
26
DOHNAL ET AL.
Figure 8.
Nearest neighbor search on sentences using different D-Index structures.
Figure 9.
Range search on sentences using different D-Index structures.
The objective of the second group of tests was to investigate the relationship between
the D-Index structure and the search efficiency. We have tested this hypothesis on the data
set STR, using ρ = 1, 14, 25, 60. The results of experiments for the nearest neighbor
search are reported in figure 8 while performance for range queries is summarized in
figure 9. It is quite obvious that very small ρ can only be suitable when range search with
small radius is used. However, this was not typical for our data where the distance to a
nearest neighbor was rather high—see in figure 8 the relatively high costs to get the nearest
neighbor. This implies that higher values of ρ, such as 14, are preferable. However, there
are limits to increase such value, since the structure with ρ = 25 was only exceptionally
better than the one with ρ = 14, and the performance with ρ = 60 was rather poor. The
experiments demonstrate that a proper choice of the structure can significantly influence
the performance. The selection of optimized parameters is still an open research issue, that
we plan to investigate in the near future.
In the last group of tests, we analyzed the influence of the block size on system performance. We have experimentally studied the problem on the VEC data set both for the
nearest neighbor and the range search. The results are summarized in figure 10.
In order to obtain an objective comparison, we express the cost as relative block reads,
i.e. the percentage of blocks that was necessary to execute a query.
27
D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS
Page Reads (%)
Page Reads (%)
80
70
70
60
60
50
50
block=400
block=1024
block=65536
40
40
30
30
20
20
block=400
block=1024
block=65536
10
10
0
0
1
5
10
20
50
100
Nearest neighbors
Figure 10.
0
5
10
15
20
Search radius (x100)
Search efficiency versus the block size.
We can conclude from the experiments that the search with smaller blocks requires to
read a smaller fraction of the entire data structure. Notice that the number of distance
computations for different block sizes is practically constant and depends only on the
query. However, a small block size implies a large number of blocks; this means that
in case the access cost to a block is significant (e.g., if blocks are allocated randomly)
than large block sizes may become preferable. However, when the blocks are stored in
a continuous storage space, small blocks can be preferable, because an optimized strategy for reading a set of disk pages, such as that proposed in [10], can be applied. For a
static file, we can easily achieve such storage allocation of the D-Index through a simple
reorganization.
4.2.
D-Index performance comparison
We have also compared the performance of D-Index with other index structures under the
same workload. In particular, we considered the M-tree1 [5] and the sequential organization,
SEQ, because, according to [4], these are the only types of index structures for metric data
that use disk memories to store objects. In order to maximize objectivity of comparison,
all three types of index structures are implemented using the GIST package [9], and the
actual performance is measured in terms of the number of distance comparisons and I/O
operations—the basic access unit of disk memory for each data set was the same for all
index structures.
The main objective of these experiments was to compare the similarity search efficiency
of the D-Index with the other organizations. We have also considered the space efficiency,
costs to build an index, and the scalability to process growing data collections.
4.2.1. Search efficiency for different data sets. We have built the D-Index, M-tree, and
SEQ for all the data sets VEC, URL, and STR. The M-tree was built by using the bulk load
package and the Min-Max splitting strategy. This approach was found in [5] as the most
efficient to construct the tree. We have measured average performance over 50 different
query objects considering numerous similarity range and nearest neighbor queries. The
28
DOHNAL ET AL.
Distance Computations
VEC
12000
Distance Computations
URL
12000
dindex
mtree
seq
10000
Distance Computations
12000
dindex
mtree
seq
10000
8000
8000
6000
6000
6000
4000
4000
4000
2000
2000
2000
0
0
5
10
15
20
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Search radius (x100)
Figure 11.
and STR.
0
10 20 30 40 50 60 70
Search radius
Search radius
Comparison of the range search efficiency in the number of distance computations for VEC, URL,
Page Reads
1000
900
800
700
600
500
400
300
200
100
0
dindex
mtree
seq
10000
8000
0
STR
VEC
Page Reads
URL
120
dindex
mtree
seq
Page Reads
STR
140
dindex
mtree
seq
100
120
dindex
mtree
seq
100
80
80
60
60
40
40
20
20
0
0
5
10
15
Search radius (x100)
Figure 12.
20
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Search radius
VEC
12000
Distance Computations
URL
12000
10000
20 30 40 50
Search radius
60
70
Distance Computations
10000
dindex
mtree
seq
8000
6000
6000
4000
4000
4000
2000
2000
2000
0
5
10
20
50
Nearest neighbors
100
dindex
mtree
seq
8000
6000
0
STR
12000
10000
dindex
mtree
seq
1
10
Comparison of the range search efficiency in the number of block reads for VEC, URL, and STR.
Distance Computations
8000
0
0
1
5
10
20
50
Nearest neighbors
100
1
5
10
20
50
100
Nearest neighbors
Figure 13. Comparison of the nearest neighbor search efficiency in the number of distance computations for
VEC, URL, and STR.
results for the range search are shown in figures 11 and 12, while the performance for the
nearest neighbor search is presented in figures 13 and 14.
For all tested queries, i.e. retrieving subsets up to 20% of the database, the M-tree and
the D-Index always needed less distance computations than the sequential scan. However,
this was not true for the block accesses, where even the sequential scan was typically
29
D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS
Page Reads
VEC
1200
Page Reads
URL
120
dindex
mtree
seq
1000
80
600
60
400
40
200
20
0
0
STR
140
dindex
mtree
seq
100
800
Page Reads
dindex
mtree
seq
120
100
80
60
1
Figure 14.
and STR.
5
10
20
50
Nearest neighbors
100
40
20
0
1
5
10
20
50
Nearest neighbors
100
1
5
10
20
50
Nearest neighbors
100
Comparison of the nearest neighbor search efficiency in the number of block reads for VEC, URL,
more efficient than the M-tree. In general, the number of block reads of the M-tree was
significantly higher than for the D-Index. Only for the URL sets when retrieving large
subsets, the number of block reads of the D-Index exceeded the SEQ. In this situation, the
M-tree accessed nearly three times more blocks than the D-Index or SEQ. On the other hand,
the M-tree and the D-Index required for this search case much less distance computations
than the SEQ.
Figure 12 demonstrates another interesting observation: to run the exact match query, i.e.
range search with rq = 0, the D-Index only needs to access one block. As a comparison, see
(in the same figure) the number of block reads for the M-tree: they are one half of the SEQ
for vectors, equal to the SEQ for the URL sets, and even three times more than the SEQ for
the sentences. Notice that the exact match search is important when a specific object is to
be eliminated—the location of the deleted object forms the main cost. In this respect, the
D-Index is able to manage deletions more efficiently than the M-tree. We did not verify this
fact by explicit experiments, because the available version of the M-tree does not support
deletions.
Concerning the block reads, the D-Index looks to be significantly better than the Mtree, while comparison on the level of distance computations is not so clear. The D-Index
certainly performed much better for the sentences and also for the range search over vectors.
However, the nearest neighbor search on vectors needed practically the same number of
distance computations. For the URL sets, the M-tree was on average only slightly better
than the D-Index.
4.2.2. Search, space, and insertion scalability. To measure the scalability of the D-Index,
M-tree, and SEQ, we have used the 45-dimensional color feature histograms. Particularly, we
have considered collections of vectors ranging from 100,000 to 600,000 elements. For these
experiments, the structure of D-Index was defined by 37 reference objects and 74 buckets,
where only the number of blocks in buckets was increasing to deal with growing files. The
results for several typical nearest neighbor queries are reported in figure 15. The execution
costs for range queries are reported in figure 16. Except for the SEQ organization, the individual curves are labelled with a number, representing either the number of nearest neighbors
or the search radius, and a letter, where D stands for the D-Index and M for the M-tree.
30
DOHNAL ET AL.
Distance Computations
Page Reads
600000
500000
400000
30000
1D
100 D
1M
100 M
SEQ
20000
300000
15000
200000
10000
100000
5000
0
100
Figure 15.
25000
200 300 400 500
Data Set Size (x1000)
600
0
100
200 300 400 500
Data Set Size (x1000)
600
Nearest neighbor search scalability.
Distance Computations
Page Reads
600000
500000
400000
30000
1000 D
2000 D
1000 M
2000 M
SEQ
25000
20000
300000
15000
200000
10000
100000
5000
0
100
200
300
400
500
Data Set Size (x1000)
Figure 16.
1D
100 D
1M
100 M
SEQ
600
0
100
1000 D
2000 D
1000 M
2000 M
SEQ
200
300
400
500
600
Data Set Size (x1000)
Range search scalability.
We can observe that on the level of distance computations the D-Index is usually slightly
better than the M-tree, but the differences are not significant—the D-Index and M-tree
can save considerable number of distance computations compared to the SEQ. To solve
a query, the M-tree needs significantly more block reads than the D-Index and for some
queries, see 2000M in figure 16, this number is even higher than for the SEQ.We have
also investigated the performance of range queries with zero radius, i.e. the exact match.
Independently of the data set size, the D-Index required one block access and 18 distance
comparisons. This was in a sharp contrast with the M-tree, which needed about 6,000
block reads and 20,000 distance computations to find the exact match in the set of 600,000
vectors.
The search scale up of the D-Index was strictly linear. In this respect, the M-tree was
even slightly better, because the execution costs for processing a two times larger file
were not two times higher. This sublinear behavior should be attributed to the fact that
the M-tree is incrementally reorganizing its structure by splitting blocks and, in this way,
improving the clustering of data. On the other hand, the D-Index was using a constant bucket
structure, where only the number of blocks was changing. Such observation is posing another
D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS
31
research challenge, specifically, to build a D-Index structure with a dynamic number of
buckets.
The disk space to store data was also increasing in all our organizations linearly. The
D-Index needed about 20% more space than the SEQ. In contrast, the M-tree occupied
two times more space than the SEQ. In order to insert one object, the D-Index evaluated on average the distance function 18 times. That means that distances to (practically) one half of the 37 reference objects were computed. For this operation, the DIndex performed 1 block read and, due to the dynamic bucket implementation using
splits, a bit more than one block writes. The insertion into the M-tree was more expensive and required on average 60 distance computations, 12 block reads, and 2 block
writes.
5.
Conclusions
Recently, metric spaces have become an important paradigm for similarity search and many
index structures supporting execution of similarity queries have been proposed. However,
most of the existing structures are only limited to operate on main memory, so they do not
scale up to high volumes of data. We have concentrated on the case where indexed data are
stored on disks and we have proposed a new index structure for similarity range and nearest
neighbor queries. Contrary to other index structures, such as the M-tree, the D-Index stores
and deletes any object with one block access cost, so it is particularly suitable for dynamic
data environments. Compared to the M-tree, it typically needs less distance computations
and much less disk reads to execute a query. The D-Index is also economical in space
requirements. It needs slightly more space than the sequential organization, but at least two
times less disk space compared to the M-tree. As experiments confirm, it scales up well to
large data volumes.
Our future research will concentrate on developing methodologies that would support
optimal design of the D-Index structures for specific applications. The dynamic bucket
management is another research challenge. We will also study parallel implementations, which the multilevel hashing structure of the D-Index inherently offers. Finally,
we will carefully study two very recent proposals to index structures based on selection of reference objects [6, 13] and compare the performance of D-index with these
approaches.
Acknowledgments
We would like to thank the anonymous referees for their comments on the earlier version
of the paper.
Note
1. The software is available at http://www-db.deis.unibo.it/research/Mtree/
32
DOHNAL ET AL.
References
1. T. Bozkaya and Ozsoyoglu, “Indexing large metric spaces for similarity search queries,” ACM TODS, Vol.
24, No. 3, pp. 361–404, 1999.
2. B. Bustos, G. Navarro, and E. Chavez, “Pivot selection techniques for proximity searching in metric spaces,”
in Proceedings of the XXI Conference of the Chielan Computer Science Society (SCCC01), IEEE CS Press,
2001, pp. 33–40.
3. E. Chavez, J. Marroquin, and G. Navarro, “Fixed queries array: A fast and economical data structure for proximity searching,” Multimedia Tools and Applications, Vol. 14, No. 2, pp. 113–135,
2001.
4. E. Chavez, G. Navarro, R. Baeza-Yates, and J. Marroquin, “Proximity searching in metric spaces,” ACM
Computing Surveys. Vol. 33, No. 3, pp. 273–321, 2001.
5. P. Ciaccia, M. Patella, and P. Zezula, “M-tree: An efficient access method for similarity search in metric
spaces,” in Proceedings of the 23rd VLDB Conference, Athens, Greece, 1997, pp. 426–435.
6. R.F.S. Filho, A. Traina, C. Traina Jr., and C. Faloutsos, “Similarity search without tears: The OMNI-family
of all-purpose access methods,” in Proceedings of the 17th ICDE Conference, Heidelberg, Germany, 2001,
pp. 623–630.
7. V. Dohnal, C. Gennaro, P. Savino, and P. Zezula, “Separable splits in metric data sets,” in Proceedings of
9-th Italian Symposium on Advanced Database Systems, Venice, Italy, June 2001, pp. 45–62, LCM Selecta
Group—Milano.
8. C. Gennaro, P. Savino, and P. Zezula, “Similarity search in metric databases through Hashing,” in Proceedings
of ACM Multimedia 2001 Workshops, Oct. 2001, Ottawa, Canada, pp. 1–5.
9. J.M. Hellerstein, J.F. Naughton, and A. Pfeffer, “Generalized search trees for database systems,” in Proceedings
of the 21st VLDB Conference, 1995, pp. 562–573.
10. B. Seeger, P. Larson, and R. McFayden, “Reading a set of disk pages,” in Proceedings of the 19th VLDB
Conference, 1993, pp. 592–603.
11. P.N. Yianilos, “Data structures and algorithms for nearest neighbor search in general metric spaces,” ACMSIAM Symposium on Discrete Algorithms (SODA), 1993, pp. 311–321.
12. P.N. Yianilos, “Excluded middle vantage point forests for nearest neighbor search,” Tech. rep., NEC Research
Institute, 1999, Presented at Sixth DIMACS Implementation Challenge: Nearest Neighbor Searches workshop,
Jan. 15, 1999.
13. C. Yu, B.C. Ooi, K.L. Tan, and H.V. Jagadish, “Indexing the Distance: An efficient method to KNN processing,”
in Proceedings of the 27th VLDB Conference, Roma, Italy, 2001, pp. 421–430.
Vlastislav Dohnal received the B.Sc. degree and M.Sc. degree in Computer Science from the Masaryk University,
Czech Republic, in 1999 and 2000, respectively. Currently, he is a Ph.D. student at the Faculty of Informatics,
Masaryk University. His interests include multimedia data indexing, information retrieval, metric spaces, and
related areas.
D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS
33
Claudio Gennaro received the Laurea degree in Electronic Engineering from University of Pisa in 1994 and the
Ph.D. degree in Computer and Automation Engineering from Politecnico di Milano in 1999. He is now researcher
at IEI an institute of National Research Council (CNR) situated in Pisa. His Ph.D. studies were in the field
of Performance Evaluation of Computer Systems and Parallel Applications. His current main research interests
are Performance Evaluation, Similarity retrieval, Storage Structures for Multimedia Information Retrieval and
Multimedia document modelling.
Pasquale Savino graduated in Physics at the University of Pisa, Italy, in 1980. From 1983 to 1995 he has worked at
the Olivetti Research Labs in Pisa; since 1996 he has been a member of the research staff at CNR-IEI in Pisa, working
in the area of multimedia information systems. He has participated and coordinated several EU-funded research
projects in the multimedia area, among which MULTOS (Multimedia Office Systems), OSMOSE (Open Standard
for Multimedia Optical Storage Environments), HYTEA (HYperText Authoring), M-CUBE (Multiple Media
Multiple Communication Workstation), MIMICS (Multiparty Interactive Multimedia Conferencing Services),
HERMES (Foundations of High Performance Multimedia Information Management Systems). Currently, he is
coordinating the project ECHO (European Chronicles on Line). He has published scientific papers in many
international journals and conferences in the areas of multimedia document retrieval and information retrieval. His
current research interests are multimedia information retrieval, multimedia content addressability and indexing.
Pavel Zezula is Full Professor at the Faculty of Informatics, Masaryk University of Brno. For many years, he
has been cooperating with the IEI-CNR Pisa on numerous research projects in the areas of Multimedia Systems
and Digital Libraries. His research interests concern storage and index structures for non-traditional data types,
similarity search, and performance evaluation. Recently, he has focused on index structures and query evaluation
of large collections of XML documents.