Fast Multidimensional Nearest Neighbor Search Algorithm Based on

International Journal of Advanced Intelligence
Volume 1, Number 1, pp.89-107, November, 2009.
c AIA International Advanced Information Institute
Fast Multidimensional Nearest Neighbor Search Algorithm
Based on Ellipsoid Distance
Tadashi Uemiya
Faculty of Engineering, Tokushima University
2-1,Minami-josanjima, Tokushima 770-8506, Japan
[email protected]
Yoshihide Matsumoto
Faculty of Engineering, Tokushima University
2-1,Minami-josanjima, Tokushima 770-8506, Japan
[email protected]
Daichi Koizumi
Faculty of Engineering, Tokushima University
2-1,Minami-josanjima, Tokushima 770-8506, Japan
[email protected]
Masami Shishibori
Faculty of Engineering, Tokushima University
2-1,Minami-josanjima, Tokushima 770-8506, Japan
[email protected]
Kenji Kita
Center for Advanced Information Technology, Tokushima University
2-1,Minami-josanjima, Tokushima 770-8506, Japan
[email protected]
Received (January 2009)
Revised (August 2009)
The nearest neighbor search in high-dimensional spaces is an interesting and important
problem that is relevant for a wide variety of applications, including multimedia information retrieval, data mining, and pattern recognition. For such applications, the curse of
high dimensionality tends to be a major obstacle in the development of efficient search
methods. This paper addresses the problem of designing a new and efficient algorithm
for high-dimensional nearest neighbor search based on ellipsoid distance. The proposed
algorithm uses Cholesky decomposition to perform data conversion beforehand so that
calculation by ellipsoid distance function can be replaced with calculation by Euclidean
distance, and it improves efficiency by omitting an unnecessary operation. Experimental
results indicate that our scheme scales well even for a very large number of dimensions.
Keywords: Nearest neighbors search; high-dimensional space; ellipsoid distance; multimedia information retrieval.
89
90
T. Uemiya, Y. Matsumoto, D. Koizumi, M. Shishibori, K. Kita
1. Introduction
High-performance computing and large-capacity storage at lower prices have led to
explosive growth in multimedia information, and the need for information retrieval
technology that can handle multimedia content is growing stronger by the day.
In recent years, content-based searches that perform similarity searches based on
feature quantities obtained from multimedia data have become the mainstream in
searches of multimedia content, but in many cases, they express multiple feature
quantities in multidimensional vectors and judge the degree of similarity between
samples of content on the basis of the distance between these vectors. For example,
in text search, weight vectors of index words can be used to express text and search
queries,1 while in image search, the image content can be represented by feature
vectors involving color histograms, texture, shape, and other features.2,3 Similarity
searches of content based on feature vectors come down to the problem of a nearest
neighbor search that seeks to find targeted vectors that are close to the vectors given
as search queries. Finding nearest neighbors in high-dimensional space is one of the
important topics of current research, not only in multimedia content retrieval, but
also in data mining, pattern recognition, and other fields of application.
The challenges in nearest neighbor searches are increasing the search speed and
improving the accuracy of the search. We have already proposed a very fast search
algorithm for exploring nearest neighbors, achieved simply by improving the basic
linear search.4 The proposed algorithm speeds up the process by eliminating unnecessary operations in the computation of the distance between vectors. By using
the maximum possible distance among the candidates, it is possible to cut out unnecessary calculations midstream in the course of the distance calculation. Since
updating search result candidates requires processes for removing the candidate
with the maximum value for distance from the list and inserting a new candidate,
a hierarchy queue is used in the data structure in order to carry out these operations efficiently. Data conversion using dimension sorting by distribution value,
together with principal component analysis, is proposed as pre-processing for the
early detection of unnecessary operations.
For content similarity searches based on feature vectors, defining the distance
for representing the scale of similarity is important. Euclidean distance is a typical
measure of distance, but there are some problems with this distance as follows: (1)
Euclidean distance is extremely sensitive to the scales of the feature values, and
(2) Euclidean distance is blind to correlated features. In this paper, we propose a
fast multidimensional nearest neighbor search algorithm based on ellipsoid distance.
Ellipsoid distance takes into account the correlation among features in calculating
distance. By using ellipsoid distance, the problems of scale and correlation inherent
in Euclidean distance are no longer an issue. With the algorithm proposed in this
paper, efficient elimination of unnecessary arithmetic operations has been achieved
by converting the calculation of ellipsoid distance to calculation of Euclidean distance through a spatial transformation performed using Cholesky decomposition to
Fast Multidimensional Nearest Neighbor Search Algorithm Based on Ellipsoid Distance
91
pretreat the data that are targets of the search.
Below, in Chapter 2 presents an overview of nearest neighbor searches of multidimensional data, describing problems with multidimensional indexing technology
in higher-dimensional space. In Chapter 3, we explain the fast nearest neighbor
search algorithm that has already been proposed, after which Chapter 4 proposes
a fast nearest neighbor search algorithm based on distance ellipsoid. In Chapter
5 describes an experiment for assessing the validity of the proposed method, and
finally, in Chapter 6 discusses future challenges.
2. Nearest Neighbor Search of Multidimensional Data
2.1. Nearest neighbor search of multidimensional data
A nearest neighbor search in a multidimensional space is the problem of finding the
nearest vector to a given vector (query vector) q among N data vectors (candidate
vectors) xi (i = 1, 2, . . . , N ) placed in n-dimentional space. There are two typical
varieties of nearest neighbor search:
(i) k-nearest neighbor search (search restricted by number) The search attempts to find the k vectors closest to the given query vector q,
(ii) ε-nearest neighbor search (search restricted by range) The search attempts
to find vectors within a distance ε from the given query vector q; that is,
vector xi satisfying d(q, xi ) are found.
In a linear search wherein a given vector is compared sequentially to all vectors
in a database, the computational complexity increases in direct proportion to the
database size. Therefore, the development of multidimensional indexing techniques
for efficient nearest neighbor search has been attracting much attention recently.5
There are various algorithms for multidimensional indexing in a Euclidean space,
such as R-tree,6 R+-tree,7 R*-tree,8 SS-tree,9 SS+-tree,10 CSS+-tree,11 X-tree,12
and SR-tree,13 as well as more general indexing methods for metric spaces, for
example, VP-tree,14 MVP-tree,15 M-tree,16 etc. Such indexing techniques are based
on restriction of the search range by hierarchical partitioning of multidimensional
search space, and they limit the scope of the basic search.
2.2. VP-tree
In the experiment described in Chapter 5, we use the VP-tree as the object of
comparison with the proposed method. The following is a brief overview of the
VP-tree.
The VP-tree is a method for multidimensional indexing of typical distance space.
It aims to shrink the amount of space explored in the search by recursively partitioning multidimensional space, based on the distance between data points. The
VP-tree uses a reference point known as a vantage point, and it has the special
characteristic of not allowing a common area to arise in the partitioned space so
92
T. Uemiya, Y. Matsumoto, D. Koizumi, M. Shishibori, K. Kita
that hyperspheres can be used to partition space in a top-down manner. By contrast, the M-tree, which partitions space in a bottom-up manner, has a drawback in
that there are many common areas between the partitioned spaces, with the result
that search efficiency declines.
VP-tree index building can be summarized as follows. A vantage point (hereinafter referred to as vp) is selected for dataset s consisting of N number of data
points by means of the random algorithms described below.
(i) Select temporary vp randomly from the data set,
(ii) Calculate the distance to the reset of the N − 1 objects from the temporary,
(iii) Calculate the intermediate value and distribution of these distances,
(iv) The point of maximum distribution, obtained by repeating performing (i)
– (iii) above, is designated vp.
The intermediate value for the distance of all data in the data set S from the vp
chosen as the root node is µ. When d(p, q) is established as the distance between
points p, q, data set S is partitioned into S1 and S2 as follows:
S1 = {s ∈ S|d(s, vq) < µ}
S2 = {s ∈ S|d(s, vq) ≥ µ}
(1)
In like manner, this partitioning operation is recursively applied to S1 and S2 to
create the index. The VP-tree index is represented by a tree structure, and subsets
such as the above-mentioned S1 and S2 each correspond to one node of the tree.
In addition, each leaf node stores a number of data points. The search starts from
the route nodes and follows the nodes conforming to the search scope, accesses
data stored in the leaf node that it finally arrives at point by point, calculates the
distance, and determines whether or not it conforms to the search scope.
2.3. Problems with multidimensional indexing technology in high
dimensions
Content searches of images and other multimedia content employ multidimensional
feature vectors that may exceed 100 dimensions. Phenomena of the kind that cannot
even be imaged in two-dimensional or three-dimensional space are known to occur in
such high-dimensional space. Because the degree of spatial freedom is extremely high
in higher-dimensional space, solving various problems in computational geometry
and multivariate analysis involves an enormous amount of calculation and is hence
notoriously difficult. These difficulties are collectively referred to as the “curse of
dimensionality.”
In nearest neighbor searches in high-dimensional space, a phenomenon occurs
whereby the search becomes more and more difficult as the dimensionality becomes
Fast Multidimensional Nearest Neighbor Search Algorithm Based on Ellipsoid Distance
93
higher. For example, when points are uniformly distributed in n-dimensional space,
the ratio of the distance of the k-th nearest and the (k + 1)-th nearest point to a
given point can be approximated by the following formula:17
E{d(k+1)N N }
1
≈1+
E{dkN N }
kn
(2)
As you can see from the above, as n becomes larger, the ratio of the distance of
the k-th nearest point and the (k + 1)-th nearest point asymptotically approaches 1.
Moreover, when the points are uniformly distributed, the ratio of the distance to the
nearest point to the distance to the most distant point asymptotically approaches 1
as the dimensionality becomes higher. Therefore, methods for dividing the space hierarchically entail problems in that the difference due to distance is small, making it
impossible to limit the area explored, and an amount of calculation that approaches
that of a linear search is required.
3. Fast Nearest Neighbor Search Algorithm
3.1. Basic idea
The fast nearest neighbor search algorithm that we propose aims to speed up the
process by skipping unnecessary operations in the calculation of the distance between vectors. Let us briefly explain the main idea of the proposed algorithm. We
assume that a search query vector q and search target vectors x1 ,x2 ,x3 , and x4 have
been given as follows.
 
 
 
 
 
2
4
2
2
1









q = 2 , x1 = 2 , x2 = 3 , x3 = 1 , x4 = 5 
2
2
2
2
1
(3)
Here, we consider the problem of searching for the top 2 search target vectors
closest to the search query vector q. Computation of the distance between search
query vector q and the first two search target vectors x1 and x2 is as follows (the
square of the distance value is used to simplify the explanation).
d2 (q, x1 ) = (1 − 2)2 + (2 − 2)2 + (1 − 2)2 = 2
d2 (q, x2 ) = (1 − 2)2 + (2 − 3)2 + (1 − 2)2 = 3
(4)
For now, x1 and x2 are considered as candidates for the search results. The
maximum value for distance to the two candidates is 3. Calculation of the third
search target vector x3 is as follows.
d2 (q, x3 ) = (1 − 4)2 + (2 − 1)2 + (1 − 2)2
(5)
94
T. Uemiya, Y. Matsumoto, D. Koizumi, M. Shishibori, K. Kita
At the time of calculation of Item 1 of the right-hand side, the existing maximum
distance to the search result candidate, 3, is exceeded, so it is clear, even without
calculating Items 2 and 3 of the right-hand side, that this search target vector
cannot be the result of the search. Similarly, the calculation of the distance to the
4th search vector x4 is as follows.
d2 (q, x4 ) = (1 − 2)2 + (2 − 5)2 + (1 − 2)2
(6)
The sum up through Term 2 on the right-hand side exceeds the existing maximum distance to the search result candidate, 3, making calculation of Item 3 on
the right-hand side unnecessary. Thus, by eliminating the arithmetic involved in
calculation of the distance between the search query vector and the search target
vector, it becomes possible to perform a nearest neighbor search efficiently.
3.2. Early detection of unnecessary operations by conversion of
data
If it is possible to detect unnecessary operations early on in calculation of the cumulative distances for each dimension of a vector, then those operations can be omitted,
making it possible to carry out the search that much more rapidly. The following 2
methods are proposed for use as pre-processing for early detection of unnecessary
operations, and it can be demonstrated that there is no degradation in performance
with an approach incorporating data conversion by principal component analysis,
even in high dimensions and at high speed.4
3.2.1. Dimension sorting by distribution value
The element distribution is found for every dimension of the search target vector,
and the elements are permuted so that the dimension with the largest distribution
value comes first. As a result, calculation of cumulative distance proceeds from
dimensions with large distribution values toward those with smaller distribution
values, and hence one can expect that the cumulative distance increases quickly,
thus providing for early detection of unnecessary operations.
3.2.2. Data conversion via principal component analysis
Dimension sorting by distribution value involves only permutation of the vector
elements; however, we may consider applying linear transformation as a more efficient means. An orthogonal transformation must be used to preserve the distances
between vectors. There are various orthogonal transformations; in particular, in
principal component analysis (KL transform), a basis for the best representation of
multidimensional vector fluctuations can be found. The eigenvalue decomposition of
a covariance matrix is performed, and the eigenvectors are made the new basis. The
Fast Multidimensional Nearest Neighbor Search Algorithm Based on Ellipsoid Distance
95
covariance increases with the eigenvalues. The eigenvector with the largest eigenvalue is called the first principal component, after which comes the second principal
component, and so on. Early detection of unnecessary operations can be made more
practicable by preliminary transformation of the data so that the coordinates are
arranged in the order of principal components.
4. Fast Nearest neighbor Search Algorithm Based on Ellipsoid
Distance
4.1. Ellipsoid distance
In ellipsoid distance, unlike dimensional Euclidean distance, it is possible to express
the correlation and weight between dimensions, and a high degree of freedom in
determining functions can be expected, along with improved search accuracy.18
In ordinary Euclidean distance, the surface equidistant from a certain point is a
d-dimensional sphere. In weighted Euclidean distance, the equidistant surface is
a d-dimensional elliptical sphere, and its spindle is parallel to the axis. On the
other hand, in ellipsoid distance, the spindle of an ellipsoidal body with equidistant
surfaces can assume any direction. For this reason, the ellipsoid distance can be
considered the generalized distance of the Euclidean distance and the weighted
Euclidean distance.
When search query vectors in n-dimensional space are given by q = [q1 , . . . , qn ]T
and arbitrary search target vectors included in the data set are given by x =
[x1 , . . . , xn ]T , the ellipsoid distance will be represented by the following formula
(Note: represents the transposition of the vector and matrix):
D2 (x,q) = (x − q)A(x − q)T
(7)
Here, A = [aij ] signifies a positive and definite symmetric matrix n × n , called
the correlation matrix. By expanding the formula for ellipsoid distance, we obtain
the following formula.
D2 (x,q) =
n X
n
X
aij (xi − qi )(xj − qj )T
(8)
i=1 j=1
Because the right-hand side of the ellipsoid distance formula represents a square,
D(x,q) corresponds to the distance.
4.2. Application of high-speed nearest neighbor search algorithm to
ellipsoid distance
Because a high degree of freedom in determining the search function can be expected, together with improved search accuracy, numerous studies have used ellipsoid distance in similarity searches of content, such as image searches.19,20 Research
96
T. Uemiya, Y. Matsumoto, D. Koizumi, M. Shishibori, K. Kita
has also been carried out on how to improve the efficiency of similarity searches
based on ellipsoids. Sakurai et al. proposed a method known as SST (Spatial Transformation Technique), based on the spatial transformation method, as a technique
for efficiently supporting similarity searches based on ellipsoid distance.21 SST is a
method that converts an enclosed rectangle positioned in the original space such
that the distance from the search query must be calculated with the ellipsoid distance, into an object in Euclidean space. In addition, Ankerst et al. proposed an
efficient similarity search method for ellipsoid distance, using a spatial transformation based on principal component analysis (KL conversion).22
In the following, we propose a method for applying the fast nearest neighbor
search algorithm described in the previous chapter to a search based on ellipsoid
distance. The basic idea is similar to Sakurai et al.’s method using SST and Ankerst
et al.’s method, and spatial transformation is used to replace calculation of the ellipsoid distance function with calculation of the Euclidean distance function. Cholesky
decomposition of the matrix is used when carrying out the spatial transformation.23
By using Cholesky decomposition to perform spatial transformation on all search
target vectors in the database beforehand, it becomes possible to carry out a similarity search efficiently, without major alteration of the fast nearest neighbor search
algorithm described in the previous chapter.
Cholesky decomposition is a special case applying LU decomposition of a square
matrix to a positive definite symmetric matrix, and it refers to the decomposition
of the following positive definite symmetric matrix A = [aij ].

a11
 a12

 .
 ..
a12
a22
..
.
an1 an2

l11
 l21

= .
 ..
ln1

· · · a1n
· · · a2n 

= A = LLT
. . .. 

. .
· · · ann
0 ···
l22 · · ·
.. . .
.
.
ln2 · · ·
0
l11 l21
 0 l22
0 


0  0 ···
lnn
0 ···

···
···
..
.
0
(9)

ln1
ln2 

.. 
. 
lnn
Here, if we seek li,j satisfying the formula above, we obtain the following.
min(i,j)
X
li,k lj,k = ai,j
(10)
k=0
Accordingly, L elements, if non-diagonal elements, become
j
X
k=0
li,k lj,k = ai,j
(11)
Fast Multidimensional Nearest Neighbor Search Algorithm Based on Ellipsoid Distance
97
And therefore
li,j =
1
li,j
ai,j −
j−1
X
li,k lj,k
k=0
!
, j = 0, 1, 2, . . . , i − 1
(12)
is obtained. And then, the diagonal elements are
i
X
li,k lj,k = ai,i
(13)
k=0
hence
li,i =
ai,i −
i−1
X
k=0
2
li,k
!
, j = 0, 1, 2, . . . , i − 1
(14)
can be calculated. Because A is a positive value, li,i becomes a real number so that
the internal radical of the above equation is always positive. If we use Cholesky
decomposition on ellipsoid distance in n-dimensional space S, ellipsoid distance
can be expressed as shown below (L is a triangular matrix such that all diagonal
elements are positive).
D2 (x,q) = (x − q)LLT (x − q)T
′
(15)
′
If we consider the point x = (x − q) · L in the Euclidean space S here, we find
′
′
that the Euclidean distance between origin O and point p in Euclidean space S is
equal to the formula for Euclidean distance in S. Because the correlation between
characteristics is reflected in matrix L, it is possible, by performing data conversion beforehand using matrix L, to replace calculation based on ellipsoid distance
function with calculation based on Euclidean distance. As a result, it becomes possible to efficiently eliminate unnecessary operations in the calculation of distance
between vectors, and a fast nearest neighbor search algorithm based on the ellipsoid
distance can be achieved.
5. Evaluation Experiment
In order to evaluate the effectiveness of the nearest neighbor search algorithm based
on ellipsoid distance, we conducted an experiment using actual image data and
random data. A search specifying the number of items was used in the experiment.
Also performed in this experiment was an evaluation using Mahalanobis distance,
which is a kind of ellipsoid distance. In other words, we sought the correlation
P
matrix of the ellipsoid distance by calculating covariance matrix
from the data
set. The following is the definition of distance used in this experiment.
98
T. Uemiya, Y. Matsumoto, D. Koizumi, M. Shishibori, K. Kita
D2 (x,q) = (x − q)
P−1
(x − q)T
d
P
1X
=
(xik − ui )(xjk − uj )
n
(16)
k
5.1. Experimental setup
5.1.1. Image data
In the experiments on image similarity search, 51,067 color photographs from the
Corel image database were used. Among these images, 1,000 were chosen at random
as query images. The four types of feature vector data used were created from these
images and had different numbers of dimensions, as shown below.
HSI-48 (48-dimensional data). HSI features were found for the 256-grade hue,
saturation, and intensity; every feature was compressed to 16 dimensions (total of
48 dimensions).
HSI-192 (192-dimensional data). HSI features were found for the 256-grade hue,
saturation, and intensity; every feature was compressed to 64 dimensions (total of
192 dimensions).
HSI-384 (384-dimensional data). HSI features were found for the 256-grade hue,
saturation, and intensity; every feature was compressed to 128 dimensions (total of
384 dimensions). HSI-432 (432-dimensional data). Images were portioned into
9(3×3) equal fragment, HIS features were found for each fragment; every feature
was compressed to 48 dimensions (total of 432 dimensions).
5.1.2. Random data
We used a uniform distribution data set with 51,067 items, created using a random
function with [0,1] as its range. From this, 1,000 items of data were randomly
extracted and used as search query data. Four types of random data having different
numbers of dimensions were prepared, as in the case of the image data.
5.1.3. Nearest neighbor search used in the evaluation experiment
The experiment was performed using the following three kinds of techniques.
(i) The proposed method (Fast-Ellipsoid)
(ii) The conventional method (VP-tree)
(iii) A linear search (Linear)
With respect to the proposed method (Fast-Ellipsoid) and the conventional method
(VP-tree), we carried out spatial transformation using Cholesky decomposition from
the original data set when creating the index. In the experiment using linear search
Fast Multidimensional Nearest Neighbor Search Algorithm Based on Ellipsoid Distance
99
(Linear), distance calculation was performed each time in accordance with the definition of ellipsoid distance, without performing the spatial transformation operation.
5.1.4. Experimental method
The number of operations (number of distance calculations) and CPU time used per
search item were determined for a search of 1,000 items of search query data using
a PC with a Xeon 2.4GHz CPU and a 512Kbyte main memory under the condition
of 100 nearest neighbor search items. For the conventional method (VP-tree) and
the proposed method (Fast-Ellipsoid), we also performed a comparison experiment,
varying the number of nearest neighbor searches from 10 to 100. In the measurement
of CPU time, we repeated the same experiment two times to determine the average
amount of time.
5.2. Experimental results
5.2.1. Results for image data
The number of operations for 100 nearest neighbor searches is shown in Table 1
and Figure 1. The number of operations in the proposed method (Fast-Ellipsoid)
represents a major reduction compared to the linear search (Linear) and the conventional method (VP-tree). Experimental results for the comparison of CPU time
are shown in Table 2 and Figure 2, and here the search time is shorter for the
proposed method (Fast-Ellipsoid), making the process faster than the linear search
(Linear) and the conventional method (VP-tree).
Table 1. The number of distance calculations for image data
Search
Algorithm
Linear
VP-tree
Fast-Ellipsoid
The number of distance calculations
HSI-48
HSI-192
HSI-384
HSI-432
2,451,216 9,804,864 19,609,728 22,060,944
2,194,731 8,962,259 18,192,135 21,905,091
1,179,581 5,533,128 11,445,424 13,383,128
Results obtained for the number of operations in the case where the number
of search items was varied are shown in Figure 3. Regardless of the number of
searches, and no matter how many dimensions there were to the data, the number
of operations was found to be smaller in the proposed method (Fast-Ellipsoid) than
in the conventional method (VP-tree). In terms of CPU time as well, as shown
in Figure 4, the proposed method (Fast-Ellipsoid) is faster than the conventional
method (VP-tree).
100
T. Uemiya, Y. Matsumoto, D. Koizumi, M. Shishibori, K. Kita
Fig. 1. The number of distance calculations for image data
Table 2. Comparison of CPU time for image data
Search
Algorithm
Linear
VP-tree
Fast-Ellipsoid
HSI-48
0.54886
0.0766
0.0344
CPU time(sec)
HSI-192 HSI-384
5.06031 29.0452
0.20039 0.36719
0.14233 0.27175
HSI-432
38.3658
0.43706
0.31604
Fig. 2. Comparison of CPU time for image data
5.2.2. Results for random data
The number of operations performed in the case of 100 nearest neighbor searches
is shown in Table 3 and Figure 5. This number is less for the proposed method
Fast Multidimensional Nearest Neighbor Search Algorithm Based on Ellipsoid Distance
101
Fig. 3. The number of distance calculations
Fig. 4. CPU time
(Fast-Ellipsoid) than for the linear search (Linear) and the conventional method
(VP-tree), but as shown in Tables 4 and 6, the CPU time is slightly longer than in
the conventional method (VP-tree).
Results obtained for the number of operations in the case where the number
of searches was varied are shown in Figure 7, and here we find that the proposed
method (Fast-Ellipsoid) did not result in a reduction in the number of operations
102
T. Uemiya, Y. Matsumoto, D. Koizumi, M. Shishibori, K. Kita
Table 3. The number of distance calculations for random data.
Search
Algorithm
Linear
VP-tree
Fast-Ellipsoid
The number of distance calculations
HSI-48
HSI-192
HSI-384
HSI-432
2,451,216 9,804,864 19,609,728 22,060,944
2,451,120 9,804,473 19,608,055 22,060,208
1,582,790 7,950,796 16,941,416 19,219,825
Fig. 5. The number of distance calculations for random data
Table 4. Comparison of CPU time for random data
Search
Algorithm
Linear
VP-tree
Fast-Ellipsoid
HSI-48
0.46533
0.08329
0.08612
CPU time(sec)
HSI-192 HSI-384
4.92684 29.89793
0.21685
0.39378
0.35741
0.74444
HSI-432
38.44843
0.42772
0.84103
comparable to that achieved in the results for image data. The CPU time was also
slightly longer in the proposed method (Fast-Ellipsoid) than in the conventional
method (VP-tree), as shown in Figure 8.
The reason for the above findings is believed to lie in the characteristics of the
data. In cases where the distribution of the data elements is uniform and there is
hardly any distribution, such as is the case with random data, it is not possible to
efficiently perform operations to reduce the amount of calculation by means of data
conversion. The resulting proportional increase in the comparative calculation cost
is believed to bring about a reduction in the search time. On the other hand, in the
Fast Multidimensional Nearest Neighbor Search Algorithm Based on Ellipsoid Distance
103
Fig. 6. Comparison of CPU time for random data
case of image data, data conversion is believed to effectively lead to a reduction in
the amount of calculation because the degree of randomness is small. In cases such
as actual searches of multimedia content, there is believed to be a certain bias to
the data distribution, so the approach proposed in this paper is a sufficiently valid
one.
Fig. 7. The number of distance calculations
104
T. Uemiya, Y. Matsumoto, D. Koizumi, M. Shishibori, K. Kita
Fig. 8. CPU time
6. Conclusions
This paper proposed a fast multidimensional nearest neighbor search algorithm
for ellipsoid distance. Our proposed method can efficiently eliminate unnecessary
operations in distance calculations performed in nearest neighbor searches because
it uses Cholesky decomposition to carry out data conversion beforehand, making
it possible to replace calculations based on the ellipsoid distance function with
calculations based on Euclidean distance.
In the evaluation experiment using image data, it was possible to reduce the
search time by 26 to 55 percent, compared to the conventional VP-tree method.
Unfortunately, in the case of random data showing no data bias, the proposed
method was found to be slightly inferior to the VP-tree method. However, images
and other real-world data can be expected to have some bias in the data distribution,
so it should be possible to use the proposed method effectively in fields such as
multimedia content searches and pattern recognition.
As a future challenge, there is a need to devise search techniques that can produce exceptionally good results even in distance space having a relatively high
degree of spatial freedom. This time, we used an inverse matrix of the covariance
matrix as the correlation matrix, but in the future, we plan to develop a fast, highly
accurate system by devising a matrix that can achieve high-precision searches and
then incorporating it into multimedia search and cross media search systems.
Fast Multidimensional Nearest Neighbor Search Algorithm Based on Ellipsoid Distance
105
Acknowledgments
This work was supported in part by grants from the Grant-in-Aid for Scientific
Research (B) numbered 21300036 and the Grant-in-Aid for Exploratory Research
numbered 20650143 from the Japan Society for the Promotion of Science.
References
1. G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing, Communications of the ACM, Vol.18, No.11, pp.613-620, 1975.
2. M. Flickner et al. Query by image and video content: The QBIC system, IEEE Computer,
Vol.28, No.9, pp.23-32, 1995.
3. A. Pentland, R. Picard, and S. Sclaroff. Photobook:Content-based manipulation of image
databases, International Journal of Computer Vision, Vol.18, No.3, pp.233-254, 1996.
4. A. Shiroo, S. Tsuge, M. Shishihori, K. Kita. Fast Multidimensional Nearest Neighbor Search
Algorithm Using Priority Queue, Journal of Electronics (2006), Vol.126, No.3, 2006.
5. V. Gaede and O. Gunther. Multidimensional access methods, ACM Computing Surveys, Vol.30,
No.2, pp.170-231, 1998.
6. A. Guttman. R-trees: A dynamic index structure for spatial searching, Proceedings of the ACM
SIGMOD International Conference on Management of Data, pp.47-57, 1984.
7. T. Sellis, N. Roussopoulos, and C. Faloutsos. The R+-tree:A dynamic index for multidimensional objects, Proceedings of the 12th International Conference on Very Large Data
Bases, pp.507-518, 1987.
8. N. Beckmann, H. P. Kriegel, R. Schneider, and B. Seeger. The R*-tree: An efficient and robust access method for points and rectangles, Proceedings of the ACM SIGMOD International
Conference on Management of Data, pp.322-331, 1990.
9. D. A. White and R. Jain. Similarity indexing with the SS-tree, Proceedings of the 12th IEEE
International Conference on Data Engineering, pp.516-523, 1996.
10. R. Kurniawati,J. S. Jin, and J. A. Shepherd. The SS+-tree: An improved index structure
for similarity searches in a high dimensional feature space, Proceedings of the SPIE: Storage
and Retrieval for Image and Video Databases, pp.110-120, 1997.
11. J. S. Jin. Indexing and retrieving high dimensional visual features, Multimedia Information
Retrieval and Management, Feng, D.,Siu, W. C. and Zhang, H. J. (Eds.), Springer, pp. 178-203,
2003.
12. S. Berchtold, D. A. Keim and H. P. Kriegel. The X-tree : An index structure for highdimensional data, Proceedings of the 22th International Conference on Very Large Data Bases,
pp.28-39, 1996.
13. N. Katayama and S. Satoh. The SR-tree: An index structure for high-dimensional nearest
neighbor queries, Proceedings of the ACM SIGMOD International Conference on Management
of Data, pp.369-380, 1997.
14. P. N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric
spaces, Proceedings of the Fourth ACM-SIAM Symposium on Discrete Algorithms, pp. 311-321,
1993.
15. T. Bozkaya and Z. M. Ozsoyoglu. Indexing large metric spaces for similarity search queries,
ACM Transactions on Database Systems, Vol.24, No.3, pp.361-404, 1999.
16. P. Ciaccia, M. Patella and P. Zezula. M-tree: An efficient access method for similarity search
in metric spaces, Proceedings of the 23rd International Conference on Very Large Data Bases
(VLDB’97), pp.426-435, 1997.
17. M. Katayama, S. Sato. Search for a similar index of technology, Information processing 42,
Vol.10, No.2001, pp.958-964, 2001.
18. Yue Sheng Wu, Y. Ishikawa, H. Kitagawa. Implementation and evaluation of image retrieval
techniques similar to the method of contact ellipsoid, IEICE 11 times Data Engineering Workshop (DEWS 2000), 3, 2000.
19. J. Hafner, H. S. Sawhney, W. Equitz, M. Flicker, and W. Niblack. Efficient color histogram
106
T. Uemiya, Y. Matsumoto, D. Koizumi, M. Shishibori, K. Kita
indexing for quadratic form distance functions, IEEE Transactions on Pattern Analysis and
Machine Intelligence, Vol.17, No.7, pp.729-736, 1995.
20. T. Seidl and H. P. Kriegel. Efficient user-adaptable similarity search in large multimedia
databases, Proceedings of the 23rd International Conference on Very Large Data Bases, pp.
506-515, 1997.
21. H. Sakurai, M. Yoshikawa, S. Uemura, Y. Kataoka. Search algorithm using a similar transformation for the contact ellipsoid space, IEICE Journal, Vol.J85-DI, No.3, pp. 303-312, 2002.
22. M. Ankerst and H. P. Kriegel. A multistep approach for shape similarity search in image
databases, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.10, No.6,
pp.996-1004, 1998.
23. H. Yanai, A. Takeuti. Projection matrix singular value decomposition, the University of Tokyo
Press, 1993.
Tadashi Uemiya
He received the B.E., degree in science and engineering
from Waseda University, Tokyo, Japan, in 1968. From
1968 to 2000, he worked for the Kawasaki Heavy Industries, Ltd., Kobe, Japan. From 2000 to 2006, he worked for
the Benesse Corporation, Okayama, Japan. Since 2006, he
has been with the University of Tokushima, Tokushima,
Japan. His current research interests include information
retrieval and multimedia processing.
Yoshihide Matsumoto
He received the B.E., degrees in information engineering
from the Kochi University of Technology, Kochi, Japan, in
2002. Since 2002, he has been with the Laboatec in Japan
Co. Ltd, Okayama, Japan. Since 2006, he has been with
the University of Tokushima, Tokushima, Japan. His current research interests include information retrieval and
multimedia processing.
Daichi Koizumi
He received the B.E., M.E. and Dr. Eng. degrees in
information science and intelligent systems from the University of Tokushima, Japan, in 2002, 2004, and 2007, respectively. Since 2007, he has been with the Justsystems
Corporation, Tokushima, Japan. His current research interests include information retrieval and multimedia processing.
Fast Multidimensional Nearest Neighbor Search Algorithm Based on Ellipsoid Distance
107
Masami Shishibori
He received the B.E., M.E. and Dr. Eng. degrees in
information science and intelligent systems from the University of Tokushima, Japan, in 1991, 1993 and 1997, respectively. Since 1995 he has been with the University
of Tokushima. He is currently an Associate Professor in
the Department of Information Science and Intelligent
Systems at the University of Tokushima. His research interests include multimedia information retrieval, natural
language processing and multimedia processing.
Kenji Kita
He received the B.S. degree in mathematics and the
Ph.D degree in electrical engineering, both from Waseda
University, Tokyo, Japan, in 1981 and 1992, respectively.
From 1983 to 1987, he worked for the Oki Electric Industry Co. Ltd., Tokyo, Japan. From 1987 to 1992, he was a
researcher at ATR Interpreting Telephony Research Laboratories, Kyoto, Japan. Since 1992, he has been with the
University of Tokushima, Tokushima, Japan, where he is
currently a Professor at Center for Advanced Information
Technology. His current research interests include multimedia information retrieval, natural language processing,
and speech recognition.