SLAC-PUB-1549 (Rev.) - Stanford University

SLAC-PUB-1549(Rev.)
STAN-CS-75-482
February 1975
Revised December 1975
Revised July 1976
AN ALGORITHMFOR FINDING BEST MATCHES
IN LOGARITHMIC EXPECTEDTIME
Jerome H. Friedman
Stanford Linear Accelerator
Stanford University,
Stanford,
Center
Ca. 94305
Jon Louis Bentley
Department of Computer Science
University
of North Carolina at Chapel Hill
Chapel Hill,
N.C. 27514
Raphael Ari Finkel
Department of Computer Science
Stanford University,
Stanford,
Ca. 94305
ABSTRACT
An algorithm
a file
and data structure
containing
N records,
for the m closest
record.
tional
The computation
required
The expected
each search is independent
to perform
evidence
suggests
is considerably
(Submitted
Work supported
Administration
each described
matches or nearest
to kNlogN.
tation
are presented
by k real
neighbors
to organize
the file
number of records
of the file
size.
faster
valued keys,
to a given query
is propor-
examined in
The expected
each search is proportional-to
that
for searching
1ogN.
except for very small files,
this
compu-
Empirical
algorithm
than other methods.
to ACM Transactions
on Mathematical
Software)
in part by U.S. Energy Research and Development
under contract E(O43)515
The Best Match or Nearest
Neighbor
The best match or nearest
that
store
records
blem is to find
according
file
several
those records
of N recor,ds (each
(possibly
for
post offices.
If
closest
a letter
town that
The solution
mation retrieval
similar
measure D, find
decisions
classified.
to this
might
involve
attributes
that
Multivariate
costly
searching
describe
is its
will
in space and time,
its
on all
cities
longitude
and latithe
Infor-
for those items most
would be cataloged
Classification
characteristics.
features
is closest
estimation
from each category
to the record
can be performed
ccntaining
the closest
to be
by calcu-
m neighbors.
Searching
technique
minimizes
values.
a post office,
a catalog
prototype
density
small,identically
procedure
information
each item in the file
Used for Associative
from any query record
this
attribute
to a query
might be chosen as the destination.
which of these prototypes
problem is the cell
into
records
problem is of use in many applications.
One straightforward
neighbor
specified
given a
valued attributes)
the m closest
is addressed to a town without
has a post office
Formally,
by k real
each city
the volume about a given point
Structures
vided
with
can be made by selecting
and finding
lating
with
The pro-
to a query record
measure.
example, might contain
to a given query item;
by numerical
most similar
or distance
Associated
to data files
valued keys or attributes.
in the file
not in the file)
A data file,
tude.
real
problem applies
of which is described
a dissimilarity
record
neighbor
to some dissimilarity
and
with
with
Problem
for solving
method.
sized
find
the best match or nearest
The k-dimensional
cells.
A spiral
search of the cells
the best matches of that
the number of records
especially
is large.
-l-
key space is di-
examined,
record.
it
when the dimensionality
Although
is extremely
of the space
_
I
Burkhard
cribe
and Keller
heuristic
strategies
gies use the triangle
consideration
while
these techniques
based on clustering
inequality
the file.
a substantial
These stratefrom
no calculations
experiments
fraction
the nearest
of the keys.
and Shustek [3] describe
Basket-t,
neighbor
of ex-
indfcate
that
of the records
to be
with
distance
this
[4]
deals with
that
the expected
two keys per record
of dissimilarity
inequality.
required
to search
due to Elias
after
which
on only two values;
diagram (a general
structure
the
and Euclidean
distance
for
case of
measure. He
One can search for best matches in worst
a file
organization
proportional
can perform
the search in worst
zation
requires
Unfortunately,
enough on one
is the Hamming distance.
(two dimensions)
to N and computation
that
on those
to the best match problem for the special
two algorithms.
time,
list
1
1
to kmk l$-*E .
keys. That is, each key takes
the plane)
O[(logN)2]
computation
for
a projection
the triangle
of an algorithm
Shamos [5] employs the Voroni
presents
forming
to a wide variety
method is proportional
applied
strategy
match closely
they satisfy
shows the optimality
binary
function
searching
involves
that
The method is applicable
They were able to show that
Rivest
It
only those records
measures and does not require
the file
problem.
another
onto one or more keys, keeping a linear
and searching
tional
des-
some of the records
Although
simulation
[2]
techniques.
to eliminate
are presented,
permit
of the records
keys,
Fukunaga and Narendra
from consideration.
Friedman,
solving
and later
searching
pected performance
eliminated
[l]
both storage
that
requires
to NlogN.
case O[logN]
time,
and computation
propor-
The other algorithm
after
a file
proportional
these methods have not yet been generalized
-2-
storage
case
organito N?
to higher
or more general
dimensionalities
Finkel
tree,
for
binary
the storage
tree
different
termed
for
storing
data
generalization
be applied
This
paper
blem of finding
partitioning
In his
to the best
introduces
best
measures
matches.
The storage
required
computation
is proportional
This
the search
quired
to search
the file
that
file
a
it
is
k-d trees
the average
for
best
obey the triangle
For large
The time
to logN,
number of record
matches
this
is quite
inequality.
to N, while
files,
the expected
is shown to be in-
in descending
so that
in
of dissimilarity
the search
spent
the pro-
effective
is proportional
for
matches with
for
is very
they
required
N.
algorithm
a wide variety
to kNlogN.
best
that
with
organization
is proportional
for
of the
structure;
data structure
in searching
size,
the quad
l-71 develops
suggests
k-d tree
so that
examinations
during
Bentley
Bentley
in the file
for
of the file
called
is a generalization
keys.
article,
method can be applied
number of record
It
keys.
on single
structure,
match problem.
and does not require
dependent
a tree
an optimized
the records
This
measures.
of the same one-dimensional
examinations (1) involved
small.
describe
of composite
the k-d tree.
could
[6]
and Bentley
dissimilarity
the tree
the expected
time
method is proportional
re-
to
1ogN.
Definition
of the k-d Tree
The k-d tree
sorting
subfile.
nonterminal
a subfile
of the simple
The k-d tree
and searching.
represents
that
is a generalization
of the records
The root
of the tree
is a binary
tree
in the file
represents
node has two sons or successor
-3-
binary
tree
in which
used for
each node
and a partitioning
the entire
nodes.
file.
These successor
of
Each
nodes
represent
the two subfiles
nodes represent
defined
mutually
exclusive
which collectively
form a partition
subsets
are called
of records
by the partitioning.
small subsets
.,
key and a partition
All
in a subfile
searching,
is defined
with
key values
value belong to the left
son.
assigning
to the two subfiles.
In k dimensions,
The keg variable
a record
number can range from 1 to k.
the tree;
the discriminator
the keys in order.
those with
key.
value be-
thus becomes a discriminator
by k keys.
that
The original
for
of that
a larger
for partitioning
for
is represented
less than or equal to the par-
node in the tree;
[7] chooses the discriminator
a record
is represented
these can serve as the discriminator
sented by a particular
These terminal
by some value
son, while
long to the right
records
space.
buckets.
by a single
tition
of the data records,
of the record
In the case of one-dimensional
records
The terminal
is,
Any one of
the subfile
repre-
the discriminating
key
k-d tree proposed by Bentley
each node on the basis
each level
for
is obtained
of its
level
by cycling
in
through
That is,
D=Lmodk+l
where D is the discriminating
is defined
to be at level
random key values
zero.
pected
The partition
in each particular
This paper deals with
value for
key number for
each subfile,
cost of searching
level
L and the root
values
are chosen to be
subfile.
choosing both the discriminator
as well
as the bucket
for nearest
what is termed an optimized
node
k-d tree.
-4-
neighbors.
size,
and partition
to minimize
the ex-
This process yields
The Search Algorithm
The k-d tree
examining
data structure
only those records
reducing
the
computation
The first
invocation
able as a global
boundaries
the root
is most easily
the subfile
node is defined
boundaries
above it
in the tree.
current
subfile,
accrual
of this
cell
is smaller
for
priority
dissimilarity
found to be closer
is updated.
sive procedure
If
by the node.
defined
the side of the partition
its
records
a cell
in
The volume
the records
so far
as a
is examined and
list,
the subfile
When control
to consider
the query record.
in the
encountered
is not terminal,
as the query record.
-5-
The
subfile.
member of this
is necessary
the
on the value
is always maintained
Whenever a record
opposite
at the nodes
by nodes deeper in the tree.
of the m closest
it
These
not only divides
for the node representing
if
keys.
defined
then all
the node under investigation
is made to determine
The domain of
of any node defines
than the most distant
is called
the geometric
on all
is terminal,
the search.
same side of the partition
test
is,
Avail-
in the two new subfiles.
to the query record
queue during
node; that
space containing
subfiles
A list
pro-
argument.
a lower or upper limit
in the ancestors
the node under investigation
and their
as this
by the partitions
each record
record-key
are examined.
of the tree
also defines
of these limits
greatly
is the node under investigation.
At each node, the partition
key for
thereby
as a recursive
to be plus and minus infinity
but it
the multidimensional
described
represented
are determined
the discriminator
mechanism for
the best matches.
is the domain of that
delimiting
geometric
bucket
to find
passes the root
array
an efficient
to the query record,
The argument to the procedure
cedure.
If
closest
required
The search algorithm
of
provides
It
the list
the recuron the
returns,
the records
is necessary
a
on
to
consider
that
subfile
records
overlap
only if
the ball
to the dissimilarity
is referred
ball
test
can be among the m closest
sidered
and the procedure
subfile.
determine
if
it
is necessary
tire
If
is entirely
so, the current
file
list
and ball-within-bounds
a detailed
notation.
The Optimized
k-d Tree
examined with
are the discriminating
The solution
distribution
no knowledge of this
seek a procedure
that
only uses information
will
will,
contained
all
deter-
for the en-
The bounds-overlap-ball
Appendix 2
one.
-6-
to be adjusted
value at each non-terminal
bucket.
in general,
depend upon the
key space.
Usually,
of the distribution
possible
using
the expected number of
in advance of the queries.
in the file
to
domain of the
in each terminal
in the record
is independent
for any particular
This test
The parameters
contained
distribution
be seen to be good for
be optimal
is to minimize
to the optimization
of query records
must be con-
returning
in Appendix 1.
key nwnber and partition
the
for the node representing
the geometric
the search algorithm.
node, and the number of records
If
of the complete search algorithm
The goal of the optimization
records
side of the
subtree
the search.
within
This
the bounds-overlap-
is made before
are described
description
an algorithmic
of that
need be examined.
tests
If
of m best matches is correct
and no more records
contains
test
equal
to the query record.
recursively
to continue
radius
on the opposite
then the records
is called
with
those
so far encountered.
test.
records
A "ball-within-bounds"
mines whether the ball
node.
record
then none of the records
the ball,
delimiting
at the query record
to the mth closest
bounds do overlap
that
centered
boundaries
to as the "bounds-overlap-ball"
fails,
partition
the geometric
records.
one has
Thus, we
of queries
and
Such a procedure
query distributions
but will
not
A second restriction
is that
key number and partition
subfile
represented
the k-d tree
by that
node.
the discriminating
The information
ing is the location
that
lie
we can provide
a binary
choice
It
side.
is well
known that
This criterion
dictates
at the median of the marginal
tive
of which key is chosen for the discriminator.
side of the partition
sect the current
the partition
radius
m-nearest
is greater
neighbor
query locations)
for
range in values
before
every nonterminal
discriminator,
likely.
that we locate
the parti-
of key values,
irrespec-
the subfile
if
on the opposite
does not inter-
the distance
of the ball.
By definition,
Thus, the probability
(averaged
key which exhibited
to
of being on either
That is,
is least
provided
were equally
the partition
key coordinates.
the ball
that
if
ball.
than the radius
intersecting
The prescription
searching
to the query record
is the same along all
the partition
distribution
can exclude
for
of those records
information
tion
The search algorithm
[8] and
by the partition-
and the identities
should have had equal probability
side of the partition.
tree
value at each nonterminal
is maximal when the two alternatives
Thus, each record
binary
a prescription
to the search algorithm
of the partition
on either
so that
time complexity.
key and partition
provided
a general
is known to be NP-complete
of non-polynomial
Under these two restrictions,
choosing
is necessary
avoiding
recursively,
for discriminating
node depend only on the
This restriction
Such an optimization
very likely
values
at any particular
node.
can be defined
optimization.
thus,
value
the solution
over all
the greatest
to
the
of
possible
spread or
the partitioning.
for
optimizing
node the key with
the k-d tree,
the largest
then,
spread in values
and to choose the median of the discriminator
-7-
is to choose at
as the
key values
as the partition.
is developed
presents
The optimum number of records
in the next section
an algorithm
that
on analysis
builds
for
each terminal
Appendix 3
of performance.
an optimized
k-d tree
bucket
according
to this
prescription.
Analysis
of the Performance
The storage
file
size,
stored
for
file
The discriminating
required
At each level
of the tree,
This requires
computation
logN, so the total
kNlogN.
value must be
The number of non-
b is the number of records
to build
the entire
[Here we are solving
in each term-
C',(l),
xi(2),
in the file.
" ' Xi(k)]
If
the value
space of k dimensions.
coordinate
sented as a point,
the m closest
dissimilarity
space.
is proportional
relation
TN = 2TN/2
+ m'
framework.
the set of key values
a record
file
is
to
of the search is not so easily
of each key is plotted
The entire
must be scanned.
TN = O(kNlogN).]
in a geometric
for
derived.
The depth of the tree
the tree
the recurrence
represent
then the set of key values
k-dimensional
to kN.
to have the solution
discussed
is easily
set of key values
to build
The expected time performance
is most easily
the k-d tree
proportional
computation
which is well-known
find
the
bucket.
The computation
It
is proportionalto
key number and partition
N
I; - lwhere
11
nodes is
organization
node of the k-d tree. (2)
for each nonterminal
terminal
inal
N.
required
represents
Let ?i =
the ith
record
along a coordinate
axis,
for
a point
is a collection
The query record
derived.
in a coordinate
of such points
can similarly
in
be repre-
x" in this space. The best match problem is then to
q'
points to the query point in this space by the given
measure.
-8-
The performance
of records
of the algorithm
in the file,
N, the dimensionality
number of nearest
neighbors
terminal
b, the dissimilarity
buckets,
distribution
sought,
p(x", of the file
Let S-(3-)
l”
at ?q that
Y
contains
So
q
{x”
measure D(?,?),
ball
in the record
D(x’,?q)
5
D(it,~?~)
mth nearest
where 4
is the
and the
probability
of this
content
in the
and the
key space.
space centered
to x”q'
points
k, the
employed,
in the coordinate
the m closest
1
number
(number of keys),
m, the number of records
records
be the smallest
exactly
may depend upon the total
That is,
0)
3
neighbor
region,
to ? .
q
um(zq),
The volume
is defined
as
~(8
with
dx",
J
sm(2q)
It
can be shown [g] that
follows
the probability
a beta distribution,
B(m,N);
0 I um(Xq) Il.
distribution
that
(3)
of um,(Tq)
is,
m-1[ l-;lm]N-m
dim) = (m-l):(N-m),
. ruml
.
independently
of the probability
or the dissimilarity
tribution
measure,
density
D@,% -
function
E[u,l
=
urn p[u,l
-9-
of the points,
The expected value
is
m
dum = -N+l
(4)
of this
p (3,
dis-
These results
state
has probability
that
content
m/(N+l)
exactly
the file
size,
N, is large
Sm(zq) is small and thus the probability
p(x", is approximately
constant
m points
on the average.
we assume that
To proceed further,
enough so that
any compact volume enclosing
within
the region
distribution
Sm(?q).
In this
.
eqn 3 by
we can approximate
case,
and from eqn 5
E[vm(X3,)] 3 m
N+l
Here 5 (zq) is the probability
* Note that
sm($)
Consider
gorithm
that
it
size.
reasonably
hypercubical
contain
very nearly
that
In fact,
compact.
with
the geometric
edge length
The effect
divide
the coordinate
each containing
have that
the expected
the largest
spread in values
approximately
be
is
to the coordinate
hypercubical
the same number of records.
volume of such a bucket
at
of the volume of the
k-d tree partitioning,
E[vb(‘b)I
- j& +j
will
shape of these buckets
The edges are parallel
space into
al-
where b is the maxi-
shape of these buckets
equal to the kth root
of the optimized
very nearly
b records,
the expected
space occupied by the bucket.
axes.
k-d tree partitioning
Choosing the median insures
section.
Choosing the key with
each node insures
(7)
averaged over the small region
of the optimized
in the previous
each bucket will
mum bucket
density
.
can never be zero.
now the effect
described
'
p (xi,,
then,
is to
subregions,
From eqn 7, we
is
(8)
b
- 10 -
G 5 bc
Two important
minimizing
results
with
it
follow
respect
the (upper bound on the)
buckets
G(k)]
= b{[ ;
;
+ ljk
from this
to b yields
02)
expression.
the result
number of records
should each contain
.
one record.
b=l;
to minimize
the terminal
examined,
With this
First,
provision,
kqn 12
becomes
5 2 {[ mG(k)]
The second important
amined
is independent
tribution
derived
can be easily
accumulated
and distribution
a fixed
number of records
file
containing
volume containing
leaving
as possible.
these results
by any region,
the
then
This is accomplished
the
file
buckets
to file
partitions
the k-dimensional
bucket has the same properties
m best matches.
(b and m, respectively)
As a result,
and their
geometrical
the dependence of the bucof key values
Sm(?q) containing
the m best matches.
increases,
the m best matches shrink
the number of overlapped
buckets,
- 12 -
J,
as the
Namely, each contains
size and distribution
key density
size
consequence of the prescription
This prescription
compact.
for the region
size or the local
key space.
overlapped
is a direct
each terminal
Sm(?q),
cal to that
dis-
If the goal is to minimize
of the number of overlapped
region,
ket volumes on total
in the record
ex-
as small as possible.
k-d trees.
shapes are reasonably
number of records
N, and the probability
size,
the buckets
of key values
space so that
the expected
intuitively.
should be as fine
The independence
optimizing
p (3,
03)
here in a somewhat obtuse fashion,
coverage of all
by making each bucket
record
of the file
understood
the partitioning
+ ljk.
is that
of the key values,
Although
for
result
1
E
is identiAs the
the bucket volumes and the
at exactly
constant.
the same rate,
The constancy
creases
implies
logarithmic
of the number of records
that
in file
file
search time for
I.
is directly
The amount of backtracking
N.
binary
to descend from the root to the terminal
4, which we have demonstrated
portional
is a balanced
in the number of nodes,which
size,
size
in-
to search for best matches is
The k-d tree
size.
the time required
logarithmic
the time required
examined as file
buckets
is
to the
is proportional
to
Thus, the expected
of N.
the m best matches to a prespecified
Thus,
proportional
in the tree
to be independent
tree.
query record
is pro-
to 1ogN.
Dissimilarity
Measures
The derivations
concerning
of the preceding
the particular
dissimilarity
some implicit
are, however,
A dissimilarity
section
make no explicit
measure, D(x",%,
assumptions
that
measure is defined
assumptions
employed.
There
are now discussed.,
as
k
fib(i),
D(x",,3 = F
11
i=l
where the k -i- 1 arbitrary
the basic properties
f,b,Y)
I
functions
YWI
3
F and {fi]tZl,
04)
are required
to satisfy
of symmetry
15i5k
= f,(Y,X)
J.
(15s)
and monotonicity
F(x) 2 F(Y)
fi(x,z)
The k functions,
they define
2 fi(x,y)
if
if
(15b)
x>y
‘z 2 y-l
or
x
xryr
zj
{fi(x,y)]t,L,
are called
the one-dimensional
distance
- 13 -
\!
Isick
the coordinate
.
distance
along each coordinate.
05c)
functions;
Since the
-
spread in coordinate
the center,
estimate
values
the ith
coordinate
the bounds-overlap-ball
To this
the particular
dissimilarity
will
yields
serve just
tions
are all
the linear
functions
It
described
if
identical,
that
?(x,y)
is,
fi(x,y)
depends upon
Any set of funcdistance
the coordinate
= f(x,y)
functions
distance
for
II
func-
i 5 k, then
can be used to estimate
= Ix-y/
that
The purpose
the k-d tree.
as the coordinate
For example,
in Appen-
is not necessary
is to order the key numbers.
as well.
of the
also appear in
of the k-d tree
measure employed.
the same ordering
function
the construction
tests
be used in building
from
should be used to
during
distance
the construction
extent,
of the spread estimation
that
key values
and ball-within-bounds
these functions
tions
function
(Th ese coordinate
k-d tree.
dix 1.)
to be the average distance
distance
the spread in the ith
optimized
exactly
is defined
the spread in
key values.
The properties
directly
through
Appendix
1).
measure.
of the dissimilarity
the bounds-overlap-ball
These tests
First,
nondecreasing
ordinate.
with
increasing
coordinate
by eqn 14, together
distance,
dissimilarity
set.
with
tests
along any co-
IX(i)-W)l,
of the co-
dissimilarity
based
for a dissimilarity
measure
of eqn 15, are sufficient
the restrictions
(see
must be
63,
based on any subset
The form required
algorithm
of a dissimilarity
between two points,
linear
this
to
both of these properties.
A dissimilarity
addition
only two properties
the dissimilarity
into
and ball-within-bounds
must be less than or equal to the actual
on the full
guarantee
require
Second, a partial
ordinates
measure enter
measure is said to be a metric
to symmetry and monotonicity
(eqns 14-15c),
distance
it
if,
in
obeys the tri-
angle inequality
D(x",?) + D(?,,z? 2 D(?,a.
- 14 -
06)
.-
The most common metric
distances
are the vectorspace
p-norms
rk
0,(x”,?)
= ‘1
ii=l
Of these,
IX(i)
m)
i;
- Y(i)j'
r
the most commonly used are:
p = 1:
taxicab
p = 2:
Euclidean
p=
maximum coordinate
co:
or city
block
distance
distance
distance
.
That is,
D,o,(?,?)
Since the separate
distances,
estimate
coordinate
the linear
particular
number of records
examined
pected number) as a function
.
functions
building
08)
are identical
?(x,y)
= Ix-y/,
the k-d tree. (3)
(eqn 18), the geometric
and the inequality
distance,
- Y(i)/
function,
for
case of the p = co distance
For this
distance
distance
the key spreads
9a and 13) is unity,
IX(i)
= max
llilk
for
these
can be used to
For the special
constant
G(k) (eqns
of eqn 13 becomes an equality.
we can therefore
(instead
calculate
the expected
of an upper bound on the ex-
of the number of best matches, m, and number
of keys, k :
Rco(m,k)
Note that
ball
i
= (m + l)k
for m=l, Rco(l,k)
of constant
= 2k.
The number of buckets
volume decreases with
serves as a --lower bound for
all
(1-9)
.
vector
increasing
by a
p, so the p = co result
space p norms.
- 15 -
overlapped
There is an assumption
vious
section.
optimal
It
order;
is that
that
is,
query record.
It
comes to this
ideal.
this
in the results
of increasing
does exist,
dissimilarity
inefficiency
the geometric
constant,
G(k),
eqn 19 represents
from the
geometrical,
it
in eqns 12 and 13,
However, to the extent
eqn 19 is overly
in
search algorithm
is purely
unchanged.
of the pre-
examines the buckets
how close the k-d tree
conclusions
G(k) = 1) and thus,
.,
in order
Since this
the general
inefficiency
is implicit
the search algorithm
is not clear
can be absorbed into
leaving
that
optimistic
(as it
a lower bound even for
that
assumes
the p = co
distance.
Simulation
Results
Several
simulations
mance of the algorithm
The results
eqn 19.
lation,
record
with
are presented
unit
uncertainty
two percent
in the worst
Figure
record).
tor
A similar
matrix.
and the number of record
predicted
from a normal
set of 2000 query
examinations
of these averages is quite
required
small,
with
are shown both for
in the previous
The solid
line
dimensionality
represents
of the algorithm
section.
being around
required
(number of keys per
and the p=co vec-
eqn l-9 which predicts
the
(R = 2k).
corresponds
For low dimensionality
-
examinations
the p=2 (Euclidean)
number for the p = co metric
The behavior
The
cases.
the best match (m=l) varies
Results
by
For each simu-
keys was generated
1 shows how the average number of record
space norms.
expected
1 and 2.
the perfor-
the m best matches was averaged over these 2000 queries.
statistical
to find
into
to the performance
in Figures
dispersion
keys was generated
to find
to gain insight
and to compare it
of 8192 sets of record
a file
distribution
were performed
16
-
closely
(k I6),
to that
discussed
the p=co results
-
strongly
that,
the 2k dependence.
These simulation
results
for m=l, the k-d tree
search algorithm
is not far from
exhibit
at least
optimal.
lation
(k 5 6) where N = 8192 appears to
For those dimensionalities
be big enough for the validity
results
of the large
for p = co lie
indicate
assumption, (4) the simu-
file
no more than 2C$ above that
predicted
by
eqn 19.
The Euclidean
performance
distance
of the algorithm
The increase
p=co.
results
for
shown in Figure
for
tance is to be chosen mainly
the higher
for
that
the
lower p-norms is not as good as for
in expected number of records
but becomes more pronounced
1 confirm
rapid
examined is not severe,
dimensionalities.
calculation,
If
a dis-
the p=co distance
is
a good choice.
Figure
2 shows how the number of records
The average number of record
ber of best matches sought.
quired
to find
Euclidean
(solid
the corresponding
the m-nearest
neighbor
cells,
ball
therefore,
to be linear
grows linearly
should
mately borne out by the results
of the non-optimality
a larger
large
fwa
dtiensions,
the large
thee Figure
re-
If
with
file
that
m-l and 5% for m=25.
- 17 -
increasing
One would in-
The average number of
This is approxi-
2 also shows that
the effect
becomes more pronounced for
is assumed that
assumption
2ashows
m.
similarly.
Figure
it
with
of eqn 19
since the expected volume of
of the search algorithm
number of best matches.
enough so that
than linearly.
increase
shown.
the prediction
examined rises
more slowly
expect the increase
overlapped
along with
The average number of records
number of best matches slightly
tuitively
examinations
number of best matches for both the
and p=co norms is displayed
line).
examined depends on the num-
is valid
8192 records
is
even for m=25 in
the inefficiency-‘ls-l%--r---
-
Implementation
The above discussion
has centered
amined as the sole criterion
for performance
This has the advantage that
implementation
evaluation
requirements
to the number of records
examined,
These considerations
include
and the overhead computation
required
kNlogN, as previously
stated.
3 where the actual
there
are other
required
to build
The overhead required
This calculation
node visited
in the search.
calculating
the dissimilarity
boundary of the subfile
test
point,
few keys.
then it
the subfile
If,
on the oth.erhand,the
bounds do in fact
test
becomes as expensive
suggests
that
if
a subfile
simply be investigated
This situation
several
if
must be performed
the tree
a full
is
values
of k.
at each
in Appendix
quickly
on the basis
of the keys.
keys are included
dissimilarity
to overlap
and the bounds-overlap-ball
1,
to the
The coordinate
the boundary is far
then all
to
in Figure
from the query record
is very likely
is most likely
the k-d
of only a
If
calculation
point,
the
and the
This
calculation.
the ball,
dis-
from the
boundary is close to the test
the ball,
as
as well.
is dominated by the bounds-
to examine most or all
overlap
for
As described
can be excluded
may be necessary
empirically
under consideration.
tances are compared one key at a time;
related
is proportional
needed to build
to search the tree
non-terminal
Al-
to search the tree.
number of records
calculation.
involves
to build
the k-d tree
overlap-ball
is executed.
considerations
This is illustrated
of the total
of
are strongly
required
computation (5) per record
shown as a function
closest
of the details
of the algorithm
it
should
omitted.
to occur near the bottom of the tree where
-
18
ex-
cf the algorithm.
is independent
the computation
The computation
it
evaluation
and the computer upon which the algorithm
though the computational
tree
on the expected number of records
-
the file
records
profitable
are closest
to increase
the bucket
the number of record
lation
each file
record
a bucket are relatively
records
efficient
most or all
to have larger
the number of records
This speculation
quired
for finding
creasing
the performance
simulations
it
bucket
near the-bottom
It
sizes
of
calcu-
Since the records
is very likely
pass.
that
if
in
one of
is then more computation-
ezen though this
increases
examined.
in Figure
4.
Here the computation
of the search.
bucket sizes.
reIn-
considerably
improves
This improvement is approximately
constant
size from one record
per bucket
from 4 to 32.
Figure
(not
4 shows results
shown) verify
dent of dimensionality,
records,
each bucket.
best matches is shown for various
sizes
Although
will
is confirmed
the bucket
for bucket
once for
must
a bounds-overlap-ball
per bucket,
close together,
them passes the test,
calculation
close to the query record
need only be performed
ally
it may be
sizes even at the expense of increasing
a bounds-overlap-ball
per bucket,
With several
the tree.
Therefore,
comparisons.
With one record
be made for
to the query record.
that
for
this
only a few situations,
other
behavior
indepen-
k, number of best matches,
is completely
m, and number of file
N.
Comparison to Other Methods
The only previous
various
dimensionalities,
records
is the sorting
This algorithm
the brute
for
force
method with
verified
number of best matches,
algorithm
of Friedman,
has been shown to yield
method (linear
a wide variety
expected performance
of situations.
and number of file
Baskett
a considerable
search over all
Figure
- 1-g -
for
and Shustek [?I].
improvment
the records
over
in the file)
5 shows the computation
(CPU
milliseconds
per query)
sorting
algorithm
Also
shown is the average number of records
The rate
indicates
is valid.
near-asymptotic
of sixteen
of increase
how near it
sumption
In four
buckets
by this
tree
method.
(using
required
of this
behavior
dimensions,
sizes
greater
for
files
of 16000 records.
than 2000.
file
size.
limit
increasing
file
size
where the large
file
as-
as small as 128 records.
appears reasonably
dimensions,
close for
the limit
Even for this
case, however,
examined with
file
is not near
the increase
size is only slightly
than logarithmic.
The logarithmic
size increases
except that
behavior
for eight
involved
computation
of the overall
is illustrated
Comparison of Figure
for
dimensions
3 to
in building
mensions,
the increase
represents
while
algorithm
in Figure
is slightly
faster.
the tree
is not excessive.
decreases with
(6)
compu-
The fraction
increasing
dimensions
that
fraction
computation
of
dimensionality.
is the same as the number of file
about 25% of the total
for eight
5,
5 shows that the preprocessing
spent on preprocessing
preprocessing
as the file
computation
the k-d tree
Figure
When the number of query records
five
increasing
5 show that in two dimensions
limit
In eight
in average number of records
tation
average with
in Figure
the asymptotic
for
and the k-d
examined under the k-d tree
occurs even for files
file
faster
records)
is to the asymptotic
The results
algorithm
for
records,
two di-
is between three
and
percent.
The computation
required
by the sorting algorithm has been shown
11
I; 1-I;
km N
. Although this is much worse than
[3] to be proportionalto
logN, the sorting
algorithm
very small files,
it
introduces
is faster
very little
than the k-d tree
- 20 -
overhead so that
algorithm.
for
For larger
files,
,however, the k-d tree
especially
advantage,
Implementation
Efficient
all
algorithm
for
dimensions. (7)
higher
operation
of the k-d tree
buckets
reside
essing,
these data can be arranged
records
in the same bucket
can be stored
the external
is not even necessary
Only the top levels
levels
can be stored
keeps non-terminal
the entire
of the tree
storage
so that
Buckets close together
there will
k-d tree
reside
files,
in fast
that
sons.
ACKNOWLEDGMENT
Helpful
discussions
and J.E.
with
F. Baskett,
Zolnowsky are gratefully
- 21 -
M.G.N. Hine,
acknowledged.
it
memory.
memory; the lower
device under an arrangement
nodes close to their
examines
be few accesses to
For ex&memely large
need to be in fast
on an external
device
Since the search algorithm
on the average,
that
During the.preproc-
memory.
together.
similarly.
that
does not require
on an external
for each query. 63)
storage
algorithm
in fast
are stored
a small number of buckets
..
computational
on Secondary Storage
of the terminal
in the tree
is seen to have a clear
C.T. Zahn,
APPENDIX 1
describes
This appendix
ball-within-bounds
tests
The purpose
geometric
tered
discussed
boundaries
record
query record
delimiting
is determined
partial
distance
than the radius
a ball
the
cento the mth
If
this
dissimilarity
dissimilarity
between the
is greater
from consideration.
if
as follows:
coordinate
This mini-
the query record's
of the geometric
otherwise
it
jth
key
domain, then
is set to the co-
f
then there
of the neighborhood,
(eqn 14),
there
If
The test
sum of coordinate
case of the p=co vector
whether
is no overlap
the sum of coordinate
is no overlap.
as soon as the partial
if
the smallest
is set to zero;
domain and the neighborhood.
to testing
overlap
if
(eqns 14, 15) by which the key falls outside the do3
If any of these coordinate distances is greater
coordinate.
distance
the special
of records
can be eliminated
the bounds for the jth
main in that
F-l(r)
is to determine
That is,
encountered.
then the subfile
ordinate
test
r equal to the dissimilarity
and the query record.
mal dissimilarity
the jth
radius
employed is to find
bounded region
is within
in the text.
a subfile
with
and
r = D(?m,?q) where x" is the
q
and zrn is the mth best match so far encountered in the search.
so far
The technique
than r,
for the bounds-overlap-ball
of the bounds-overlap-ball
at the query record
closest
algorithms
is greater
exceeds
with
failure
exceeds F -l(r).
space norm, this
any of the distances
distances
can terminate
distances
between the
technique
In
reduces
than the radius
and,
so, failing.
The ball-within-bounds
from the query record
compared to the radius,
ordinate
distances
test
is simpler.
to the closer
r.
Here the coordinate
boundary
The test
fails
is less than the radius.
- 22 -
distance
along each key is in turn
as soon as one of these coThe test
succeeds if
all
of these coordinate
Descriptions
distances
are greater
of these tests
than the radius.
in an algorithmic
notation
are presented
in the next appendix.
APPENDIX 2
This appendix
presents
the k-d tree
in an algorith-
search algorithm
mic notation.
global
Xq[l:k],
"key values
WD[l:ml,
"priority
of the query record"
queue of the m closest
countered
P&R[l:ml,
to the mth nearest
en
RD[ 11
at any phase of the search.
is the distance
far
distances
neighbor
encountered."
"priority
queue of the record
corresponding
numbers of the
m best matches encountered
at
any phase of the search"
B+[l:kl,
"coordinate
upper bounds"
B-b:k],
"coordinate
lower bounds"
discriminator
partition
"discriminator
[l:I],
"partition
[l:I];
at each k-d tree
value at each k-d tree
"I is the number of internal
"initialize"
P&D[l:m]
"search"
SEARCH(root);
procedure
t co; B+[l:k]
node"
t CO; B-[l:k]
node"
nodes'
+ - ~0;
SEARCH(node);
begin
local
p, a, temp;
if node is terminal
then begin
I_(examine records
-if
in bucket(node),
updating
BALL WITHIN BOUNDS--I_
then done else return
e+;
d tdiscriminator[node];
p tpartition[node];
- 23 -
P&D, p&R);
so
"recursive
call
on closer
son"
if X [d] 5 p
q
then begin
temp +-B+[dl;
B+[dl
+P;
SEARCH(leftson(node));
B+[d] ttemp;
end
else begin
-temp tB-[d];
~-[a]
+P;
SEARCH(rightson(node));
'recursive
call
on farther
B-Cd] ttemp;
son, if
necessary"
if X [d] 5 p
q
then
begin
-temp +B-[dl;
if
B-Cdl +P;
BOUNDSOVERLAPBALL then SRARCH(rightson(node));
B-b]
ttemp;
end
else begin
--
temp + B+[aJ; B+Cdl + P;
-if
BOUNDSOVERLAPBALL then SEARCH(leftson(node)
B+[d] ttemp;
a;
'see if we should return
or terminate'
i-J BALL WITHIN BOUNDS--then done else return;
- 24 -
>;
logical
procedure
BALL WITHIN BOUNDS;
begin
local
d;
d t 1 --step 1 until
for
k do
COORDINATEDISTANCE (d, Xq[d],
B-L
-or COORDINATEDISTANCE (d, Xq[d],
B+[
-if
aI>5
P&DC1-l
then return(false);
return(true);
a;
.,
logical
procedure
BOUNDSOVERLAPBALL;
begin
local
sum, d;
sum to;
for
d t 1 step 1 until
if
Xqral
k &
< B-t-d]
then begin
"lower
than low boundary"
sum +-sum + COORDINATEDISTANCE (d,Xq[d],
s
DISSIM(sum) > FQD[l]
then return
B-La]);
(true);
end
else if
Xqkd
then begin
> B+[dl
"higher
than high boundary"
sum t sum + COORDINATEDISTANCE (d,Xq[d],
if
DISSIM(sum) > KJD[l]
then return
a;
return
(false);
*;
-
25
-
(true);
B,[d]);
DISSIM (x) and COORDINATEDISTANCE (j,x,y)
The procedures
functions
F(x)
and fj(x,y)
that
appear in the definition
are the
of the dissim-
measure (eqn 14).
ilarity
APRENDIX 3
This appendix presents
the procedure
for
a description
constructing
in an algorithmic
an optimized
notation
of
k-d tree for best match file
searching.
root
tBUILD
TREE (entire
node procedure
file);
BUILD TREE (file);
begin
local
-if
j,d,
maxspread,
SIZE(subfile)
p;
5 b then return@&?,
TERMINAL(file));
maxspread to;
for
k -do "find
j tl
step 1 -until
-if
SPREADEST(j,file)
coordinate
with
greatest
spreadll
> maxspread
then begin
maxspread tSPREADEST(j,file);
d tj;
e&;
en&;
p tMEDIAN(d,file);
return
MARE NONTERMINAL(d,p,BUILMREE(LEFT SUBFILE(d,p,file)),BUILM'REE
(RIGHTSUBFILE(d,p,file));
The procedure
value spread for
SFREADEST(j,subfile)
the records
the jth
coordinate
returns
the median of the jth
are procedures
tree
and return
that
distance
store
a pointer
returns
in the subfile
function.
to that
parameters
node.
- 26 -
represented
The procedure
key values.
their
the estimated
jth
key
by the node, using
MEDIAN (j,subfile)
MfKE TERMINAL and MAKE NONTERMINAL
as values
of a node in the k-d
-
FOOTNOTES
(1)
A record
examination involves
calculating
to
the dissimilarity
the dissimilarity
Since the k-d tree
to store
(3)
(4)
updating
(5)
along each key
the trimmed variance
sures that
the estimate
Asymptotic
behavior
increasing
All
simulations
so far
of m closest
tree,
it
can be determined
file
node [11].
The trimming
empirically
by
(6)
The behavior
for
mic for
(7)
large
eight
on an IBM 370/168
optimization
dimensions
enough file
The comparison
in Figure
grows nearly
best matches,
computer.
(8)
Inspection
and eight
16000 records.
with
file
dimensions,
Increasing
the total
become logarith-
in computation
while
for
Thus, for
m.
for
k-d tree
large
for bucket
respectively,
computation
-
27
numbers of
of
size of 16 records,the
for total
file
size to 32 records
required
-
larger
algor-
accessed is 1i.56, 6.25 and 75.0 for
the bucket
is
increase.
duces the average number of accesses for eight
increasing
the IBM FORTRAN
size at which the performance
5 shows that
average number of buckets
four,
algorithm,
is comparable will
of Figure
All
two.
of course,
The increase
linearly
the crossover
the two algorithms
will,
5.
5 is for the best match (m=l) since this
on application.
T
E
m grows as m for the sorting
it
level
examined
sizes.
the most co
ithm,
observ-
in Figure
programs were coded in FORTRANIV and compiled with
with
in-
extreme outliers.
This is illustrated
were performed
compiler
by cal-
of the average number of records
size.
H (extended)
encounter-
records.
can be estimated
against
it
is not necessary
of the key values.
is robust
of increase
with
the list
record
comparing
to the sons of each nonterminal
The spread of values
ing the rate
keys from memory,
to the query record,
is a complete binary
pointers
culating
the record
to the mth closest
ed, and if necessary,
(2)
fetching
dimensions
two,
size of
(not shown) reto 44.0 while
for the search by only 8%.
REFERENCES
[1]
Burkhard,
W.A. and Keller,
16 (April
Corn. of ACM, Vol.
searching.
[2]
Some approaches to best match file
R.M.
Fukunaga, K.,
and Narendra,
computing
k-nearest
1973), 230-236.
A Branch and bound algorithm
P.M.
IEEE Trans.
neighbors.
for
~24 (1975),
Comput.,
750-753.
[3]
Friedman,
J.H.,
finding
Baskett,
nearest
F.,
and Shustek,
L.J.
IFEE Trans.
neighbors.
An algorithm
Comput.,
for
C-24(1975)
looo- 1006.
[4]
Rivest,
On the optimality
R.
Proceedings
best match searches.
Sweden (August
[5]
of Elias'
for performing
IFIP Congress 74, Stockholm,
1974), 678-681.
Computational
Shamos, M.I.
algorithm
Conference
Geometry.
record
Annual ACM Symposium of Theory of Computing,
of Seventh
Albuquerque,
N.M.,
(May 7, 1975).
[6]
Finkel,
R.A. and Bentley,
retrieval
[7]
Bentley,
[8I
Hyafil,
on composite
L.,
trees
searching.
[F]
[lo]
and Hostetler,
estimates.
Knuth,
D.E.,
The Art
Menlo Park,
L.D.,
used for
optimal
Processing
associ-
509-517.
(Sept.l975),
Optimization
IEEE Trans.
binary
decision
Letters,
Info.
of k-nearest
Theory,
Computing and.Mathematical
Research Associates,
[ll]
4(1)(1974),1-9.
search trees
Constructing
for
Vol.
5,
15-17.
S.M., Numerical
Pizer,
binary
Information
is NP-complete.
Fukunaga, K.,
density
R.L.,
- a data structure
Acta Informatica
Corn. of ACM, vol.18
and Rivest,
(May l976),
Quad trees
keys.
Multidimensional
J.L.
ative
J.L.
Palo Alto,
p 4OP.
- 28 -
(1973), 320-326.
Analysis,
Science
Ca., 1975, pp 88, eqn 87.
of Computer Programming,
Ca., 1969,
IT-19
neighbor
Vol.
1, Addison-Wesley,
I
FIGURF: CAPTIOrJS
FIGURE 1.
Variation
with
of the average number of records
dimensionality
constant
file
(number of keys per record)
Results
size.
FIGURE 2.
number of best matches sought for
Results
dictions
of eqn 19 for
Computation
per file
as a function
examined
several
dimen-
are shown for the Euclidean
The solid
(p=2) and p=co metrics.
tree
is the pre-
of the average number of records
sionalities.
FIGURE 3.
line
of eqn 19 for the p=co metric.
Variation
with
for
are shown for the Euclidean
The solid
(p=2) and p=co metrics.
diction
examined
lines
are the pre-
the p=oo metric.
record
of total
required
file
to build
size for
the k-d
several
dimensionalities.
FIGURE 4.
Computation
function
FIGURF: 5.
for
required
of terminal
Computation
bucket
of total
k-d tree
algorithms
size.
for best match searching
required
function
the best match search as a
file
as a
size for both the sorting
at several
and
dimensionalities.
Also shown is the variation
of the average number of
records
examined with
total
file
*
buckets
of 16 records
were used with
algorithm.
- 29 -
size.
Terminal
the k-d tree
IO3
I
I
8 I92 Records
0
Euclidean
q
p= co Metric
Metric
102
IO’
I
2
4
6
NUMBER OF KEYS
I I
8
IO
(dimens ional I tY)
2668Al
Figure
1
60
0
50
40
-Two Keys per Record
8192 Records
0 Euclidean Metric
30
20
IO
0
5
IO
I5
20
25
NUMBER OF BEST MATCHES
2660A2
Figure
23
200
I50
0
Four Keys per Record
8192 Records
-
0
0 Euclidean Metric
q p=a
Metric
100
50
0
IO
15
5
NUMBER OF BEST
20
25
MATCHES
2668A3
Figure
2b
600
400
I
I
I
Six Keys per Record
8192 Records
0
cl
I
I
0
Euclidean Me tric
p= ~0 Metric
0
200
0
0
I
I
I
IO
I5
5
NUMBER OF BEST
I
I
20
25
MATCHES
2668A4
Figure
2c
I .4
I
’I
I
I
I
I
IIIII~
I
lllll~
r
.,
Euclidean Metric
1.2 - 2 Two Keys per Record
4 Four Keys per Record
I .o
8 Eight Keys per Record
8
8
0.8
0.6
8
8
8
-
8
4
2
2
II
102
4
2
I
I
I
4
4
4
I
IllIll
2
2
2
2
4
I
I
4
3
llllll
I04
IO3
TOTAL NUMBER OF RECORDS IN FILE
2668A7
Figure
3
c
.
70
I I 1
I
I
I
I
IIIII
Euclidean
60
8
50
Metric
2 Two Keys per Record (x10)
8 Eight Keys per Record
8
30
-
20
-
0
I-
8
40
IO
I’
I
-
2
2
I I I
I
2
I
I
I
IIIII
8
8
2
2
8
2
I
I
50
IO
5
I
NUMBER OF RECORDS PER BUCKET
2668A5
Figure
4
N Fl
:m
:
r
0
PER QUERY
ys
cm
Iv
b-J
AVERAGE NUMBER OF RECORDS EXAMINED
0
0
MILLISECONDS
0
z-0
in
30
25
20
I
I
I
IIIII~
I
I
I
Illll~
I
Euclidean Metric
Four Keys per Record
Sorting Algorithm
k-d Tree Algorithm
Ave. Records Examined
I5
IO
5
0
TOTAL NUMBER OF RECORDS IN FILE
Figure
5b
I
300
250
200
I
I
I
I I III(
I
Euclidean
I
I
I
IllI]
I
3000
Metric
Eight Keys per Record
0 Sor ting Algorithm
•I k-d Tree Algorithm
a Ave. Records Examined
n i
2500
2000
150
1500
100
1000
50
500
0
0
TOTAL NUMBER OF RECORDS
Figure
5c
IN FILE

Download Report

SLAC-PUB-1549 (Rev.) - Stanford University

Paperzz.com

Your Paperzz