Slides - Rui Zhang

Approximate NN queries on Streams
with Guaranteed Error/performance
Bounds
Nick Koudas @ AT&T labs-research
Beng Chin Ooi , Kian-Lee Tan , Rui Zhang
@ National University of Singapore
Problem
• Problem: kNN search.
• Environment: data stream (one scan; memory
constraint).
• Approximate Solution: e-approximate kNN (ekNN).
• Motivation: Applications in which absolute error is
preferable or more straightforward.
IP:
137.132.48.120
137.132.48.121
…
• Two Optimization Problems:
– memory optimization for a given error bound:
given an error bound e, use as little memory as
possible to answer ekNN queries.
– error minimization for a given memory size:
given a fixed amount of memory, achieve the
best accuracy for ekNN queries.
• Requirements:
– One scan algorithm.
– Satisfies the constraints.
– Efficient updates and query processing.
A Framework
• Divide space into equal square-shaped cells.
• Maintain at most K points in each cell.
• For any k≤K, absolute error of kNN distance is
bounded by dM, the maximum distance within a cell.
For Euclidean distance: dM = d / u
where d is dimensionality; u is the number of cells
each dim is divided to.
Maintenance of the Points
--aDaptive Indexing on Streams
by space-filling Curves (DISC)
• Cells are not explicitly maintained, only
points.
• Cells linearized according to Z-curve.
• Z-value of the cell is the key of a point.
• Points maintained in a B*-tree.
• An efficient merge-cell algorithm possible.
Algorithm: Build index
• m: the order of Z-curve, 2m cells each dim.
m
• If e given, d / 2 e  e , we get me  log 2 ( d / e) .
me is integer, so me  log 2 ( d / e)
• If memory constraint given, set a large enough m.
• Build index
– Initialize m
– Read a record P, calculate Z-value, search the B*-tree and find out Nc:
number of existing points in the cell P belongs to.
– If Nc < K
• Insert P to the B*-tree.
– Else
• Discard one and insert P.
– If memory runs out //this only happens for the error minimization problem
• Merge cells and let m=m-1
– Go back to Step 2 (Read next record)
Algorithm: Merge Cells
• General Merge-Cell
– Apply to any structure.
– For each new cell, find all the points of
the old cells in it, and merge them.
• Bulk Merge-Cell
– Only apply to DISC.
– Scan all the leaf pages once.
Algorithm: KNN search
• W: a window query centered
at the center of the cell Q is in;
and with gradually increasing
side length s.
• Find the kNN to Q within W.
– If the kNN distance is no larger
than the distance between the
nearest side of W to Q and Q,
search terminates;
– Else increase s by 1/u .
Experiments
Questions ?