4up

Outline
Similarity Search
Metric Index Structures
1
Nikolaus Augsten
Free University of Bozen-Bolzano
Faculty of Computer Science
DIS
The M-Tree
Introduction
Update and Search
Unit 12 – June 7, 2012
Nikolaus Augsten (DIS)
Similarity Search
The M-Tree
Unit 12 – June 7, 2012
1 / 13
Nikolaus Augsten (DIS)
Introduction
Similarity Search
The M-Tree
Outline
Unit 12 – June 7, 2012
2 / 13
Introduction
M-Trees for Similarity Queries
What is an M-tree?
disk-based index structure for metric distances
reduces search space for similarity query
1
Features of M-trees:
The M-Tree
Introduction
Update and Search
dynamic (insertion and deletion of data objects)
balanced tree (structure does not degenerate)
supports range and k-nearest neighbor queries
Literature:
M-trees were introduced by Ciaccia et al. [CPZ97] in 1997
Textbook by Zezula et al. [ZADB06] covers M-trees
Nikolaus Augsten (DIS)
Similarity Search
Unit 12 – June 7, 2012
3 / 13
Nikolaus Augsten (DIS)
Similarity Search
Unit 12 – June 7, 2012
4 / 13
The M-Tree
Introduction
The M-Tree
M-Tree: Illustration
Introduction
Inner and Leaf nodes
Internal nodes: prune irrelevant subtrees
o1
o1
internal node: tuple of m entries (e1 , e2 , . . . , em )
ei = (pi , ric , d(pi , p p ), ptri )
pi : pivot (some data object)
ric : covering radius around pi
d(pi , p p ): distance between pi and the parent pivot p p of pi
ptri : pointer to a child node
o2
o10
o7
o2
o4
Guarantee: all objects in subtree pi are at most at distance ric from pi
Leaf nodes: store data objects
o1
o6
o10
o3
Nikolaus Augsten (DIS)
o7
o5
o11
o2
Similarity Search
The M-Tree
o8
o4
leaf node: tuple of m entries (f1 , f2 , . . . , fm )
fi = (oi , d(oi , p p ))
oi : data object
d(oi , p p ): distance between oi and the pivot in the parent node
o9
Unit 12 – June 7, 2012
5 / 13
Nikolaus Augsten (DIS)
Update and Search
Similarity Search
The M-Tree
Outline
Unit 12 – June 7, 2012
6 / 13
Update and Search
Insertion of New Objects
New object oN is inserted as a leaf node
At each internal node (starting with the root node):
1
1. find set E of entries that can store oN without increasing covering
radius (i.e., d(oN , p) < r c )
2. if E 6= ∅ traverse into subtree of element e ∈ E with minimum distance
d(oN , p)
3. if E = ∅ increase covering radius of element that requires minimum
increase and traverse into respective subtree
The M-Tree
Introduction
Update and Search
At leave node:
1. compute distance between oN and parent pivot d(oN , p p )
2. try to store new entry (oN , d(oN , p p )) in leaf node
3. if node is full (overflow), then split node
Nikolaus Augsten (DIS)
Similarity Search
Unit 12 – June 7, 2012
7 / 13
Nikolaus Augsten (DIS)
Similarity Search
Unit 12 – June 7, 2012
8 / 13
The M-Tree
Update and Search
The M-Tree
Splitting a Node in the M-Tree
Update and Search
Range Query
If a node N overflows, it must be split
Node split
1.
2.
3.
4.
5.
create new node N ′ at the same level
select two new pivots (for N and N ′ )
redistribute the m + 1 objects to N and N ′
substitute old pivot by the two new pivots
if parent node overflows:
Definition (Range Query)
Given a set of objects X ⊆ D from a domain D and a query object q ∈ D
with a query radius r . The range query, R(q, r ) retrieves all objects in X
within distance r from q:
a. if parent node is non-root: split parent node
b. if parent node is root: create new root node (tree grows by one level)
R(q, r ) = {o ∈ X | d(o, q) ≤ r }
How to choose new pivots?
try to keep covering radii as small as possible to avoid overlaps
criterion: pi and pj are used as new pivots if max(ric , rjc ) is minimal
Nikolaus Augsten (DIS)
Similarity Search
The M-Tree
Unit 12 – June 7, 2012
9 / 13
Update and Search
Similarity Search
The M-Tree
Range Query in M-Tree
Unit 12 – June 7, 2012
10 / 13
Update and Search
Search Algorithm
1. Start at root node
2. For each entry (p, r c , d(p, p p ), ptr ) in an internal node:
Range query R(q, r ):
if |d(q, p p ) − d(p, p p )| − r c > r (criterion A), then the subtree of p can
be safely ignored (pruned)
if |d(q, p p ) − d(p, p p )| − r c ≤ r , then compute d(q, p); if
d(q, p) − r c > r (criterion B), then prune subtree, otherwise traverse
subtree pointed to by ptr
search in M-tree returns candidate set C ⊇ R(q, r ) of data objects
for each object o ∈ C the criterion d(q, o) ≤ r must be verified
Note:
objects are always in the leaf nodes
some subtrees with all their leaf nodes can (hopefully) be pruned
different from a B-tree, multiple root-leaf paths in a tree might be
traversed for a single search
Nikolaus Augsten (DIS)
Nikolaus Augsten (DIS)
Similarity Search
Unit 12 – June 7, 2012
3. For each entry (o, d(o, p p )) in a leaf node:
if |d(q, o p ) − d(o, o p )| > r , then ignore object o
otherwise compute d(q, o); if d(q, o) ≤ r , then o is reported as an
answer
11 / 13
Nikolaus Augsten (DIS)
Similarity Search
Unit 12 – June 7, 2012
12 / 13
The M-Tree
Update and Search
Search Algorithm
Pruning criterion B: d(q, p) − r c > r
the objects in the subtree of p are within radius r c from p
the range query looks for objects within radius r from q
if d(q, p) > r c + r , then the spheres defined by (p, r c ) and (q, r ) are
too small and do not overlap, thus p is pruned
d(q, p) is only computed if criterion A does not hold
Pruning criterion A is applied instead: |d(q, p p ) − d(p, p p )| − r c > r
Paolo Ciaccia, Marco Patella, and Pavel Zezula.
M-tree: An efficient access method for similarity search in metric
spaces.
In Proceedings of the 23rd International Conference on Very Large
Data Bases, VLDB ’97, pages 426–435, San Francisco, CA, USA,
1997. Morgan Kaufmann Publishers Inc.
Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, and Michal Batko.
Similarity Search: The Metric Space Approach, volume 32 of
Advances in Database Systems.
Springer, 2006.
both d(q, p p ) and d(p, p p ) are known
|d(q, p p ) − d(p, p p )| ≤ d(q, p) follows from the triangle inequality
criterion A ⇒ criterion B ⇒ subtree of p can be pruned
Nikolaus Augsten (DIS)
Similarity Search
Unit 12 – June 7, 2012
13 / 13
Nikolaus Augsten (DIS)
Similarity Search
Unit 12 – June 7, 2012
13 / 13