Construction and Maintenance of Structural Indexes for XML Data

Incremental Maintenance of
XML Structural Indexes
Ke Yi1, Hao He1, Ioana Stanoi2 and Jun Yang1
1Department
of Computer Science, Duke University
2IBM T. J. Watson Research Center
Motivation



XML is gaining tremendously in popularity in
recent years
Used to represent many kinds of data
Major DB vendors are rushing to incorporate
solutions for native XML repositories and
retrieval


IBM DB2, Oracle , Microsoft SQL Server
Tamino, Natix, X-Hive, …
Overview
paper
1
13 section
section
2
title 14
title 3
8 section
4 section
“intro”
algorithm
5 title
“1-index”
“experiments”
6 algorithm
proof 7
9 title
“A(k)-index”
12
uses
15
exp
10
11
proof
17
about
18
about
exp
16
Label Path Expressions
/paper/section/algorithm
paper
1
13 section
section
2
title 14
title 3
8 section
4 section
“intro”
algorithm
5 title
“1-index”
“experiments”
6 algorithm
proof 7
9 title
“A(k)-index”
12
uses
15
exp
10
11
proof
17
about
18
about
exp
16
Structural Indexes

Why do we need them?



Structural indexes





Speedup the evaluation of path expressions
Provides a structural summary of the data graph
DataGuide [Goldman & Widom 97]
1-index [Milo & Suciu 99]
A(k)-index [Kaushik et al. 02], D(k)-index [Qun et al. 03],
M(k)-index [He & Yang 04]
Integration of structural indexes and inverted lists
[Kaushik et al. 04]
Focus on maintenance


Has a major effect on index efficiency
Remains an overlooked issue
Outline
paper
1
13 section
section
2
title 14
title 3
8 section
4 section
“intro”
“experiments”
algorithm
5 title
“1-index”
6 algorithm
proof 7
9 title
“A(k)-index”
12
uses
15
exp
10
11
proof
17
about
18
about
exp
16
1-Index: Definition


Constructed by using bisimilarity
Definition based on stability





Partition data nodes into index nodes
dnode (v) and inode (I[v])
I[u] is v’s index parent if u is v’s parent
An inode is stable if all of its dnodes have the
u
same index parents
In a 1-index, all inodes are stable
v
I[u]
I[v]
1-Index: Example
paper
paper
1
section
title 14
2
4 section
2,4,8,13
section
8
15 exp
3
title
16 exp
10
6
9 algorithm
title
18
proof
17 about
algorithm
5 title
proof
1
13 section
7
uses
11
12
data graph
section
exp
title
3,5,9,14
7
algorithm
6,10
proof
about
/paper/section/algorithm
12
uses
11
proof
1-index
15,16
17,18
about
1-Index: Quality
paper

Assigning dnodes that
are bisimilar into
different inodes



does not affect
correctness,
but does affect efficiency
1
2,4 2,4,8,13
8,13
section
exp
title
3,5,9,14
algorithm
6,10
15,16
The quality of an index
# inodes
# inodes in
the minimum 1-index
− 1 X 100%
7
proof
12
uses
Ideal: quality = 0%
11
proof
17,18
about
Previous Results

Construction


Edge changes



The PT algorithm [Paige & Tarjan 87], in time O(m log n)
 m – # edges, n - # nodes
The propagate algorithm [Kaushik et al. 02]
Quality of the 1-index after update
 No guarantee on the quality of the resulted index
 3 ~ 5% after 500 edge insertions in experiments
Subgraph addition

Index-reconstruction
Edge Insertion: An Example (1)
R
A
R
R
B
A
B
C1
C2
C3
C1, C2
C3
D1
D2
D3
D1, D2
D3
Data Graph
1-Index
A
C1
B
C2
C3
D1, D2
D3
Split 1
Edge Insertion: An Example (2)
R
A
R
B
A
C1
C2
C3
C1
D1
D2
D3
D1
Split 2
R
B
D2
A
B
C2, C3
C1
C2, C3
D3
D1
D2, D3
Merge 1
Merge 2
Indeed the minimum 1-index
for the data graph after update
Not a coincidence!
Minimum & Minimal Indexes


Minimum: with the smallest number of inodes
Minimal: no two inodes can be merged
R
R
R
A1
A2
A1,A2
A1
A2
B1
B2
B1,B2
B1
B2
Data graph
Minimum 1-index
Minimal 1-index
Quality Guarantee


Theorem: The split/merge algorithm always
maintains a minimal 1-index
Lemma: For acyclic data graphs, there is a
unique minimal 1-index


The minimum 1-index is always maintained
For cyclic data graphs, there could be more
than one minimal 1-index

One of them is maintained
Outline
paper
1
13 section
section
2
title 14
title 3
4 section
8 section
“intro”
algorithm
5 title
“1-index”
“experiments”
6 algorithm
proof 7
9 title
“A(k)-index”
12
uses
15
exp
10
11
proof
17
about
18
about
exp
16
A(k)-Index: Definition


k-bisimilarity
Definition based on stability
 A(0)-index: partition by label
 …

A(k)-Index




An inode in A(k)-index is stable if all of its dnodes
have the same index parents in A(k-1)-index
Only interested in paths of length ≤k
Shown to be much smaller and more efficient than
1-index [Kaushik et al. 02]
But, no efficient maintenance algorithms are known!
A(k)-index: Example
R
A
R
R
B
A
B
A
B
C1
C2,C3
C1
C2
C3
C1
C2,C3
C4
C5
C6
C4
C5,C6
Data graph
R
A(2) (=1-index)
A
B
C1,C2,C3
C4,C5,C6
C4,C5,C6
A(1)
Maintenance of A(i)-index requires the information in A(i-1)-index
A(0)
A(k)-index: Refinement Tree
R
A
R
R
B
A
B
A
B
C1
C2,C3
C1
C2
C3
C1
C2,C3
C4
C5
C6
C4
C5,C6
Data graph
R
A(2) (=1-index)
A
B
C1,C2,C3
C4,C5,C6
C4,C5,C6
A(1)
A(0)
A(k)-index: Refinement Tree
R
A
R
R
B
A
B
A
B
C
C
C1
C2
C3
C
C
C4
C5
C6
C
C
Data graph
R
A(2)
1. Reduce storage cost
2. Reduce maintenance cost
A
B
C
C
A(1)
A(0)
0.5% ~ 13% additional storage
Quality Guarantee
Theorem: The split/merge algorithm always maintains
the
a minimal
minimum A(k)-index
 Lemma: There is a unique minimal A(k)-index for any
data graph, acyclic or cyclic

1-index
A(k)-index
Acyclic
minimum
minimum
Cyclic
minimal
minimum
Outline
paper
1
13 section
section
2
title 14
title 3
4 section
8 section
“intro”
algorithm
5 title
“1-index”
“experiments”
6 algorithm
proof 7
9 title
“A(k)-index”
12
uses
15
exp
10
11
proof
17
about
18
about
exp
16
Experiments on Edge Changes

Datasets



Setup



Real-life: IMDB (272,000 nodes)
Benchmark: XMark (198,000 nodes)
First delete a portion of existing ID-REF links
Then do random mixed insertions/deletions
Compare with


1-index: propagate (+ reconstruction)
A(k)-index: recompute affected portion (+
reconstruction)
Experiment Results: 1-index
Experiment Results: A(k)-index
k
speedup
2
1.35
3
6.15
4
16.6
5
15.3
running times
Conclusions

The first solutions for the maintenance (edge
& subgraph additions/deletions) of 1-index
and A(k)-index that are both effective and
efficient


Effective: quality guarantee on the resulted index
Efficient: the algorithms themselves are fast
Thank you!
Graphical Illustration
size
valid 1-index
merge
split
the index can only grow in size due to splitting, if merging is not enforced
index