A Combination of Trie-trees and Inverted files for the Indexing of Set

A Combination of Trie-trees and
Inverted files for the Indexing of
Set-valued Attributes
Manolis Terrovitis (NTUA)
Spyros Passas (NTUA)
Panos Vassiliadis (UoI)
Timos Sellis (NTUA)
Problem

We are interested in low cardinality set-values
–
–
–

We address the efficient evaluation of containment queries
–
–

Retail store transaction logs
Web logs
Biomedical databases etc.
In which transactions were products ‘a’ and ‘b’ sold together?
Which users visited only the main page or the download page of
our site?
We propose the Hybrid Trie-Inverted file (HTI) index
Terrovitis et. al.,
CIKM '06
Outline





Problem definition
The HTI index
Query evaluation
Experiments
Conclusions
Terrovitis et. al.,
CIKM '06
Outline





Problem definition
The HTI index
Query evaluation
Experiments
Conclusions
Terrovitis et. al.,
CIKM '06
Data and queries
tid
products
tid products
1
{f,a}
9
2
{a,d,c}
10 {g,c,a}
3
{c,b,a}
11 {b,a,e}
4
{f,a,c}
12 {b,d,c}
5
{c,g}
13 {c,f,a,d,b}
6
{a,b,g,c,d,e} 14 {b,d}
7
{a,d,b}
15 {e}
8
{a,e,b}
16 {b,f,a}
{a,e}
Terrovitis et. al.,
CIKM '06
Data and queries
tid
products
tid products
1
{f,a}
9
2
{a,d,c}
10 {g,c,a}
3
{c,b,a}
11 {b,a,e}
4
{f,a,c}
12 {b,d,c}
5
{c,g}
13 {c,f,a,d,b}
6
{a,b,g,c,d,e} 14 {b,d}
7
{a,d,b}
15 {e}
8
{a,e,b}
16 {b,f,a}
{a,e}

Find all transactions
that contain ‘a’, ‘b’ and ‘d’
(subset)
Terrovitis et. al.,
CIKM '06
Data and queries
tid
products
tid products
1
{f,a}
9
2
{a,d,c}
10 {g,c,a}
3
{c,b,a}
11 {b,a,e}
4
{f,a,c}
12 {b,d,c}
5
{c,g}
13 {c,f,a,d,b}
6
{a,b,g,c,d,e} 14 {b,d}
7
{a,d,b}
15 {e}
8
{a,e,b}
16 {b,f,a}

{a,e}

Find all transactions
that contain ‘a’, ‘b’ and ‘d’
(subset)
Find all transactions
that contain exactly ‘a’,
‘b’ and ‘d’ (equality)
Terrovitis et. al.,
CIKM '06
Data and queries
tid
products
tid products
1
{f,a}
9
2
{a,d,c}
10 {g,c,a}
3
{c,b,a}
11 {b,a,e}
4
{f,a,c}
12 {b,d,c}
5
{c,g}
13 {c,f,a,d,b}
6
{a,b,g,c,d,e} 14 {b,d}
7
{a,d,b}
15 {e}
8
{a,e,b}
16 {b,f,a}

{a,e}


Find all transactions
that contain ‘a’, ‘b’ and ‘d’
(subset)
Find all transactions
that contain exactly ‘a’,
‘b’ and ‘d’ (equality)
Find all transactions
that contain only items
from ‘a’, ‘b’ and ‘d’
(superset)
Terrovitis et. al.,
CIKM '06
Data and queries

Traditional methods
–
–

Signature files
Inverted files
Differences from text databases:
–
–
–
Low cardinality
Large number of records in comparison with
vocabulary size
New types of queries (equality-superset)
Terrovitis et. al.,
CIKM '06
Outline





Problem definition
The HTI index
Query evaluation
Experiments
Conclusions
Terrovitis et. al.,
CIKM '06
The HTI index
Background – The inverted file
Vocabulary (I)
Inverted (postings) lists
a
1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 13, 16
b
3, 6, 7, 8, 9, 11, 12, 13, 14, 16
c
2, 3, 4, 5, 6, 10, 12, 13
d
2, 6, 7, 12, 13, 14
e
6, 8, 9, 11, 15
f
1, 4, 13, 16
g
5, 6, 10
Database
transactions (D)
14
b, d
16
b, f, a
Terrovitis et. al.,
CIKM '06
HTI index
Inverted files - problems


The evaluation of containment queries relies on
merge-joining the inverted lists
The inverted lists become very long
–
–

when the database size is very big compared to the
vocabulary
when the items’ distribution is skewed
This is often the case in the real world!
Terrovitis et. al.,
CIKM '06
HTI index
Solution?

We need to break up the lists!

But how?
–
Lets make a list for every combination of
items!
Terrovitis et. al.,
CIKM '06
HTI index
Solution?



We assume a total order based on the frequency
of appearance for the items of the database
We order the items in each set-value and we
transform it to a sequence
We create a path in the access tree for each
sequence
Terrovitis et. al.,
CIKM '06
HTI index
All combinations?
Null
Ordered
Transactions
1 {a,f}
2 {a,c,d}
3 {a,b,c}
4 {a,c,f}
5 {c,g}
6 {a,b,c,d,e,g}
7 {a,b,d}
8 {a,b,e}
9 {a,e}
10 {a,c,g}
11 {a,b,e}
12 {b,c,d}
13 {a,b,c,d,f}
14 {b,d}
15 {e}
16 {a,b,f}
a
b
c
d
b
c
e
f
d
f
f
g
e
c
c
d
c
g
d
d
e
f
g
Terrovitis et. al.,
CIKM '06
HTI index
All combinations?
Null
Ordered
Transactions
1 {a,f}
2 {a,c,d}
3 {a,b,c}
4 {a,c,f}
5 {c,g}
6 {a,b,c,d,e,g}
7 {a,b,d}
8 {a,b,e}
9 {a,e}
10 {a,c,g}
11 {a,b,e}
12 {b,c,d}
13 {a,b,c,d,f}
14 {b,d}
15 {e}
16 {a,b,f}
a
b
c
c
tid’s: 1
b
c
f
e
c
d
g
tid’s: 1
c
d
e
f
d
f
g
d
d
e
f
g
Terrovitis et. al.,
CIKM '06
HTI index
All combinations?
Null
Ordered
Transactions
1 {a,f}
2 {a,c,d}
3 {a,b,c}
4 {a,c,f}
5 {c,g}
6 {a,b,c,d,e,g}
7 {a,b,d}
8 {a,b,e}
9 {a,e}
10 {a,c,g}
11 {a,b,e}
12 {b,c,d}
13 {a,b,c,d,f}
14 {b,d}
15 {e}
16 {a,b,f}
a
b
c
c
tid’s: 1,2
b
c
f
e
c
d
g
tid’s: 2
tid’s: 1
c
d
e
f
d
f
g
d
tid’s: 2
d
e
f
g
Terrovitis et. al.,
CIKM '06
HTI index
All combinations?
Null
Ordered
Transactions
1 {a,f}
2 {a,c,d}
3 {a,b,c}
4 {a,c,f}
5 {c,g}
6 {a,b,c,d,e,g}
7 {a,b,d}
8 {a,b,e}
9 {a,e}
10 {a,c,g}
11 {a,b,e}
12 {b,c,d}
13 {a,b,c,d,f}
14 {b,d}
15 {e}
16 {a,b,f}
a
b
tid’s: 1,2,3,4,6,7,8,9,10,11,13,16
tid’s: 12,14
c
c
tid’s: 5
tid’s: 15
b
c
tid’s: 3,6,7,8,11,13,16
tid’s: 2,4,10
f
tid’s: 1
c
e
c
d
tid’s: 12
tid’s: 9
d
e
f
d
f
g
d
tid’s: 7
tid’s: 8,11
tid’s: 16
tid’s: 2
tid’s: 4
tid’s: 10
tid’s: 12
tid’s: 14
g
tid’s: 5
tid’s: 3,6,13
d
tid’s: 13,16
e
f
tid’s: 13
tid’s: 16
g
tid’s: 13
Terrovitis et. al.,
CIKM '06
HTI index
All combinations? Maybe, not…
Null
Ordered
Transactions
1 {a,f}
2 {a,c,d}
3 {a,b,c}
4 {a,c,f}
5 {c,g}
6 {a,b,c,d,e,g}
7 {a,b,d}
8 {a,b,e}
9 {a,e}
10 {a,c,g}
11 {a,b,e}
12 {b,c,d}
13 {a,b,c,d,f}
14 {b,d}
15 {e}
16 {a,b,f}
a
b
tid’s: 1,2,3,4,6,7,8,9,10,11,13,16
tid’s: 12,14
c
c
tid’s: 5
tid’s: 15
b
c
tid’s: 3,6,7,8,11,13,16
tid’s: 2,4,10
f
tid’s: 1
c
e
c
d
tid’s: 12
tid’s: 9
d
e
f
d
f
g
d
tid’s: 7
tid’s: 8,11
tid’s: 16
tid’s: 2
tid’s: 4
tid’s: 10
tid’s: 12
tid’s: 14
g
tid’s: 5
tid’s: 3,6,13
d
tid’s: 13,16
e
f
tid’s: 13
tid’s: 16
g
tid’s: 13
Terrovitis et. al.,
CIKM '06
HTI index
An access tree for the frequent items
Ordered
Transactions
1 {a,f}
2 {a,c,d}
3 {a,b,c}
4 {a,c,f}
5 {c,g}
6 {a,b,c,d,e,g}
7 {a,b,d}
8 {a,b,e}
9 {a,e}
10 {a,c,g}
11 {a,b,e}
12 {b,c,d}
13 {a,b,c,d,f}
14 {b,d}
15 {e}
16 {a,b,f}
Null
a
tid’s: 1,2,3,4,6,7,8,9,10,11,13,16
b
b
tid’s: 3,6,7,8,11,13,16
c
tid’s: 3,6,13
tid’s: 12
c
tid’s: 2,4,10
c
tid’s: 12,14
c
tid’s: 5
Terrovitis et. al.,
CIKM '06
HTI index
An access tree for the frequent items
Ordered
Transactions
1 {a,f}
2 {a,c,d}
3 {a,b,c}
4 {a,c,f}
5 {c,g}
6 {a,b,c,d,e,g}
7 {a,b,d}
8 {a,b,e}
9 {a,e}
10 {a,c,g}
11 {a,b,e}
12 {b,c,d}
13 {a,b,c,d,f}
14 {b,d}
15 {e}
16 {a,b,f}
Null
a
tid’s: 1,2,3,4,6,7,8,9,10,11,13,16
b
b
tid’s: 3,6,7,8,11,13,16
c
tid’s: 3,6,13
tid’s: 12
c
tid’s: 2,4,10
c
tid’s: 12,14
c
tid’s: 5
Terrovitis et. al.,
CIKM '06
The HTI index
Vocabulary
a
b
c
d
e
f
f
Terrovitis et. al.,
CIKM '06
The HTI index
Vocabulary
a
b
c
d
2, 6, 7, 12, 13, 14
e
6, 8, 9, 11, 15
f
1, 4, 13, 16
f
5, 6, 10
Inverted Lists
Terrovitis et. al.,
CIKM '06
The HTI index
Vocabulary
Null
a
a
b
Access
Tree
b
b
c
c
d
2, 6, 7, 12, 13, 14
e
6, 8, 9, 11, 15
f
1, 4, 13, 16
f
5, 6, 10
c
c
c
Terrovitis et. al.,
CIKM '06
Inverted Lists
The HTI index
Vocabulary
Null
Access
Tree
a
a
1,2,3,4,6,7,8,9,10,11,13,16
b
b
c
b
12,14
3,6,7,8,11,13,16
c
c
3,6,13
d
2, 6, 7, 12, 13, 14
e
6, 8, 9, 11, 15
f
1, 4, 13, 16
g
5, 6, 10
c
2,4,10
c
12
5
Inverted Lists
Terrovitis et. al.,
CIKM '06
HTI index
The basic points



The access tree is used only for the most
frequent items
The inverted lists are restructured so that each
node of the access tree points to a different
inverted sublist
We keep the access tree in main memory
Terrovitis et. al.,
CIKM '06
Outline





Problem definition
The HTI index
Query evaluation
Experiments
Conclusions
Terrovitis et. al.,
CIKM '06
Query Evaluation
Basic Steps
1.
Find the frequent items of the query set
2.
Use the access tree to detect the
sublists which might participate in the
answer
3.
Merge-join these sublists with the
inverted lists of the non-frequent items
Terrovitis et. al.,
CIKM '06
Subset - (‘b’, ‘c’, ‘d’’)
Vocabulary
Null
Access
Tree
a
a
1,2,3,4,6,7,8,9,10,11,13,16
b
b
c
b
12,14
3,6,7,8,11,13,16
c
c
3,6,13
d
2, 6, 7, 12, 13, 14
e
6, 8, 9, 11, 15
f
1, 4, 13, 16
g
5, 6, 10
c
2,4,10
c
12
5
Inverted Lists
Terrovitis et. al.,
CIKM '06
Subset - (‘b’, ‘c’, ‘d’’)
Vocabulary
Null
Access
Tree
a
a
1,2,3,4,6,7,8,9,10,11,13,16
b
b
c
b
12,14
3,6,7,8,11,13,16
c
c
3,6,13
d
2, 6, 7, 12, 13, 14
e
6, 8, 9, 11, 15
f
1, 4, 13, 16
g
5, 6, 10
c
2,4,10
c
12
5
Inverted Lists
Terrovitis et. al.,
CIKM '06
Subset - (‘b’, ‘c’, ‘d’’)
Vocabulary
Null
Access
Tree
a
a
1,2,3,4,6,7,8,9,10,11,13,16
b
b
c
b
12,14
3,6,7,8,11,13,16
c
c
3,6,13
d
2, 6, 7, 12, 13, 14
e
6, 8, 9, 11, 15
f
1, 4, 13, 16
g
5, 6, 10
c
2,4,10
c
12
5
Inverted Lists
Terrovitis et. al.,
CIKM '06
Subset - (‘b’, ‘c’, ‘d’’)
Vocabulary
Null
Access
Tree
a
a
1,2,3,4,6,7,8,9,10,11,13,16
b
b
c
b
12,14
3,6,7,8,11,13,16
c
c
3,6,13
d
2, 6, 7, 12, 13, 14
e
6, 8, 9, 11, 15
f
1, 4, 13, 16
g
5, 6, 10
c
2,4,10
c
12
5
Inverted Lists
Terrovitis et. al.,
CIKM '06
Subset - (‘b’, ‘c’, ‘d’’)
Vocabulary
Null
Access
Tree
a
a
1,2,3,4,6,7,8,9,10,11,13,16
b
b
c
b
12,14
3,6,7,8,11,13,16
c
c
3,6,13
d
2, 6, 7, 12, 13, 14
e
6, 8, 9, 11, 15
f
1, 4, 13, 16
g
5, 6, 10
c
2,4,10
c
12
5
Inverted Lists
Terrovitis et. al.,
CIKM '06
Equality - (‘b’, ‘c’, ‘d’’)
Vocabulary
Null
Access
Tree
a
a
1,2,3,4,6,7,8,9,10,11,13,16
b
b
c
b
12,14
3,6,7,8,11,13,16
c
c
3,6,13
d
2, 6, 7, 12, 13, 14
e
6, 8, 9, 11, 15
f
1, 4, 13, 16
g
5, 6, 10
c
2,4,10
c
12
5
Inverted Lists
Terrovitis et. al.,
CIKM '06
Equality - (‘b’, ‘c’, ‘d’’)
Vocabulary
Null
Access
Tree
a
a
1,2,3,4,6,7,8,9,10,11,13,16
b
b
c
b
12,14
3,6,7,8,11,13,16
c
c
3,6,13
d
2, 6, 7, 12, 13, 14
e
6, 8, 9, 11, 15
f
1, 4, 13, 16
g
5, 6, 10
c
2,4,10
c
12
5
Inverted Lists
Terrovitis et. al.,
CIKM '06
Equality - (‘b’, ‘c’, ‘d’’)
Vocabulary
Null
Access
Tree
a
a
1,2,3,4,6,7,8,9,10,11,13,16
b
b
c
b
12,14
3,6,7,8,11,13,16
c
c
3,6,13
d
2, 6, 7, 12, 13, 14
e
6, 8, 9, 11, 15
f
1, 4, 13, 16
g
5, 6, 10
c
2,4,10
c
12
5
Inverted Lists
Terrovitis et. al.,
CIKM '06
Equality - (‘b’, ‘c’, ‘d’’)
Vocabulary
Null
Access
Tree
a
a
1,2,3,4,6,7,8,9,10,11,13,16
b
b
c
b
12,14
3,6,7,8,11,13,16
c
c
3,6,13
d
2, 6, 7, 12, 13, 14
e
6, 8, 9, 11, 15
f
1, 4, 13, 16
g
5, 6, 10
c
2,4,10
c
12
5
Inverted Lists
Terrovitis et. al.,
CIKM '06
Superset - (‘b’, ‘c’, ‘d’’)
Vocabulary
Null
Access
Tree
a
a
1,2,3,4,6,7,8,9,10,11,13,16
b
b
c
b
12,14
3,6,7,8,11,13,16
c
c
3,6,13
d
2, 6, 7, 12, 13, 14
e
6, 8, 9, 11, 15
f
1, 4, 13, 16
g
5, 6, 10
c
2,4,10
c
12
5
Inverted Lists
Terrovitis et. al.,
CIKM '06
Superset - (‘b’, ‘c’, ‘d’’)
Vocabulary
Null
Access
Tree
a
a
1,2,3,4,6,7,8,9,10,11,13,16
b
b
c
b
12,14
3,6,7,8,11,13,16
c
c
3,6,13
d
2, 6, 7, 12, 13, 14
e
6, 8, 9, 11, 15
f
1, 4, 13, 16
g
5, 6, 10
c
2,4,10
c
12
5
Inverted Lists
Terrovitis et. al.,
CIKM '06
Superset - (‘b’, ‘c’, ‘d’’)
Vocabulary
Null
Access
Tree
a
a
1,2,3,4,6,7,8,9,10,11,13,16
b
b
c
b
12,14
3,6,7,8,11,13,16
c
c
3,6,13
d
2, 6, 7, 12, 13, 14
e
6, 8, 9, 11, 15
f
1, 4, 13, 16
g
5, 6, 10
c
2,4,10
c
12
5
Inverted Lists
Terrovitis et. al.,
CIKM '06
Superset - (‘b’, ‘c’, ‘d’’)
Vocabulary
Null
Access
Tree
a
a
1,2,3,4,6,7,8,9,10,11,13,16
b
b
c
b
12,14
3,6,7,8,11,13,16
c
c
3,6,13
d
2, 6, 7, 12, 13, 14
e
6, 8, 9, 11, 15
f
1, 4, 13, 16
g
5, 6, 10
c
2,4,10
c
12
5
Inverted Lists
Terrovitis et. al.,
CIKM '06
Superset - (‘b’, ‘c’, ‘d’’)
Vocabulary
Null
Access
Tree
a
a
1,2,3,4,6,7,8,9,10,11,13,16
b
b
c
b
12,14
3,6,7,8,11,13,16
c
c
3,6,13
d
2, 6, 7, 12, 13, 14
e
6, 8, 9, 11, 15
f
1, 4, 13, 16
f
5, 6, 10
c
2,4,10
c
12
5
Inverted Lists
Terrovitis et. al.,
CIKM '06
Superset - (‘b’, ‘c’, ‘d’’)
Vocabulary
Null
Access
Tree
a
a
1,2,3,4,6,7,8,9,10,11,13,16
b
b
c
b
12,14
3,6,7,8,11,13,16
c
c
3,6,13
d
2, 6, 7, 12, 13, 14
e
6, 8, 9, 11, 15
f
1, 4, 13, 16
g
5, 6, 10
c
2,4,10
c
12
5
Inverted Lists
Terrovitis et. al.,
CIKM '06
Superset - (‘b’, ‘c’, ‘d’’)
Vocabulary
Null
Access
Tree
a
a
1,2,3,4,6,7,8,9,10,11,13,16
b
b
c
b
12,14
3,6,7,8,11,13,16
c
c
3,6,13
d
2, 6, 7, 12, 13, 14
e
6, 8, 9, 11, 15
f
1, 4, 13, 16
g
5, 6, 10
c
2,4,10
c
12
5
Inverted Lists
Terrovitis et. al.,
CIKM '06
Outline





Problem definition
The HTI index
Query evaluation
Experiments
Conclusions
Terrovitis et. al.,
CIKM '06
Experiments
Setup

Real Data from UCI
–
–

web log from microsoft.com [ 320k records, 294 items]
web log from msnbc.com [1M records, 17 items]
Synthetic data
–
–
–
–
Zipfian distribution of order 1
100k-1M records
1k-10k items
Queries with 2-22 items
Terrovitis et. al.,
CIKM '06
Experiments
Query performance – DB size
synthetic data - DB size
IF
3000
HTI-0.5%
HTI-1%
disk page accesses
2500
HTI-3%
2000
1500
1000
500
0
0
200
400
600
800
1000
1000's of records
Terrovitis et. al.,
CIKM '06
Experiments
Query performance – query length
synthetic data - query length
IF
2500
HTI-0.5%
HTI-1%
disk page accesses
2000
HTI-3%
1500
1000
500
0
2
7
12
17
22
query length
Terrovitis et. al.,
CIKM '06
Experiments
Query performance – query length
real data - subset
disk page accesses
IF
400
HTI-5%
350
HTI-20%
300
HTI-40%
250
200
150
100
50
0
2
3
4
5
6
7
query length
Terrovitis et. al.,
CIKM '06
Experiments
Query performance – query length
real data - equality
disk page accesses
IF
400
HTI-5%
350
HTI-20%
300
HTI-40%
250
200
150
100
50
0
2
3
4
5
6
7
query length
Terrovitis et. al.,
CIKM '06
Experiments
Query performance – query length
real data - superset
IF
1000
HTI-5%
disk page accesses
900
HTI-20%
800
HTI-40%
700
600
500
400
300
200
100
0
2
3
4
5
6
7
query length
Terrovitis et. al.,
CIKM '06
Experiments
Access tree size – DB size
Effect of the DB size
HTI-0.5%
2500
1000's of tree nodes
HTI-1%
2000
HTI-3%
1500
1000
500
0
0
200
400
600
800
1000
1000's of records
Terrovitis et. al.,
CIKM '06
Experiments
Access tree size – DB size
Effect of the DB size
1800
1000's of tree nodes
1600
1400
1200
1000
800
600
400
200
0
0
5
10
15
20
25
30
millions of records
Terrovitis et. al.,
CIKM '06
Experiments



The HTI scales a lot better than the inverted file
as the query and the database size grow
A small threshold is enough for a performance
gain over an order of magnitude
The main memory requirements do not exceed
0.5M for the real data.
Terrovitis et. al.,
CIKM '06
Outline





Problem Definition
The HTI index
Query evaluation
Experiments
Conclusions
Terrovitis et. al.,
CIKM '06
Conclusions



The HTI index relies on breaking up the larger
inverted lists in smaller lists that contain known
combinations of items
The HTI index significantly outperforms the
inverted file for small domains and skewed item
distributions
It has moderate memory requirements that can
be adjusted by using the right threshold
Terrovitis et. al.,
CIKM '06
The End
Thank You!
Terrovitis et. al.,
CIKM '06
Experiments
Vocabulary size
Effect of the vocabulary size
HTI-0.5%
1600
HTI-1%
1000's of tree nodes
1400
HTI-3%
1200
1000
800
600
400
200
0
1
3
5
7
9
vocabulary size in 1000's of items
Terrovitis et. al.,
CIKM '06
Experiments
Threshold choice
Effect of the threshold
1400
1200
1000's of tree nodes
1000
800
600
400
200
0
0,00%
2,00%
4,00%
6,00%
8,00%
10,00%
threshold
Terrovitis et. al.,
CIKM '06
Experiments
Threshold choice
Effect of the threshold
300
250
200
Avg of disk page accesses
150
100
50
0
0,00%
2,00%
4,00%
6,00%
8,00%
10,00%
threshold
Terrovitis et. al.,
CIKM '06