TopX - Efficient and Versatile Top

TopX
Efficient & Versatile
Top-k Query Processing for Text,
Semistructured & Structured Data
Martin Theobald
Max-Planck-Institut Informatik
Stanford University
//article[.//bib[about(.//item, “W3C”)]
]//sec[about(.//, “XML retrieval”)]
//par[about(.//, “native XML databases”)]
RANKING
article
article
title
“Current
Approaches
to XML Data
Manage- sec
ment”
VAGUENESS
XML Files”
bib
sec
par
“XML queries
with an expressive power
similar to that
of Datalog …”
“Data management systems control
data acquisition, storage, and
retrieval. Systems evolved from flat
files … ”
title
PRUNING
“The
Ontology
Game”
“Native XML
Data Bases.”
“Native XML data
base systems can store
schemaless
data ... ”
par
sec
sec
bib
title
title
par
title
“The
item
title
“The
Dirty Little
Secret”
par
inproc
“XML-QL: “Proc. Query
A Query
Languages
Language
Workshop,
for XML.”
W3C,1998.”
par
“Sophisticated
technologies
“There, I've said
developed by
it - the "O" word. If
smart people.” anyone is thinking
along ontology lines, I
would like to break
some old news …”
item
title
“XML”
url
“w3c.org/xml”
par
“What does XML
add for retrieval?
It adds formal ways …”
Frontends
Probabilistic
Candidate
Pruning
Dynamic
Query
Expansion
Incremental
XPath Engine
Thesaurus
WordNet,
OpenCyc, etc.
Candidate
Queue
Candidate
Cache
SA
Scan
Threads
Random Access
Indexing Time
Probabilistic
Index Access
Scheduling
TopX
Query Processor
Sequential Access
Query Processing Time
• Web Interface
• Web Service
• API
Top-k
Queue
Auxiliary
Predicates
Index Metadata
•Selectivities
•Histograms
•Correlations
DBMS / Inverted Lists
Unified Text & XML Schema
Indexer
/Crawler
RA
RA
Data Model
“xml data manage xml manage system
“xml data manage xml manage system
vary
wide
expressive
power
native
xml
vary
wide
expressive
power
native
xml
native
base
store
dataxml
basedata
native
xmlsystem
data base
system
store schemaless data“
<article>
<title>XML Data Management
article
ftf (“xml”,
</title>
<abs>XML management systems vary
1
widely in their expressive
power.
</abs>
<sec>
<title>Native XML Data Bases.
</title>
<par>Native XML data base systems
can store schemaless data.
</par>
</sec>
</article>
article ) = 4
1
title
2
1
6
abs
2
3
“native xml data base
native xml data base
system store
sec schemaless data“
4
5
“xml
“xml manage
data
system vary
title
manage” wide expressive
5 3
power“
 XML trees (no XLinks or ID/IDref attributes)
 Pre-/postorder node labels
 Redundant full-content text nodes
“native xml
data base”
par
6
4
“native xml data
base system
store
schemaless
data“
Scoring Model
[INEX ’06/’07]
 XML-specific extension to Okapi BM25
(originating from probabilistic text IR)
 ftf instead of tf
 ef instead of df
 Element type-specific
length normalization
 Tunable parameters k1 and b
bib[“transactions”]
vs.
par[“transactions”]
TopX Query Processing
[VLDB ’05]
//sec[about(.//, “XML”) and
about(.//title, “native”]
//par[about(.//, “retrieval”)]
1.0 sec[“xml”]
0.9 eid docid score
0.85
0.1
pre post
1.0 title[“native”] 1.0 par[“retrieval”]
0.9 eid docid score pre post 1.0 eid docid score pre post
0.9
2
15
1
1.0
1
21
0.8 216 17
0.8 3
0.8
14 10
28
2
0.8
8
14
0.5 72 3
0.75
46
2
0.9
2
15
9
2
0.5
10
8
171
5
0.85
1
20
51
2
0.5
4
12
182
5
0.75
3
7
84
3
0.1
1
12
89
5
0.4
11
16
19
8
0.8
8
14
32
1
0.09
3
1
35
4
0.05
5
8
21
8
0.04
3
20
…
…
…
Top-2
Candidate
Queue
max-q=2.15
max-q=2.55
max-q=2.45
max-q=2.75
max-q=2.8
max-q=3.0
max-q=1.6
max-q=2.7
max-q=2.9
171
46 worst=1.0
worst=0.9
worst=1.6
182
3
9 worst=0.9
46
worst=0.5
worst=1.7
worst=2.2
216
51 28
min-2=0.5
min-2=1.6
min-2=1.0
min-2=0.9
min-2=0.0
Index Access Scheduling
Inverted
 SA Scheduling
Block Index
0.9
1.0
SA
SA
SA
1.0
1.0
0.9
0.9
 Look-ahead Δi through
precomputed score histograms
 Knapsack-based optimization of
Score Reduction
Δ3,3 = 0.2
Δ1,3 = 0.8
0.7
0.9
[VLDB ’06]
0.8
 RA Scheduling
0.2
0.6
0.8
…
…
…
RA
 2-phase probing:
Schedule RAs “late & last”
 Extended probabilistic cost model
for integrating SA & RA scheduling
Probabilistic Pruning
[VLDB ’04]
 Convolutions of score distributions
(assuming independence)
P [d gets in the final top-k] =
title[“native”]
eid
…
0.9
72
0.8
51
0.5
par[“retrieval”]
…
maxscore
3
1.0
28
0.8
182
0.75
sampling
216
eid
f1
maxscore
1
high1
0
f2
2
δ(d)
Probabilistic
candidate
pruning:
0
1
high2
Drop d from the candidate queue if
Indexing Time
Query
P [d gets in the final top-k
] < εProcessing Time
With probabilistic guarantees for precision & recall
0
Dynamic Query Expansion

Incrementally merge
accident
fire
d78 d10 d11 d1 ...
d37 d42 d32 d87...
~disaster
SA d42 d11 d92 d37 …
d42 d11 d92 d21 ...
 Incremental Merge operator
 Nested Top-k operator
(efficient phrase matching)
 Boolean (but ranked) retrieval mode
 Supports any sorted inverted index
for text, structured records & XML
disaster
 Specialized expansion operators
tunnel d95 d17 d11 d99
...
Best-match score aggregation
Top-k
(transport, tunnel,
~disaster)
SA

SA transport d66 d93 d95 d101...
inverted lists for expansion ti,1...ti,m
in descending order of s(tij, d)
TREC Robust
Topic #363
[SIGIR ’05]
Incremental Merge Operator
Thesaurus lookups/
Relevance feedback
Index list metadata
(e.g., histograms)
Expansion terms
Initial high-scores
~t = { t1, t2, t3 }
Large corpus
term correlations
Expansion similarities
sim(t, t1 ) = 1.0
t1
sim(t, t2 ) = 0.9
t2
sim(t, t3 ) = 0.5
t3
d78 d23 d10
0.9 0.8 0.8
d1
0.4
0.9
0.4
d64 d23 d10 d12
0.8 0.8 0.7 0.2
0.18
0.72
d11 d78 d64 d99
0.9 0.9 0.7 0.7
0.45
0.35
d88
0.3
...
d78
0.1
...
d34
0.6
...
SA
~t
Meta histograms
d78 d23 d10 d64 d23 d10 d11 d78 d1 d88
0.9 0.8 0.8 0.72 0.72 0.63 0.45 0.45 0.4 0.3 ...
seamlessly integrate Incremental Merge
into probabilistic scheduling and candidate pruning
Some Experiments
 New XML-ified Wikipedia corpus (INEX 2006)
 660,000 documents w/ 130,000,000 elements
 125 INEX queries, each as content-only (CO)
and content-and-structure (CAS) formulation
 CO: +“state machine” figure Mealy Moore
 CAS: //article[about(., “state machine” )]
//figure[about(., Mealy ) or about(., Moore )]
 Primary cost metric: Cost = #SA + cR/cS #RA
(Millions)
TopX vs. Full-Merge
40
35
CAS - Full Merge
CO - Full Merge
CAS - TopX - ε=0.0
CO - TopX - ε=0.0
CAS - TopX - ε=0.1
CO - TopX - ε=0.1
30
Cost
25
20
15
10
5
0
10
20
50
100
500
 Significant cost savings for large ranges of
 CAS cheaper than CO !
k
1,000
k
Efficiency vs. Effectiveness
CAS - Rel. Precision
CO - Rel. Precision
CAS - Rel. Cost
CO - Rel. Cost
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
 Very good precision/runtime ratio for
probabilistic pruning
0.8
0.9
1.0
ε
(Millions)
Static vs. Dynamic Expansions
120
# RA
100
# SA
80
60
 Query expansions with up to
m=292 keywords & phrases
 Balanced amount of sorted vs.
random disk access
 Adaptive scheduling wrt.
cR/cS cost ratio
40
 Dynamic expansions superior to
static expansions & full-merge in
both efficiency & effectiveness
20
0
CAS Full
Merge
CAS CAS TopX - TopX Static Dynamic
Thanks…
Gerhard Weikum
Ralf Schenkel
Norbert Fuhr, Michalis Vazirgiannis
Holger Bast, Debapriyo Majumdar
All the MPI & INEX folks
topx.sourceforge.net
See our Sigmod’07 demo!