presentation

Integrating DB and IR Technologies:
What is the Sound of
One Hand Clapping?
Surajit Chaudhuri (Microsoft Research)
Raghu Ramakrishnan (U Wisconsin, ex QUIQ)
Gerhard Weikum (Max-Planck Institute of CS)
Warning: Non-technical Content!
To Be Taken with Grain of SALT.
CIDR 2005
1/16
DB and IR: Two Parallel Universes
Database Systems
Information Retrieval
canonical
application:
accounting
libraries
data type:
numbers,
short strings
text
foundation:
algebraic /
logic based
probabilistic /
statistics based
search
paradigm:
Boolean retrieval
(exact queries,
result sets/bags)
ranked retrieval
(vague queries,
result lists)
parallel universes forever ?
CIDR 2005
2/16
Take-home Message
or Food for Disagreement
Claim 1:
DB&IR applications
require and justify new platform / kernel system
with appropriately designed API for a
Scoring Algebra for Lists and Text (SALT)
Claim 2:
One key challenge lies in reconciling
flexible scoring with query optimizability
CIDR 2005
3/16
Outline
• Top-down Motivation: DB&IR Applications
• Bottom-up Motivation: Algorithms & Tricks
• Towards SALT:
Scoring Algebra(s) for Lists and Text
• Key Problem: Query Optimization
CIDR 2005
4/16
Top-down Motivation: Applications (1)
- Customer Support Typical data:
Why customizable
Customers (CId, Name, Address,
Area, Category, scoring?
Priority, ...)
• wealth of different apps within this app class
Requests (RId, CId, Date, Product, ProblemType, Body, RPriority, WFId, ...)
• different customer classes
Answers (AId, RId, Date, Class,
Body, WFId,
WFStatus,
...) needs
• adjustment
to evolving
business
• scoring on text + structured data
(weighted sums, language models, skyline,
premium customer from Germany:
w/ correlations,
etc. a
etc.)
„A notebook, model ... configured
with ..., has
problem with the driver of
its Wave-LAN card. I already tried the fix ..., but received error message ...“
 request classification & routing
 find similar requests
Typical queries:
Platform desiderata (from app developer‘s viewpoint):
• Flexible ranking and scoring on text, categorical, numerical attributes
• Incorporation of dimension hierarchies for products, locations, etc.
• Efficient execution of complex queries over text and data attributes
• Support for high update rates concurrently with high query load
CIDR 2005
5/16
Top-down Motivation: Applications (2)
More application classes:
• Global health-care management for monitoring epidemics
• News archives for journalists, press agencies, etc.
• Product catalogs for houses, cars, vacation places, etc.
• Customer relationship management in banks, insurances, telcom, etc.
• Bulletin boards for social communities
• P2P personalized & collaborative Web search
etc. etc.
CIDR 2005
6/16
Top-down Motivation: Applications (3)
Next wave Text2Data:
use Information-Extraction technology
(regular expressions, HMMs, lexicons,
other NLP and ML techniques)
to convert text docs into relational facts, moving up in the value chain
Example:
„The CIDR‘05 conference takes place in Asilomar from Jan 4 to Jan 7,
and is organized by D.J. DeWitt, Mike Stonebreaker, ...“
Conference
ConfOrganization
Name Year Location Date
Prob
Name Year Chair Prob
CIDR 2005 Asilomar 05/01/04 0.95
CIDR 2005 P68 0.9
CIDR 2005 P35 0.75
• facts now have confidence scores
• queries involve probabilistic inferences
and result ranking
• relevant for „business intelligence“
CIDR 2005
People
Id Name
P35 Michael Stonebraker
P68 David J. DeWitt
7/16
Top-down Motivation: Applications (4)
Essential requirements for DB&IR platform:
1) Customizable scoring and ranking
2) Composite queries incl. joins, filters & top-k
3) Optimizability of query expressions
4) Metadata and ontologies
5) Simple, sufficiently expressive data model (XML light)
6) Data preparation (entity recognition, entity resolution, etc.)
7) Personalization (profile learning)
8) Usage patterns (query logs, click streams, etc.)
1, 2, 3 most strongly affect platform architecture and API
CIDR 2005
8/16
Bottom-up Motivation: Algorithms & Tricks
B+ tree on terms, categories, values, ...
...
Vanilla algorithm
„join&sort“
for query q: t1 t2 t3
top-k (
[term=t1](index) ID
[term=t2](index) ID
[term=t3](index) ID
order by sum(s) desc)
CIDR 2005
t2
12: 0.5
11: 0.4
28: 0.1
44: 0.2
51: 0.6
52: 0.3
...
t3
11: 0.6
17: 0.1
52: 0.7
...
index lists with
(ID, s = tf*idf)
sorted by ID
17: 0.3
44: 0.4
52: 0.1
53: 0.8
51: 0.6
...
...
t1
Google:
> 10 mio. terms
> 8 bio. docs
> 4 TB index
Good search engines use a
variety of heuristics and tricks
for shortcutting:
• keeping short lists of
best docs per term in memory
• global statistics for index list selection
• early pruning of result candidates
• bounded priority queue of candidates
9/16
Bottom-up Motivation: Algorithms & Tricks
TA with sorted access only (NRA)
TA: efficient & principled
(Fagin 01, Güntzer/Kießling/Balke 01):
top-k query processing
scan index lists; consider d at posi in Li;
with monotonic
E(d)
:= E(d)  {i}; high
i := s(ti,d);
• TAscore
flavor aggr.
w/ early
termination
is great
worstscore(d) := aggr{s(t ,d) |  E(d)};
Data items: d1,…,dn• Implementation details are crucial
bestscore(d) := aggr{worstscore(d),
aggr{high |   E(d)}};
• DB&IR needs to combine it
d1
if worstscore(d) > min-k then add d to top-k
with
filter,
join,
phrase
etc. | d’  top-k};
min-kmatching,
:= min{worstscore(d’)
s(t1,d1) = 0.7
…
else if bestscore(d)
• Unclear how to abstract
TA > min-k then
s(tm,d1) = 0.2
cand := cand  {d};
and integrate into
relational
algebra | d’ cand};
threshold := max {bestscore(d’)
if threshold  min-k then exit;
Query: q = (t1,t2,t3)
Index lists
t1
t2
t3
d78
0.9
d64
0.8
d10
0.7
d23
0.8
d23
0.6
d78
0.5
d10
0.8
d10
0.6
d64
0.4
d1
0.7
d10
0.2
d99
0.2
d88
0.2
d78
0.1
d34
0.1
…
…
…
k=1
Scan
Scan
Scan
depth
1
depth
depth2 3
Rank Doc Worst- BestRank
WorstBestRank Doc
Docscore
Worst-score
Bestscore
score
score
score
1
d78 0.9
2.4
1 1 d78
2.0
d10 1.4
2.1
2.1
2
d64 0.8
2.4
2 2 d23
1.9
d78 1.4
1.4
2.0
3
d10 0.7
2.4
3 3 d64
2.1
d23 0.8
1.4
1.8
44
CIDR 2005
STOP!
d10
d64
0.7
1.2
2.1
2.0
10/16
SALT Algebra: Three Proposals
SALT = Scoring Algebra for Lists and Text
Goals:
• reconcile relational algebra with TA-flavor operators
• reconcile flexible scoring with query optimizability
Three proposals:
• Speculative filters and stretchable operators
• Operators with scoring modalities
• Scoring operator 
Related prior work:
probabilistic relations
approximate query processing
query algebras on lists
SQL user-defined aggregation
CIDR 2005
11/16
Speculative Filters and Stretchable Operators
(SALT with SQL Flavor)
Rationale:
map ranked-retrieval queries to multidimensional SQL filters
such that they return approx. k results
Ex.: recent WLAN device driver problems on notebook T40 (with Debian)
date > 11/30/04  class=„/network/drivers“
 [date
 product=„Thinkpad“
product=„Thinkpad“ software=„Linux“
software=„Linux“] (Requests)
Techniques:
• ranking many answers  speculative filters
generate additional conjunctive conditions to approximate top-k
• finding enough answers  stretchable operators
relax (range or categorical)
conditions
ensure at-least-k
Properties
andtoproblems:
similar to IR query expansion, by (pseudo-)feedback, thesaurus, query log
+ can leverage multidim. histograms
? composability of operators
Proposal:
choice of filters for approx.
k top-level (...)
results
~[k, date > 1/4/05 ?class=„/network/drivers/wlan“
 product=„T40“]
generally: ~[k], ~[k], ~[k], ...
CIDR 2005
12/16
 Operator (SALT with TA Flavor)
Rationale:
• all operators produce lists of tuples
• a  operator encapsulates customizable scoring
•  can be efficiently implemented in relational kernel
Technique:
[; , F; T] (R) consumes prefixes of an input list R with
• a set  of simple aggregation functions,
similar to SQL rank( ) with
each with O(1) space and O(|prefix|) time („accumulators“)
user-defined aggregation
• a scoring function : dom(R)out()  real
(and LDL++ aggregation),
• a filter condition F as in , referring to current tuple
values
butand
withearly
termination!
• a stopping condition T, of the same form as F
Ex.: sort[k, Score,
desc] ( and problems:
Properties
[: min-k+:=pipelined
min{Score(t)|tinput};
threshold
:= ...;
processing
of list prefixes
(t) := sum(R1.Score, R2.Score, C1.Score) as Score;
implemented by TA with bounded queue
F: Score+> can
min-kbe
 |input|<k;
T: min-k? difficult
thresholdto
 |input|
 k] into query rewriting
integrate
(merge( sort[...]
([...](Requests
R1 ...)), sort[...] ([...](Requests R2 ...)),
? difficult
for cost estimation
sort[...] ([...](Customers C1 ...)))
CIDR 2005
13/16
Key Problem: Query Rewriting
Goal:
establish algebraic equivalences for SALT expressions
as a basis for query rewriting
Examples:
commutativity of stretchable top-k and standard selection
~[k, date > 1/4/05] ( [product=„T40“] (R)) Wishful thinking!
 [product=„T40“] (~[k, date > 1/4/05] (R))
commutativity of scoring operator  and standard selection
distributivity of scoring operator  over union


...
Technical challenge:
Either work out correct & useful rewriting rules
or establish „approximate equivalences“ of the kind
~[k, F] ( [G] (R))  sort[k, ...] ( [G] (~[k*, F] (R)) with proper k*
ideally with quantifiable error probabilities
CIDR 2005
14/16
Key Problem: Cost Estimation
1) usual DB cost estimation: selectivity of multidimensional filters
2) cost estimation for top-k ranked retrieval: when will we stop?
(for : length of input prefix; for TA: scan depth on index lists)
We claim that 2 is harder than 1 !
Index lists
d23
t1 d78
0.9 0.8
d23
Technical challenge:
t2 d64
0.8 0.6
Develop full estimator for top-k execution cost
d78
t3 d10
0.7 0.5
d10 d1
0.8 0.7
d10 d10
0.6 0.2
d64 d99
0.4 0.2
d88
0.2 …
d78
0.1 …
d34
0.1 …
Possible approaches (Ilyas et al.: Sigmod’04, Theobald et al.: VLDB’04):
Probabilistically predict (quantile of) aggregated score of data item d:
• precompute score-distribution histogram for each single dim
• compute convolution of histograms at query time to predict P[i Si  ]
View scores X1 > X2 > ... > Xn of n data items as samples from S = i Si
Use order statistics to predict
score of rank-k item and scan depth at stopping time
CIDR 2005
15/16
Conclusion: Caveats and Rebuttals
DB&IR is important, SALT algebra is one key aspect
Is there anything new here?
Don‘t eXtensible DBSs or
intranet SEs cover 90%?
Literature has bits and pieces,
but no strategic view
XDBSs with UDFs too complex,
SEs lack query optimization
Do IR people believe
in DB&IR?
Do IR people believe
in SALT and query opt.?
Does SE industry believe
in SALT and query opt.?
Yes: prob. Datalog, XML IR,
Is there business value
in DB&IR?
Where do we go from here?
Yes, for both individual
apps and general text2data.
CIDR 2005
statistical relational learning, etc.
No: mostly driven by search result
quality, largely disregard performance
No: simple consumer-oriented search
or small content mgt. apps
Detailed design & impl. of SALT,
with query optimization
16/16