Integrating DB and IR Technologies: What is the Sound of One Hand Clapping? Surajit Chaudhuri (Microsoft Research) Raghu Ramakrishnan (U Wisconsin, ex QUIQ) Gerhard Weikum (Max-Planck Institute of CS) Warning: Non-technical Content! To Be Taken with Grain of SALT. CIDR 2005 1/16 DB and IR: Two Parallel Universes Database Systems Information Retrieval canonical application: accounting libraries data type: numbers, short strings text foundation: algebraic / logic based probabilistic / statistics based search paradigm: Boolean retrieval (exact queries, result sets/bags) ranked retrieval (vague queries, result lists) parallel universes forever ? CIDR 2005 2/16 Take-home Message or Food for Disagreement Claim 1: DB&IR applications require and justify new platform / kernel system with appropriately designed API for a Scoring Algebra for Lists and Text (SALT) Claim 2: One key challenge lies in reconciling flexible scoring with query optimizability CIDR 2005 3/16 Outline • Top-down Motivation: DB&IR Applications • Bottom-up Motivation: Algorithms & Tricks • Towards SALT: Scoring Algebra(s) for Lists and Text • Key Problem: Query Optimization CIDR 2005 4/16 Top-down Motivation: Applications (1) - Customer Support Typical data: Why customizable Customers (CId, Name, Address, Area, Category, scoring? Priority, ...) • wealth of different apps within this app class Requests (RId, CId, Date, Product, ProblemType, Body, RPriority, WFId, ...) • different customer classes Answers (AId, RId, Date, Class, Body, WFId, WFStatus, ...) needs • adjustment to evolving business • scoring on text + structured data (weighted sums, language models, skyline, premium customer from Germany: w/ correlations, etc. a etc.) „A notebook, model ... configured with ..., has problem with the driver of its Wave-LAN card. I already tried the fix ..., but received error message ...“ request classification & routing find similar requests Typical queries: Platform desiderata (from app developer‘s viewpoint): • Flexible ranking and scoring on text, categorical, numerical attributes • Incorporation of dimension hierarchies for products, locations, etc. • Efficient execution of complex queries over text and data attributes • Support for high update rates concurrently with high query load CIDR 2005 5/16 Top-down Motivation: Applications (2) More application classes: • Global health-care management for monitoring epidemics • News archives for journalists, press agencies, etc. • Product catalogs for houses, cars, vacation places, etc. • Customer relationship management in banks, insurances, telcom, etc. • Bulletin boards for social communities • P2P personalized & collaborative Web search etc. etc. CIDR 2005 6/16 Top-down Motivation: Applications (3) Next wave Text2Data: use Information-Extraction technology (regular expressions, HMMs, lexicons, other NLP and ML techniques) to convert text docs into relational facts, moving up in the value chain Example: „The CIDR‘05 conference takes place in Asilomar from Jan 4 to Jan 7, and is organized by D.J. DeWitt, Mike Stonebreaker, ...“ Conference ConfOrganization Name Year Location Date Prob Name Year Chair Prob CIDR 2005 Asilomar 05/01/04 0.95 CIDR 2005 P68 0.9 CIDR 2005 P35 0.75 • facts now have confidence scores • queries involve probabilistic inferences and result ranking • relevant for „business intelligence“ CIDR 2005 People Id Name P35 Michael Stonebraker P68 David J. DeWitt 7/16 Top-down Motivation: Applications (4) Essential requirements for DB&IR platform: 1) Customizable scoring and ranking 2) Composite queries incl. joins, filters & top-k 3) Optimizability of query expressions 4) Metadata and ontologies 5) Simple, sufficiently expressive data model (XML light) 6) Data preparation (entity recognition, entity resolution, etc.) 7) Personalization (profile learning) 8) Usage patterns (query logs, click streams, etc.) 1, 2, 3 most strongly affect platform architecture and API CIDR 2005 8/16 Bottom-up Motivation: Algorithms & Tricks B+ tree on terms, categories, values, ... ... Vanilla algorithm „join&sort“ for query q: t1 t2 t3 top-k ( [term=t1](index) ID [term=t2](index) ID [term=t3](index) ID order by sum(s) desc) CIDR 2005 t2 12: 0.5 11: 0.4 28: 0.1 44: 0.2 51: 0.6 52: 0.3 ... t3 11: 0.6 17: 0.1 52: 0.7 ... index lists with (ID, s = tf*idf) sorted by ID 17: 0.3 44: 0.4 52: 0.1 53: 0.8 51: 0.6 ... ... t1 Google: > 10 mio. terms > 8 bio. docs > 4 TB index Good search engines use a variety of heuristics and tricks for shortcutting: • keeping short lists of best docs per term in memory • global statistics for index list selection • early pruning of result candidates • bounded priority queue of candidates 9/16 Bottom-up Motivation: Algorithms & Tricks TA with sorted access only (NRA) TA: efficient & principled (Fagin 01, Güntzer/Kießling/Balke 01): top-k query processing scan index lists; consider d at posi in Li; with monotonic E(d) := E(d) {i}; high i := s(ti,d); • TAscore flavor aggr. w/ early termination is great worstscore(d) := aggr{s(t ,d) | E(d)}; Data items: d1,…,dn• Implementation details are crucial bestscore(d) := aggr{worstscore(d), aggr{high | E(d)}}; • DB&IR needs to combine it d1 if worstscore(d) > min-k then add d to top-k with filter, join, phrase etc. | d’ top-k}; min-kmatching, := min{worstscore(d’) s(t1,d1) = 0.7 … else if bestscore(d) • Unclear how to abstract TA > min-k then s(tm,d1) = 0.2 cand := cand {d}; and integrate into relational algebra | d’ cand}; threshold := max {bestscore(d’) if threshold min-k then exit; Query: q = (t1,t2,t3) Index lists t1 t2 t3 d78 0.9 d64 0.8 d10 0.7 d23 0.8 d23 0.6 d78 0.5 d10 0.8 d10 0.6 d64 0.4 d1 0.7 d10 0.2 d99 0.2 d88 0.2 d78 0.1 d34 0.1 … … … k=1 Scan Scan Scan depth 1 depth depth2 3 Rank Doc Worst- BestRank WorstBestRank Doc Docscore Worst-score Bestscore score score score 1 d78 0.9 2.4 1 1 d78 2.0 d10 1.4 2.1 2.1 2 d64 0.8 2.4 2 2 d23 1.9 d78 1.4 1.4 2.0 3 d10 0.7 2.4 3 3 d64 2.1 d23 0.8 1.4 1.8 44 CIDR 2005 STOP! d10 d64 0.7 1.2 2.1 2.0 10/16 SALT Algebra: Three Proposals SALT = Scoring Algebra for Lists and Text Goals: • reconcile relational algebra with TA-flavor operators • reconcile flexible scoring with query optimizability Three proposals: • Speculative filters and stretchable operators • Operators with scoring modalities • Scoring operator Related prior work: probabilistic relations approximate query processing query algebras on lists SQL user-defined aggregation CIDR 2005 11/16 Speculative Filters and Stretchable Operators (SALT with SQL Flavor) Rationale: map ranked-retrieval queries to multidimensional SQL filters such that they return approx. k results Ex.: recent WLAN device driver problems on notebook T40 (with Debian) date > 11/30/04 class=„/network/drivers“ [date product=„Thinkpad“ product=„Thinkpad“ software=„Linux“ software=„Linux“] (Requests) Techniques: • ranking many answers speculative filters generate additional conjunctive conditions to approximate top-k • finding enough answers stretchable operators relax (range or categorical) conditions ensure at-least-k Properties andtoproblems: similar to IR query expansion, by (pseudo-)feedback, thesaurus, query log + can leverage multidim. histograms ? composability of operators Proposal: choice of filters for approx. k top-level (...) results ~[k, date > 1/4/05 ?class=„/network/drivers/wlan“ product=„T40“] generally: ~[k], ~[k], ~[k], ... CIDR 2005 12/16 Operator (SALT with TA Flavor) Rationale: • all operators produce lists of tuples • a operator encapsulates customizable scoring • can be efficiently implemented in relational kernel Technique: [; , F; T] (R) consumes prefixes of an input list R with • a set of simple aggregation functions, similar to SQL rank( ) with each with O(1) space and O(|prefix|) time („accumulators“) user-defined aggregation • a scoring function : dom(R)out() real (and LDL++ aggregation), • a filter condition F as in , referring to current tuple values butand withearly termination! • a stopping condition T, of the same form as F Ex.: sort[k, Score, desc] ( and problems: Properties [: min-k+:=pipelined min{Score(t)|tinput}; threshold := ...; processing of list prefixes (t) := sum(R1.Score, R2.Score, C1.Score) as Score; implemented by TA with bounded queue F: Score+> can min-kbe |input|<k; T: min-k? difficult thresholdto |input| k] into query rewriting integrate (merge( sort[...] ([...](Requests R1 ...)), sort[...] ([...](Requests R2 ...)), ? difficult for cost estimation sort[...] ([...](Customers C1 ...))) CIDR 2005 13/16 Key Problem: Query Rewriting Goal: establish algebraic equivalences for SALT expressions as a basis for query rewriting Examples: commutativity of stretchable top-k and standard selection ~[k, date > 1/4/05] ( [product=„T40“] (R)) Wishful thinking! [product=„T40“] (~[k, date > 1/4/05] (R)) commutativity of scoring operator and standard selection distributivity of scoring operator over union ... Technical challenge: Either work out correct & useful rewriting rules or establish „approximate equivalences“ of the kind ~[k, F] ( [G] (R)) sort[k, ...] ( [G] (~[k*, F] (R)) with proper k* ideally with quantifiable error probabilities CIDR 2005 14/16 Key Problem: Cost Estimation 1) usual DB cost estimation: selectivity of multidimensional filters 2) cost estimation for top-k ranked retrieval: when will we stop? (for : length of input prefix; for TA: scan depth on index lists) We claim that 2 is harder than 1 ! Index lists d23 t1 d78 0.9 0.8 d23 Technical challenge: t2 d64 0.8 0.6 Develop full estimator for top-k execution cost d78 t3 d10 0.7 0.5 d10 d1 0.8 0.7 d10 d10 0.6 0.2 d64 d99 0.4 0.2 d88 0.2 … d78 0.1 … d34 0.1 … Possible approaches (Ilyas et al.: Sigmod’04, Theobald et al.: VLDB’04): Probabilistically predict (quantile of) aggregated score of data item d: • precompute score-distribution histogram for each single dim • compute convolution of histograms at query time to predict P[i Si ] View scores X1 > X2 > ... > Xn of n data items as samples from S = i Si Use order statistics to predict score of rank-k item and scan depth at stopping time CIDR 2005 15/16 Conclusion: Caveats and Rebuttals DB&IR is important, SALT algebra is one key aspect Is there anything new here? Don‘t eXtensible DBSs or intranet SEs cover 90%? Literature has bits and pieces, but no strategic view XDBSs with UDFs too complex, SEs lack query optimization Do IR people believe in DB&IR? Do IR people believe in SALT and query opt.? Does SE industry believe in SALT and query opt.? Yes: prob. Datalog, XML IR, Is there business value in DB&IR? Where do we go from here? Yes, for both individual apps and general text2data. CIDR 2005 statistical relational learning, etc. No: mostly driven by search result quality, largely disregard performance No: simple consumer-oriented search or small content mgt. apps Detailed design & impl. of SALT, with query optimization 16/16
© Copyright 2024 Paperzz