SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007 2009. 02. 05. Summarized by Jaehui Park, IDS Lab., Seoul National University Presented by Jaehui Park, IDS Lab., Seoul National University Introduction Demand for RDB to support effective and efficient IR-style keyword queries Features – Assembling data collectively – Supporting casual users – Revealing unexpected relationships among entities – More flexible search for back-end databases than pre-built template querying Issues Search results contradictory to human perception (in previous work) Technical challenges – Aggregating final score of an answer Relying on monotonicity of the rank aggregation function Contributions New ranking function – Non-monotonic nature of ranking methods Techniques for avoiding unnecessary DB accesses – Skyline sweeping algorithm – Block pipeline algorithm Copyright 2009 by CEBT 2 Preliminaries Keyword queries on a set of relations Joined Tuple Tree (JTT) Tree of tuples – Top-3 JTTs Top-k results c3 Foreign key to primary key relationships c3->p2 Candidate Network (CN) c1->p1 c2->p2 Relevance score – How relevant the JTT is to the query c2->p2<-c3 Example query : “maxtor netvista” Copyright 2009 by CEBT 3 Preliminaries: existing solutions Enumerating (Union) all possible CNs (DISCOVER 2002,2003) DISCOVER (2003) CQ->PQ : valid rules CQ->U : not valid Prune duplicate CNs CQ->U<-CQ : may be valid Prune non-minimal CNs Prune CNs of type: RQ<-S->RQ Example (cont.) Copyright 2009 by CEBT 4 Preliminaries: existing solutions (DISCOVER 2002,2003) Upper bounding functions Bound the scores of potential answers from each CN – – Stop query execution earlier Ex) Sparse algorithm Global pipeline algorithm id score id score t1 50 I1 70 t2 40 I2 60 t3 30 I3 40 t4 20 i4 20 aggregate Focus of this paper How to score a JTT : Ranking Function How to generate and order the SQL queries for the CNs : Top-k Join query – Minimal DB accesses are required before top-k results are returned. Copyright 2009 by CEBT 5 Ranking Function Problems with existing ranking functions Monotonic aggregation function have been considered. – SUM CQ->PQ Discordance with human perception Side Effect : Overly rewarding contributions of the same keyword in different tuples in the same JTT Copyright 2009 by CEBT 6 Ranking Function Modeling a JTT as a virtual document K2 K1 P(t1) C(t1) K2 K1 P(t1) C(t1) attenuating : same keyword in different relations Technical issues Expensive cost to compute Completeness score and Size normalization score Copyright 2009 by CEBT 7 Top-k Join algorithm None of the existing top-k query processing methods deals with non-monotonic scoring function c[i]->p[i] max(score(p[1],c[i+1]), score(p[j+1], c[1])) X K1 K2 P(t2) C(t1) Monotonic, upper bounding function to the actual function Lemma 1. score(T,Q) can be bounded by a function uscore(T,Q)=1/(1-s) * min(A,B) max(uscore(c[i+1], p[1]), uscore(c[1],p[j+1])) Copyright 2009 by CEBT 8 Top-k Join algorithm Skyline Sweeping Algorithm uscore Avoid unnecessary join checking -> minimal number of accesses to the database dominate relationship among candidates – Checking candidate of higher upper bound first – Priority queue Descending order of the upper bound scores uscore uscore Technical point – Duplicate checking uscore Copyright 2009 by CEBT 9 Top-k Join algorithm Large gaps between the upper bound scores and the corresponding real scores Harder to stop early – upper bound of un-processed >> real score Block Pipeline Algorithm Employing local non-monotonic upper bounding function that bounds the real score of JTTs more accurately Tighter upper bounding: bscore < uscore signature – An ordered sequence of term frequencies for all the query keywords – <tfw1(t), …, tfw2(t)> Signature of the block < > Copyright 2009 by CEBT 10 Experiments Dataset: IMDB, DBLP and Mondial Oracle 10g, MySQL 5.00.18, JDK 1.5 Implementation: Sparse, Global pipeline (GP). Skyline sweep (SS), Block pipeline (BP) Metrics Number of top-1 answers (#Rel) Reciprocal rank (R-Rank) Relevance answer It must match all the search keyword Its size must be the smallest Copyright 2009 by CEBT 11 Experiments Effectiveness Efficiency Observations – Fastest : BP – SS outperforms Sparse and GP – Sparse == GP (GP > Sparse for small k or easy query) – All algorithms are more responsive for smaller k values Copyright 2009 by CEBT 12 Experiments Copyright 2009 by CEBT 13 Conclusion New ranking method Adapts that the state-of-the-art IR ranking function and principles Query processing method Tailored for our non-monotonic ranking functions Extensive experiments on large scale real databases High precision with high efficiency Copyright 2009 by CEBT 14 Reviews Good Detailed explanation of background and existing approach Good paper organization and good examples Short of rationale for new algorithms Non-monotonicity of Block pipeline algorithm Copyright 2009 by CEBT 15
© Copyright 2026 Paperzz