SPARK: Top-k Keyword Query in Relational Databases

SPARK: Top-k Keyword Query in Relational Databases
Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou
Univ. of New South Wales, Univ. of Queensland
SIGMOD 2007
2009. 02. 05.
Summarized by Jaehui Park, IDS Lab., Seoul National University
Presented by Jaehui Park, IDS Lab., Seoul National University
Introduction

Demand for RDB to support effective and efficient IR-style keyword queries


Features
–
Assembling data collectively
–
Supporting casual users
–
Revealing unexpected relationships among entities
–
More flexible search for back-end databases than pre-built template querying
Issues

Search results contradictory to human perception (in previous work)

Technical challenges
–
Aggregating final score of an answer


Relying on monotonicity of the rank aggregation function
Contributions

New ranking function
–

Non-monotonic nature of ranking methods
Techniques for avoiding unnecessary DB accesses
–
Skyline sweeping algorithm
–
Block pipeline algorithm
Copyright  2009 by CEBT
2
Preliminaries
 Keyword queries on a set of relations
 Joined Tuple Tree (JTT)

Tree of tuples
–
Top-3 JTTs
Top-k results
c3

Foreign key to primary key relationships
c3->p2

Candidate Network (CN)
c1->p1

c2->p2
Relevance score
–
How relevant the JTT is to the query
c2->p2<-c3
 Example query : “maxtor netvista”
Copyright  2009 by CEBT
3
Preliminaries: existing solutions
 Enumerating (Union) all possible CNs
(DISCOVER 2002,2003)
DISCOVER (2003)

CQ->PQ : valid
rules

CQ->U : not valid
Prune duplicate CNs

CQ->U<-CQ : may be valid
Prune non-minimal CNs
Prune CNs of type: RQ<-S->RQ
 Example (cont.)
Copyright  2009 by CEBT
4
Preliminaries: existing solutions
(DISCOVER 2002,2003)
 Upper bounding functions

Bound the scores of potential answers from each CN
–
–
Stop query execution earlier
Ex) Sparse algorithm
Global pipeline algorithm
id
score
id
score
t1
50
I1
70
t2
40
I2
60
t3
30
I3
40
t4
20
i4
20
aggregate
 Focus of this paper

How to score a JTT : Ranking Function

How to generate and order the SQL queries for the CNs : Top-k
Join query
–
Minimal DB accesses are required before top-k results are returned.
Copyright  2009 by CEBT
5
Ranking Function
 Problems with existing ranking functions

Monotonic aggregation function have been considered.
–
SUM
CQ->PQ
 Discordance with human perception

Side Effect : Overly rewarding contributions of the same keyword
in different tuples in the same JTT
Copyright  2009 by CEBT
6
Ranking Function
 Modeling a JTT as a virtual document
K2
K1
P(t1)
C(t1)
K2
K1
P(t1)
C(t1)
 attenuating : same keyword in different relations
 Technical issues

Expensive cost to compute
 Completeness score and Size normalization score
Copyright  2009 by CEBT
7
Top-k Join algorithm

None of the existing top-k query processing
methods deals with non-monotonic scoring function

c[i]->p[i]
max(score(p[1],c[i+1]),
score(p[j+1], c[1]))

X
K1
K2
P(t2)
C(t1)
Monotonic, upper bounding function to the actual
function

Lemma 1. score(T,Q) can be bounded by a function
uscore(T,Q)=1/(1-s) * min(A,B)

max(uscore(c[i+1], p[1]), uscore(c[1],p[j+1]))
Copyright  2009 by CEBT
8
Top-k Join algorithm
 Skyline Sweeping Algorithm

uscore
Avoid unnecessary join checking
-> minimal number of accesses to the database

dominate relationship among candidates
–
Checking candidate of higher upper bound first
–
Priority queue


Descending order of the upper bound scores
uscore
uscore
Technical point
–
Duplicate checking
uscore
Copyright  2009 by CEBT
9
Top-k Join algorithm
 Large gaps between the upper bound scores and the
corresponding real scores

Harder to stop early
–
upper bound of un-processed >> real score
 Block Pipeline Algorithm

Employing local non-monotonic upper bounding function that
bounds the real score of JTTs more accurately

Tighter upper bounding: bscore < uscore

signature
–
An ordered sequence of term frequencies for all the query keywords

–
<tfw1(t), …, tfw2(t)>
Signature of the block

<
>
Copyright  2009 by CEBT
10
Experiments
 Dataset: IMDB, DBLP and Mondial
 Oracle 10g, MySQL 5.00.18, JDK 1.5
 Implementation: Sparse, Global pipeline (GP). Skyline sweep
(SS), Block pipeline (BP)
 Metrics

Number of top-1 answers (#Rel)

Reciprocal rank (R-Rank)
 Relevance answer

It must match all the search keyword

Its size must be the smallest
Copyright  2009 by CEBT
11
Experiments
 Effectiveness
 Efficiency

Observations
–
Fastest : BP
–
SS outperforms Sparse and GP
–
Sparse == GP (GP > Sparse for small k or easy query)
–
All algorithms are more responsive for smaller k values
Copyright  2009 by CEBT
12
Experiments
Copyright  2009 by CEBT
13
Conclusion
 New ranking method

Adapts that the state-of-the-art IR ranking function and principles
 Query processing method

Tailored for our non-monotonic ranking functions
 Extensive experiments on large scale real databases

High precision with high efficiency
Copyright  2009 by CEBT
14
Reviews
 Good

Detailed explanation of background and existing approach

Good paper organization and good examples
 Short of rationale for new algorithms
 Non-monotonicity of Block pipeline algorithm
Copyright  2009 by CEBT
15