Slides

Answering Top-k Queries
Using Views
Gautam Das (Univ. of Texas),
Dimitrios Gunopulos (Univ. of California Riverside),
Nick Koudas (Univ. of Toronto),
Dimitris Tsirogiannis (Univ. of Toronto)
Introduction
R
tid
X1
X2
X3
tid
Score
1
82
1
59
2
612
2
53
19
83
1
543
3
29
99
15
4
370
4
80
45
8
3
360
5
28
32
39
5
343
fQ

Preferences expressed as scoring functions on
the attributes of a relation, e.g
fQ  3X1  2X 2  5X 3
Top-k: k tuples with the highest score
VLDB '06
Related Work

TA [Fagin et. al. ‘96]



PREFER [Hristidis et. al. ‘01]



Deterministic stopping condition
Always the correct top-k set
Stores multiple copies of base relation R
Utilizes only one
We complement existing approaches
VLDB '06
Motivation




Query answering using views
Space-Performance tradeoff
Improved efficiency
Can we exploit the same tradeoffs for
top-k query answering?
VLDB '06
Problem Statement
Ranking Views: Materialized results of previously
asked top-k queries
Problem: Can we answer new ad-hoc top-k queries
efficiently using ranking views?
fQ  3X1  2X 2  5X 3 fV1  2X1  5X2 fV 2  X2  4X3
R
tid
X1
X2
X3
1
82
1
59
2
53
19
83

3
29
99
15
4
80
45
8
5
28
32
39
V1
tid
Score
3
4
V2
tid
Score
553
2
351
385
237
5
 216
1
5
177
2
201
3
159
1
169
4
88
VLDB '06
Outline


LPTA Algorithm
View Selection Problem




Cost Estimation Framework
View Selection Algorithms
Experimental Evaluation
Conclusions
VLDB '06
LPTA - Setting

Linear additive scoring functions e.g.
fQ  3X1  2X 2  5X 3

Set of Views:
Materialized result of a previously executed
 top-k query
 Arbitrary subset of attributes
 Sorted access on pairs tid,scoreQ tid




Random access on the base table R
VLDB '06


LPTA - Example
Top-1
R(X1, X2)
V1
1
1
tid12
1
tid3
1

tid
s12
tid s
4
T  (0,1)
V2
1
1

2
1
4
2
4
tid s
2
4
2
5
2
5




1
1
tid s
Q
R  (1,1)
tid
tid s
1
3
 1 1
tid5 s5
 
stopping
condition
2
1
tid 22 s22
2
tid 3 s32
s

1
s
V1
X1


2
1
tid
O  (0,0)
VLDB '06

P  (1,0)
V2
X2
LPTA
Linear Programming adaptation of TA
R(X1, X 2 )
fV1  2X1  5X2 fV 2  X1  2X2
V1


s1d
 
max( f Q )
V2
tid Score tid Score

tid1d
Q: fQ  3X1  10X 2
tid d2
sd2
0  X1, X 2  1

2X1  5X 2  s1d
d iteration
X 2  2X 2  sd2
unseen max  topkmin
VLDB '06
LPTA - Example (cont’)
R(X1, X2)
V1
tid11 s11
tid12 s12
1
2
1
3
1
2
1
3
tid 2 s2
tid 32 s32
4
tid 42 s42
tid s
tid s

tid1 
s1
4
 1 1
tid5 s5
 


X1
Top-1
V2
V1
T  (0,1)
stopping
condition
1
1
Q
R  (1,1)
tid
tid12
2 2


tid 22

tid12
V2
tid 52 s52

O  (0,0)
VLDB '06

P  (1,0)
X2
Outline


LPTA Algorithm
View Selection Problem




Cost Estimation Framework
View Selection Algorithms
Experimental Evaluation
Conclusions
VLDB '06
View Selection Problem


Given a collection of views V  {V1, ,Vr}
and a query Q, determine the most
efficient subset U  V to execute Q on.
Conceptual discussion



Two dimensions
Higher
dimensions
VLDB '06
View Selection - 2d
Q
Y
T  (0,1)
A1

V1
Min top-k tuple
A
R  (1,1)
M
V2

B
O  (0,0)
B1
VLDB '06
P  (1,0)
X
View Selection - Higher d

Theorem: If V  {V1, ,Vr} is a set of views
for an m-dimensional dataset and Q a
query, the optimal execution of LPTA
requires
a subset of views U  V such

that U  m.
Question: How do we
 select the optimal
 subset of views?
VLDB '06
Outline


LPTA Algorithm
View Selection Problem




Cost Estimation Framework
View Selection Algorithms
Experimental Evaluation
Conclusions
VLDB '06
Cost Estimation Framework


What is the cost of running LPTA when a
specific set of views is used to answer a
query?
Cost = number of sequential accesses
V1
Min top-k tuple
Q
Cost = 6 sequential
A
B
V2 accesses
Can we find that cost
without actually running
LPTA?
VLDB '06
Simulation of LPTA on
Histograms
HQ: approximates the score
distribution of the query Q
HQ
HV1 HV2
Cost
1.
topkmin
2.
Use HQ to estimate the
score of the k highest
tuple (topkmin).
Simulate LPTA in a
bucket by bucket lock
step to estimate the
cost.
b buckets
n/b tuples per bucket
VLDB '06
Outline


LPTA Algorithm
View Selection Problem




Cost Estimation Framework
View Selection Algorithms
Experimental Evaluation
Conclusions
VLDB '06
View Selection Algorithms


Exhaustive (E): Check all possible
r
p

m
subsets of size
, p .
Greedy (SV): Keep expanding the set of
views to use until the estimated cost
stops reducing.



VLDB '06

Select Views Spherical (SVS)
Requires the solution of a single linear
program.
(0,1)
max( f Q )
fV j  s
s
Q
T
s s
s
(0,0)
s
Selected Views

(1,0)
VLDB '06
Select Views By Angle (SVA)
Select Views By Angle (SVA): Sort the views by
increasing angle with respect to Q.
(0,1)
V4
V3
4

3
V2
2
1


(0,0)
Q
Selected Views
V1
1  2  3  4

(1,0)
VLDB '06
General Queries and Views

Views that materialize their top-k tuples.


Truncate the view histograms.
Accommodating range conditions


Select the views that cover the range
conditions.
Truncate each attribute’s histogram.
VLDB '06
Outline


LPTA Algorithm
View Selection Problem




Cost Estimation Framework
View Selection Algorithms
Experimental Evaluation
Conclusions
VLDB '06
Experiments


Datasets (Uniform, Zipf, Real)
Experiments:




Performance comparison of LPTA,
PREFER and TA
Accuracy of the cost estimation framework
Performance of LPTA using each of the
view selection algorithms
Scalability of the LPTA algorithm
VLDB '06
Performance comparison of
LPTA, PREFER and TA
Real dataset, 2d
Uniform dataset, 3d
VLDB '06
Cost Estimation Accuracy
2d
(buckets = 0.5% of n)
(buckets = 1% of n)
VLDB '06
Performance of LPTA using
View Selection Algorithms
(2d)
500K tuples, top-100 (3d)
VLDB '06
Scalability Experiments on
LPTA
(2d, uniform dataset)
(500K tuples, top-100)
VLDB '06
Conclusions




Using views for top-k query answering
LPTA: linear programming adaptation of
TA
View selection problem, cost estimation
framework, view selection algorithms
Experimental evaluation
VLDB '06
(Thank You!)
Questions?
VLDB '06