Mining Frequent Patterns Without Candidate Generation

Optimal Aggregation Algorithms for
Middleware
Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
All right reserved by Xuehua Shen [email protected]
1
Problem: Rank Aggregation

Each object is scored using m different criteria, m
sorted list for each criterion

Combined score is calculated by an aggregation
function

Problem: find top-k objects with highest combined
scores
All right reserved by Xuehua Shen [email protected]
2
carID score
d
0.81
c
0.76 e.g. weighted sum
Example
Top 2 Car
Combined score = 0.2 *mileage score
+ 0.3*year score + 0.5 * price score
Do we need access all entries of
all sorted lists?
Rank Aggregation
carID Mileage
Score
c
1.0
a
0.8
e
0.6
b
0.5
d
0.5
carID Year
Score
a
0.9
b
0.7
c
0.7
d
0.7
e
0.5
All right reserved by Xuehua Shen [email protected]
carID Price
Score
d
1.0
e
0.9
b
0.8
c
0.7
a
0.6
3
Applications
Query
Top k
Color=‘red’and Shape=‘round’

Multimedia database system
Rank Aggregation
Engine

Web search query
Color = ‘red’ Sorted Sorted
List
List
color
Shape =‘round’
shape
From Zhang2002 talk
All right reserved by Xuehua Shen [email protected]
4
Outline

Assumptions

Fagin Algorithm

Threshold Algorithm

Summary & Comments
All right reserved by Xuehua Shen [email protected]
5
Assumption 1: Modes of Access
Sequential Access: obtain score of an object
in one sorted list sequentially from current
position
Random Access: obtain score of an object in
one sorted list using one random access
carID Year
score
a
0.8
c
0.8
e
0.7
…
Assumption: Both Access Modes are available
All right reserved by Xuehua Shen [email protected]
6
Assumption 2: Aggregation Function

Object gets different scores from different subsystems
in the interval [0,1]

Aggregation function to compute them into combined
scores e.g. min, avg

Monotone: f ( x1 , x2 ,..., xm )  f ( y1 , y2 ,..., ym ) if xi  yi for
every i
All right reserved by Xuehua Shen [email protected]
7
Intuition of Algorithms

Top objects in individual sorted lists also have
chances to be correct answers

Do some accesses, and think “Can we stop now?”
All right reserved by Xuehua Shen [email protected]
8
Fagin Algorithm
carID Price
score
a
0.9
c
0.8
e
0.7
…
carID Mileage
score
b
1.0
e
0.8
f
0.7
…
carID
a
c
e
…
Year
score
0.8
0.8
0.7
’e’ appears in all of them. top-1 object must be in {a, b, c, e, f}. why?
Monotone function, object ‘e’ blocks all objects below
Do random access for these 5 objects to get their scores and pick
Top-1.
We can’t say ‘e’ must be top-1,other objects can still have higher
combined score
All right reserved by Xuehua Shen [email protected]
9
Drawbacks of Fagin Algorithm

Only use information provided by sorted list and monotone
property

Have to remember lots of objects: large buffer size
All right reserved by Xuehua Shen [email protected]
10
Threshold Algorithm (TA)
Intuition: Combined score calculated by aggregation function can provide
some extra information.
upper bound (or threshold) of combined score of unseen objects!
When object R is seen under sequential access, immediately do
random access to get all other scores of object R and compute
combined score
At the same time, Keep track of the upper bound of the unseen objects
Halt when at least k objects have combined scores no less
than upper bound
All right reserved by Xuehua Shen [email protected]
11
TA: Example (K=1,AVG aggregation)
carI Price
D
score
a
0.9
c
0.8
e
0.7
carID Mileage
score
b
1.0
e
0.8
f
0.7
carID
…
…
…
a
c
e
Year
score
0.8
0.8
0.7
Upper Bound:0.9
Upper Bound:0.8
0.77
Const-size buffer 0.8
Step 1: sequential access ‘a’ price score(0.9), then random access ‘a’
mileage score(0.6) and year score(0.8), avg is (0.77)
Step 2: sequential access ‘b’ mileage score(1.0), then random access ‘b’
price score(0.7) and year score(0.7), avg is (0.8)
All right reserved by Xuehua Shen [email protected]
12
Evaluation of TA

TA never stops later than FA

TA requires only small constant-size (K) buffer

However, TA may perform more random accesses
All right reserved by Xuehua Shen [email protected]
13
Summary

FA and TA with both sequential access and
random access

Extend TA to other situations


Approximate algorithm
No random access
All right reserved by Xuehua Shen [email protected]
14
Comments



Rely on universal identification of objects from different
lists
Assumptions can not always be valid
e.g. not every sorted list exists beforehand
Do sequential access wisely for speeding up TA for skewed
data
All right reserved by Xuehua Shen [email protected]
15
All right reserved by Xuehua Shen [email protected]
16
Backup Slides
All right reserved by Xuehua Shen [email protected]
17
Middleware


Middleware: functions as a translation layer, handles all
incoming requests (such as Top-K query) and replies,
interacting with the disparate back-office systems to
gather the information it needs.
Application developers don’t need know there are several
heterogeneous systems behind the middleware.
All right reserved by Xuehua Shen [email protected]
18
Boolean Query Vs. Fuzzy Query

Semantics



Get all the results that satisfy the conditions Vs. get the best
possible answers to the query
Size of result: constant Vs. variable
Processing the query

It’s possible to determine whether the tuple belongs to result only
based on the tuple itself, but for fuzzy query it’s not. So for
boolean query we can deal with each tuple individually, but for
fuzzy query, we cannot determine whether it’s in the result just by
itself
All right reserved by Xuehua Shen [email protected]
19
Fuzzy Query Processor
(from Zhang02)
Query
Set
Query
Top k
Color=‘red’and Shape=‘round’
Title=‘database’ and Price <100
Query Processor
(Boolean)
Query Processor
(Fuzzy)
Color = ‘red’ Sorted Sorted
List
List
color
Traditional Database
All right reserved by Xuehua Shen [email protected]
Shape =‘round’
shape
Database with fuzzy data
20
Cost




Reduce the number of sequential access(Cs)
Number of random accesses is bounded by sequential
access by a factor of m-1
Overall cost is bounded by the Cs by constant factor
Really optimal?
All right reserved by Xuehua Shen [email protected]
21
Approximation Algorithm


Approximately top k answers are acceptable or even
desirable
θ-approximation (θ>1)

For any object y in the answer, z in database
θt(y) >= t(z)

Turning TA to approximate algorithm

The top k objects seen so far satisfy the inequality
All right reserved by Xuehua Shen [email protected]
22
Non Random Access (NRA)


Similar as TA, except that
 No exact score
 No sorted order
 The lower bound and upper bound of such objects
Do sequential access until there are k objects whose lower
bound no less than the upper bound of all other objects
All right reserved by Xuehua Shen [email protected]
23
NRA cont.


Low Bound: use 0
Upper Bound: use last score seen
carID Price
score
a
0.9
c
0.8
e
0.7
…
carID Mileage
score
b
1.0
e
0.8
f
0.7
…
All right reserved by Xuehua Shen [email protected]
carID
a
c
e
…
Year
score
0.8
0.8
0.7
24
NRA example

Advantage: R1(1,0), others(1/3,1/3) Top 1

Top 2 vs. Top 1: R1(1,0),R2(1,1/4),others(1/3,1/3) Top 2

Lots of Bookkeeping
All right reserved by Xuehua Shen [email protected]
25
Optimality of FA

Assumption


Cost


t is monotone
Θ(N(m-1)/mk1/m) with arbitrarily high probability
Optimality


Each algorithm that correctly find the top k answers for strict
monotone query Ft(A1, A2, …,Am) where A1, A2, …,Am are
independent, and without wild guess has the cost Θ (N(m-1)/mk1/m)
with arbitrarily high probability
FA is optimal in all such algorithms in high probability sense
All right reserved by Xuehua Shen [email protected]
26
Optimality of TA

Assumption


t is monotone
Instance Optimality
 For any algorithm C that correctly find the top k
answers for monotone query Ft(A1, A2, …,Am) without
wild guess on any database D
Cost(TA,D)=O(cost(C,D))
 TA is instance optimal in all such algorithms
All right reserved by Xuehua Shen [email protected]
27
Optimality of NRA


Assumption
 t is monotone
Instance Optimality

For all algorithm that correctly find the top k objects for monotone
query t for every database and don’t make random access
All right reserved by Xuehua Shen [email protected]
28
Algorithm Comparision
(from Zhang2002 talk)
Algorithm
Assumption
Access
Model
Termination
Worst Case
Termination
Expected
Buffer
Space
FA
Monotone
Sorted
Random
n(m-1)/m +
k/m
Nm-1/mk1/m
N
TA
Monotone
Sorted
Random
Bounded by
FA
Depends on
distribution
k
NRA
Monotone
Sorted
N
Depends on
distribution
N
All right reserved by Xuehua Shen [email protected]
29
Worst Case
O1
O2
...
On+1
On+2
On+3
...
O2n+1
1.0
1.0
0.0
0.0
1.0
0.0
0.0
1.0
1.0
1.0
0.0
1.0
All right reserved by Xuehua Shen [email protected]
Aggregation Function: min
n(m-1)/m + k/m
30
Naïve algorithm
Algorithm:
 For each criterion, do sequential access to retrieve all objects and their score
 Calculate combined scores for all objects
 Pick up top K
Comments:
 Access the entire database
 Cost is linear in the database size
 Does NOT use the fact that each list is sorted
All right reserved by Xuehua Shen [email protected]
31
Fagin Algorithm
Algorithm:
Do sequential in parallel to all sorted list Li,
until there is k “matches”.
A “match” is an object that has been seen in all sorted lists Li.
Then for each object that has been seen, do random access to get
all its score.
Compute the combined scores and pick the top k
All right reserved by Xuehua Shen [email protected]
32