Slides

Composite Subset Measures
Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran
University of Wisconsin - Madison
Raghu Ramakrishnan
Yahoo! Research and University of Wisconsin – Madison
09.12.2006
Motivation

Consider this query:


It is



“For each year and each country, compute the ratio of the
average personal incomes between richest city and
poorest city . Then find the number of countries where
such ratio continuously decrease between 1990-2000“
Hard to write in SQL
Hard to optimize/understand the SQL query
This kind of queries is increasingly common:


Multi-step aggregation
Must scale to very large datasets, often distributed
2
Contributions

A new framework for expressing such
compositional aggregate queries


Key contribution is how we look at the
computation, in terms of aggregating over
related regions in “cube space”
An efficient evaluation framework based
on sorted scans that take into account of
multiple aggregation steps

Experimental results
3
Background

Computing “measures”



Measures summarize some characteristic of data
subsets (e.g., SUM, std dev, beta-value of a
portfolio)
Approaches: Group by, data cubes, Hancock,
Sawzall
Cube space



Partition feature space using attribute values;
domain hierarchies organize this space into nested
collections of regions
Regions: (2006, Korea), (2006/09, Seoul)
Region sets: (Year, Country), (Month, City)
4
Composite Subset Measures

The measure of a cube region is
computed by:


Aggregating data in a region directly (e.g.,
sales volumes for each day), or
Summarizing the measures for related
regions, e.g.:


The maximum of daily volumes within a year
The ratio of average personal incomes
between the richest and poorest cities in a
country
5
What is “Related” in Cube Space

Focus on relationships which




Self
Parent/Child


E.g., Year/Day
Child/Parent


are commonly used
can be efficiently evaluated
E.g., Day/Year
Sibling

E.g., Today/Tomorrow
6
Examples (Network Analysis)

Data involved:



Stream of data records for IP packet information
Time (t), Source (U), Destination (D) , Size (s)
Queries:


For every minute, the number of outgoing packets
from each given source IP
For every hour, the maximum number of minutely
outgoing packets from a given source IP
7
Expression Algebra

Each measure entity is defined as a
collection of region/value pairs






Regions should belong to same region set
Fact Table
Aggregation
Selection
Match join
Combine join
D
gG,agg (T )
 cond (T )
S |
S

cond , agg
T
fc (T1, T2 ,..., Tn )
8
Example: Aggregation

For every hour and every unique IP,
compute the number of outgoing
packets
SC  g(t:hour ,U :IP ),count (*) D
9
Example: Selection

For every hour, compute the sum of
outgoing packets from those source IP
with at least five packets in that hour
(High traffic count)
SS  g(t:hour ),count (*) ( M 5SC )
Source
time
10
Example: Match

For each six hour time window, compute
the average of the high traffic count
1
Savg  SS
|
2
S
1
2
2
2
( SS .t[ SS .t , SS .t 6]),avg ( SS . M ) S
1
SS
2
SS
11
Example: Combine

For each hour, compute the ratio
between the six hour average and the
high traffic count
Sratio  Savg

S avg . M / SS . M
SS
S avg
12
SS
Aggregation Workflows

A diagrammatic way to express multiple
composite subset measure expressions




Semantically equivalent to the algebra
Rectangles: Region sets
Ellipses: Measures associated with the
Region sets
Arcs: Computational dependencies
among measures
13
Example
Region set
U:IP
t:hour
t:hour
Measure name
Count
count(*)
Aggregation formula
Selection condition
Match condition
Count.t=Sbase.t
Savg
Avg(Count)
14
Example (cont.)
MAXS
max(s)
U:IP
t:hour
Ratio
MAXS/MINS
MINS
min(s)
15
Multi-step Execution Plan



Evaluation based on the topology order
of the aggregation workflow
Materialize non-dependent measures
Then evaluate dependent measures



following the arcs of the aggregation
workflow
May need to perform join
Problem

Intermediate measures: extra I/O
16
Simple Scan Execution[*]





Build one hash table for each measure
“Insert” data into hash tables of low-level
measures
Propagate the measures upwards after the
scan is over
Distributive or algebraic aggregation function
Problem


Each hash table keeps all the entries
Bottleneck: Memory capacity
[*] T. Johnson and D. Chatziantoniou, Extending complex ad-hoc OLAP, in CIKM,
1999, 170-179.
17
Sort/Scan Execution

Simple scan requires large memory


For each hash table, we need to keep all the
entries during the scan
When the data is ordered




Some hash entries can be flushed out before
the scan is finished
The memory footprint can be reduced
One pass scan becomes feasible
CPU cost is reduced
18
Evaluation
t:Day
COUNT0
count(*)
t:Day
U:IP
t:Month
U:IP
COUNT3
count(*)
COUNT2
count(*)
month 1
month 2
Output stream for
each hash table is
still ordered!
Sort by day
19
Evaluation
t:Day
COUNT0
count(*)
t:Day
U:IP
t:Month
U:IP
COUNT3
count(*)
COUNT2
count(*)
month 1
month 2
All the output
stream is ordered
by month!
Sort by month
20
Evaluation
t:Month
U:IP
COUNT3
count(*)
month 1
Data are sorted
by (t:month, U:IP)
month 2
1 1 1 2
21
By carefully
choosing the
sort order of
the raw data,
we can
greatly
reduce the
memory
footprint
Order and Slack

Order



Slack



How the records are ordered in the stream
E.g., <t:day, U:IP>
The gap between the output stream of the measure
and the scan progress of raw data
E.g., <t:day:[-3,+3]>
We have developed a mechanism to


Calculate the order/slack
Take advantage of the order/slack information during
evaluation
22
Evaluation Network
M1
hash
tables
order key:<t:Day, U:IP>
slack: t:[-1,+1]
M2
order key:<t:Hour, U:IP>
slack: t:[-1,+1]
order key:<t:Hour, U:IP>
slack: <>
M3
M4
Scan sorted data
23
Optimization

How to find a good sort order?




Enumerate all possible orders
For each order estimate the memory usage
Use sort orders with minimal usage
Evaluation with multiple passes


What measure to compute during each
pass?
What order to use in each pass?
24
Experiments



64 million records
Synthetic data set
Scenario 1


The measures of a region are computed by
combining the aggregated measures for different
kinds of child region sets
Scenario 2

The measures of a region are computed by
aggregating the measures of multiple chained
siblings
25
Execution Time (seconds)
Experimental Results (cont.)
9000
DB
SortScan
8000
7000
6000
5000
4000
3000
2000
1000
0
2
3
4
5
#dependent child measures
26
6
Execution Time (seconds)
Experimental Results
3000
DB
SortScan
2500
2000
1500
1000
500
0
2
3
4
5
6
Size of the Sibling Chain
27
7
Conclusions




Composite measures as building blocks for
complicated analysis process
Algebra provides the semantic foundation
Aggregation workflow offers intuitive interface
Sort/Scan execution plan evaluates multiple
dependent measures in the same run

and hence improve the evaluation performance
28