Online Computation and Continuous Maintaining of Quantile

Online Computation and
Continuous Maintaining of
Quantile Summaries
Tian Xia
Database Lab @ CCIS
Northeastern University
April 16, 2004
1
References


M. Greenwald and S. Khanna. Space-Efficient
Online Computation of Quantile Summaries. In
SIGMOD, pages 58-66, 2001.
X. Lin, H. Lu, J. Xu, and J. X. Yu. Continuously
Maintaining Quantile Summaries of the Most
Recent N Elements over a Data Stream. In ICDE,
pages 362-373, 2004
2
Outline of this talk


Quantile Estimation Overview
GK-quantile Summary Algorithm




Data Structure
Operations
Space Complexity Analysis
Sliding Window Model
3
Problem Definitions


-Quantile: A -quantile ((0,1]) of an ordered
sequence of N data elements is the element with
rank N .
Quantile Query: Given , find the data element
with rank N among all elements in the stream.


Variation: N recent elements (sliding window model).
(-approximate): Find the element with rank r
within the interval [r-N, r+N].
4
Example of A Quantile Query
t0
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11
t12
t13
t14
t15
12
10
11
10
1
10
11
9
6
7
8
11
4
5
2
3



The sorted order of the sequence is 1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 10, 10, 11, 11, 11, 12.
0.5-quantile returns the element ranked 8,
which is 8.
0.25-approximate 0.5-quantile returns one of
the elements in {4,5,6,7,8,9,10}.
5
Why Approximation?


Munro and Paterson (Theoretical Computer Science,
1980) showed that any algorithm which exactly
computes -quantile of N data elements in p
passes, requires a space of .
Approximate quantile techniques are
necessary to achieve sub-linear space
efficiency.
6
Quantile Summary

Quantile Summary: A small number of
objects from the input data sequence, which
could be used (by quantile estimator) to
answer quantile queries.

Other summary methods of large data sets
include average, standard deviation,
histogram, counting sketch (FM-sketch), etc.
7
Properties of A Good Quantile Estimator



Provide tunable and explicit a priori guarantees
on the precision of the approximation, e.g. it is approximate.
Data independent.
Use as small a memory footprint as possible,
which includes temporary storage.
8
Previous Work

Manku, Rajagopalan, and Lindsay (SIGMOD,
1998) proposed a single-pass algorithm that
constructs an -approximate quantile
summary.


Space complexity: log2N.
It requires an advance knowledge of N, the size of
data set. Won’t work in data stream environment.
9
Outline of this talk


Quantile Estimation Overview
GK-quantile Summary Algorithm




Data Structure
Operations
Space Complexity Analysis
Sliding Window Model
10
Contributions of GK-algorithm


Dynamically adjust quantile summary with the
growth of N, the total number of data
elements in the data stream.
Space complexity is reduced to logN.
11
Assumptions




A new data element arrives after each unit of time.
n denotes both the number of elements of the data
sequence, as well as the current time.
A data element is represented by its value v.
rmin(v) and rmax(v) denote respectively the lower and
upper bounds on the actual rank r of v among the
elements seen so far.
12
The Summary Data Structure


GK-algorithm maintains a summary data
structure S=S(n) at any point in time n.
S(n) consists of an ordered (non-decreasing)
sequence of tuples which corresponds to a
subset of the elements seen thus far.
13
The Summary Data Structure

S = {t0, t1, …, ts-1}, where ti = (vi, gi, Δi).




vi is the value of one of the elements seen so far.
gi = rmin(vi) - rmin(vi-1)
Δi = rmax(vi) - rmin(vi)
v0 and vs-1 always correspond to the
minimum and the maximum elements seen
so far.
14
The Summary Data Structure

Given gi = rmin(vi) - rmin(vi-1) and Δi = rmax(vi) rmin(vi),




rmin(vi) = ji gj
rmax(vi) = ji gj +Δi
gi +Δi -1 is upper bound on the total number
of elements that may have fallen between
vi-1and vi.
rmin(vs-1) = i gj = n.
15
Example of A Quantile Summary
t0
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11
t12
t13
t14
t15
12
10
11
10
1
10
11
9
6
7
8
11
4
5
2
3


{(1,1,0), (2,1,7), (3,1,7), (4,1,6), (10,6,0),
(12,6,0)} is an quantile summary consisting of
6 tuples.
For clarity, re-write the tuples of the above
summary in the form ti = (vi, rmin(vi), rmax(vi))
as follows: {(1,1,1), (2,2,9), (3,3,10), (4,4,10),
(10,10,10), (12,16,16)}.
16
Error Rate?


PROPOSITION 1: Given a quantile summary S, a -
quantile can always be identified to within an
error of maxi(gi+Δi)/2.
COROLLARY 1: If at any time n, the summary S(n)
satisfies the property that maxigi+i  2n, than
we can answer any -quantile query to within an
n precision.
17
QUANTILE ()

QUANTILE(): To compute an -approximate
-quantile from the summary S(n) after n data
elements, compute the rank r=n. Find i
such that both r rmin(vi) n and rmax(vi) r n,
return vi.

i.e. r n  rmin(vi)  rmax(vi)  r  n
18
Example of A Quantile Summary
t0
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11
t12
t13
t14
t15
12
10
11
10
1
10
11
9
6
7
8
11
4
5
2
3


{(1,1,0), (2,1,7), (3,1,7), (4,1,6), (10,6,0),
(12,6,0)} is 0.25-approximate with respect to
the data stream.
An 0.25-approximate 0.5-quantile returns the
element (4,1,6) or (10,6,0).
19
Outline of this talk


Quantile Estimation Overview
GK-quantile Summary Algorithm




Data Structure
Operations
Space Complexity Analysis
Sliding Window Model
20
How does their algorithm work?

Insert a tuple in the summary corresponding to a
new incoming element.
Periodically sweep over the summary to “merge”
some of the tuples into their neighbors.
 It ensures the space requirement.
At all times maxi (gi +Δi) 2n.

What to merge & How to merge?


21
INSERT (v)

INSERT(v): Find the smallest i, such that vi-1 vvi,
and insert the tuple (v, 1, 2n ), between ti-1 and ti.
Increment s. As a special case, if v is the new
minimum or the maximum element seen, then insert
(v, 1, 0).
22
Example of INSERT
t0
t8
t3
t4
12
6
10
1





  0.25
S={(12, 1, 0)}, n=1
S={(6, 1, 0), (12, 1, 0)}, n=2
S={(6, 1, 0), (10, 1, 1), (12, 1, 0)}, n=3
S={(1, 1, 0), (6, 1, 0), (10, 1, 1), (12, 1, 0)}, n=4
23
Merge




Space will increase with insertions.
Intuitively, two tuples (vi, gi,Δi) and (vj, gj,Δj)
can be merged into a new tuple (vk, gk,Δk), as
long as gk +Δk  2n.
An individual tuple is full if gk +Δk  2n.
Capacity and Band are introduced.
24
Capacity and Band

The capacity of a tuple is the maximum numer of
elements that can be counted by gi before the tuple
become full. (gi  2n  i).



The merge phase will free up space by merging tuples with
small capacities into tuples with similar or larger capacities.
Bands: Roughly speaking, divide the Δs into bands
that lie between elements of (0, ½2n, ¾2n, …,
2i-1   2n, …, 2n-1, 2n).
2i
The larger the capacity (with smallerΔ), the larger
the band.
25
Example of A Quantile Summary
t0
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11
t12
t13
t14
t15
12
10
11
10
1
10
11
9
6
7
8
11
4
5
2
3


{(1,1,0), (2,1,7), (3,1,7), (4,1,6), (10,6,0),
(12,6,0)} is an quantile summary consisting of
6 tuples.
(2,1,7) and (3,1,7) are in the lowest band.
(1,1,0), (10,6,0) and (12,6,0) are in the
highest bands.
26
Band

Strictly, Given  from 1 to log2n, p=2n,
band is the set of allΔsuch that p2 (p mod
2)Δ p2-1 (p mod 2-1).



If twoΔs are ever in the same band, they never
appear in different bands as n increase.
In band0,Δ= 2n .
A tree structure is imposed to facilitate
merges between bands.
27
Tree Representation


Given a summary S = {t0, t1, …, ts-1}, the tree
T associated with S contains a node Vi for
each ti and a special root node R.
The parent of a node Vi is the node Vj such
that j is the least index greater than i with
band(ti) > band(tj). Otherwise R is the parent.
28
Tree Representation
R
(1,1,0)


(2,1,7)
(3,1,7)
(4,1,6)
(10,6,0)
(12,6,0)
PROPOSITION 3: The children of any node in T are
always arranged in non-increasing order of band in
S.
PROPOSITION 4: For any node V, the set of all its
descendants arranged in T forms a contiguous
segment in S.
29
Merge Actually



GK-algorithm will merge together a node and
all its descendants into either its parent node
or into its right sibling.
The tuple that results after the merge must
not be full, i.e. gi +i  2n.
The operation is called COMPRESS().
30
COMPRESS ( )

The operation COMPRESS tries to merge
together a node and all its descendants into
either parent node or into its right sibling.
COMPRESS()
for i from s-2 to 0 do
if ((BAND(i, 2n)  BAND(i+1, 2n)) &&
g*gi+1i+1 2n)) then
g* denotes the
sum of g-values
of the tuple ti
and all its
descendants in
T.
DELETE all descendants of ti and the tuple ti itself;
end if
end for
end COMPRESS
31
DELETE (vi)

DELETE(vi): To delete the tuple (vi, gi,Δi) from
S, replace (vi, gi,Δi) and (vi+1, gi+1,Δi+1) by the
new tuple (vi+1, gi+ gi+1,Δi+1), and decrement s.
32
Example of COMPRESS and DELETE
t0
t1
t2
t3
t4
t5
12
10
11
10
1
10




  0.25
S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (11, 1, 1), (12,
1, 0)}, s=6, n=6
Compress tuples (11, 1, 1) and (12, 1, 0) into a new tuple
(12, 2, 0).
S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (12, 2, 0)},
s=5, n=6
33
Pseudo-Code for the whole algorithm
Initial State
S; s  0; n  0;
Algorithm
To add the n+1st element, v, to summary S(n):
if (n  0 mod 12) then
COMPRESS();
end if
INSERT (v);
n=n+1;
34
A Complete Example (  0.25)





t0
t1
t2
t3
t4
t5
t6
12
10
11
10
1
10
11
S={(10, 1, 0), (12, 1, 0)}, n=2
S={(10, 1, 0), (10, 1, 1), (11, 1, 1), (12, 1, 0)}, n=4
S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (11, 1, 1), (12,
1, 0)}, n=6, s=6
Perform compress when t6 comes.
S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (12, 2, 0)},
n=6, s=5
35
A Complete Example (  0.25)



t0
t1
t2
t3
t4
t5
t6
t7
t8
12
10
11
10
1
10
11
9
6
S={(1, 1, 0), (9, 1, 3), (10, 1, 0), (10, 1, 1), (10, 1, 2), (11,
1, 3), (12, 2, 0)}, n=8, s=7
Perform compress when t8 comes.
S={(1, 1, 0), (10, 2, 0), (10, 1, 1), (10, 1, 2), (12, 3, 0)},
n=8, s=5
36
A Complete Example (  0.25)





t0
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11
t12
t13
t14
t15
12
10
11
10
1
10
11
9
6
7
8
11
4
5
2
3
S={(1, 1, 0), (4, 1, 6), (5, 1, 6), (10, 5, 0), (12, 6, 0)}, n=14,
s=5
Perform compress
S={(1, 1, 0), (4, 1, 6), (10, 6, 0), (12, 6, 0)}, n=14, s=4
Finally
S={(1, 1, 0), (2, 1, 7), (3, 1, 7), (4, 1, 6), (10, 6, 0), (12, 6,
0)}, n=16, s=6
37
Outline of this talk


Quantile Estimation Overview
GK-quantile Summary Algorithm




Data Structure
Operations
Space Complexity Analysis
Sliding Window Model
38
Band Property


Observe that the number of band and
elements in a band determine the space
complexity.
PROPOSITION 2: At any point in time n and for
any  1, band(n) contains either 2 or 2-1
distinct values ofΔ.
 Since no more than 1 2 elements with any
givenΔ are inserted, band is a summary of at
most 2 2 elements in the stream.
39
LEMMAs

LEMMA 3: At any time n and for any given ,
there are at most 32 nodes in T(n) that have
a child with band value of .

Only a small number of nodes can have a child
with band . See Proposition 3.
40
LEMMAs


A full pair of tuples (ti-1, ti): band(ti-1)  band(ti).
The tuple ti-1 is left partner and ti is a right
partner in this full pair.
LEMMA 4: At any time n and for any given ,
there are at most 4 tuples from band(n)
that are right partners in a full tuple pair.
41
Full Pair Example
R
(1,1,0)



(2,1,7)
(3,1,7)
(4,1,6)
(10,6,0)
(12,6,0)
{(2,1,7), (3,1,7)} and is a full pair
{(1,1,0), (2,1,7)} is not a full pair.
(2,1,7) can only be a left partner!
42
Space Efficiency

Any band(n) node either is a right partner of a
full pair, or can only be a left partner.

By Proposition 3, a band(n) node that can only
be a left partner only occurs once for every
parent of nodes from band(n).

By Lemma 3 and 4, the number of nodes in any
band is bounded by 3 2  4 11 2.
43
Space Efficiency



The number of band is 1.
THEOREM: At any time n, the total number of
tuples stored in S(n) is at most (11 2)log(2n).
GK-algorithm’s space complexity is
logN.
44
Outline of this talk


Quantile Estimation Overview
GK-quantile Summary Algorithm




Data Structure
Operations
Space Complexity Analysis
Sliding Window Model
45
Sliding Window Model



Under sliding window model, a summary is
maintained for the most recently seen N data
elements.
Eliminate exact out-dated elements requires
a space of O(N).
Lin, etc. (ICDE 2004) proposed a spaceefficient one-pass summary algorithm for
sliding window model. Their underlying
summary algorithm is GK-algorithm.
46
n-of-N Model


A summary is maintained for N most recently seen
data elements. However, quantile queries can be
issued against any n  N. That is, for any (0,1],
and any n  N, we can return -quantiles among the
n most recent elements in a data stream seen so far.
Lin, etc. (ICDE 2004) proposed their one-pass
summary algorithm combining EH partitioning
technique (Datar, etc. ACM-SIAM 2002) with GKalgorithm, solving n-of-N model.
47
Example of n-of-N model
t0
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11
t12
t13
t14
t15
12
10
11
10
1
10
11
9
6
7
8
11
4
5
2
3

Assume the sliding window is 16 in an n-of-N
model. A quantile query can be answered for
any 1 n  16.
0.5-quantile returns 6 for n=12 and 3 for n=4.

FYI: The sorted order of the sequence is 1, 2, 3, 4, 5, 6, 7, 8,

9, 10, 10, 10, 11, 11, 11, 12.
48
Thank you!
49