Online Computation and
Continuous Maintaining of
Quantile Summaries
Tian Xia
Database Lab @ CCIS
Northeastern University
April 16, 2004
1
References
M. Greenwald and S. Khanna. Space-Efficient
Online Computation of Quantile Summaries. In
SIGMOD, pages 58-66, 2001.
X. Lin, H. Lu, J. Xu, and J. X. Yu. Continuously
Maintaining Quantile Summaries of the Most
Recent N Elements over a Data Stream. In ICDE,
pages 362-373, 2004
2
Outline of this talk
Quantile Estimation Overview
GK-quantile Summary Algorithm
Data Structure
Operations
Space Complexity Analysis
Sliding Window Model
3
Problem Definitions
-Quantile: A -quantile ((0,1]) of an ordered
sequence of N data elements is the element with
rank N .
Quantile Query: Given , find the data element
with rank N among all elements in the stream.
Variation: N recent elements (sliding window model).
(-approximate): Find the element with rank r
within the interval [r-N, r+N].
4
Example of A Quantile Query
t0
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11
t12
t13
t14
t15
12
10
11
10
1
10
11
9
6
7
8
11
4
5
2
3
The sorted order of the sequence is 1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 10, 10, 11, 11, 11, 12.
0.5-quantile returns the element ranked 8,
which is 8.
0.25-approximate 0.5-quantile returns one of
the elements in {4,5,6,7,8,9,10}.
5
Why Approximation?
Munro and Paterson (Theoretical Computer Science,
1980) showed that any algorithm which exactly
computes -quantile of N data elements in p
passes, requires a space of .
Approximate quantile techniques are
necessary to achieve sub-linear space
efficiency.
6
Quantile Summary
Quantile Summary: A small number of
objects from the input data sequence, which
could be used (by quantile estimator) to
answer quantile queries.
Other summary methods of large data sets
include average, standard deviation,
histogram, counting sketch (FM-sketch), etc.
7
Properties of A Good Quantile Estimator
Provide tunable and explicit a priori guarantees
on the precision of the approximation, e.g. it is approximate.
Data independent.
Use as small a memory footprint as possible,
which includes temporary storage.
8
Previous Work
Manku, Rajagopalan, and Lindsay (SIGMOD,
1998) proposed a single-pass algorithm that
constructs an -approximate quantile
summary.
Space complexity: log2N.
It requires an advance knowledge of N, the size of
data set. Won’t work in data stream environment.
9
Outline of this talk
Quantile Estimation Overview
GK-quantile Summary Algorithm
Data Structure
Operations
Space Complexity Analysis
Sliding Window Model
10
Contributions of GK-algorithm
Dynamically adjust quantile summary with the
growth of N, the total number of data
elements in the data stream.
Space complexity is reduced to logN.
11
Assumptions
A new data element arrives after each unit of time.
n denotes both the number of elements of the data
sequence, as well as the current time.
A data element is represented by its value v.
rmin(v) and rmax(v) denote respectively the lower and
upper bounds on the actual rank r of v among the
elements seen so far.
12
The Summary Data Structure
GK-algorithm maintains a summary data
structure S=S(n) at any point in time n.
S(n) consists of an ordered (non-decreasing)
sequence of tuples which corresponds to a
subset of the elements seen thus far.
13
The Summary Data Structure
S = {t0, t1, …, ts-1}, where ti = (vi, gi, Δi).
vi is the value of one of the elements seen so far.
gi = rmin(vi) - rmin(vi-1)
Δi = rmax(vi) - rmin(vi)
v0 and vs-1 always correspond to the
minimum and the maximum elements seen
so far.
14
The Summary Data Structure
Given gi = rmin(vi) - rmin(vi-1) and Δi = rmax(vi) rmin(vi),
rmin(vi) = ji gj
rmax(vi) = ji gj +Δi
gi +Δi -1 is upper bound on the total number
of elements that may have fallen between
vi-1and vi.
rmin(vs-1) = i gj = n.
15
Example of A Quantile Summary
t0
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11
t12
t13
t14
t15
12
10
11
10
1
10
11
9
6
7
8
11
4
5
2
3
{(1,1,0), (2,1,7), (3,1,7), (4,1,6), (10,6,0),
(12,6,0)} is an quantile summary consisting of
6 tuples.
For clarity, re-write the tuples of the above
summary in the form ti = (vi, rmin(vi), rmax(vi))
as follows: {(1,1,1), (2,2,9), (3,3,10), (4,4,10),
(10,10,10), (12,16,16)}.
16
Error Rate?
PROPOSITION 1: Given a quantile summary S, a -
quantile can always be identified to within an
error of maxi(gi+Δi)/2.
COROLLARY 1: If at any time n, the summary S(n)
satisfies the property that maxigi+i 2n, than
we can answer any -quantile query to within an
n precision.
17
QUANTILE ()
QUANTILE(): To compute an -approximate
-quantile from the summary S(n) after n data
elements, compute the rank r=n. Find i
such that both r rmin(vi) n and rmax(vi) r n,
return vi.
i.e. r n rmin(vi) rmax(vi) r n
18
Example of A Quantile Summary
t0
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11
t12
t13
t14
t15
12
10
11
10
1
10
11
9
6
7
8
11
4
5
2
3
{(1,1,0), (2,1,7), (3,1,7), (4,1,6), (10,6,0),
(12,6,0)} is 0.25-approximate with respect to
the data stream.
An 0.25-approximate 0.5-quantile returns the
element (4,1,6) or (10,6,0).
19
Outline of this talk
Quantile Estimation Overview
GK-quantile Summary Algorithm
Data Structure
Operations
Space Complexity Analysis
Sliding Window Model
20
How does their algorithm work?
Insert a tuple in the summary corresponding to a
new incoming element.
Periodically sweep over the summary to “merge”
some of the tuples into their neighbors.
It ensures the space requirement.
At all times maxi (gi +Δi) 2n.
What to merge & How to merge?
21
INSERT (v)
INSERT(v): Find the smallest i, such that vi-1 vvi,
and insert the tuple (v, 1, 2n ), between ti-1 and ti.
Increment s. As a special case, if v is the new
minimum or the maximum element seen, then insert
(v, 1, 0).
22
Example of INSERT
t0
t8
t3
t4
12
6
10
1
0.25
S={(12, 1, 0)}, n=1
S={(6, 1, 0), (12, 1, 0)}, n=2
S={(6, 1, 0), (10, 1, 1), (12, 1, 0)}, n=3
S={(1, 1, 0), (6, 1, 0), (10, 1, 1), (12, 1, 0)}, n=4
23
Merge
Space will increase with insertions.
Intuitively, two tuples (vi, gi,Δi) and (vj, gj,Δj)
can be merged into a new tuple (vk, gk,Δk), as
long as gk +Δk 2n.
An individual tuple is full if gk +Δk 2n.
Capacity and Band are introduced.
24
Capacity and Band
The capacity of a tuple is the maximum numer of
elements that can be counted by gi before the tuple
become full. (gi 2n i).
The merge phase will free up space by merging tuples with
small capacities into tuples with similar or larger capacities.
Bands: Roughly speaking, divide the Δs into bands
that lie between elements of (0, ½2n, ¾2n, …,
2i-1 2n, …, 2n-1, 2n).
2i
The larger the capacity (with smallerΔ), the larger
the band.
25
Example of A Quantile Summary
t0
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11
t12
t13
t14
t15
12
10
11
10
1
10
11
9
6
7
8
11
4
5
2
3
{(1,1,0), (2,1,7), (3,1,7), (4,1,6), (10,6,0),
(12,6,0)} is an quantile summary consisting of
6 tuples.
(2,1,7) and (3,1,7) are in the lowest band.
(1,1,0), (10,6,0) and (12,6,0) are in the
highest bands.
26
Band
Strictly, Given from 1 to log2n, p=2n,
band is the set of allΔsuch that p2 (p mod
2)Δ p2-1 (p mod 2-1).
If twoΔs are ever in the same band, they never
appear in different bands as n increase.
In band0,Δ= 2n .
A tree structure is imposed to facilitate
merges between bands.
27
Tree Representation
Given a summary S = {t0, t1, …, ts-1}, the tree
T associated with S contains a node Vi for
each ti and a special root node R.
The parent of a node Vi is the node Vj such
that j is the least index greater than i with
band(ti) > band(tj). Otherwise R is the parent.
28
Tree Representation
R
(1,1,0)
(2,1,7)
(3,1,7)
(4,1,6)
(10,6,0)
(12,6,0)
PROPOSITION 3: The children of any node in T are
always arranged in non-increasing order of band in
S.
PROPOSITION 4: For any node V, the set of all its
descendants arranged in T forms a contiguous
segment in S.
29
Merge Actually
GK-algorithm will merge together a node and
all its descendants into either its parent node
or into its right sibling.
The tuple that results after the merge must
not be full, i.e. gi +i 2n.
The operation is called COMPRESS().
30
COMPRESS ( )
The operation COMPRESS tries to merge
together a node and all its descendants into
either parent node or into its right sibling.
COMPRESS()
for i from s-2 to 0 do
if ((BAND(i, 2n) BAND(i+1, 2n)) &&
g*gi+1i+1 2n)) then
g* denotes the
sum of g-values
of the tuple ti
and all its
descendants in
T.
DELETE all descendants of ti and the tuple ti itself;
end if
end for
end COMPRESS
31
DELETE (vi)
DELETE(vi): To delete the tuple (vi, gi,Δi) from
S, replace (vi, gi,Δi) and (vi+1, gi+1,Δi+1) by the
new tuple (vi+1, gi+ gi+1,Δi+1), and decrement s.
32
Example of COMPRESS and DELETE
t0
t1
t2
t3
t4
t5
12
10
11
10
1
10
0.25
S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (11, 1, 1), (12,
1, 0)}, s=6, n=6
Compress tuples (11, 1, 1) and (12, 1, 0) into a new tuple
(12, 2, 0).
S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (12, 2, 0)},
s=5, n=6
33
Pseudo-Code for the whole algorithm
Initial State
S; s 0; n 0;
Algorithm
To add the n+1st element, v, to summary S(n):
if (n 0 mod 12) then
COMPRESS();
end if
INSERT (v);
n=n+1;
34
A Complete Example ( 0.25)
t0
t1
t2
t3
t4
t5
t6
12
10
11
10
1
10
11
S={(10, 1, 0), (12, 1, 0)}, n=2
S={(10, 1, 0), (10, 1, 1), (11, 1, 1), (12, 1, 0)}, n=4
S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (11, 1, 1), (12,
1, 0)}, n=6, s=6
Perform compress when t6 comes.
S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (12, 2, 0)},
n=6, s=5
35
A Complete Example ( 0.25)
t0
t1
t2
t3
t4
t5
t6
t7
t8
12
10
11
10
1
10
11
9
6
S={(1, 1, 0), (9, 1, 3), (10, 1, 0), (10, 1, 1), (10, 1, 2), (11,
1, 3), (12, 2, 0)}, n=8, s=7
Perform compress when t8 comes.
S={(1, 1, 0), (10, 2, 0), (10, 1, 1), (10, 1, 2), (12, 3, 0)},
n=8, s=5
36
A Complete Example ( 0.25)
t0
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11
t12
t13
t14
t15
12
10
11
10
1
10
11
9
6
7
8
11
4
5
2
3
S={(1, 1, 0), (4, 1, 6), (5, 1, 6), (10, 5, 0), (12, 6, 0)}, n=14,
s=5
Perform compress
S={(1, 1, 0), (4, 1, 6), (10, 6, 0), (12, 6, 0)}, n=14, s=4
Finally
S={(1, 1, 0), (2, 1, 7), (3, 1, 7), (4, 1, 6), (10, 6, 0), (12, 6,
0)}, n=16, s=6
37
Outline of this talk
Quantile Estimation Overview
GK-quantile Summary Algorithm
Data Structure
Operations
Space Complexity Analysis
Sliding Window Model
38
Band Property
Observe that the number of band and
elements in a band determine the space
complexity.
PROPOSITION 2: At any point in time n and for
any 1, band(n) contains either 2 or 2-1
distinct values ofΔ.
Since no more than 1 2 elements with any
givenΔ are inserted, band is a summary of at
most 2 2 elements in the stream.
39
LEMMAs
LEMMA 3: At any time n and for any given ,
there are at most 32 nodes in T(n) that have
a child with band value of .
Only a small number of nodes can have a child
with band . See Proposition 3.
40
LEMMAs
A full pair of tuples (ti-1, ti): band(ti-1) band(ti).
The tuple ti-1 is left partner and ti is a right
partner in this full pair.
LEMMA 4: At any time n and for any given ,
there are at most 4 tuples from band(n)
that are right partners in a full tuple pair.
41
Full Pair Example
R
(1,1,0)
(2,1,7)
(3,1,7)
(4,1,6)
(10,6,0)
(12,6,0)
{(2,1,7), (3,1,7)} and is a full pair
{(1,1,0), (2,1,7)} is not a full pair.
(2,1,7) can only be a left partner!
42
Space Efficiency
Any band(n) node either is a right partner of a
full pair, or can only be a left partner.
By Proposition 3, a band(n) node that can only
be a left partner only occurs once for every
parent of nodes from band(n).
By Lemma 3 and 4, the number of nodes in any
band is bounded by 3 2 4 11 2.
43
Space Efficiency
The number of band is 1.
THEOREM: At any time n, the total number of
tuples stored in S(n) is at most (11 2)log(2n).
GK-algorithm’s space complexity is
logN.
44
Outline of this talk
Quantile Estimation Overview
GK-quantile Summary Algorithm
Data Structure
Operations
Space Complexity Analysis
Sliding Window Model
45
Sliding Window Model
Under sliding window model, a summary is
maintained for the most recently seen N data
elements.
Eliminate exact out-dated elements requires
a space of O(N).
Lin, etc. (ICDE 2004) proposed a spaceefficient one-pass summary algorithm for
sliding window model. Their underlying
summary algorithm is GK-algorithm.
46
n-of-N Model
A summary is maintained for N most recently seen
data elements. However, quantile queries can be
issued against any n N. That is, for any (0,1],
and any n N, we can return -quantiles among the
n most recent elements in a data stream seen so far.
Lin, etc. (ICDE 2004) proposed their one-pass
summary algorithm combining EH partitioning
technique (Datar, etc. ACM-SIAM 2002) with GKalgorithm, solving n-of-N model.
47
Example of n-of-N model
t0
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11
t12
t13
t14
t15
12
10
11
10
1
10
11
9
6
7
8
11
4
5
2
3
Assume the sliding window is 16 in an n-of-N
model. A quantile query can be answered for
any 1 n 16.
0.5-quantile returns 6 for n=12 and 3 for n=4.
FYI: The sorted order of the sequence is 1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 10, 10, 11, 11, 11, 12.
48
Thank you!
49
© Copyright 2026 Paperzz