Storage Estimation for Multidimensional Aggregates in the Presence

Storage Estimation for
Multidimensional Aggregates in
the Presence of Hierarchies
병렬 분산 컴퓨팅 연구실
석사 1학기 김남희
Contents


Introdution
Approximating the size of the Cube
 An Analytical Algorithm
 A Sampling-Based Algorithm
 An Algorithm Based on probabilistic
Counting
 The Probabilistic Counting Algorithm
 Approximating the Size of the Cube
Contents



Evaluating the Accuracy of the Estimates
Extension to the PCSA-based algorithm
 Estimating sub-cube sizes
 Incremental estimation
 Estimation after data removal
Conclusion
Introduction



Virtually all OLAP products resort to some degree
of precomputation of these aggregation
The more that is precomputated, the faster
queries can be answered. 저장공간의 문제 발생
The problem of estimating how much storage will
be required if all possible combinations of
dimensions and their hierarchies are precomputed.
Hierarchy가 있는 경우가 없는 경우 보다 더 많
은 storage를 요구한다.
Introduction

Three strategies.
 An analytic algorithm
 A sampling - based algorithm
 Probabilistic counting algorithm
An analytical Algorithm(1/2)

Uniformly distributed assumption
If r elements are chosen uniformly and at
random from a set of n elements, the expected
number of distinct elements obtained is
n  n(1  1 n )
r
위의 결과를 이용하여 attribute의 어떤 subset
의 경우라도 group by의 size를 추정할 수 있다.
 hi : size of hierarchy of dimension i
k : dimensions
k
the total numbers of group bys:  ( hi  1)
i 1
An analytical Algorithm(2/2)

Advantage : simple , fast

Disadvantage
cube의 size를 overestimate하는 경향이
있다.
It requires counts of distinct values.
A Sampling-Based Algorithm(1/2)

Basic idea
Take the random subset of the data base
Compute the cube on that subset
Scale up this estimate by the ratio of the data
size to the sample size
D:database s:sample
|D|:size of database |s|: size of sample
CUBE(s):size of cube computed on the sample s
the size of the cube computed on the entire
database D: CUBE ( s ) * | D |
|s|
A Sampling-Based Algorithm(2/2)
 Advantage : simple
 Disadvantage: biased in the case of
projection duplicate를 고려하지 않음
The Probabilistic Counting Algorithm(1/3)
Probabilistic algorithm
: counting the number of distinct elements in a
multi set.
 Algorithm

The estimate formed from the above will typically
within a factor of 2 from the actual size.
The Probabilistic Counting Algorithm(2/3)

The simplest way to improve the accuracy of the
estimate is to use a set H of m hashing function
and computing m different BITMAP vectors.
R:position of the left most zero in the BITMAP
R i :obtained from hashing function i
average
R1  R 2  ...  R m
A
m
E( A)  log 2 n,  0.77351
The Probabilistic Counting Algorithm(3/3)

stochastic averaging
  h( x ) mod m
# of distinct values :
Approximating the size o f the Cube(1/2)

Algorithm (using probabilistic counting algorithm
to estimate the # of tuples resulting from
computing the cube on the base data)
C:hierarchy
bitset(C,BM,b): hierarchy C의 combination에 대한
BM번째 bitmap의 b번째 bit을 setting한다.
Approximating the size of the Cube(2/2)



Lemma
The error in the sum of two estimates is  the
error in a single estimate.
Advantage
This algorithm actually guarantees an error
bound on its estimate.
Disadvantage
 This comes at the cost of a complete scan of the
base data table;however,even this scan is much
cheaper than actually computing the cube.
Evaluating the accuracy of the Estimates

scheme1
Evaluating the accuracy of the Estimates

scheme2
Evaluating the accuracy of the Estimates

Scheme3  D0,D1: dimension
Each dimension has 100 unique values.
Database consists of 50,000 tuples.
There is no hierarchy on either dimension.
Extension to the PCSA-based algorithm


Estimating sub-cube sizes
Incremental estimation
 addition of new data change the sizes of
some of the group bys
 this change can be estimated by updating the
bitmap used by the previous estimation.
 To estimate the cube size, the bitmaps
corresponding to every combination of group bys
have to be stored.
|C|: # of group bys L:length of each bitmap
m: # of bitmaps per group bys
storage needed for the bitmaps: |C|*L*m
Extension to the PCSA-based algorithm

Estimation after data removal
 For each bitmap, we have to store # of “hits” for
each bit.
|C| :# of group bys in the cube
L: length of each count-array
m:# of count-arrays per group by
I: size of the an integer
storage needed for the count array: |C|*L*m*I
Conclusion
Three strategies to estimate the blowup
Algorithm based on sampling.

 Overestimate the size of the cube
 strongly dependent on the # of duplicates
 Algorithm based on assuming the data is
uniformly distributed
work well if the data uniformly distributed
inaccurate if the skew in the data increase
The analytical estimate was more accurate
than the sampling based estimate for widely
varying skew in the data.
Conclusion
 Probabilistic counting algorithm
 Perform very well under various degrees
of skew,always giving an estimate with a
bounded error.
 Provide a more reliable, accurate and
predictable estimate than the other algorithm.