Materialized Sample Views for Database Approximation

Materialized Sample Views for Database
Approximation
Shantanu Joshi
Christopher Jermaine
University of Florida
Department of Computer and Information Sciences and Engineering
Gainesville, FL, USA
{ssjoshi,cjermain}@cise.ufl.edu
Abstract
We consider the problem of creating a sample view of a database table. A sample view is an indexed,
materialized view that permits efficient sampling from an arbitrary range query over the view. Such “sample
views” are very useful to applications that require random samples from a database: approximate query processing,
online aggregation, data mining, and randomized algorithms are a few examples. Our core technical contribution is
a new file organization called the ACE Tree that is suitable for organizing and indexing a sample view. One of the
most important aspects of the ACE Tree is that it supports online random sampling from the view. That is, at all
times, the set of records returned by the ACE Tree constitutes a statistically random sample of the database records
satisfying the relational selection predicate over the view. Our paper presents experimental results that demonstrate
the utility of the ACE Tree.
Index Terms
H.2.2.c Indexing methods, H.2.4.h Query processing, I.4.1.f Sampling
I. I NTRODUCTION
W
ITH ever-increasing database sizes, randomization and randomized algorithms [1] have become
vital data management tools. In particular, random sampling is one of the most important sources
of randomness for such algorithms. Scores of algorithms that are useful over large data repositories either
require a randomized input ordering for data (i.e., an online random sample), or else they operate over
samples of the data to increase the speed of the algorithm.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
2
Applications requiring randomization abound in the data management literature. For example, consider
online aggregation [2]–[4]. In online aggregation, database records are processed one-at-a-time, and used
to keep the user informed of the current “best guess” as to the eventual answer to the query. If the
records are input into the online aggregation algorithm in a randomized order, then it becomes possible
to give probabilistic guarantees on the relationship of the current guess to the eventual answer to the
query. For another example, consider data mining algorithms. Many data mining algorithms require an
online, randomized input ordering of the data. One of the most widely-cited techniques along these
lines is the scalable K-means clustering algorithm of Bradley et al. [5]. For another example, one-pass
frequent-itemset mining algorithms are typically useful only if the data are processed in a randomized
order so that the first few records are distributed in the same way as latter ones [6], [7]. In general, it
is often possible to scale data mining and machine learning techniques by incorporating samples into a
learned model, one-at-a-time, until the marginal accuracy of adding an additional sample into the model
is small [8]–[12]. Such online algorithms could be used in a database context for building decision trees,
fitting parametric statistical models [9], [11], learning the characteristics of a certain class of data records
(including variations on PAC learning [13]), and so on.
In fact, database sampling has been recognized as an important enough problem that ISO has been
working to develop a standard interface for sampling from relational database systems [14], and significant
research efforts are directed at providing sampling from database systems by vendors such as IBM [15].
However, despite the obvious importance of random sampling in a database environment and dozens
of recent papers on the subject (approximately 20 papers from recent SIGMOD and VLDB conferences
are concerned with database sampling), there has been relatively little work towards actually supporting
random sampling with physical database file organizations. The classic work in this area (by Olken and
his co-authors [16]–[18]) suffers from a key drawback: each record sampled from a database file requires
a random disk I/O. At a current rate of around 100 random disk I/Os per second per disk, this means that
it is possible to retrieve only 6,000 samples per minute. If the goal is fast approximate query processing
or speeding up a data mining algorithm, this is clearly unacceptable.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
3
The Materialized Sample view
In this paper, we propose to use the materialized sample view1 as a convenient abstraction for allowing
efficient random sampling from a database. For example, consider the following database schema:
SALE (DAY, CUST, PART, SUPP)
Imagine that we want to support fast, random sampling from this table, and most of our queries include
a temporal range predicate on the DAY attribute. This is exactly the interface provided by a materialized
sample view. A materialized sample view can be specified with the following SQL-like query:
CREATE MATERIALIZED SAMPLE VIEW MySam
AS SELECT * FROM SALE
INDEX ON DAY
In general, the range attribute or attributes referenced in the INDEX ON clause can be spatial, temporal,
or otherwise, depending on the requirements of the application.
While the materialized sample view is a straightforward concept, efficient implementation is difficult. The primary technical contribution of this paper is a novel index structure called the ACE Tree
(Appendability, Combinability Exponentiality; see Section 4) which can be used to efficiently implement
a materialized sample view. Such a view, stored as an ACE-Tree, has the following characteristics:
•
It is possible to efficiently sample (without replacement) from any arbitrary range query over the
indexed attribute, at a rate that is far faster than is possible using techniques proposed by Olken [19]
or by scanning a randomly permuted file. In general, the view can produce samples from a predicate
involving any attribute having a natural ordering, and a straightforward extension of the ACE Tree
can be used for sampling from multi-dimensional predicates.
•
The resulting sample is online, which means that new samples are returned continuously as time
progresses, and in a manner such that at all times, the set of samples returned is a true random
sample of all of the records in the view that match the range query. This is vital for important
applications like online aggregation and data mining.
•
Finally, the sample view is created efficiently, requiring only two external sorts of the records in the
view, and with only a very small space overhead beyond the storage required for the data records.
1
This term was originally used in Olken’s PhD thesis [19] in a slightly different context, where the goal was to maintain a fixed-size
sample of database; in contrast, as we describe subsequently our materialized sample view is a structure allowing online sampling
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
4
We note that while the materialized sample view is a logical concept, the actual file organization used
to implement such a view can be referred to as a sample index since it is a primary index structure to
efficiently retrieve random samples.
Paper Organization
In the next Section we give an overview of three obvious approaches for supporting such sample views,
and in Section 3 we present an overview of our approach, based on the ACE Tree. Section 4 discusses
important properties of the ACE Tree while Section 5 gives the detailed design and construction of the
index structure. We present and analyze the retrieval algorithms in Section 6 and experimental results are
discussed in Section 8. Section 9 concludes the paper and discusses future avenues of research.
II. E XISTING SAMPLING TECHNIQUES
In this Section, we discuss three simple techniques that can be used to create materialized sample views
to support random sampling from a relational selection predicate.
A. Randomly Permuted Files
One option for creating a materialized sample view is to randomly shuffle or permute the records in the
view. To sample from a relational selection predicate over the view, we scan it sequentially from beginning
to end and accept those records that satisfy the predicate while rejecting the rest. This method has the
advantage that it is very simple, and using a fast external sorting algorithm, permuting the records can
be very efficient. Furthermore, since the process of scanning the file can make use of the fast, sequential
I/O provided by modern hard disks, a materialized view organized as a randomly permuted file can be
very useful for answering queries that are not very selective.
However, the major problem with such a materialized view is that the fraction of useful samples retrieved
by it is directly proportional to the selectivity of the selection predicate. For example, if the selectivity of
the query is 10%, then on average only 10% of the random samples obtained by such a view can be used
to answer the query. Hence for moderate to low selectivity queries, most of the random samples retrieved
by such a view will not be useful for answering queries. Thus, the performance of such a view quickly
degrades as selectivity of the selection predicates decreases.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
5
Algorithm 1: Sampling from a Ranked B+-Tree
Algorithm SampleRankedB+Tree (Value v1 , Value v2 )
1. Find the rank r1 of the record which has the smallest
DAY value greater than v1 .
2. Find the rank r2 of the record which has the largest
DAY value smaller than v2 .
3. While sample size < desired sample size
3.a
Generate a uniformly distributed random number
i, between r1 and r2 .
3.b
If i has been generated previously, discard it and
generate the next random number.
3.c
Using the rank information in the internal nodes,
retrieve the record whose rank is i.
B. Sampling from Indices
The second approach to creating a materialized sample view is to use one of the standard indexing
structures like a hashing scheme or a tree-based index structure to organize the records in the view.
In order to produce random samples from such a materialized view, we can employ iterative or batch
sampling techniques [16], [18]–[21] that sample directly from a relational selection predicate, thus avoiding
the aforementioned problem of obtaining too few relevant records in the sample. Olken [19] presents a
comprehensive analysis and comparison of many such techniques. In this Section we discuss the technique
of sampling from a materialized view organized as a ranked B+-Tree, since it has been proven to be the
most efficient existing iterative sampling technique in terms of number of disk accesses. A ranked B+-Tree
is a regular B+-Tree whose internal nodes have been augmented with information which permits one to
find the ith record in the file.
Let us assume that the relation SALE presented in the Introduction is stored as a ranked B+-Tree file
indexed on the attribute DAY and we want to retrieve a random sample of records whose DAY attribute
value falls between 11-28-2004 and 03-02-2005. This translates to the following SQL query:
SELECT * FROM SALE
WHERE SALE.DAY BETWEEN ’11-28-2004’ AND ’03-02-2005’
Algorithm 1 above can then be used to obtain a random sample of relevant records from the ranked
B+-Tree file.
The drawback of the above algorithm is that whenever a leaf page is accessed, the algorithm retrieves
only that record whose rank matches with the rank being searched for. Hence for every record which
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
6
resides on a page that is not currently buffered, the retrieval time is the same as the time required for a
random disk I/O. Thus, as long as there are unbuffered leaf pages containing candidate records, the rate
of record retrieval is very slow.
C. Block-based Random Sampling
While the classic algorithms of Olken and Antoshenkov sample records one-at-a-time, it is possible to
sample from an indexing structure such as a B+-Tree, and make use of entire blocks of records [14] [22].
The number of records per block is typically on the order of 100 to 1000, leading to a speedup of two or
three orders of magnitude in the number of records retrieved over time if all of the records in each block
are consumed, rather than a single record.
However, there are two problems with this approach. First, if the structure is used to estimate the
answer to some aggregate query, then the confidence bounds associated with any estimate provided after
N samples have been retrieved from a range predicate using a B+-Tree (or some other index structure)
may be much wider than the confidence bounds that would have been obtained had all N samples been
independent. In the extreme case where the values on each block of records are closely correlated with one
another, all of the N samples may be no better than a single sample. Second, any algorithm which makes
use of such a sample must be aware of the block-based method used to sample the index, and adjust its
estimates accordingly, thus adding complexity to the query result estimating process. For algorithms such
as Bradley’s K-means algorithm [5], it is not clear whether or not such samples are even appropriate.
III. OVERVIEW OF O UR A PPROACH
We propose an entirely different strategy for implementing a materialized sample view. Our strategy
uses a new data structure called the ACE Tree to index the records in the sample view. At the highest
level, the ACE Tree partitions a data set into a large number of different random samples such that each
is a random sample without replacement from one particular range query. When an application asks to
sample from some arbitrary range query, the ACE Tree and its associated algorithms filter and combine
these samples so that very quickly, a large and random subset of the records satisfying the range query
is returned. The sampling algorithm of the ACE Tree is an online algorithm, which means that as time
progresses, a larger and larger sample is produced by the structure. At all times, the set of records retrieved
is a true random sample of all the database records matching the range selection predicate.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
7
A. ACE Tree Leaf Nodes
The ACE Tree stores records in a large set of leaf nodes on disk. Every leaf node has two components:
1) A set of h ranges, where a range is a pair of key values in the domain of the key attribute and h is
the height of the ACE Tree. Unlike a B+-Tree, each leaf node in the ACE Tree stores records falling
in several different ranges. The ith range associated with leaf node L is denoted by L.Ri . The h
different ranges associated with a leaf node are hierarchical; that is L.R1 ⊃ L.R2 ⊃ · · · ⊃ L.Rh .
The first range in any leaf node, L.R1 , always contains a uniform random sample of all records
of the database thus corresponding to the range (−∞, ∞). The hth range in any leaf node is the
smallest among all other ranges in that leaf node.
2) A set of h associated sections. The ith section of leaf node L is denoted by L.Si . The section L.Si
contains a random subset of all the database records with key values in the range L.Ri .
Figure 1 depicts an example leaf node in the ACE Tree with attribute range values written above each
section and section numbers marked below. Records within each section are shown as circles.
R1 :0-100
75
36
47
41
S1
Fig. 1.
R2 :0-50 R3 :0-25
18
10
22
S2
25
3
S3
R4 : 0-12
11
7
1
S4
Structure of a leaf node of the ACE tree.
B. ACE Tree Structure
Logically, the ACE Tree is a disk-based binary tree data structure with internal nodes used to index
leaf nodes, and leaf nodes used to store the actual data. Since the internal nodes in a binary tree are much
smaller than disk pages, they are packed and stored together in disk-page-sized units [23]. Each internal
node has the following components:
1) A range R of key values associated with the node.
2) A key value k that splits R and partitions the data on the left and right of the node.
3) Pointers ptrl and ptrr , that point to the left and right children of the node.
4) Counts cntl and cntr , that give the number of database records falling in the ranges associated with
the left and right child nodes. These values can be used, for example, during evaluation of online
aggregation queries which require the size of the population from which we are sampling [4].
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
8
0−100
I1,1
50
0−50
51−100
I2,1
25
0−25
26−50
I3,1
12
L1
L2
L3
26
70
L4
0−50
14
87
L4 .S1
I2,2
51−75
I3,2
37
0−100
Fig. 2.
75
7
20 40
L4 .S2
I3,3
62
L5
76−100
88
L6
L7
I3,4
L8
26−50 38−50
39
27
50
44
L4 .S3
38
40
46
L4 .S4
Structure of the ACE Tree.
Figure 2 shows the logical structure of the ACE Tree. Ii,j refers to the jth internal node at level i.
The root node is labeled with a range I1,1 .R = [0-100], signifying that all records in the data set have
key values within this range. The key of the root node partitions I1,1 .R into I2,1 .R = [0-50] and I2,2 .R =
[51-100]. Similarly each internal node divides the range of its descendents with its own key.
The ranges associated with each section of a leaf node are determined by the ranges associated with
each internal node on the path from the root node to the leaf. For example, if we consider the path from
the root node down to leaf node L4 , the ranges that we encounter along the path are 0-100, 0-50, 26-50
and 38-50. Thus for L4 , L4 .S1 has a random sample of records in the range 0-100, L4 .S2 has a random
sample in the range 0-50, L4 .S3 has a random sample in the range 26-50, while L4 .S4 has a random
sample in the range 38-50.
C. Example Query Execution in ACE Tree
In the following discussion, we demonstrate how the ACE Tree efficiently retrieves a large random
sample of records for any given range query. The query algorithm is formally described in Section 6.
Let Q = [30-65] be our example query postulated over the ACE Tree depicted in Figure 2. The query
algorithm starts at I1,1 , the root node. Since I2,1 .R overlaps Q, the algorithm decides to explore the left
child node labeled I2,1 in Figure 2. At this point the two range values associated with the left and right
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
9
children of I2,1 are 0-25 and 26-50. Since the left child range has no overlap with the query range, the
algorithm chooses to explore the right child next. At this child node (I3,2 ), the algorithm picks leaf node L3
to be the first leaf node retrieved by the index. Records from section 1 of L3 (which totally encompasses
Q) are filtered for Q and returned immediately to the consumer of the sample as a random sample from
the range [30-65], while records from sections 2, 3 and 4 are stored in memory. Figure 3 shows the one
random sample from section 1 of L3 which can be used directly for answering query Q.
0-100
31
55
72
0-50
89
L3 .S1
45
18
48
26-50
23
L3 .S2
29
37
L3 .S3
26-37
29
34
L3 .S4
σ30−65
55
Fig. 3.
Random samples from section 1 of L3 .
Next, the algorithm again starts at the root node and now chooses to explore the right child node
I2,2 . After performing range comparisons, it explores the left child of I2,2 which is I3,3 since I3,4 .R has
no overlap with Q. The algorithm chooses to visit the left child node of I3,3 next, which is leaf node
L5 . This is the second leaf node to be retrieved. As depicted in Figure 4, since L5 .R1 encompasses Q,
the records of L5 .S1 are filtered and returned immediately to the user as two additional samples from
R. Furthermore, section 2 records are combined with section 2 records of L3 to obtain a random sample
of records in the range 0-100. These are again filtered and returned, giving four more samples from Q.
Section 3 records are also combined with section 3 records of L3 to obtain a sample of records in the
range 26-75. Since this range also encompasses R, the records are again filtered and returned adding four
more records to our sample. Finally section 4 records are stored in memory for later use.
Note that after retrieving just two leaf nodes in our small example, the algorithm obtains eleven
randomly selected records from the query range. However, in a real index, this number would be many
times greater. Thus, the ACE Tree supports “fast first” sampling from a range predicate: a large number
of samples are returned very quickly. We contrast this with a sample taken from a B+-Tree having a
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
10
similar structure to the ACE Tree depicted in Figure 2. The B+-Tree sampling algorithm would need to
pre-select which nodes to explore. Since four leaf nodes in the tree are needed to span the query range,
there is a reasonably high likelihood that the first four samples taken would need to access all four leaf
nodes. As the ACE Tree Query Algorithm progresses, it goes on to retrieve the rest of the leaf nodes in
the order L4 , L6 , L1 , L7 , L2 , L8 .
0−100
10
61
59 77
L5 .S1
0−50
31
48
Fig. 4.
18
99
23
34
L3 .S2
σ30−65
61
51−100
59
46
L5 .S2
26−50
45
29
37
51−75
53
58
74 69
L3 .S3
L5 .S3
Combine
samples
Combine
samples
σ30−65
σ30−65
31
48
46 34
45
53
37
58
Combining samples from L3 and L5 .
D. Choice of Binary Versus k-Ary Tree
The ACE Tree as described above can also be implemented as a k-ary tree instead of a binary tree.
For example, for a ternary tree, each internal node can have two (instead of one) keys and three (instead
of two) children. If the height of the tree was h, every leaf node would still have h ranges and h sections
associated with them. Like a standard complete k-ary tree, the number of leaf nodes will be k h . However,
the big difference would be the manner in which a query is executed using a k-ary ACE Tree as opposed
to a binary ACE Tree. The query algorithm will always start at the root node and traverse down to a
leaf. However, at every internal node it will alternate between the k children in a round-robin fashion.
Moreover, since the data space would be divided into k equal parts at each level, the query algorithm
might have to make k traversals and hence access k leaf nodes before it can combine sections that can
be used to answer the query. This would mean that the query algorithm will have to wait longer (than a
binary ACE Tree) before it can combine leaf node sections and thus return useful random samples. Since
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
11
the goal of the ACE Tree is to support “fast first” sampling, use of a binary tree instead of a k-ary tree
seems to be a better choice to implement the ACE Tree.
IV. P ROPERTIES OF THE ACE T REE
In this Section we describe the three important properties of the ACE Tree which facilitate the efficient
retrieval of random samples from any range query, and will be instrumental in ensuring the performance
of the algorithm described in Section 6.
A. Combinability
σ3−47
0−100
75
36
0−50
47
41
L1 .S1
18
0−25
10
22
25
3
L1 .S2
0−12
11
1
L1 .S3
L1
7
22
47
L1 .S4
18
39
Combined
samples
σ3−47
0−100
88
61
27
L3 .S1
0−50
39
26−50
50
26
33
L3 .S2
26−37
29
45
34
L3 .S3
L3 .S4
L3
Fig. 5.
Combining two sections of leaf nodes of the ACE tree.
The various samples produced from processing a set of leaf nodes are combinable. For example, consider
the two leaf nodes L1 and L3 , and the query “Compute a random sample of the records in the query range
Ql = [3 to 47]”. As depicted in Figure 5, first we read leaf node L1 and filter the second section in order
to produce a random sample of size n1 from Ql which is returned to the user. Next we read leaf node L3 ,
and filter its second section L3 .S2 to produce a random sample of size n2 from Ql which is also returned
to the user. At this point, the two sets returned to the user constitute a single random sample from Ql of
size n1 + n2 . This means that as more and more nodes are read from disk, the records contained in them
can be combined to obtain an ever-increasing random sample from any range query.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
12
B. Appendability
The ith sections from two leaf nodes are appendable. That is, given two leaf nodes Lj and Lk ,
S
Lj .Si Lk .Si is always a true random sample of all records of the database with key values within
S
the range Lj .Ri Lk .Ri . For example, reconsider the query, “Compute a random sample of the records
in the query range Ql = [3 to 47]”. As depicted in Figure 6, we can append the third section from node
L3 to the third section from node L1 and filter the result to produce yet another random sample from Ql .
This means that sections are never wasted.
0−100
75
36
0−50
47
41
L1 .S1
0−25
18
10
22
L1 .S2
0−12
25
3
L1 .S3
11
7
1
L1 .S4
Append
sections
L1
0−100
88
61
27
0−50
26−50
50
39
26
33
45
σ3−47
26−37
29
34
10
L3 .S1
L3 .S2
L3 .S3
L3 .S4
33
L3
Fig. 6.
25
45
3
26
Combined
samples
Appending two sections of leaf nodes of the ACE tree.
C. Exponentiality
The ranges in a leaf node are exponential. The number of database records that fall in L.Ri is twice
the number of records that fall in L.Ri+1 . This allows the ACE Tree to maintain the invariant that for
any query Q0 over a relation R such that at least hµ database records fall in Q0 , and with |R|/2k+1 <=
|σQ0 (R)| <= |R|/2k ; ∀k <= h − 1, there exists a pair of leaf nodes Li and Lj , where at least one-half of
S
the database records falling in Li .Rk+2 Lj .Rk+2 are also in Q0 . µ is the average number of records in
each section, and h is the height of the tree or equivalently, the total number of sections in any leaf node.
While the formal statement of the exponentiality property is a bit complicated, the net result is is
simple: there is always a pair of leaf nodes whose sections can be appended to form a set which can be
filtered to quickly obtain a sample from any range query Q0 . As an illustration, consider query Q over the
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
13
ACE Tree of Figure 2. Note that the number of database records falling in Q is greater than one-fourth,
but less than half the database size. The exponentiality property assures us that Q can be totally covered
by appending sections of two different leaf nodes. In our example, this means that Q can be covered by
S
appending section 3 of nodes L4 and L6 . If RC = L4 .R3 L6 .R3 , then by the invariant given above we
can claim that |σQ (R)| >= (1/2) × |σRC (R)|.
V. C ONSTRUCTION OF THE ACE T REE
In this Section, we present an I/O efficient, bulk construction algorithm for the ACE Tree.
A. Design Goals
The algorithm for building an ACE Tree index is designed with the following goals in mind:
1) Since the ACE Tree may index enormous amounts of data, construction of the tree should rely on
efficient, external memory algorithms, requiring as few passes through the data set as possible.
2) In the resulting data structure, the data which are placed in each leaf node section must constitute a
true random sample (without replacement) of all database records lying within the range associated
with that section.
3) Finally, the tree must be constructed in such a way as to have the exponentiality, combinability, and
appendability properties necessary for supporting the ACE Tree query algorithms.
B. Construction
The construction of the ACE Tree proceeds in two distinct phases. Each phase comprises of two
read/write passes through the data set (that is, constructing an ACE-Tree from scratch requires two external
sorts of a large database table). The two phases are as follows:
1) During Phase 1, the data set is sorted based on the record key values. This sorted order of records
is used to provide the split points associated with each internal node in the tree.
2) During Phase 2, the data are organized into leaf nodes based on those key values. Disk blocks
corresponding to groups of internal nodes can easily be constructed at the same time as the final
pass through the data writes the leaf nodes to disk.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
14
Median Record
I1,1 .k
3
7 10 12 15 18 22 25 29 33 36 37 41 47 50 50 53 58 60 62 69 72 74 75 77 81 84 88 89 92 98
I2,2 .k
I2,1 .k
3
7 10 12 15 18 22 25 29 33 36 37 41 47 50 50 53 58 60 62 69 72 74 75 77 81 84 88 89 92 98
I3,1 .k
3
Fig. 7.
I3,2 .k
I3,3 .k
I3,4 .k
7 10 12 15 18 22 25 29 33 36 37 41 47 50 50 53 58 60 62 69 72 74 75 77 81 84 88 89 92 98
Choosing keys for internal nodes.
C. Construction Phase 1
The primary task of Phase 1 is to assign split points to each internal node of the tree. To achieve this,
the construction algorithm first sorts the data set based upon keys of the records, as depicted in Figure 7.
After the dataset is sorted, the median record for the entire data set is determined (this value is 50 in
our example). This record’s key will be used as the key associated with the root of the ACE Tree, and
will determine L.R2 for every leaf node in the tree. We denote this key value by I1,1 .k, since the value
serves as the key of the first internal node in level 1 of the tree.
After determining the key value associated with the root node, the medians of each of the two halves
of the data set partitioned by I1,1 .k are chosen as keys for the two internal nodes at the next level: I2,1 .k
and I2,2 .k, respectively. In the example of Figure 7, these values are 25 and 75. I2,1 .k and I2,2 .k, along
with I1,1 .k, will determine L.R3 for every leaf node in the tree. The process is then repeated recursively
until enough medians2 have been obtained to provide every internal node with a key value. Note that the
same time that these various key values are determined, the values cntl and cntr can also be determined.
This simple strategy for choosing the various key values in the tree ensures that the exponentiality
property will hold. If the data space between Ii+1,2j−1 .k and Ii,j .k corresponds to some leaf node range
L.Rn , then the data space between Ii+1,2j−1 .k and Ii+2,4j−2 .k will correspond to some range L.Rn+1 .
2
We choose a value for the height of the tree in such a manner that the expected size of a leaf node (see Sec. V F.) does not exceed
one logical disk block. Choosing a node size that corresponds to the block size is done for the same reason it is done in most traditional
indexing structures: typically, the system disk block size has already been carefully chosen by a DBA to balance speed of sequential access
(which demands a larger block size) with the cost of accessing more data than is needed (which demands a smaller block size).
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
15
Since Ii+2,4j−2 .k is the midpoint of the data space between Ii+1,2j−1 .k and Ii,j .k, we know that two times
as many database records should fall in L.Rn , compared with L.Rn+1 .
The following example also shows how the invariant described in Section 4.3 is guaranteed by adopting
the aforementioned strategy of assigning key values to internal nodes. Consider the ACE Tree of Figure 2.
Figure 8 shows the keys of the internal nodes as medians of the dataset R. We also consider two example
queries, Q1 and Q2 such that the number of database records falling in Q2 is greater than one-fourth but
less than one-half of the database size, while the number of database records falling in Q1 is more than
half the database size.
Q1
Q2
12
25
37
50
62
75
88
|R|/4
|R|/2
|R|
Fig. 8.
Exponentiality property of ACE Tree.
Q1 can be answered by appending section 2 of (for example) L4 and L8 (refer to Figure 2). Let
S
RC1 = L4 .R2 L8 .R2 . Then all the database records fall in RC1 . Moreover, since |σQ1 (R)| >= |R|/2,
we have |σQ1 (R)| >= (1/2) × |σRC1 (R)|. Similarly, Q2 can be answered by appending section 3 of (for
S
example) L4 and L6 . If RC2 = L4 .R3 L6 .R3 , then half the database records fall in RC2 . Also, since
|σQ2 (R)| >= |R|/4 we have |σQ2 (R)| >= (1/2) × |σRC2 (R)|. This can be generalized to obtain the
invariant stated in Section 4.3.
D. Construction Phase 2
The objective of Phase 2 is to construct leaf nodes with appropriate sections and populate them with
records. This can be achieved by the following three steps:
1) Assign a uniformly generated random number between 1 and h to each record as its section number.
2) Associate an additional random number with the record that will be used to identify the leaf node
to which the record will be assigned.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
16
Section Numbers
1
2
3
2 1
2 4
2
3 4 3
4
2
1 4 3
2
3
1
4
1 4
3
2
1
4
3
2
3
1
1
3
7 10 12 15 18 22 25 29 33 36 37 41 47 50 50 53 58 60 62 69 72 74 75 77 81 84 88 89 92 98
(a) Records assigned section numbers
Leaf Numbers
Section Numbers
7 3
1 2
3
1
3
4 5
2 1
1 2 1
2 4 2
4 3 4 3 2 8 4 3 6
3 4 3 4 2 1 4 3 2
6
3
1
1
5 2 6 5 7 3
4 1 4 3 2 1
7 8
4 3
5
2
8
3
2
1
7
1
7 10 12 15 18 22 25 29 33 36 37 41 47 50 50 53 58 60 62 69 72 74 75 77 81 84 88 89 92 98
(b) Records assigned leaf numbers
Leaf Numbers
Section Numbers
1
1
1
1 2
2
2 2
3
1
2
2
3 1
1
2
1 2
4
3 3 3
3 4
3
4 4
4
4
5
5
5 5
6
6
6
7
7
7
7
8
8
8
4 2 3
3
4
1
2
3
4 2
3
4 1
1
2
4
1
3
3
60 18 25 10 69 92 41 22 77 7 50 37 33 12 29 36 50 15 88 74 62 53 58 74 3 98 75 81 47 84 89
Leaf 1
Fig. 9.
Leaf 2
Leaf 3
Leaf 4
Leaf 5
Leaf 6
(c) Records organized into leaf nodes
Leaf 7
Leaf 8
Phase 2 of tree construction.
3) Finally, re-organize the file by performing an external sort to group records in a given leaf node
and a given section together.
Figure 9(a) depicts our example data set after we have assigned each record a randomly generated
section number, assuming four sections in each leaf node.
In Step 2, the algorithm assigns one more randomly generated number to each record, which will
identify the leaf node to which the record will be assigned. We assume for our example that the number
of leaf nodes is 2h−1 = 23 = 8. The number to identify the leaf node is assigned as follows.
1) First, the section number of the record is checked. We denote this value as s.
2) We then start at the root of the tree and traverse down by comparing the record key with s − 1 key
values. After the comparisons, if we arrive at an internal node, Ii,j , then we assign the record to
one of the leaves in the subtree rooted at Ii,j .
From the example of Figure 9(a), the first record having key value 3 has been assigned to section 1. Since
this record can be randomly assigned to any leaf from 1 through 8, we assign it to leaf 7.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
17
The next record of Figure 9(a) has been assigned to section number 2. Referring back to Figure 7, we
see that the key of the root node is 50. Since the key of the record is 7 which is less than 50, the record
will be assigned to a leaf node in the left subtree of the root. Hence we assign a leaf node between 1 and
4 to this record. In our example, we randomly choose the leaf node 3.
For the next record having key value 10, we see that the section number assigned is 3. To assign a leaf
node to this record, we initially compare its key with the key of the root node. Referring to Figure 7, we
see that 10 is smaller than 50; hence we then compare it with 25 which is the key of the left child node
of the root. Since the record key is smaller than 25, we assign the record to some leaf node in the left
subtree of the node with key 25 by assigning to it a random number between 1 and 2.
The section number and leaf node identifiers for each record are written in a small amount of temporary
disk space associated with each record. Once all records have been assigned to leaf nodes and sections,
the dataset is re-organized into leaf nodes using a two-pass external sorting algorithm as follows:
•
Records are sorted in ascending order of their leaf node number.
•
Records with the same leaf node number are arranged in ascending order of their section number.
The re-organized data set is depicted in Figure 9(c).
E. Combinability/Appendability Revisited
In Phase 2 of the tree construction, we observe that all records belonging to some section k are
segregated based upon the result of the comparison of their key with the appropriate medians, and are
then randomly assigned a leaf node number from the feasible ones. Thus, if records from section s of all
leaf nodes are merged together, we will obtain all of the section s records. This ensures the appendability
property of the ACE Tree.
Also note that the probability of assignment of one record to a section is unaffected by the probability of
assignment of some other record to that section. Since this results in each section having a random subset
of the database records, it is possible to merge a sample of the records from one section that match a range
query with a sample of records from a different section that match the same query. This will produce a
larger random sample of records falling in the range of the query, thus ensuring the combinability property.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
18
F. Page Alignment
In Phase 2 of the construction algorithm, section numbers and leaf node numbers are randomly
generated. Hence we can only predict on expectation the number of records that will fall in each section
of each leaf node. As a result, section sizes within each leaf node can differ, and the size of a leaf node
itself is variable and will generally not be equal to the size of a disk page. Thus when the leaf nodes are
written out to disk, a single leaf node may span across multiple disk pages or may be contained within
a single disk page.
This situation could be avoided if we fix the size of each section a priori. However, this poses a serious
problem. Consider two leaf node sections Li .Sj and Li+1 .Sj . We can force these two sections to contain
the same number of records by ensuring that the set of records assigned to section j in Phase 2 of the
construction algorithm has equal representation from Li .Rj and Li+1 .Rj . However, this means that the
set of records assigned to section j is no longer random. If we fix the section size and force a set number
of records to fall in each section, we invalidate the appendability and combinability properties of the
structure. Thus, we are forced to accept a variable section size.
In order to implement variable section size, we can adopt one of the following two schemes:
1) Enforce fixed-sized leaf nodes and allow variable-sized sections within the leaf nodes.
2) Allow variable-sized leaf nodes along with variable-sized sections.
If we choose the fixed-sized leaf node, variable-sized section scheme, leaf node size is fixed in advance.
However, section size is allowed to vary. This allows full sections to grow further by claiming any available
space within the leaf node. The leaf node size chosen should be large enough to prevent any leaf node
from becoming completely filled up, which prevents the partitioning of any leaf node across two disk
pages. The major drawback of this scheme is that the average leaf node space utilization will be very
low. Assuming a reasonable set of ACE Tree parameters, a quick calculation shows that if we want to be
99% sure that no leaf node gets filled up, the average leaf node space utilization will be less than 15%.
The variable-sized leaf node, variable-sized section scheme does not impose a size limit on either the
leaf node or the section. It allows leaf nodes to grow beyond disk page boundaries, if space is required.
The important advantage of this scheme is that it is space-efficient. The main drawback of this approach
is that leaf nodes may span multiple disk pages, and hence all such pages must be accessed in order to
retrieve such a leaf node. Given that most of the cost associated with reading an arbitrary leaf page is
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
19
associated with the disk head movement needed to move the disk arm to the appropriate cylinder, this
does not pose too much of a problem. Hence we use this scheme for the construction of leaf nodes of
the ACE Tree.
VI. Q UERY A LGORITHM
In this Section, we describe in detail the algorithm used to answer range queries using the ACE Tree.
A. Goals
The algorithm has been designed to meet the primary goal of achieving “fast-first” sampling from the
index structure, which means it attempts to be greedy on the number of records relevant for the query in
the early stages of execution. In order to meet this goal, the query answering algorithm identifies the leaf
nodes which contain maximum number of sections relevant for the query. A section Li1 .Sj is relevant
T
S
S
S
for a range query Q if Li1 .Rj Q 6= φ and Li1 .Rj Li2 .Rj . . . Lin .Rj ⊇ Q where Li1 , . . . ,Lin are
some leaf nodes in the tree.
The query algorithm prioritizes retrieval of leaf nodes so as to:
•
Facilitate the combination of sections so as to maximize n in the above formulation, and
•
Maximize the number of relevant sections in each leaf node L retrieved such that L.Sj
T
Q 6= φ
where j = (c + 1) . . . h where L.Rc is the smallest range in L that encompasses Q.
B. Algorithm Overview
At a high level, the query answering algorithm retrieves the leaf nodes relevant to answering a query
via a series of stabs or traversals, accessing one leaf node per stab. Each stab begins at the root node
and traverses down to a leaf. The distinctive feature of the algorithm is that at each internal node that is
traversed during a stab, the algorithm chooses to access the child node that was not chosen the last time
the node was traversed. For example, imagine that for a given internal node I, the algorithm chooses to
traverse to the left child of I during a stab. The next time that I is accessed during a stab, the algorithm
will choose to traverse to the right child node. This can be seen in Figure 10, when we compare the paths
taken by Stab 1 and Stab 2. The algorithm chooses to traverse to the left child of the root node during
the first stab, while during the second stab it chooses to traverse to the right child of the root node.
12
37
62
next = L
88
12
37 next = L 62
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
L
L
L
L2
L3
L4
L5
L6
L7
L1
L8
L22
L33
L11
20
L
L44
I1,1
25
25
0−25
0−25
12
12
26−50
26−50
I3,1
I3,1
L
L22
L
L11
L
L33
0−50
0−50
2,2
75 II2,2
75
next = L
51−75
51−75
I3,3
I3,3
next = L
L
L44
L
L55
L
L66
76−100
76−100
88
88
L
L77
I3,4
I3,4
L
L88
0−25
0−25
12
12
26−50
26−50
I3,1
I3,1
34
34
4
3 4
3 4343
L
L22
L
L11
37
37
L
L33
L
L44
12
12
next = R
L
L22
L
L11
37
37
I3,2
I3,2
L
L33
L
L44
0−50
0−50
75 next = L
75
51−75
51−75
76−100
76−100
I3,3
I3,3
62
88
62
next = L 88
L
L55
L
L66
L7
L7
I3,4
I3,4
L8
L8
25
25
0−25
0−25
12
12
L
L11
(b)Stab
(c) Stab2,3,6 7contributing
contributingsections
sections
0−100
0−100
L8
L8
I2,2
I3,2
I3,2
37
62
37
next = R 62
75 I2,2
75
next = L
51−75
51−75
I3,3
I3,3
next = R
L
L33
L
L22
L
L44
L
L55
L
L66
76−100
76−100
88
88
L
L7 7
I3,4
I3,4
L
L8 8
(c) Stab
Stab 4,
3, 16
7 contributing
(d)
contributingsections
sections
0−100
I1,1
50 next = R
Execution runs of query Answering algorithm.
51−100
51−100
I2,1
Inext
2,1 = L
L7
L7
51−100
51−100
%&&
%&&% '
'(('
(
%&
'
(
% &% (
' ('
+
+
/
/
:
9
8
7
6
5
)*
)
*) )
*
*
,
,
.
.
0
+
+
/
) *) ,, .. 00/0
*
26−50
26−50
I3,1
I3,1
1,1
50 Inext
=L
50
next = R
0−50
0−50
L
L66
I2,1
Inext
2,1 = L
I1,1
Fig. 10.
L
L55
I3,4
I3,4
I1,1
I2,2
I2,2
12
2
1 2121
next = R
76−100
76−100
1,1
50 Inext
=L
50
next = R
51−100
51−100
26−50
26−50
I3,1
I3,1
51−75
51−75
3,3
62 II3,3
88
62
next = L 88
(b)Stab
(c) Stab2,3,6 7contributing
contributingsections
sections
0−100
0−100
I1,1
0−25
0−25
I2,2
2,2
75 Inext
=L
75
I3,2
I3,2
1,1
50 Inext
=R
50
next = L
2,1
25 II2,1
25
next = L
L
L88
51−100
51−100
2,1
25 II2,1
25
next = L
(a) Stab2,1,61contributing
contributingsections
section
(b)Stab
0−100
0−100
0−50
0−50
L
L77
1,1
50 Inext
=R
50
next = L
51−100
51−100
3,2
37 II3,2
62
37
next = L 62
L
L66
I1,1
1,1
50 Inext
=L
50
next = R
I2,1
Inext
2,1 = L
L
L55
(a) Stab2,1,61contributing
contributingsections
section
(b)Stab
0−100
0−100
(a) Stab 1, 1 contributing section
0−100
0−100
0−50
0−50
88
next = L
0−50
51−100
I2,2
I2,2
I2,2
25
75
75 next = L
25 I2,1
25
75
next = L
The advantage of retrieving leaf nodes in this back and forth sequence is that it allows us to quickly
0−25
26−50
51−75
76−100
0−25
26−50
51−75
76−100
26−50 nodes with the
51−75
76−100 sections possible in a given number of stabs. The
retrieve0−25
a set
of
most 88
disparate
I3,1leaf 37
I3,3
I3,4
I3,2
12
62
12 I3,1
37 I3,2
62 I3,3
88 I3,4
I
I
I
3,1
3,2
3,3
12
37 next = R 62
88 I3,4
next = R
next = R
reason that we want a non-homogeneous set of nodes is that nodes from very distant portions of a query
"! #$
;< => ?@ AB
range will tend to have sections covering large ranges that do not overlap. This allows us to append
L
L11
L
L33
L
L22
L
L44
L
L55
L
L66
L
L7 7
L
L8 8
L1
L2
L3
L4
L5
L6
L7
L8
sections of newly retrieved leaf nodes with the corresponding sections of previously retrieved leaf nodes.
(c) Stab
Stab 4,
3, 16
7 contributing
(d)
contributingsections
sections
(d) Stab 4, 16 contributing sections
The samples obtained can 0−100
then be filtered and immediately returned.
I1,1
This order of retrieval is implemented
50 next = R by associating a bit with each internal node that indicates whether
the next child 0−50
node to be retrieved should
be the left node or the right node. The value of this bit is toggled
51−100
I2,2
2,1
25 Iis
7510
next
=L
every time the node
accessed. Figure
illustrates
the choices made by the algorithm at each internal
0−25
12
L1
26−50
I3,1
L2
51−75
37
I3,2
62
I3,3
L3
L4
L5
next = R
L6
L7
76−100
88
I3,4
L8
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
21
node during four separate stabs. Note that when the algorithm reaches an internal node where the range
associated with one of the child nodes has no overlap with the query range, the algorithm always picks
the child node that has overlap with the query, irrespective of the value of the indicator bit. The only
exception to this is when all leaf nodes of the subtree rooted at an internal node which overlaps the query
range have been accessed. In such a case, the internal node which overlaps the query range is not chosen
and is never accessed again.
C. Data Structures
In addition to the structure of the internal and leaf nodes of the ACE Tree, the query algorithm uses
and updates the following two memory resident data structures:
1) A lookup table T , to store internal node information in the form of a pair of values (next = [LEFT] |
[RIGHT], done = [TRUE] | [FALSE]). The first value indicates whether the next node to be retrieved
should be the left child or right child. The second value is TRUE if all leaf nodes in the subtree
rooted at the current node have already been accessed, else it is FALSE.
2) An array buckets[h] to hold sections of all the leaf nodes which have been accessed so far and
whose records could not be used to answer the query. h is the height of the ACE Tree.
D. Actual Algorithm
We now present the algorithms used for answering queries using the ACE Tree. Algorithm 2 simply
calls Algorithm 3 which is the main tree traversal algorithm, called Shuttle(). Each traversal or stab begins
at the root node and proceeds down to a leaf node. In each invocation of Shuttle(), a recursive call is
made to either its left or right child with the recursion ending when it reaches a leaf node. At this point,
the sections in the leaf node are combined with previously retrieved sections so that they can be used
to answer the query. The algorithm for combining sections is described in Algorithm 4. This algorithm
determines the sections that are required to be combined with every new section s that is retrieved and then
searches for them in the array buckets[]. If all sections are found, it combines them with s and removes
them from buckets[]. If it does not find all the required sections in buckets[], it stores s in buckets[].
E. Algorithm Analysis
We now present a lower bound on the expected performance of the ACE Tree index for sampling from
a relational selection predicate. For simplicity, our analysis assumes that the number of leaf nodes in the
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
22
Algorithm 2: Query Answering Algorithm
Algorithm Answer (Query Q)
Let root be the root of the ACE Tree
While (!T.lookup(root).done)
T.lookup(root).done = shuttle(Q, root);
Algorithm 3: ACE Tree traversal algorithm
Algorithm Shuttle (Query Q, Node curr node)
If (curr node is an internal node)
lef t node = curr node →get lef t node();
right node = curr node →get right node();
If (lef t node is done AND right node is done)
Mark curr node as done
Else if (right node is not done)
Shuttle(Q, right node);
Else if (lef t node is not done)
Shuttle(Q, lef t node);
Else if (both children are not done)
If (Q overlaps only with lef t node.R)
Shuttle(Q, lef t node);
Else if (Q overlaps only with right node.R)
Shuttle(Q, right node);
Else
//Q overlaps both sides or none
If (next node is LEFT)
Shuttle(Q, lef t node);
Set next node to RIGHT;
If (next node is RIGHT)
Shuttle(Q, right node);
Set next node to LEFT;
Else
//curr node is a leaf node
Combine Tuples(Q, curr node);
Mark curr node as done
tree is a power of 2.
Lemma 1. Efficiency of the ACE Tree for query evaluation.
•
Let n be the total number of leaf nodes in a ACE Tree used to sample from some arbitrary range
query, Q
•
Let p be the largest power of 2 no greater than n
•
Let µ be the mean section size in the tree
•
Let α be fraction of database records falling in Q
•
Let N be the size of the sample from Q that has been obtained after m ACE Tree leaf nodes have
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
23
Algorithm 4: Algorithm for combining sections
Algorithm Combine Tuples(Query Q, LeafNode node)
For each section s in node do
Store the section numbers required to be
combined with s to span Q, in a list list
For each section number i in list do
If buckets[] does not have section i
f lag = false
If (f lag == true)
Combine all sections from list with s
and use the records to answer Q
Else
Store s in the appropriate bucket
been retrieved from disk
If m is not too large (that is, if m ≤ 2αn + 2), then:
E[N ] ≥
µ
p log2 p
2
where E[N ] denotes the expected value of N (the mean value of N after an infinite number of trials).
Proof: Let Ii,j and Ii,j+1 be the two internal nodes in the ACE Tree where R = Ii,j .R
S
Ii,j+1 .R
covers Q and i is maximized. As long as the shuttle algorithm has not retrieved all the children of Ii,j
and Ii,j+1 (this is the case as long as m ≤ 2αn + 2), when the mth leaf node has been processed, the
expected number of new samples obtained:
blog2 mc 2k−1
Nm =
X X
k=1
wkl µ
l=1
where the outer summation is over each of the h − i contributing sections of the leaf nodes, starting with
P
section number i up to section number h, while l wkl represents the fraction of records of the 2k−1
P
combined sections that satisfy Q. By the exponentiality property, l wkl ≥ 1/2 for every k,
Nm ≥
µ
log2 m
2
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
24
Thus after m leaf nodes have been obtained, the total number of expected samples is given by:
E[N ] ≥
m
X
Nk
k=1
≥
m
X
µ
k=1
≥
2
log2 k
µ
m log2 m
2
If m is a power of 2, the result is proven.
Lemma 2. The expected number of records µ in any leaf node section is given by:
E[µ] =
|R|
h2h−1
where |R| is the total number of database records, h is the height of the ACE Tree and 2h−1 is the number
of leaf nodes in the ACE Tree.
Proof: The probability of assigning a record to any section i; i ≤ h is 1/h. Given that the record is
assigned to section i, it can be assigned to only one of 2i−1 leaf node groups after comparing with the
appropriate medians. Since each group would have 2h−1 /2i−1 candidate leaf nodes, the probability that
the record is assigned to some leaf node Lj is:


2i−1
i−1
X
X1
1
1 2
0 i−1 + i−1 h−1 
×
E[µi,j ] =
h
2
2 2
t∈R
k=1
=
|R|
h2h−1
VII. M ULTI - DIMENSIONAL ACE T REES
The ACE Tree can be easily extended to support queries that include multi-dimensional predicates.
The change needed to incorporate this extension is to use a k-d binary tree instead of the regular binary
tree for the ACE Tree. Let a1 . . . ak be the k key attributes for the k-d ACE Tree. To construct such a
tree, the root node would be the median of all the a1 values in the database. Thus the root partitions the
dataset based on a1 . At the next step, we need to assign values for level 2 internal nodes of the tree. For
each of the resulting partitions of the dataset, we calculate the median of all the a2 values. These two
medians are assigned to the two internal nodes at level 2 respectively, and we recursively partition the two
halves based on a2 . This process is continued until we finish level k. At level k + 1, we again consider a1
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
25
0.18
% of total number of records in the relation
0.16
0.14
0.12
ACE Tree
0.1
0.08
0.06
0.04
B+ Tree
Randomly permuted file
0.02
0
0
0.5
1
1.5
2
2.5
% of time required to scan relation
3
3.5
4
Fig. 11. Sampling rate of an ACE Tree vs. rate for a B+ Tree and scan of a randomly permuted file, with a one dimensional selection
predicate accepting 0.25% of the database records. The graph shows the percentage of database records retrieved by all three sampling
techniques versus time plotted as a percentage of the time required to scan the relation
for choosing the medians. We would then assign a randomly generated section number to every record.
The strategy for assigning a leaf node number to the records would also be similar to the one described
in Section 5.4 except that the appropriate key attribute is used while performing comparisons with the
internal nodes. Finally, the dataset is sorted into leaf nodes as in Figure 9(c).
Query answering with the k-d ACE Tree can use the Shuttle algorithm described earlier with a few
minor modifications. Whenever a section is retrieved by the algorithm, only records which satisfy all
predicates in the query should be returned. Also, the mth sections of two leaf nodes can be combined
only if they match in all m dimensions. The nth sections of two leaf nodes can be appended only if they
match in the first n − 1 dimensions and form a contiguous interval over the nth dimension.
VIII. B ENCHMARKING
In this Section, we describe a set of experiments designed to test the ability of the ACE Tree to quickly
provide an online random sample from a relational selection predicate as well as to demonstrate that the
memory requirement of the ACE Tree is reasonable. We performed two sets of experiments. The first set
is designed to test the utility of the ACE Tree for use with one-dimensional data, where the ACE Tree
is compared with a simple sequential file scan as well as Antoshenkov’s algorithm for sampling from a
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
26
0.4
% of total number of records in the relation
0.35
0.3
0.25
0.2
0.15
ACE Tree
Randomly permuted file
0.1
0.05
B+ Tree
0
0.5
1
1.5
2
2.5
3
% of time required to scan the relation
3.5
4
Fig. 12. Sampling rate of an ACE Tree vs. rate for a B+ Tree and scan of a randomly permuted file, with a one dimensional selection
predicate accepting 2.5% of the database records. The graph shows the percentage of database records retrieved by all three sampling
techniques versus time plotted as a percentage of the time required to scan the relation
ranked B+-Tree. In the second set, we compare a multi-dimensional ACE Tree with the sequential file
scan as well as with the obvious extension of Antoshenkov’s algorithm to a two-dimensional R-Tree.
A. Overview
All experiments were performed on a Linux workstation having 1GB of RAM, 2.4GHz clock speed
and with two 80GB, 15,000 RPM Seagate SCSI disks. 64KB data pages were used.
Experiment 1. For the first set of experiments, we consider the problem of sampling from a range
query of the form:
SELECT * FROM SALE
WHERE SALE.DAY >= i AND SALE.DAY <= j
We implemented and tested the following three random-order record retrieval algorithms for sampling
the range query:
1) ACE Tree Query Algorithm: The ACE Tree was implemented exactly as described in this paper. In
order to use the ACE Tree to aid in sampling from the SALE relation, a materialized sample view
for the relation was created, using SALE.DAY as the indexed attribute.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
27
% of total number of records in the relation
2.5
2
1.5
1
Randomly permuted file
ACE Tree
0.5
B+ Tree
0
0.5
1
1.5
2
2.5
% of time required to scan the relation
3
3.5
4
Fig. 13. Sampling rate of an ACE Tree vs. rate for a B+ Tree and scan of a randomly permuted file, with a one dimensional selection
predicate accepting 25% of the database records. The graph shows the percentage of database records retrieved by all three sampling
techniques versus time plotted as a percentage of the time required to scan the relation
2) Random sampling from a B+-Tree: Antoshenkov’s algorithm for sampling from a ranked B+-Tree
was implemented as described in Algorithm 1. The B+-Tree used in the experiment was a primary
index on the SALE relation (that is, the underlying data were actually stored within the tree), and
was constructed using the standard B+-Tree bulk construction algorithm.
3) Sampling from a randomly permuted file: We implemented this random sampling technique as
described in Section 2.1 of this paper. This is the standard sampling technique used in previous
work on online aggregation. The SALE relation was randomly permuted by assigning a random key
value k to each record. All of the records from SALE were then sorted in ascending order of each
k value using a two-phase, multi-way merge sort (TPMMS) (see Garcia-Molina et al. [24]). As the
sorted records are written back to disk in the final pass of the TPMMS, k is removed from the file.
To sample from a range predicate using a randomly permuted file, the file is scanned from front to
back and all records matching the range predicate are immediately returned.
For the first set of experiments, we synthetically generated the SALE relation to be 20GB in size with
100B records, resulting in around 200 million records in the relation.
We began the first set of experiments by sampling from 10 different range selection predicates over
SALE using the three sampling techniques described above. 0.25% of the records from MySam satisfied
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
28
3
ACE Tree
Randomly permuted file
% of total number of records in the relation
2.5
2
B+ Tree
1.5
1
0.5
0
0
100
200
300
400
500
% of time required to scan the relation
600
700
800
Fig. 14. Sampling rate of an ACE Tree vs. rate for a B+ Tree and scan of a randomly permuted file, with a one dimensional selection
predicate accepting 2.5% of the database records. The graph is an extension of Figure 12 and shows results till all three sampling techniques
return all the records matching the query predicate.
each range selection predicate. For each of the three random sampling algorithms, we recorded the total
number of random samples retrieved by the algorithm at each time instant. The average number of random
samples obtained for each of the ten queries was then calculated. This average is plotted as a percentage
of the total number of records in SALE along the Y-axis in Figure 11. On the X-axis, we have plotted
the elapsed time as a percentage of the time required to scan the entire relation. We chose this metric
considering the linear scan as the baseline record retrieval method. The test was then repeated with two
more sets of selection predicates that are satisfied by 2.5% and 25% of MySam’s records, respectively.
The results are plotted in Figure 12 and Figure 13. For all the three figures, results are shown for the first
15 seconds of execution, corresponding to approximately 4% of the time required to scan the relation.
We show an additional graph in Figure 14 for the 2.5% selectivity case, where we plot results until all
the three record retrieval algorithms return all the records matching the query predicate.
Finally, we provide experimental results to indicate the number of records that are needed to be buffered
by the ACE Tree query algorithm for two different query selectivities. Figure 15 (a) shows the minimum,
maximum and the average number of records stored for ten different queries having a selectivity of 0.25%
while Figure 15 (b) shows similar results for queries having selectivity 2.5%.
Experiment 2. For the second set of experiments, we add an additional attribute AMOUNT to the SALE
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
0.003
Average across 10 queries
Maximum of 10 queries
Minimum of 10 queries
Fraction of total number of records in the relation
Fraction of total number of records in the relation
0.00035
29
0.0003
0.00025
0.0002
0.00015
0.0001
5e-05
0
Average across 10 queries
Maximum of 10 queries
Minimum of 10 queries
0.0025
0.002
0.0015
0.001
0.0005
0
0
1
2
3
4
5
6
7
8
9
10
11
0
1
2
% of time required to scan the relation
(a) 0.25% selectivity
3
4
5
6
7
8
9
10
11
% of time required to scan the relation
(b) 2.5% selectivity
Fig. 15. Number of records needed to be buffered by the ACE Tree for queries with 0.25% and 2.5% selectivity. The graph shows the
number of records buffered as a fraction of the total database records versus time plotted as a percentage of the time required to scan the
relation.
relation and test the following two-dimensional range query:
SELECT * FROM SALE
WHERE SALE.DAY >= d1 AND SALE.DAY <= d2
AND SALE.AMOUNT >= a1 AND SALE.AMOUNT <= b2
To generate the SALE relation, each (DAY, AMOUNT) pair in each record is generated by sampling from
a bivariate uniform distribution.
In this experiment, we again test the three random sampling options given above:
1) ACE Tree Query Algorithm: The ACE Tree for multi-dimensional data (a k-d ACE Tree) was
implemented exactly as described in Section 7. It was used to create a materialized sample view
over the DAY and AMOUNT attributes.
2) Random sampling from a R-Tree: Antoshenkov’s algorithm for sampling from a ranked B+-Tree
was extended in the obvious fashion for sampling from a R-Tree [25]. Just as in the case of the
B+-Tree, the R-Tree is created as a primary index, and the data from the SALE relation are actually
stored in the leaf nodes of the tree. The R-Tree was constructed in bulk using the well-known
Sort-Tile-Recursive [26] bulk construction algorithm.
3) Sampling from a randomly permuted file: We implemented this random sampling technique in a
similar manner as Experiment 1.
In this experiment, the SALE relation was generated so as to be about 16 GB in size. Each record in
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
30
0.1
0.09
% of total number of records in relation
0.08
0.07
0.06
0.05
0.04
ACE Tree
R Tree
0.03
0.02
Randomly permuted file
0.01
0
0.5
1
1.5
2
2.5
3
3.5
% of time required to scan relation
4
4.5
5
Fig. 16. Sampling rate of an ACE Tree vs. rate for an R-Tree and scan of a randomly permuted file, with a spatial selection predicate
accepting 0.25% of the database tuples.
the relation was 100B in size, resulting in approximately 160 million records.
Just as in the first experiment, we began by sampling from 10 different range selection predicates over
SALE using the three sampling techniques described above. 0.25% of the records from SALE satisfied each
range selection predicate. For all the three random sampling algorithms, we recorded the total number of
random samples retrieved by the algorithm at each time instant. The average number of random samples
obtained for each of the ten queries is then computed. This average is plotted as a percentage of the total
number of records in SALE along the Y-axis in Figure 16. On the X-axis, we have plotted the elapsed
time as a percentage of the time required to scan the entire relation. The test was then repeated with two
more selection predicates that are satisfied by 2.5% and 25% of the SALE relations records, respectively.
The results are plotted in Figure 17 and Figure 18 respectively.
B. Discussion of Experimental Results
There are several important observations that can be made from the experimental results. Irrespective of
the selectivity of the query, we observed that the ACE Tree clearly provides a much faster sampling rate
during the first few seconds of query execution compared with the other approaches. This advantage tends
to degrade over time, but since sampling is often performed only as long as more samples are needed to
achieve a desired accuracy, the fact that the ACE Tree can immediately provide a large, online random
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
31
0.35
% of total number of records in relation
0.3
0.25
0.2
0.15
ACE Tree
Randomly permuted file
0.1
0.05
R Tree
0
0.5
1
1.5
2
2.5
3
3.5
% of time required to scan relation
4
4.5
5
Fig. 17. Sampling rate of an ACE Tree vs. rate for an R-Tree, and scan of a randomly permuted file with a spatial selection predicate
accepting 2.5% of the database tuples.
sample almost immediately indicates its practical utility.
Another observation indicating the utility of the ACE Tree is that while it was the top performer over
the three query selectivities tested, the best alternative to the ACE Tree generally changed depending on
the query selectivity. For highly selective queries, the randomly-permuted file is almost useless due to the
fact that the chance that any given record is accepted by the relational selection predicate is very low.
On the other hand, the B+-Tree (and the R-Tree over multi-dimensional data) performs relatively well
for highly selective queries. The reason for this is that during the sampling, if the query range is small,
then all the leaf pages of the B+-Tree (or R-Tree) containing records that match the query predicate are
retrieved very quickly. Once all of the relevant pages are in the buffer, the sampling algorithm does not
have to access the disk to satisfy subsequent sample requests and the rate of record retrieval increases
rapidly. However, for less selective queries, the randomly-permuted file works well since it can make use
of an efficient, sequential disk scan to retrieve records. As long as a relatively large fraction of the records
retrieved match the selection predicate, the amount of waste incurred by scanning unwanted records as
well is small compared to the additional efficiency gained by the sequential scan. On the other hand,
when the range associated with a query having high selectivity is very large, the time required to load all
of the relevant B+-Tree (or R-Tree) pages into memory using random disk I/Os is prohibitive. Even if the
query is run long enough that all of the relevant pages are touched, for a query with high selectivity, the
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
32
2.5
% of total number of records in relation
2
1.5
Randomly permuted file
1
ACE Tree
0.5
R Tree
0
0.5
1
1.5
2
2.5
3
3.5
% of time required to scan relation
4
4.5
5
Fig. 18. Sampling rate of an ACE Tree vs. rate for an R-Tree, and scan of a randomly permuted file with a spatial selection predicate
accepting 25% of the database tuples.
the buffer manager cannot be expected to buffer all the B+-Tree (or R-Tree) pages that contain records
matching the query predicate. This is the reason that the curve for the B+-Tree in Figure 12 or for the
R-Tree in Figure 17, never leaves the y-axis for the time range plotted.
The net result of this is that if an ACE Tree were not used, it would probably be necessary to use both
a B+-Tree and a randomly-permuted file in order to ensure satisfactory performance in the general case.
Again, this is a point which seems to strongly favor use of the ACE Tree.
An observation we make from Figure 14 is that if all the three record retrieval algorithms are allowed
to run to completion, we find that the ACE Tree is not the first to complete execution. Thus, there is
generally a crossover point beyond which the sampling rate of an alternative random sampling technique
is higher than the sampling rate of the ACE Tree. However, the important point is that such a transition
always occurs very late in the query execution by which time the ACE Tree has already retrieved almost
90% of the possible random samples. We found this trend for all the different query selectivities we tested
with single dimensional as well as multi-dimensional ACE Trees. Thus, we emphasize that the existence
of such a crossover point in no way belittles the utility of the ACE Tree since in practical applications
where random samples are used, the number of random samples required is very small. Since the ACE
Tree provides the desired number of random samples (and many more) much faster than the other two
methods, it still emerges as the top performer among the three methods for obtaining random samples.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
33
Finally, Figure 15 shows the memory requirement of the ACE Tree to store records that match the
query predicate but cannot be used as yet to answer the query. The fluctuations in the number of records
buffered by the query algorithm at different times during the query execution is as expected. This is
because the amount of buffer space required by the query algorithm can vary as newly retrieved leaf node
sections are either buffered (thus requiring more buffer space) or can be appended with already buffered
sections (thus releasing buffer space). We also note from Figure 15 that the ACE Tree has a reasonable
memory requirement since a very small fraction of the total number of records is buffered by it.
IX. C ONCLUSION AND D ISCUSSION
In this paper we have presented the idea of a “sample view” which is an indexed, materialized view
of an underlying database relation. The sample view facilitates efficient random sampling of records
satisfying a relational range predicate. In the paper we describe the ACE Tree which is a new indexing
structure that we use to index the sample view. We have shown experimentally that with the ACE Tree
index, the sample view can be used to provide an online random sample with much greater efficiency than
the obvious alternatives. For applications like online aggregation or data mining that require a random
ordering of input records, this makes the ACE Tree a natural choice for random sampling.
This is not to say that the ACE Tree is without any drawbacks. One obvious concern is that the ACE
Tree is a primary file organization as well as an index, and hence it requires that the data be stored
within the ACE Tree structure. This means that if the data are stored within an ACE Tree, then without
replication of the data elsewhere it is not possible to cluster the data in another way at the same time.
This may be a drawback for some applications. For example, it might be desirable to organize the data
as a B+-Tree if non-sampling-based range queries are asked frequently as well, and this is precluded by
the ACE Tree. This is certainly a valid concern. However, we still feel that the ACE Tree will be one
important weapon in a data analyst’s arsenal. Applications like online aggregation (where the database is
used primarily or exclusively for sampling-based analysis) already require that the data be clustered in a
randomized fashion; in such a situation, it is not possible to apply traditional structures like a B+-Tree
anyway, and so there is no additional cost associated with the use of an ACE Tree as the primary file
organization. Even if the primary purpose of the database is a more traditional or widespread application
such as OLAP, we note that it is becoming increasingly common for analysts to subsample the database
and apply various analytic techniques (such as data mining) to the subsample; if such a sample were to be
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
34
materialized anyway, then organizing the subsample itself as an ACE Tree in order to facilitate efficient
online analysis would be a natural choice.
Another potential drawback of the ACE Tree as it has been described in this paper, is that it is not an
incrementally updateable structure. The ACE Tree is relatively efficient to construct in bulk: it requires
two external sorts of the underlying data to build from scratch. The difficulty is that as new data are added,
there is not an easy way to update the structure without rebuilding it from scratch. Thus, one potential area
for future work is to add the ability to handle incremental inserts to the sample view (assuming that the
ACE Tree is most useful in a data warehousing environment, then deletes are far less useful). However,
we note that even without the ability to incrementally update an ACE-Tree, it is still easily usable in a
dynamic environment if a standard method such as a differential file [27] is applied. Specifically, one
could maintain the differential file as a randomly permuted file or even a second ACE Tree, and when
a relational selection query is posed, in order to draw a random sample from the query one selects the
next sample from either the primary ACE Tree or the differential file with an appropriate hypergeometric
probability (for an idea of how this could be done, see the recent paper of Brown and Haas [28] for a
discussion of how to draw a single sample from multiple data set partitions). Thus, we argue that the lack
of an algorithm to update the ACE tree incrementally may not be a tremendous drawback.
Finally, we close the paper by asserting that the importance of having indexing methods that can handle
insertions incrementally is often overstated in the research literature. In practice, most incrementallyupdateable structures such as B+-Trees cannot be updated incrementally in a data warehousing environment
due to performance considerations anyway [29]. Such structures still require on the order of one random
I/O per update, rendering it impossible to efficiently process bulk updates consisting of millions of records
without simply rebuilding the structure from scratch. Thus, we feel that the drawbacks associated with
the ACE Tree do not prevent its utility in many real-world situations.
R EFERENCES
[1] R. Motwani and P. Raghavan, Randomized Algorithms. Cambridge University Press, New York, 1995.
[2] J. M. Hellerstein, P. J. Haas, and H. J. Wang, “Online Aggregation,” in SIGMOD, 1997, pp. 171–182.
[3] J. M. Hellerstein, R. Avnur, A. Chou, C. Hidber, C. Olston, V. Raman, T. Roth, and P. J. Haas, “Interactive Data Analysis: The Control
Project,” in IEEE Computer 32(8), 1999, pp. 51 – 59.
[4] P. J. Haas and J. M. Hellerstein, “Ripple Joins for Online Aggregation,” in SIGMOD, 1999, pp. 287 – 298.
[5] P. S. Bradley, U. M. Fayyad, and C. Reina, “Scaling Clustering Algorithms to Large Databases,” in KDD, 1998, pp. 9 – 15.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006
35
[6] D. W. Cheung, J. Han, V. Ng, and C. Y. Wong, “Maintenance of Discovered Association Rules in Large Databases: An Incremental
Updating Technique,” in ICDE, 1996, pp. 106–114.
[7] J. X. Yu, Z. Chong, H. Lu, and A. Zhou, “False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional
Data Streams,” in VLDB, 2004, pp. 204 – 215.
[8] F. Provost, D. Jensen, and T. Oates, “Efficient Progressive Sampling,” in KDD, 1999, pp. 23–32.
[9] C. Meek, B. Thiesson, and D. Heckerman, “The Learning-Curve Sampling Method Applied to Model-Based Clustering,” J. Mach.
Learn. Res., vol. 2, pp. 397–418, 2002.
[10] M. Saar-Tsechansky and F. Provost, “Active Sampling for Class Probability Estimation and Ranking,” Mach. Learn., vol. 54, no. 2,
pp. 153–178, 2004.
[11] P. Domingos and G. Hulten, “A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering,” in
ICML, 2001, pp. 106–113.
[12] T. Scheffer and S. Wrobel, “Finding the Most Interesting Patterns in a Database Quickly by Using Sequential Sampling.” J. Mach.
Learn. Res., vol. 3, pp. 833–862, 2002.
[13] L. G. Valiant, “A Theory of the Learnable,” in STOC, 1984, pp. 436–445.
[14] P. J. Haas and C. Koenig, “A Bi-Level Bernoulli Scheme for Database Sampling,” in SIGMOD, 2004, pp. 275 – 286.
[15] P. J. Haas, “The Need for Speed: Speeding Up DB2 Using Sampling,” IDUG Solutions Journal, vol. 10, pp. 32–34, 2003.
[16] F. Olken and D. Rotem, “Random Sampling from B+ Trees,” in VLDB, 1989, pp. 269–277.
[17] F. Olken and D. Rotom, “Simple Random Sampling from Relational Databases,” in VLDB, 1986, pp. 160 – 169.
[18] F. Olken, D. Rotem, and P. Xu, “Random Sampling from Hash Files,” in SIGMOD, 1990, pp. 375 – 386.
[19] F. Olken, “Random Sampling from Databases,” in Ph.D. Dissertation, 1993.
[20] F. Olken and D. Rotem, “Sampling from Spatial Databases,” in ICDE, 1993, pp. 199 – 208.
[21] G. Antoshenkov, “Random Sampling from Pseudo-Ranked B+ Trees,” in VLDB, 1992, pp. 375–382.
[22] S. Chaudhuri, G. Das, and U. Srivastava, “Effective Use of Block-Level Sampling in Statistics Estimation,” in SIGMOD, 2004, pp.
287 – 298.
[23] A. A. Diwan, S. Rane, S. Seshadri, and S. Sudarshan, “Clustering Techniques for Minimizing External Path Length,” in VLDB, 1996,
pp. 342–353.
[24] H. Garcia-Molina, J. Widom, and J. D. Ullman, Database System Implementation. Prentice-Hall, Inc., 1999.
[25] A. Guttman, “R-trees: A Dynamic Index Structure for Spatial Searching.” in SIGMOD Conference, 1984, pp. 47–57.
[26] S. T. Leutenegger, J. M. Edgington, and M. A. Lopez, “STR: A Simple and Efficient Algorithm for R-Tree Packing.” in ICDE, 1997,
pp. 497–506.
[27] D. G. Severance and G. M. Lohman, “Differential Files: Their Application to the Maintenance of Large Databases.” ACM Trans.
Database Syst., vol. 1, no. 3, pp. 256–267, 1976.
[28] P. G. Brown and P. J. Haas, “Techniques for Warehousing of Sample Data.” in ICDE, 2006, p. 6.
[29] P. Muth, P. E. O’Neil, A. Pick, and G. Weikum, “Design, Implementation, and Performance of the LHAM Log-Structured History Data
Access Method.” in VLDB, 1998, pp. 452–463.