Computing Data Cubes without Redundant Aggregated Nodes and

Computing Data Cubes without Redundant Aggregated
Nodes and Single Graph Paths: The Sequential MCG
Approach
Joubert de Castro Lima1, Celso Massaki Hirata1
1
Instituto Tecnológico de Aeronáutica (ITA) – Divisão de Ciência da Computação.
Praça Marechal Eduardo Gomes, 50 - Vila das Acácias – 12.228-900 – São José dos
Campos – SP – Brazil
{joubert, hirata}@ita.br
Abstract. In this paper, we present a novel full cube computation and
representation approach, named MCG. A data cube can be defined as a lattice
of cuboids. In our approach, each cuboid is seen as a set of sub-graphs.
Redundant suffixed nodes in such sub-graphs are quite common, but their
elimination is a hard problem as some previous cube approaches demonstrate.
MCG approach computes a data cube in two phases: First, it generates a base
cuboid from a base relation with no tuples rearrangement. Second, it
generates all the remaining aggregated cells, in a top-down fashion, with a
unique base-MCG scan. During both MCG cube computation phases, the
MCG cube size reduction method maintains the entire lattice of cuboids
without common prefixed nodes and common single graph paths. During the
second phase, the reduction method also eliminates common aggregated
nodes that are normally frequent when sparse relations are computed. MCG
performance analysis demonstrates an efficient runtime and very low memory
consumption when compared to Star and MDAG full cube approaches.
1. Introduction
Since the introduction of Data Warehouse (DW) and OLAP [Gray et al. 1997], efficient
computation of data cubes has become one of the most relevant and pervasive problems
in the DW area. This problem is of exponential complexity with respect to the number
of dimensions; therefore, the materialization of a cube involves both a huge number of
cells and a substantial amount of time for its generation.
Previous cube computation and representation approaches were classified into
two main categories: (i) full and (ii) partial cube computation and representation. The
first category approaches compute all the cells of all of the cuboids for a given data
cube, while the second category approaches compute a subset of a given set of
dimensions, or a smaller range of possible values for some of the dimensions [Han et al.
2006]. Between these categories, it is recognized that the first one, efficient
computation and representation of full cubes with simple or complex measures, is the
basis to the other, and any new approach developed to address it, may strongly
influence developments related to the partial cube.
The problem of cube computation can be formally defined as follows: given a
base relation D and n attributes, a cell a = (a1; a2; . . . ; an; c) in an n-dimension data
cube is a “GROUP BY” with measure c and n distinct attribute values a1; a2; . . . ; an. A
cell a is called an m-dimensional cell, if and only if there are exactly m (m ≤ n) values
among {a1; a2; . . . ; an} which are not * (* is a wildcard for all values). It is called a
base cell if m = n. Given a base cuboid, the task is to compute a full cube where no
iceberg condition exists.
The hybrid dimension-based Star approach, proposed in [Xin et al. 2007], is
considered one of the most promising full cube computation approaches, since it
outperforms other approaches, as presented in [Zhao et al. 1997], [Beyer et al. 1999],
[Han et al. 2001], in dense, skewed and sparse scenarios. The Star approach uses a
prefixed data structure, named star-tree, to represent individual cuboids, but
unfortunately the Star approach suffers from additional star-tree traversals introduced
by unnecessary common suffixed nodes. The presence of such nodes in a star-tree also
consumes extra memory unnecessarily, so they may pose a serious problem to the Star
approach and may even render it useless when computing very sparse relations.
In [Lima et al. 2007], the authors propose two ideas to both reduce the number
of costly data structure traversals and the cube size of the star-tree. First, they propose
the elimination of wildcards, such as star-nodes, using the dimensional ID concatenated
with the original attribute value. Second, they propose the reduction of the cube size
using internal and external nodes abstractions in the lattice. The results demonstrate that
MDAG approach computes a data cube faster than Star approach, consuming less
memory to represent the same data cube.
However, there still exist problems with Star and MDAG approaches caused by:
(i) Both approaches start their full cube computation by scanning a base relation
to generate a base cuboid. During the base cuboid generation, both approaches consider
the presence of redundant single paths in the lattice. Consider a single path a branch of
a data structure where no forks exist. In MDAG, the authors propose the internal node
redundancy elimination, but in sparse scenarios this method is not effective to reduce
the cube size.
(ii) The Star and MDAG approaches only implement some single path
optimizations during the generation of the aggregated cells. These optimizations avoid
the creation of ancestor cells and compress single paths, but they do not avoid the
presence of common aggregated nodes, derived from non single paths. Suppose that a
node has more than one descendant node. The methods implemented in Star and MDAG
approaches generate new aggregated nodes for each descendant of the descendant nodes
(second-level of descendants). These methods can produce redundant aggregated nodes
if these second-level descendant nodes are partially/totally different. This particularity
is frequent when sparse relations are computed.
In this paper, we present a novel approach, named Multidimensional Cyclic
Graph (MCG), to efficiently compute a full cube using a representation without
redundant aggregated nodes and single graph paths. During a base relation scan, we
insert the tuples by traversing the MCG representation from root to leaf nodes and, if it
is necessary, we also traverse the data structure from leaf to root node. We called this
method double insertion method. During the MCG base cuboid construction, the double
insertion method maintains the entire lattice without both redundant prefixed nodes and
suffixed nodes that form single graph paths.
The MCG aggregated cells generation phase consider the single paths
optimizations proposed in [Xin et al. 2007] and [Lima et al. 2007], and also the
implementation of an on-demand aggregated node creation method that avoids the
presence of common aggregated nodes in a data cube.
We use Figure 1 to illustrate the benefits of the proposed method. In Figure 1, it
is presented the best scenario, i.e., the scenario where the cube size reduction is
maximized. Initially, each intermediate node (a1, a2, b1, b2, b3, and b4) has one or a set of
descendant nodes and must produce one or a set of descendant aggregated nodes from
its non aggregated descendant(s). Node a1 in Figure 1-b and c has b1 and b2 as its
descendant nodes and c1, c2, c3 and c4 as its aggregated descendant nodes, generated, in
a top-down fashion, from b1 and b2 descendants. Note that, node a1 has a sibling node
(a2), so no single path optimizations, as proposed in [Xin et al. 2007] and [Lima et al.
2007], can be applied. In Figure 1-b, the phase that generates the aggregated cells
produces new aggregated nodes, while in Figure 1-c no new nodes are created. Instead,
the on-demand aggregated node creation method directly references the descendants of
nodes b1 and b2 to a1. The same occurs to nodes a2 and root, producing a full cube with
only 15 nodes when using the MCG approach or 43 nodes if Star or MDAG approaches
are used.
Figure 1. The on-demand aggregated node creation method impact
These properties reduce the MCG cube computation effort and the MCG cube
size without any loss of generality. To verify the MCG reduction impact in runtime and
memory consumption, we test MCG algorithm against the Star-Cubing algorithm,
proposed in [Xin et al. 2007], and the MDAG-Cubing algorithm, proposed in [Lima et
al. 2007]. The results show that the reduction methods proposed in this paper improve
the runtime performance and the memory consumption is always kept very low when
compared to Star and MDAG approaches.
The remaining of the paper is organized as follows. In Section 2 the MCG
properties are explained and the algorithm is described. The performance results are
presented in Section 3. Discussions on the potential extensions and limitations of the
MCG approach are described in Section 4, and we conclude our study in Section 5.
2. MCG Approach
The MCG is a full cube computation and representation approach that computes a full
cube in two phases: First, it scans the original base relation, without tuples
rearrangement, to compute the base cuboid cells. If an iceberg cube is to be computed,
we must also compute some shared dimensions, similar to [Xin et al. 2007], in order to
implement an earlier bottom-up pruning. During the base cuboid computation, the MCG
approach guarantees the inexistence of both prefix redundancy and single graph paths
redundancy, using the root/leaf ─ leaf/root double insertion method. Second, it
generates all the remaining aggregated cells, in a top-down fashion, with a unique baseMCG scan. All the steps in the second phase continue preserving unique single graph
paths in the lattice and the elimination of redundant aggregated nodes, using the ondemand aggregated node creation method.
The MCG nodes are computed according to their cardinalities, in a descending
order, since this order permits both earlier cell pruning [Xin et al. 2007] and efficient
single path redundancies elimination. In a full cube with this dimension order, we
produce the sparsest representation, consequently, with many single paths and
infrequent nodes.
In the remaining of this section, we discuss the MCG representation, and we end
the section with the detailed MCG two-phase algorithm.
2.1. MCG Cube Representation
The MCG approach represents a data cube using a Cyclic Graph (CG) with
multidimensional properties. We use CG to represent individual cuboids. Each level in
the CG represents any dimension, and each node represents an attribute. Tuples in the
cuboid are inserted one by one into the MCG. A path from the root to a node represents
a cube cell. Each node has four fields: pointer(s) to possible descendant node key(s),
pointer(s) to possible ancestor node key(s), set of measure values and an associated ID.
MCG approach uses the attribute value as its node key. The associated ID
indicates if a node has been used in the lattice more than once. A sibling node can be
obtained indirectly, using an ancestor node plus its descendants. The set of measure
values permits simultaneous computation of measures, similar to [Lima et al. 2007].
2.2. MCG Base Cuboid Algorithm
Algorithm 1 MCG Base Cuboid
Input: A base relation R
Output: A base cuboid
for each tuple in R do
call MCG_Base_Cuboid(tuple);
procedure MCG_Base_Cuboid(tuple){
var: auxNode, LFN, LSUN;
1: traverse from root to leaf, finding LFN, LSUN;
2: if (LFN is a non leaf node) {// insertion
3:
traverse an auxiliary path (AP) from leaf to LFN level in the lattice, using tuple
attribute values;
4:
append, if necessary, new nodes to the AP until LFN level;
5:
if (LFN is reused in the lattice){
6:
append new nodes to the AP until LSUN level;
7:
set the first node of the AP as a LSUN descendant;
8:
} else set the first node of the AP as a LFN descendant;
9: }else{// update
10: auxNode ← LFN;
12: increment auxNode measure with tuple measure values;
13: traverse an AP from leaf to LSUN level in the lattice, using tuple attribute values
and auxNode measure values;
14:
append, if necessary, new nodes to the AP until LSUN level;
15:
set the first node of AP as a LSUN descendant;}}
Figure 2. MCG Base Cuboid Algorithm Execution
For each input tuple, the algorithm traverses the base cuboid to collect the last found
node (LFN) and the last single-used node (LSUN) in line 1. Intuitively, for a given tuple
to be inserted in a cuboid, considering the path from the root to the leaf of the cuboid,
which has common attributes to the tuple, LFN is the deepest node. LSUN is LFN when
LFN has only one ancestor. AP is the sub-path from leaf to root in the cuboid, which
has common attributes to the tuple. Formally, LFN and LSUN can be defined as: given
a graph G=(N, A), a N subset of nodes P={n1, n2, …, np} forming a graph path <n1→ n2
…→ np>, an input tuple T={t1, t2, …, tD, m} and a simplified node notation n=(attr, m,
Desc, Anc), where attr is an attribute value, m is a measure value, D is the number of
dimensions, |T|=D+1, |P|≤D, Desc is a set of n descendant nodes and Anc is a set of n
ancestor nodes, a LFN is np iff ( ∀ ni ∈ P | (1≤i≤p), attr(ni) = ti). A LSUN satisfies all
LFN conditions plus |Anc (np)|=1 condition.
We use Figure 2 to illustrate what happens during a base cuboid computation.
We simulate the computation of base relation R, also presented in Figure 2. The first
two tuples (a1b1c1-1 and a3b3c2-1) are inserted straight in the lattice, basically using
lines 3, 4 and 8, since they do not share any attribute value (Figures 2-a and b,
respectively). For both insertions, we have LFN=LSUN=root.
For the tuple a2b3c2-1, we have LFN=LSUN=root, so line 3 is executed. The
algorithm traverses from leaf (c2) to LFN level (root level), identifying b3c2 as the AP.
When traversing from leaf to root, arcs from descendant to ancestor nodes are required,
so the arcs have double direction. The algorithm appends an extra node (a2) to the AP
(line 4) and then it puts the first node of the AP (node a2) as a LFN descendant, since
LFN is a single-used node (line 8). The third tuple insertion is illustrated in Figure 3-c.
The next three tuples (a3b1c1-1, a2b1c1-1 and a2b2c2-1) insertions are similar to the
previous one. The new insertions produce the base cuboids illustrated in Figure 2-d, e
and f, respectively.
The next tuple a1b1c2-1 produces LFN=b1, but b1 is a reused node, so lines 6 and
7 are executed. First, c2 is identified as the AP (line 3) and no new nodes are added (line
4), since the AP c2 achieved the LFN level. Second, line 6 is executed, so new nodes are
appended until LSUN is achieved. In this insertion LSUN=a1, so a new b1 node is
created and the old b1 descendants are copied to the new one. We build the new path
b1c1 so that it does not form a cycle c1b1 as the old path does. This cycle elimination
guarantees that c1 has just one b1 ancestor node. MCG extends such a condition to the
entire lattice when it avoids nodes with multiple descendants to be ancestor nodes.
Finally, when line 7 is executed the new node b1 substitutes the old node b1, descendant
of a1. The current insertion produces a base cuboid that is presented in Figure 2-g.
The remaining tuples insertions are presented in Figures 2-h, i, j and k. These
insertions produce one of the previous described scenarios, so due to the limited space,
we omit the details in this paper. If an update occurs, the unique difference from
previous explanations is that we first need to update an auxNode (line 12) and then
traverse from leaf, using the auxNode, to the LSUN level. The remaining execution is
similar to an insertion.
In this section, we show the MCG algorithm to generate a base cuboid without
single graph paths redundancies. The double insertion method is efficient in runtime
and memory consumption, as our experiments demonstrate in Section 3.1, so it can be
considered a promising solution for this costly task.
2.3. MCG Aggregation Algorithm
Algorithm 2 MCG Aggregation
Input: A base cuboid without common single graph paths
Output: A complete full cube
call MCG_Aggregation(MCG root);
procedure MCG_Aggregation(currentNode){
1: for each descendant of currentNode do{
2:
MCG_Aggregation(currentDescendant);
3:
update measure value of oneDescendant;
4:
if (currentNode has only one descendant)
5:
reference currentDescendant descendants to currentNode descendants;
6:
else copy or reference currentDescendant descendants to currentNode
descendants; }//end for}
Figure 3. MCG Aggregation Algorithm Execution
We use Figure 3 to illustrate the MCG aggregation algorithm execution. We use the
base cuboid presented in Figure 2-k as its input (see Figure 3-a).
For each descendant of the current node (line 1), the algorithm starts another
depth first traversal (line 2). If a node descendant has no descendants, a backtrack starts
(lines 3-6).
In our example, path a1b1c1 is scanned (Figure 3-a), but c1 has no descendants,
so a backtrack occurs and the second b1 descendant (c2) is computed, but c2 is identical
to c1. At this point, all b1 descendants have been visited, so a backtrack occurs and b1 is
computed. Node b1 measure values are calculated (line 3), but b1 descendants (c1 and c2)
have no descendants, so line 6 is not executed. The temporary cube representation is
presented in Figure 3-b.
A backtrack occurs and path a1b3c2 is scanned, but node c2 has already been
visited, so a backtrack occurs and node b3 is computed. Node b3 measure values are
calculated (line 3), but b3 descendant (c2) has no descendants, so line 6 is not executed.
The temporary cube representation is presented in Figure 3-c.
After computing b3, a backtrack occurs and a1 is computed, since all its
descendants (b1 and b3) have been visited. First, line 3 is executed, so a1 measure values
are calculated. Second, b1 descendants are copied to a1 descendants. Note that, a1 has no
c1 and c2 descendants, so line 6 directly references c1 and c2 to a1. For the last a1
descendant (b3), the algorithm must copy b3 descendant c2 to a1 descendant, but a1 has a
reused descendant node c2, so instead of updating c2, a new node c2 must be created.
The new changes are presented in Figure 3-d.
Node a1 has been computed, but it has siblings (a2 and a3), so path a2b1c1 is
scanned, but c1 has already been visited, so a backtrack occurs and the second b1
descendant (c2) is computed, but c2 has also been visited before. At this point, all b1
descendants have been visited, so a backtrack occurs and b1 is computed. Node b1
measure values are calculated (line 3), but b1 descendants (c1 and c2) have no
descendants, so line 6 is not executed. The new temporary cube representation is
presented in Figure 3-e.
A backtrack occurs and path a2b2c1 is scanned, but c1 has already been visited,
so a backtrack occurs and the second b2 descendant (c2) is computed, but c2 has also
been visited before. At this point, all b2 descendants have been visited, so a backtrack
occurs and b2 is computed. Node b2 measure values are calculated (line 3), but b2
descendants (c1 and c2) have no descendants, so line 6 is not executed. The new
temporary cube representation is presented in Figure 3-f.
A backtrack occurs and node a2 is computed, since its last descendant (b3) has
been previously computed when path a1b3c2 was scanned. First, all a2 descendants must
copy their descendants (line 6). Nodes b1 and b2 descendants are copied to node a2, but
there is a third descendant b3, so b3 descendant (c2) must be copied. Node a2 has a
descendant node c2, but it is a single-used node, so it can be updated. The new
temporary cube representation is presented in Figure 3-g.
A backtrack occurs to compute root, but there is node a3 to be computed first.
The path a3b1c1 is scanned, but c1 has been visited before, so a backtrack occurs and the
second b1 descendant (c2) is computed, but c2 has also been visited before. At this point
all b1 descendants have been visited, so a backtrack occurs and b1 is computed. Node b1
measure values are calculated (line 3), but b1 descendants (c1 and c2) have no
descendants, so line 6 is not executed. The new changes are presented in Figure 3-h.
A backtrack occurs and node a3 is computed, since its last descendant b3 has
been previously computed when path a1b3c2 was scanned. First, all a3 descendants must
have their descendants copied (line 6). Nodes b1 and b3 descendants are copied to node
a3, producing the temporary cube representation presented in Figure 3-i.
Finally, the algorithm starts the computation of root node. Root measure values
are calculated (line3) and the descendant nodes of a1, a2 and a3 are copied (line 6) to
finalize the MCG cube representation, so the algorithm creates the remaining
aggregated nodes with no redundancy, as presented in Figure 5-j.
In this section, we show an efficient algorithm to generate all aggregated cells
from a base cuboid. This algorithm implements some optimizations to avoid the
creation of aggregated cells, to minimize the number of MCG traversals and to reduce
the number of temporary nodes. In the next section, we demonstrate how efficient our
approach is on dense, sparse and skewed scenarios.
3. Performance Analysis
A comprehensive performance study is conducted to check the efficiency and the
scalability of the proposed algorithm. We test MCG algorithm against the best
implementation we could achieve for the Star-Cubing algorithm, proposed in [Xin et al.
2007], and the MDAG-Cubing algorithm, proposed in [Lima et al. 2007]. All the
algorithms are coded in Java Tiger 32 bits (JRE 5.0 update 15). We run the algorithms
in a Pentium IV 3.73GHz Extreme Edition system with 4GB of RAM. The system runs
Linux Fedora Core 5 64 bits (Kernel 2.6.15). All times recorded include both
computation and I/O time. Similar to other performance and memory studies in cube
computation and representation, all relations can fit in the main memory
For the remaining of this section, D denotes the number of dimensions, C the
cardinality of each dimension, T the number of tuples in a base relation, and S the skew
or zipf of the data. When S is equal to 0, the data is uniform; as S increases, the data is
more skewed. S is applied to all dimensions.
3.1. Base Cuboid Results
The first set of experiments compares MCG base cuboid runtime against Star and
MDAG base cuboids runtime. We separate the I/O (IO) time from the base cuboid
runtime (base) to enable effective comparisons. The runtimes are compared with respect
to the cardinality (Figure 4), tuple size (Figure 6), dimension (Figure 8) and skew
(Figure 10).
The small I/O difference is caused by the dimensional ID, used in MDAG and
MCG to eliminate wildcards. Each tuple needs to be converted into a MDAG or MCG
tuple. The base runtime of Star and MCG approaches is similar and better than MDAG
approach, what demonstrates the efficiency of the double insertion method when
compared to MDAG internal node method.
The memory consumption experiments consider only the memory consumed by
the base cuboid. All base cuboids consider the prefix redundancy elimination. The
second set of experiments compares the memory consumption with respect to the
cardinality (Figure 5), tuple size (Figure 7), dimension (Figure 9) and skew (Figure 11).
We use the same base relations on both experiments to enable runtime versus memory
consumption comparisons.
The memory consumption experiments demonstrate an enormous difference
among Star, MDAG and MCG base cuboids. On average, the MCG base cuboid
representation consumes 10-20% of Star base cuboid representation and 50-60% of
MDAG base cuboid representation.
In sparse scenarios with low skew, high cardinality and high number of
dimensions, the base cuboid can have many single graph paths, what improves the
MCG double insertion method execution, turning the MDAG internal node method a
shallow approach to reduce the base cuboid size. In dense scenarios, the number of
single graph paths can be small, turning the MDAG internal node method a promising
solution to compute a base cuboid, but we must always consider the runtime cost of
such solution, since a reduction of a dense base cuboid can produce small impact in
memory consumption.
base
Star
60
IO
BaseRelation
Runtime(sec)
15
MDAG
20
40
60
80
1M
100
IO
20
Star
base
40
8
35
6
30
4
2
3M
4M
5M
Tuples
7D
MCG
Star
MDAG
MCG
Star
6D
MDAG
MCG
Star
5D
MDAG
Star
MDAG
0
MCG
0
10
MDAG
Star
MDAG
MCG
MCG
Star
MDAG
MCG
MDAG
BaseRelation
20
15
10
5
0
5D
8D
6D
7D
8D
Dimensions
Figure 9. Mem Base Cuboid:
T=1M, C=10, S=0
Figure 8. Base Cuboid: T=1M,
C=10, S=0
base
5M
25
Dimensions
Figure 7. Mem Base Cuboid:
D=5, C=30, S=0
4M
Figure 6. Base Cuboid: D=5,
C=30, S=0
10
Runtime(sec)
30
Star
3M
Tuples
Figure 5. Mem Base Cuboid:
D=5, T=1M, S=0
BaseRelation
MCG
Star
2M
MDAG
MCG
Star
MCG
MDAG
Star
0
40
IO
10
Cardinality
MCG
2M
20
0
100
50
1M
30
5
Memory(MB)
Star
40
10
Cardinality
Figure 4. Base Cuboid: D=5,
T=1M, S=0
base
50
20
MCG
Star
80
MDAG
MCG
Star
Memory(MB)
40
MDAG
MCG
Star
MDAG
MCG
Star
MDAG
MCG
MDAG
20
Mem ory(MB)
MCG
25
Star
Runtime(sec)
MDAG
30
MDAG
IO
12
10
8
6
4
2
0
MCG
IO
BaseRelation
base
sim
35
12
10
8
6
4
2
0
50
Runtime(sec)
15
10
M CG
MCG
20
IO
base
sim
Star
80
70
90
70
60
80
50
70
Mem ory(MB)
100
Runtim e(sec)
80
40
40
30
20
30
20
40
60
80
Cardinality
Figure 13. Mem Full Cube:
D=5, T=1M, S=0
100
1M
2M
3M
4M
Tuples
Figure 14. Full Cube: D=5,
C=30, S=0
5M
MCG
Star
MDAG
MCG
MDAG
Star
MCG
Star
MDAG
MCG
Star
MDAG
MCG
MDAG
Star
20
MCG
BaseRelation
60
50
40
10
0
1M
2M
3M
4M
Tuples
Figure 15. Mem Full Cube:
D=5, C=30, S=0
5M
MCG
MDAG
Star
MCG
Star
Star
MDAG
20
0
0
100
30
10
10
80
Figure 12. Full Cube: D=5,
T=1M, S=0
90
50
60
Cardinality
Figure 11. Mem Base Cuboid:
D=6, T=1M, C=100
BaseRelation
40
MDAG
2,5
MCG
2
MDAG
1,5
MCG
1
MDAG
0,5
Star
0
0
Skew
60
Mem ory(MB)
20
10
2,5
Figure 10. Base Cuboid:
D=6, T=1M, C=100
MDAG
30
5
Skew
Star
40
Star
M DA G
20
MCG
2
S tar
M CG
M DA G
S tar
M CG
1,5
25
MDAG
1
M DA G
S tar
M CG
M DA G
S tar
M DA G
0,5
M CG
M em o ry(M B)
30
S tar
Ru n tim e(sec)
60
sim
Star
80
120
70
100
60
80
50
60
40
MDAG
MCG
BaseRelation
IO
base
sim
200
150
Runtim e(sec)
base
Mem ory(MB)
40
30
100
50
20
5D
6D
MDAG
MCG
8D
0,5
1
1,5
Dimensions
BaseRelation
IO
200
base
MCG
Star
MDAG
MCG
2
2,5
Skew
Figure 17. Mem Full Cube:
T=1M, C=10, S=0
Figure 16. Full Cube: T=1M,
C=10, S=0
Star
7D
Star
8D
MDAG
7D
Dimensions
MCG
6D
Star
5D
0
MDAG
Star MDAG MCG
MCG
Star MDAG MCG
Star
Star MDAG MCG
Star
Star MDAG MCG
MDAG
0
10
0
MCG
20
MDAG
Runtim e(sec)
IO
140
Figure 18. Full Cube: D=6,
T=1M, C=100
sim
memory(MB)
80
180
BaseRelation
70
160
60
120
runtime(sec)
Memory(MB)
140
100
80
50
MCG
40
30
MDAG
60
20
40
10
20
0
Star
0
0,5
1
1,5
2
2,5
Star
MDAG
MCG
Skew
0
Figure 20. Real Dataset
Runtime
Figure 19. Mem Full Cube:
D=6, T=1M, C=100
Star
MDAG
MCG
Star
1800000
MDAG
100
150
200
Figure 21. Real Dataset
Memory
Star
MCG
MDAG
MCG
3000000
4000000
1600000
50
3500000
2500000
1400000
3000000
2000000
1000000
800000
600000
400000
Memory(KB)
Memory(KB)
Memory(KB)
1200000
2500000
2000000
1500000
1000000
1000000
200000
1500000
500000
500000
0
0
1
11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181
Number of JVM analyses
0
1
1
27 53 79 105 131 157 183 209 235 261 287 313 339 365 391
18 35 52 69 86 103 120 137 154 171 188 205 222 239 256 273 290 307 324
Number of JVM analyses
Number of JVM analyses
Figure 22. Experiment: D=5,
T=1M, S=0, C=100
Figure 23. Experiment: D=8,
T=1M, S=0, C=10
Figure 24. Experiment: D=6,
T=1M, S=0.5, C=100
3.2. Full Cube Results
The third set of experiments compares MCG full cube computation against Star and
MDAG full cube computation. We separate I/O (IO), base (base) and simultaneous
aggregation (sim) times to enable effective comparisons. The runtimes are compared
with respect to the cardinality (Figure 12), tuple size (Figure 14), dimension (Figure 16)
and skew (Figure 18).
In low skew and high dimension scenarios, the MCG approach is better than
MDAG approach. This result is explained by the impact of the on-demand aggregated
node creation method in the lattice, eliminating redundant aggregated nodes. In such
situations, nodes with descendants that are partially/totally different are frequent, but
MDAG approach does not consider this particularity.
MDAG approach is worse than Star or MCG when computing a base relation
with high number of tuples (Figure 14). This behavior is explained by the runtime cost
of the internal node method. Each external node has a set of shared internal node that
must be unique in the entire lattice and to maintain this uniqueness is costly. In the
remaining scenarios MDAG and MCG runtimes are better than Star.
The memory consumption experiments use the same relations of the runtime
experiments and the results again demonstrate an enormous difference among Star,
MDAG and MCG full cube representations. In this set of experiments we consider only
the memory consumed by the full cube representation.
We compare the memory consumption with respect to the cardinality (Figure
13), tuple size (Figure 15), dimension (Figure 17) and skew (Figure 19). In all scenarios
MCG consumes 10-15% of Star full representation and 50-60% of MDAG full
representation.
Another important observation is the memory consumption difference between
MCG and the base relation. MCG approach produces a full cube that consumes less
memory than the base relation. This fact is an enormous improvement that opens new
possibilities to represent a data cube minimally. Finally, both algorithms suffer from
dimension growth (Figure 17), but the low memory consumption of MCG turns it an
interesting solution to compute medium/high dimensional data cubes.
3.3. Real World Dataset Results
This dataset is derived from the HYDRO1k Elevation Derivative Database. It is a
geographic database developed to provide hydrologic information on a continental
scale. In our experiments, we use the dataset of South America. In the hydrologic
information base relation, there are five dimensions: slope, compound topographic
index, flow directions, latitude and longitude. The cardinalities for the dimensions are
300, 280, 160, 78 and 53, respectively. The base relation includes one measure: Flow
Accumulations (FA). The total number of rows in the base relation is 7,845,529. For a
full description of the original dataset see http://edcdaac.usgs.gov/gtopo30/hydro/.
In Figures 20 and 21 we see a similar behavior to the syntactic datasets, i.e., the
MCG algorithm runtime is better than Star and MDAG, and the MCG cube
representation consumes about 10% of memory when compared with a full Star cube
representation. The memory consumption of MDAG and MCG is similar due to the
small number of redundant single paths and aggregated nodes in the lattice.
3.4. Work Memory during an Experiment
In this section, we test the memory consumption of algorithms Star, MDAG and MCG
during an experiment. We analyze the results to verify the usage of temporary work
memory for each algorithm. The results give us an idea of the supported workload, so
we can specify a base relation that can be computed using only the main memory.
We use the log files generated by the Java garbage collector to verify the
memory consumption in the experiments. We test the algorithms with respect to the
cardinality (Figure 22), dimension (Figure 23) and skew (Figure 24). The results show
that the Star algorithm uses two or three times more memory (temporary or not) to
compute a full cube when compared with MCG algorithm. When compared to MDAG,
MCG approach consumes, on average, half of memory to compute the same data cube.
Based on all results of Section 3, we can conclude that MCG approach can
compute a full cube faster than Star or MDAG approaches, consuming less memory to
represent the same cube and using less memory during the computation, independently
if the base relation is skewed, with high number of tuples, high cardinality or high
number of dimensions.
4. Discussion
In this section, we discuss a few issues related to MCG and point out some research
directions.
4.1. MCG Update
An update is caused by: (i) new base cell in the lattice, (ii) base cell measure value
update, (iii) adding dimension, (iv) dimension suppression, (v) adding measure value,
and (vi) measure value suppression.
The first two update types may demand the creation of new nodes if there is no
occurrence of such nodes in the lattice or if a node is an associated node. We believe
that these execution alternatives are not CPU bound, since they may affect specific
regions of the lattice. Note that, after an update or insertion we must maintain a
representation with no redundant single graph paths and aggregated nodes.
The third and fourth update types are simply achieved in MCG approach. If we
need to suppress a dimension, we just remove the nodes related to that dimension and
link their descendants to their ancestors. This arrangement can be considered a simple
and costly operation, but it seems to be more efficient than re-compute the entire cube.
A dimension addition must demand a new base relation scan and the computation of a
set of new sub-graphs that have each root node formed by one new dimension attribute
value. The existing sub-graphs are considered aggregations of the new set of subgraphs. We illustrate a situation where a base relation was first scanned with D
dimensions and then with D+1 dimensions, preserving the number of tuples. The new
set of sub-graphs is computed using all MCG heuristics, so we guarantee the benefits of
MCG optimizations.
Finally, we have updates related to measure values. These types of updates are
easy to be incorporated, but they are very costly, since they demand a re-arrangement of
the lattice to consider the new measure and new redundant single graph paths, that can
be different from the existent ones.
4.2. Computing Complex Measures
In this paper, each MCG node has a set of aggregate values, each of them associated to
one or more column(s) of the base relation and to a statistic function. Computing full
MCG cube with complex measures, such as AVG, can be easily included in the MCG
approach. If an iceberg cube is to be computed, the technique proposed in [Han et al.
2001] can be adopted.
4.3. Computing Iceberg Cubes
The Star approach can compute full or iceberg data cubes. Due to similarity between
Star and MCG cube computation, we think that all Star iceberg strategies can be easily
incorporated in the MCG approach.
4.4. Handling Large Databases and the Curse of Dimensionality
In this section, we separate the problem of low/medium dimension relations, which
generates a huge amount of cells that cannot fit in main memory, from high dimension
relations that cannot be efficiently computed by the current MCG approach.
For the first problem, we can achieve an initial solution if we use a dimension to
segment the MCG cube representation. Instead of one CG, we have a set of CGs labeled
by the eliminated dimension attribute values. The solution is implemented in a parallel
version of MCG approach, but due to the limited space, we omit the details in this
paper. One may also consider the case that even the specific substructure may not fit in
memory. For this situation, the projection-based preprocessing proposed in [Han et al.
2001] can be an interesting solution.
We have many approaches that address solutions to the curse of dimensionality
problem, but none of them can solve the fundamental problem of number of cells and
runtime. In [Li et al. 2004], the authors propose one of the most efficient approaches to
compute high dimension datasets with medium number of tuples (around 106-108
tuples). The Frag-Cubing approach adopts a partial materialization of the data cube. We
believe that the adoption of cube shell fragments, proposed in [Li et al. 2004], in
conjunction with our idea can produce an efficient MCG approach for high dimensional
data cube computation, but the row-based idea needs to be reformulated to guarantee
the computation of data cubes with high number of tuples.
4.5. Temporary Nodes
The temporary nodes generated by the simultaneous aggregation algorithm can be
considered a hard problem. We suggested an on-demand aggregated node creation
method based on the presence of different descendant nodes in the sibling nodes. This
solution appeases the creation of new temporary nodes, but in some special scenarios it
continues generating a huge number of nodes.
We need to develop some alternatives to minimize the number of temporary
nodes. The double insertion method, proposed to compute the base cuboid, can be an
interesting solution to be used to compute the aggregated cells, but it must be improved
to avoid extra CG traversals.
5. Conclusions
For efficient cube computation in various data distributions, we propose a new approach
named MCG. This approach addresses an efficient solution for the cube size problem
based on the elimination of redundant single paths and aggregated nodes. We propose a
new double insertion method that produces a base cuboid with no prefix and also no
single graph path redundancies. The method proves to be efficient in dense, sparse and
skewed scenarios. Besides, we propose an on demand sub-graph insertion method to
reduce the number of temporary nodes.
Our performance studies demonstrate that MCG is a promising algorithm. In
general, MCG is faster than Star and MDAG in sparse scenarios and similar to Star
when the number of tuples is high. The memory consumption decreases drastically (8090% less memory) when compared to a Star full cube representation. The memory
consumption is better (40-60% less memory) when compared to a MDAG full cube
representation. During a full cube computation, the MCG memory consumption proves
to be more efficient than Star and MDAG approaches, since it consumes two or three
times less memory. This last result enables MCG to compute bigger/sparser relations,
consuming the same memory.
There are some interesting research directions to further extend MCG. Efficient
methods to both update the MCG data structure and enable MCG to compute high
dimension relations are required. Parallel cube computation and discover-driven
methods can be implemented using the MCG cube representation.
References
Beyer, K. and Ramakrishnan, R. Bottom-up computation of sparse and Iceberg CUBEs.
SIGMOD, 28(2):359–371, 1999.
Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M.,
Pellow, F. and Pirahesh, H. Data cube: A relational aggregation operator
generalizing group-by, cross-tab, and sub totals. Data Min. Knowl. Discov, 1(1):29–
53, 1997.
Han, J., Pei, J., Dong, G. and Wang, K. Efficient computation of iceberg cubes with
complex measures. Volume 30, 2 of SIGMOD Record, pages 1–12, New York, May
21–24 2001. ACM Press.
Han, J., Kamber, M. Data Mining Concepts and Techniques. Morgan Kaufman, 2006.
Li, X., Han, J. and Gonzalez, H. High-dimensional OLAP: A minimal cubing approach.
In M. A. Nascimento, VLDB’04, pages 528–539. Morgan Kaufmann, 2004.
Lima, J.C. and Hirata, C.M. MDAG-Cubing: A Reduced Star-Cubing Approach. SBBD,
362-376, October 2007.
Xin, D., Han, J., Li, X., Shao, Z. and Wah, B.W. Computing Iceberg Cubes by TopDown and Bottom-Up Integration: The StarCubing Approach. IEEE Transactions on
Knowledge and Data Engineering, 19(1): 111-126, 2007.
Zhao, Y., Deshpande, P., and Naughton, J. F. An array-based algorithm for
simultaneous multidimensional aggregates. In ACM SIGMOD, pages 159–170,
Tucson, Arizona, 13–15 June 1997.

Download Report

Computing Data Cubes without Redundant Aggregated Nodes and

Paperzz.com

Your Paperzz