Data Mining:
Concepts and Techniques
These slides have been adapted from Han, J.,
Kamber, M., & Pei, Y. Data Mining:
Concepts and Technique.
7/11/2017
Data Mining: Concepts and Techniques
1
Chapter 4: Data Cube Technology
Efficient Computation of Data Cubes
Exploration and Discovery in Multidimensional
Databases
Summary
7/11/2017
Data Mining: Concepts and Techniques
2
Efficient Computation of Data Cubes
General heuristics
Multi-way array aggregation
BUC
H-cubing
Star-Cubing
High-Dimensional OLAP
7/11/2017
Data Mining: Concepts and Techniques
3
Cube: A Lattice of Cuboids
all
time
0-D(apex) cuboid
item
time,location
time,item
location
supplier
item,location
time,supplier
1-D cuboids
location,supplier
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier
item,location,supplier
4-D(base) cuboid
time, item, location, supplier
7/11/2017
Data Mining: Concepts and Techniques
4
Preliminary Tricks
Sorting, hashing, and grouping operations are applied to the dimension
attributes in order to reorder and cluster related tuples
Aggregates may be computed from previously computed aggregates,
rather than from the base fact table
7/11/2017
Smallest-child: computing a cuboid from the smallest, previously
computed cuboid
Cache-results: caching results of a cuboid from which other
cuboids are computed to reduce disk I/Os
Amortize-scans: computing as many as possible cuboids at the
same time to amortize disk reads
Share-sorts: sharing sorting costs cross multiple cuboids when
sort-based method is used
Share-partitions: sharing the partitioning cost across multiple
cuboids when hash-based algorithms are used
Data Mining: Concepts and Techniques
5
Efficient Computation of Data Cubes
General heuristics
Multi-way array aggregation
BUC
H-cubing
Star-Cubing
High-Dimensional OLAP
7/11/2017
Data Mining: Concepts and Techniques
6
Multi-Way Array Aggregation
Array-based “bottom-up” algorithm
Using multi-dimensional chunks
No direct tuple comparisons
all
Simultaneous aggregation on
multiple dimensions
Intermediate aggregate values are
re-used for computing ancestor
cuboids
Cannot do Apriori pruning: No
iceberg optimization
7/11/2017
Data Mining: Concepts and Techniques
A
B
AB
C
AC
BC
ABC
7
Multi-way Array Aggregation for Cube
Computation (MOLAP)
Partition arrays into chunks (a small subcube which fits in memory).
Compressed sparse array addressing: (chunk_id, offset)
Compute aggregates in “multiway” by visiting cube cells in the order
which minimizes the # of times to visit each cell, and reduces
memory access and storage cost.
C
c3 61
62
63
64
c2 45
46
47
48
c1 29
30
31
32
c0
B
b3
B13
b2
9
b1
5
b0
7/11/2017
14
15
16
1
2
3
4
a0
a1
a2
a3
A
60
44
28 56
40
24 52
36
20
Data Mining: Concepts and Techniques
What is the best
traversing order
to do multi-way
aggregation?
8
Multi-way Array Aggregation for Cube
Computation
C
c3 61
62
63
64
c2 45
46
47
48
c1 29
30
31
32
c0
b3
B
b2
B13
14
15
60
16
44
28
9
24
b1
5
b0
1
2
3
4
a0
a1
a2
a3
56
40
36
52
20
A
7/11/2017
Data Mining: Concepts and Techniques
9
Multi-way Array Aggregation for Cube
Computation
C
c3 61
62
63
64
c2 45
46
47
48
c1 29
30
31
32
c0
b3
B
b2
B13
14
15
60
16
44
28
9
24
b1
5
b0
1
2
3
4
a0
a1
a2
a3
56
40
36
52
20
A
7/11/2017
Data Mining: Concepts and Techniques
10
Multi-Way Array Aggregation for Cube
Computation (Cont.)
Method: the planes should be sorted and computed
according to their size in ascending order
Idea: keep the smallest plane in the main memory,
fetch and compute only one chunk at a time for the
largest plane
Limitation of the method: computing well only for a small
number of dimensions
7/11/2017
If there are a large number of dimensions, “top-down”
computation and iceberg cube computation methods
can be explored
Data Mining: Concepts and Techniques
11
Efficient Computation of Data Cubes
General heuristics
Multi-way array aggregation
BUC
H-cubing
Star-Cubing
High-Dimensional OLAP
7/11/2017
Data Mining: Concepts and Techniques
12
Bottom-Up Computation (BUC)
BUC
Bottom-up cube computation
(Note: top-down in our view!)
Divides dimensions into partitions
and facilitates iceberg pruning
If a partition does not satisfy
min_sup, its descendants can
be pruned
If minsup = 1 compute full
3 AB
CUBE!
No simultaneous aggregation
4 ABC
AB
ABC
all
A
AC
B
AD
ABD
C
BC
D
CD
BD
ACD
BCD
ABCD
1 all
2A
7 AC
6 ABD
10 B
14 C
16 D
9 AD 11 BC 13 BD
8 ACD
15 CD
12 BCD
5 ABCD
7/11/2017
Data Mining: Concepts and Techniques
13
BUC: Partitioning
Usually, entire data set
can’t fit in main memory
Sort distinct values, partition into blocks that fit
Continue processing
Optimizations
Partitioning
Ordering dimensions to encourage pruning
Cardinality, Skew, Correlation
Collapsing duplicates
7/11/2017
External Sorting, Hashing, Counting Sort
Can’t do holistic aggregates anymore!
Data Mining: Concepts and Techniques
14
Efficient Computation of Data Cubes
General heuristics
Multi-way array aggregation
BUC
H-cubing
Star-Cubing
High-Dimensional OLAP
7/11/2017
Data Mining: Concepts and Techniques
15
H-Cubing: Using H-Tree Structure
all
Bottom-up computation
Exploring an H-tree
structure
If the current
computation of an H-tree
cannot pass min_sup, do
not proceed further
(pruning)
A
AB
ABC
AC
ABD
B
AD
ACD
C
BC
D
BD
CD
BCD
ABCD
No simultaneous
aggregation
7/11/2017
Data Mining: Concepts and Techniques
16
H-tree: A Prefix Hyper-tree
Header
table
Attr. Val.
Edu
Hhd
Bus
…
Jan
Feb
…
Tor
Van
Mon
…
Quant-Info
Sum:2285 …
…
…
…
…
…
…
…
…
…
…
Side-link
root
Jan
Mar
Tor
Van
Tor
Mon
Quant-Info
Q.I.
Q.I.
Q.I.
Month
City
Cust_grp
Prod
Cost
Price
Jan
Tor
Edu
Printer
500
485
Jan
Tor
Hhd
TV
800
1200
Jan
Tor
Edu
Camera
1160
1280
Feb
Mon
Bus
Laptop
1500
2500
Sum: 1765
Cnt: 2
Mar
Van
Edu
HD
540
520
bins
…
…
…
…
…
…
7/11/2017
bus
hhd
edu
Data Mining: Concepts and Techniques
Jan
Feb
17
Computing Cells Involving “City”
Header
Table
HTor
Attr. Val.
Edu
Hhd
Bus
…
Jan
Feb
…
Tor
Van
Mon
…
7/11/2017
Attr. Val.
Edu
Hhd
Bus
…
Jan
Feb
…
Quant-Info
Sum:2285 …
…
…
…
…
…
…
…
…
…
…
Q.I.
…
…
…
…
…
…
…
Side-link
From (*, *, Tor) to (*, Jan, Tor)
root
Hhd.
Edu.
Jan.
Side-link
Tor.
Quant-Info
Mar.
Jan.
Bus.
Feb.
Van.
Tor.
Mon.
Q.I.
Q.I.
Q.I.
Sum: 1765
Cnt: 2
bins
Data Mining: Concepts and Techniques
18
Computing Cells Involving Month But No City
1. Roll up quant-info
2. Compute cells involving
month but no city
Attr. Val.
Edu.
Hhd.
Bus.
…
Jan.
Feb.
Mar.
…
Tor.
Van.
Mont.
…
7/11/2017
Quant-Info
Sum:2285 …
…
…
…
…
…
…
…
…
…
…
…
Side-link
root
Hhd.
Edu.
Jan.
Q.I.
Tor.
Mar.
Jan.
Q.I.
Q.I.
Van.
Tor.
Bus.
Feb.
Q.I.
Mont.
Top-k OK mark: if Q.I. in a child passes
top-k avg threshold, so does its parents.
No binning is needed!
Data Mining: Concepts and Techniques
19
Computing Cells Involving Only Cust_grp
root
Check header table directly
Attr. Val.
Edu
Hhd
Bus
…
Jan
Feb
Mar
…
Tor
Van
Mon
…
7/11/2017
Quant-Info
Sum:2285 …
…
…
…
…
…
…
…
…
…
…
…
hhd
edu
Side-link
bus
Jan
Mar
Jan
Feb
Q.I.
Q.I.
Q.I.
Q.I.
Tor
Data Mining: Concepts and Techniques
Van
Tor
Mon
20
Efficient Computation of Data Cubes
General heuristics
Multi-way array aggregation
BUC
H-cubing
Star-Cubing
High-Dimensional OLAP
7/11/2017
Data Mining: Concepts and Techniques
21
Star-Cubing: An Integrating Method
D. Xin, J. Han, X. Li, B. W. Wah, Star-Cubing: Computing Iceberg Cubes
by Top-Down and Bottom-Up Integration, VLDB'03
Explore shared dimensions
E.g., dimension A is the shared dimension of ACD and AD
ABD/AB means cuboid ABD has shared dimensions AB
Allows for shared computations
e.g., cuboid AB is computed simultaneously as ABD
Aggregate in a top-down
manner but with the bottom-up
AC/AC
AD/A
BC/BC
sub-layer underneath which will
allow Apriori pruning
Shared dimensions grow in
bottom-up fashion
7/11/2017
ABC/ABC
ABD/AB
ACD/A
C/C
D
BD/B
CD
BCD
ABCD/all
Data Mining: Concepts and Techniques
22
Iceberg Pruning in Shared Dimensions
Anti-monotonic property of shared dimensions
7/11/2017
If the measure is anti-monotonic, and if the
aggregate value on a shared dimension does not
satisfy the iceberg condition, then all the cells
extended from this shared dimension cannot
satisfy the condition either
Intuition: if we can compute the shared dimensions
before the actual cuboid, we can use them to do
Apriori pruning
Problem: how to prune while still aggregate
simultaneously on multiple dimensions?
Data Mining: Concepts and Techniques
23
Cell Trees
Use a tree structure similar
to H-tree to represent
cuboids
Collapses common prefixes
to save memory
Keep count at node
Traverse the tree to retrieve
a particular tuple
7/11/2017
Data Mining: Concepts and Techniques
24
Star Attributes and Star Nodes
Intuition: If a single-dimensional
aggregate on an attribute value p
does not satisfy the iceberg
condition, it is useless to distinguish
them during the iceberg
computation
E.g., b2, b3, b4, c1, c2, c4, d1, d2,
d3
A
B
C
D
Count
a1
b1
c1
d1
1
a1
b1
c4
d3
1
a1
b2
c2
d2
1
a2
b3
c3
d4
1
a2
b4
c3
d4
1
Solution: Replace such attributes by
a *. Such attributes are star
attributes, and the corresponding
nodes in the cell tree are star nodes
7/11/2017
Data Mining: Concepts and Techniques
25
Example: Star Reduction
Suppose minsup = 2
Perform one-dimensional
aggregation. Replace attribute
values whose count < 2 with *. And
collapse all *’s together
Resulting table has all such
attributes replaced with the starattribute
With regards to the iceberg
computation, this new table is a
loseless compression of the original
table
7/11/2017
Data Mining: Concepts and Techniques
A
B
C
D
Count
a1
b1
*
*
1
a1
b1
*
*
1
a1
*
*
*
1
a2
*
c3
d4
1
a2
*
c3
d4
1
A
B
C
D
Count
a1
b1
*
*
2
a1
*
*
*
1
a2
*
c3
d4
2
26
Star Tree
Given the new compressed
table, it is possible to
construct the corresponding
A
B
C
D
Count
a1
b1
*
*
2
a1
*
*
*
1
a2
*
c3
d4
2
cell tree—called star tree
Keep a star table at the side
for easy lookup of star
attributes
The star tree is a loseless
compression of the original
cell tree
7/11/2017
Data Mining: Concepts and Techniques
27
Star-Cubing Algorithm—DFS on Lattice Tree
all
BCD: 51
b*: 33
A /A
B/B
C/C
b1: 26
D/D
root: 5
c*: 14
AB/AB
d*: 15
ABC/ABC
c3: 211
AC/AC
d4: 212
ABD/AB
c*: 27
AD/A
BC/BC BD/B
CD
a1: 3
a2: 2
d*: 28
ACD/A
BCD
b*: 1
b1: 2
b*: 2
c*: 1
c*: 2
c3: 2
d*: 1
d*: 2
d4: 2
ABCD
7/11/2017
Data Mining: Concepts and Techniques
28
Multi-Way Aggregation
BCD
ACD/A
ABD/AB
ABC/ABC
ABCD
7/11/2017
Data Mining: Concepts and Techniques
29
Star-Cubing Algorithm—DFS on Star-Tree
7/11/2017
Data Mining: Concepts and Techniques
30
Multi-Way Star-Tree Aggregation
Start depth-first search at the root of the base star tree
At each new node in the DFS, create corresponding star
tree that are descendents of the current tree according to
the integrated traversal ordering
E.g., in the base tree, when DFS reaches a1, the
ACD/A tree is created
When DFS reaches b*, the ABD/AD tree is created
The counts in the base tree are carried over to the new
trees
7/11/2017
Data Mining: Concepts and Techniques
31
Multi-Way Aggregation (2)
When DFS reaches a leaf node (e.g., d*), start
backtracking
On every backtracking branch, the count in the
corresponding trees are output, the tree is destroyed,
and the node in the base tree is destroyed
Example
7/11/2017
When traversing from d* back to c*, the
a1b*c*/a1b*c* tree is output and destroyed
When traversing from c* back to b*, the
a1b*D/a1b* tree is output and destroyed
When at b*, jump to b1 and repeat similar process
Data Mining: Concepts and Techniques
32
Efficient Computation of Data Cubes
General heuristics
Multi-way array aggregation
BUC
H-cubing
Star-Cubing
High-Dimensional OLAP
7/11/2017
Data Mining: Concepts and Techniques
33
The Curse of Dimensionality
None of the previous cubing method can handle high
dimensionality!
A database of 600k tuples. Each dimension has
cardinality of 100 and zipf of 2.
7/11/2017
Data Mining: Concepts and Techniques
34
Motivation of High-D OLAP
Challenge to current cubing methods:
The “curse of dimensionality’’ problem
Iceberg cube and compressed cubes: only delay the
inevitable explosion
Full materialization: still significant overhead in
accessing results on disk
High-D OLAP is needed in applications
Science and engineering analysis
Bio-data analysis: thousands of genes
Statistical surveys: hundreds of variables
7/11/2017
Data Mining: Concepts and Techniques
35
Fast High-D OLAP with Minimal Cubing
Observation: OLAP occurs only on a small subset of
dimensions at a time
Semi-Online Computational Model
1.
Partition the set of dimensions into shell fragments
2.
Compute data cubes for each shell fragment while
retaining inverted indices or value-list indices
3.
Given the pre-computed fragment cubes,
dynamically compute cube cells of the highdimensional data cube online
7/11/2017
Data Mining: Concepts and Techniques
36
Properties of Proposed Method
Partitions the data vertically
Reduces high-dimensional cube into a set of lower
dimensional cubes
Online re-construction of original high-dimensional space
Lossless reduction
Offers tradeoffs between the amount of pre-processing
and the speed of online computation
7/11/2017
Data Mining: Concepts and Techniques
37
Example Computation
Let the cube aggregation function be count
tid
A
B
C
D
E
1
a1
b1
c1
d1
e1
2
a1
b2
c1
d2
e1
3
a1
b2
c1
d1
e2
4
a2
b1
c1
d1
e2
5
a2
b1
c1
d1
e3
Divide the 5 dimensions into 2 shell fragments:
(A, B, C) and (D, E)
7/11/2017
Data Mining: Concepts and Techniques
38
1-D Inverted Indices
Build traditional invert index or RID list
7/11/2017
Attribute Value
TID List
List Size
a1
123
3
a2
45
2
b1
145
3
b2
23
2
c1
12345
5
d1
1345
4
d2
2
1
e1
12
2
e2
34
2
e3
5
1
Data Mining: Concepts and Techniques
39
Shell Fragment Cubes: Ideas
Generalize the 1-D inverted indices to multi-dimensional
ones in the data cube sense
Compute all cuboids for data cubes ABC and DE while
retaining the inverted indices
Cell
For example, shell
fragment cube ABC
a1 b1
contains 7 cuboids:
a1 b2
A, B, C
a2 b1
AB, AC, BC
a2 b2
ABC
This completes the offline
computation stage
7/11/2017
Intersection
TID List List Size
1 2 3 1 4 5
1
1
1 2 3 2 3
23
2
4 5 1 4 5
45
2
4 5 2 3
0
Data Mining: Concepts and Techniques
40
Shell Fragment Cubes: Size and Design
Given a database of T tuples, D dimensions, and F shell
fragment size, the fragment cubes’ space requirement is:
For F < 5, the growth is sub-linear
D F
OT (2 1)
F
Shell fragments do not have to be disjoint
Fragment groupings can be arbitrary to allow for
maximum online performance
Known common combinations
(e.g.,<city, state>)
should be grouped together.
Shell fragment sizes can be adjusted for optimal balance
between offline and online computation
7/11/2017
Data Mining: Concepts and Techniques
41
ID_Measure Table
If measures other than count are present, store in
ID_measure table separate from the shell fragments
7/11/2017
tid
count
sum
1
5
70
2
3
10
3
8
20
4
5
40
5
2
30
Data Mining: Concepts and Techniques
42
The Frag-Shells Algorithm
1.
Partition set of dimension (A1,…,An) into a set of k fragments
(P1,…,Pk).
2.
Scan base table once and do the following
3.
insert <tid, measure> into ID_measure table.
4.
for each attribute value ai of each dimension Ai
5.
build inverted index entry <ai, tidlist>
6.
7.
For each fragment partition Pi
build local fragment cube Si by intersecting tid-lists in bottomup fashion.
7/11/2017
Data Mining: Concepts and Techniques
43
Frag-Shells (2)
Dimensions
D Cuboid
EF Cuboid
DE Cuboid
A B C D E F …
ABC
Cube
7/11/2017
Cell
Tuple-ID List
d1 e1
{1, 3, 8, 9}
d1 e2
{2, 4, 6, 7}
d2 e1
{5, 10}
…
…
DEF
Cube
Data Mining: Concepts and Techniques
44
Online Query Computation: Query
A query has the general form
Each ai has 3 possible values
1.
Instantiated value
2.
Aggregate * function
3.
Inquire ? function
a1,a2 , ,an : M
For example, 3 ? ? * 1: count returns a 2-D
data cube.
7/11/2017
Data Mining: Concepts and Techniques
45
Online Query Computation: Method
Given the fragment cubes, process a query as
follows
1.
Divide the query into fragment, same as the shell
2.
Fetch the corresponding TID list for each
fragment from the fragment cube
3.
Intersect the TID lists from each fragment to
construct instantiated base table
4.
Compute the data cube using the base table with
any cubing algorithm
7/11/2017
Data Mining: Concepts and Techniques
46
Online Query Computation: Sketch
A B C D E F G H I J K L M N …
Instantiated
Base Table
7/11/2017
Online
Cube
Data Mining: Concepts and Techniques
47
Experiment: Size vs. Dimensionality (50
and 100 cardinality)
7/11/2017
(50-C): 106 tuples, 0 skew, 50 cardinality, fragment size 3.
(100-C): 106 tuples, 2 skew, 100 cardinality, fragment size 2.
Data Mining: Concepts and Techniques
48
Experiment: Size vs. Shell-Fragment Size
7/11/2017
(50-D): 106 tuples, 50 dimensions, 0 skew, 50 cardinality.
(100-D): 106 tuples, 100 dimensions, 2 skew, 25 cardinality.
Data Mining: Concepts and Techniques
49
Experiment: Run-time vs. Shell-Fragment Size
7/11/2017
106 tuples, 20 dimensions, 10 cardinality, skew 1, fragment
size 3, 3 instantiated dimensions.
Data Mining: Concepts and Techniques
50
Experiments on Real World Data
UCI Forest CoverType data set
54 dimensions, 581K tuples
Shell fragments of size 2 took 33 seconds and 325MB
to compute
3-D subquery with 1 instantiate D: 85ms~1.4 sec.
Longitudinal Study of Vocational Rehab. Data
7/11/2017
24 dimensions, 8818 tuples
Shell fragments of size 3 took 0.9 seconds and 60MB
to compute
5-D query with 0 instantiated D: 227ms~2.6 sec.
Data Mining: Concepts and Techniques
51
High-D OLAP: Further Implementation
Considerations
Incremental Update:
Append more TIDs to inverted list
Add <tid: measure> to ID_measure table
Incremental adding new dimensions
Bitmap indexing
Form new inverted list and add new fragments
May further improve space usage and speed
Inverted index compression
Store as d-gaps
Explore more IR compression methods
7/11/2017
Data Mining: Concepts and Techniques
52
Chapter 4: Data Cube Technology
Efficient Computation of Data Cubes
Exploration and Discovery in Multidimensional
Databases
Discovery-Driven Exploration of Data Cubes
Sampling Cube
Prediction Cube
Regression Cube
Summary
7/11/2017
Data Mining: Concepts and Techniques
53
Discovery-Driven Exploration of Data Cubes
Hypothesis-driven
exploration by user, huge search space
Discovery-driven (Sarawagi, et al.’98)
7/11/2017
Effective navigation of large OLAP data cubes
pre-compute measures indicating exceptions, guide
user in the data analysis, at all levels of aggregation
Exception: significantly different from the value
anticipated, based on a statistical model
Visual cues such as background color are used to
reflect the degree of exception of each cell
Data Mining: Concepts and Techniques
54
Kinds of Exceptions and their Computation
Parameters
SelfExp: surprise of cell relative to other cells at same
level of aggregation
InExp: surprise beneath the cell
PathExp: surprise beneath cell for each drill-down
path
Computation of exception indicator (modeling fitting and
computing SelfExp, InExp, and PathExp values) can be
overlapped with cube construction
Exception themselves can be stored, indexed and
retrieved like precomputed aggregates
7/11/2017
Data Mining: Concepts and Techniques
55
Examples: Discovery-Driven Data Cubes
7/11/2017
Data Mining: Concepts and Techniques
56
Complex Aggregation at Multiple
Granularities: Multi-Feature Cubes
Multi-feature cubes (Ross, et al. 1998): Compute complex queries
involving multiple dependent aggregates at multiple granularities
Ex. Grouping by all subsets of {item, region, month}, find the
maximum price in 1997 for each group, and the total sales among all
maximum price tuples
select item, region, month, max(price), sum(R.sales)
from purchases
where year = 1997
cube by item, region, month: R
such that R.price = max(price)
Continuing the last example, among the max price tuples, find the
min and max shelf live, and find the fraction of the total sales due to
tuple that have min shelf life within the set of all max price tuples
7/11/2017
Data Mining: Concepts and Techniques
57
Chapter 4: Data Cube Technology
Efficient Computation of Data Cubes
Exploration and Discovery in Multidimensional
Databases
Discovery-Driven Exploration of Data Cubes
Sampling Cube
7/11/2017
X. Li, J. Han, Z. Yin, J.-G. Lee, Y. Sun, “Sampling
Cube: A Framework for Statistical OLAP over
Sampling Data”, SIGMOD’08
Prediction Cube
Regression Cube
Data Mining: Concepts and Techniques
58
Statistical Surveys and OLAP
Statistical survey: A popular tool to collect information
about a population based on a sample
Ex.: TV ratings, US Census, election polls
A common tool in politics, health, market research,
science, and many more
An efficient way of collecting information (Data collection
is expensive)
Many statistical tools available, to determine validity
Confidence intervals
Hypothesis tests
OLAP (multidimensional analysis) on survey data
highly desirable but can it be done well?
7/11/2017
Data Mining: Concepts and Techniques
59
Surveys: Sample vs. Whole Population
Data is only a sample of population
Age\Education
High-school
College
Graduate
18
19
20
…
7/11/2017
Data Mining: Concepts and Techniques
60
Problems for Drilling in Multidim. Space
Data is only a sample of population but samples could be small
when drilling to certain multidimensional space
Age\Education
High-school
College
Graduate
18
19
20
…
7/11/2017
Data Mining: Concepts and Techniques
61
Traditional OLAP Data Analysis Model
Age
Education
Income
Full Data Warehouse
Data
Cube
Assumption: data is entire population
Query semantics is population-based , e.g., What is the
average income of 19-year-old college graduates?
7/11/2017
Data Mining: Concepts and Techniques
62
OLAP on Survey (i.e., Sampling) Data
Semantics of query is unchanged
Input data has changed
Age/Education
High-school
College
Graduate
18
19
20
…
7/11/2017
Data Mining: Concepts and Techniques
63
OLAP with Sampled Data
Where is the missing link?
OLAP over sampling data but our analysis target would
still like to be on population
Idea: Integrate sampling and statistical knowledge with
traditional OLAP tools
Input Data
Analysis Target Analysis Tool
Population
Population
Traditional OLAP
Sample
Population
Not Available
7/11/2017
Data Mining: Concepts and Techniques
64
Challenges for OLAP on Sampling Data
Computing confidence intervals in OLAP context
No data?
Not exactly. No data in subspaces in cube
Sparse data
Causes include sampling bias and query
selection bias
Curse of dimensionality
Survey data can be high dimensional
Over 600 dimensions in real world example
Impossible to fully materialize
7/11/2017
Data Mining: Concepts and Techniques
65
Example 1: Confidence Interval
What is the average income of 19-year-old high-school students?
Return not only query result but also confidence interval
Age/Education
High-school
College
Graduate
18
19
20
…
7/11/2017
Data Mining: Concepts and Techniques
66
Confidence Interval
Confidence interval at
:
x is a sample of data set;
is the mean of sample
tc is the critical t-value, calculated by a look-up
is the estimated standard error of the mean
Example: $50,000 ± $3,000 with 95% confidence
Treat points in cube cell as samples
Compute confidence interval as traditional sample set
Return answer in the form of confidence interval
Indicates quality of query answer
User selects desired confidence interval
7/11/2017
Data Mining: Concepts and Techniques
67
Efficient Computing Confidence Interval Measures
Efficient computation in all cells in data cube
Both mean and confidence interval are algebraic
Why confidence interval measure is algebraic?
is algebraic
where both s and l (count) are algebraic
Thus one can calculate cells efficiently at more general
cuboids without having to start at the base cuboid each
time
7/11/2017
Data Mining: Concepts and Techniques
68
Example 2: Query Expansion
What is the average income of 19-year-old college students?
Age/Education
High-school
College
Graduate
18
19
20
…
7/11/2017
Data Mining: Concepts and Techniques
69
Boosting Confidence by Query Expansion
From the example: The queried cell “19-year-old
college students” contains only 2 samples
Confidence interval is large (i.e., low confidence).
why?
Small sample size
High standard deviation with samples
Small sample sizes can occur at relatively low
dimensional selections
Collect more data?― expensive!
Use data in other cells? Maybe, but have to be
careful
7/11/2017
Data Mining: Concepts and Techniques
70
Intra-Cuboid Expansion: Choice 1
Expand query to include 18 and 20 year olds?
Age/Education
High-school
College
Graduate
18
19
20
…
7/11/2017
Data Mining: Concepts and Techniques
71
Intra-Cuboid Expansion: Choice 2
Expand query to include high-school and graduate students?
Age/Education
High-school
College
Graduate
18
19
20
…
7/11/2017
Data Mining: Concepts and Techniques
72
Intra-Cuboid Expansion
If other cells in the same cuboid satisfy both
the following
1.
Similar semantic meaning
2.
Similar cube value
Then can combine other cells’ data into own to
“boost” confidence
7/11/2017
Only use if necessary
Bigger sample size will decrease confidence
interval
Data Mining: Concepts and Techniques
73
Intra-Cuboid Expansion (2)
Cell segment similarity
Some dimensions are clear: Age
Some are fuzzy: Occupation
May need domain knowledge
Cell value similarity
How to determine if two cells’ samples come from the same
population?
Two-sample t-test (confidence-based)
Example:
7/11/2017
Data Mining: Concepts and Techniques
74
Inter-Cuboid Expansion
If a query dimension is
Not correlated with cube value
But is causing small sample size by drilling down too
much
Remove dimension (i.e., generalize to *) and move to a
more general cuboid
Can use two-sample t-test to determine similarity
between two cells across cuboids
Can also use a different method to be shown later
7/11/2017
Data Mining: Concepts and Techniques
75
Query Expansion
7/11/2017
Data Mining: Concepts and Techniques
76
Query Expansion Experiments (2)
Real world sample data: 600 dimensions and
750,000 tuples
0.05% to simulate “sample” (allows error checking)
7/11/2017
Data Mining: Concepts and Techniques
77
Query Expansion Experiments (3)
7/11/2017
Data Mining: Concepts and Techniques
78
Sampling Cube Shell: Handing “Curse of
Dimensionality”
Real world data may have > 600 dimensions
Materializing the full sampling cube is unrealistic
Solution: Only compute a “shell” around the full sampling
cube
Method: Selectively compute the best cuboids to include in
the shell
Chosen cuboids will be low dimensional
Size and depth of shell dependent on user preference
(space vs. accuracy tradeoff)
Use cuboids in the shell to answer queries
7/11/2017
Data Mining: Concepts and Techniques
79
Sampling Cube Shell Construction
Top-down, iterative, greedy algorithm
1.
2.
3.
7/11/2017
Top-down ― start at apex cuboid and slowly
expand to higher dimensions: follow the
cuboid lattice structure
Iterative ― add one cuboid per iteration
Greedy ― at each iteration choose the best
cuboid to add to the shell
Stop when either size limit is reached or it is
no longer beneficial to add another cuboid
Data Mining: Concepts and Techniques
80
Sampling Cube Shell Construction (2)
How to measure quality of a cuboid?
Cuboid Standard Deviation (CSD)
s(ci): the sample standard deviation of the samples in
ci
n(ci): # of samples in ci, n(B): the total # of samples
in B
small(ci) returns 1 if n(ci) ≥ min_sup and 0 otherwise.
Measures the amount of variance in the cells of a cuboid
Low CSD indicates high correlation with cube value
good
High CSD indicates little correlation bad
7/11/2017
Data Mining: Concepts and Techniques
81
Sampling Cube Shell Construction (3)
Goal (Cuboid Standard Deviation Reduction: CSDR)
Reduce CSD ― increase query information
Find the cuboids with the least amount of CSD
Overall algorithm
7/11/2017
Start with apex cuboid and compute its CSD
Choose next cuboid to build that will reduce the CSD
the most from the apex
Iteratively choose more cuboids that reduce the CSD
of their parent cuboids the most (Greedy selection―at
each iteration, choose the cuboid with the largest
CSDR to add to the shell)
Data Mining: Concepts and Techniques
82
Computing the Sampling Cube Shell
7/11/2017
Data Mining: Concepts and Techniques
83
Query Processing
If query matches cuboid in shell, use it
If query does not match
1.
2.
3.
7/11/2017
A more specific cuboid
exists ― aggregate to the
query dimensions level
and answer query
A more general cuboid
exists ― use the value in
the cell
Multiple more general
cuboids exist ― use the
most confident value
Data Mining: Concepts and Techniques
84
Query Accuracy
7/11/2017
Data Mining: Concepts and Techniques
85
Sampling Cube: Sampling-Aware OLAP
Confidence intervals in query processing
1.
Integration with OLAP queries
Efficient algebraic query processing
Expand queries to increase confidence
2.
Solves sparse data problem
Inter/Intra-cuboid query expansion
Cube materialization with limited space
3.
7/11/2017
Sampling Cube Shell
Data Mining: Concepts and Techniques
86
Chapter Summary
Efficient algorithms for computing data cubes
Multiway array aggregation
BUC
H-cubing
Star-cubing
High-D OLAP by minimal cubing
Further development of data cube technology
Discovery-drive cube
Multi-feature cubes
Sampling cubes
7/11/2017
Data Mining: Concepts and Techniques
87
Explanation on Multi-way array
aggregation “a0b0” chunk
a0b0c0
c1
c2
c3
a0b1c0
c1
c2
c3
a0b2c0
c1
c2
c3
b0
a0
xxxx
a0
b0
c0
c1
c2
x
x
x
c3
x
c0
c1
c2
x
x
x
c3
x
b1
b2
b3
a0b3c0
c1
c2
c3
…
7/11/2017
Data Mining: Concepts and Techniques
88
a0b1 chunk
a0b0c0
c1
c2
c3
a0b1c0
c1
c2
c3
a0b2c0
c1
c2
c3
b1
a0
yyyy
a0
c0
c1
c2
xy
xy
xy
c3
Done with a0b0
xy
c0
c1
c2
b0
x
x
x
x
b1
y
y
y
y
c3
b2
b3
a0b3c0
c1
c2
c3
…
7/11/2017
Data Mining: Concepts and Techniques
89
a0b2 chunk
a0b0c0
c1
c2
c3
a0b1c0
c1
c2
c3
a0b2c0
c1
c2
c3
b2
a0
zzzz
a0
c0
c1
c2
xyz
xyz
xyz
c3
xyz
c0
c1
c2
b0
x
x
x
x
b1
y
y
y
y
b2
z
z
z
z
Done with a0b1
c3
b3
a0b3c0
c1
c2
c3
…
7/11/2017
Data Mining: Concepts and Techniques
90
Table Visualization
a0b0c0
c1
c2
c3
a0b1c0
c1
c2
c3
a0b2c0
c1
c2
c3
c0
c1
c2
c3
xyzu
xyzu
xyzu
xyzu
c0
c1
c2
b0
x
x
x
x
b1
y
y
y
y
b2
z
z
z
z
b3
u
u
u
u
b3
a0
uuuu
a0
Done with a0b2
c3
a0b3c0
c1
c2
c3
7/11/2017
Data Mining: Concepts and Techniques
91
Table Visualization
…
a1b0c0
c1
c2
c3
a1b1c0
c1
c2
c3
a1b2c0
c1
c2
c3
b0
a1
xxxx
a1
c0
c1
c2
x
x
x
c3
Done with a0b3
Done with a0c*
x
c0
c1
c2
b0
xx
xx
xx
xx
b1
y
y
y
y
b2
z
z
z
z
b3
u
u
u
u
c3
a1b3c0
c1
c2
c3
…
7/11/2017
Data Mining: Concepts and Techniques
92
a3b3 chunk (last one)
…
a3b0c0
c1
c2
c3
a3b1c0
c1
c2
c3
a3b2c0
c1
c2
c3
a3b3c0
c1
c2
c3
7/11/2017
b0
a3
uuuu
a3
c0
c1
c2
c3
xyzu
xyzu
xyzu
xyzu
c0
c1
c2
Done with a0b3
Done with a0c*
Done with b*c*
c3
b0
xxxx
xxxx
xxxx
xxxx
b1
yyyy
yyyy
yyyy
yyyy
b2
zzzz
zzzz
zzzz
zzzz
b3
uuuu
uuuu
uuuu
uuuu
Finish
Data Mining: Concepts and Techniques
93
Memory Used
A: 40 distinct values
B: 400 distinct values
C: 4000 distinct values
ABC: Sort Order
Plane AB: Need 1 chunk (10 * 100 * 1)
Plane AC: Need 4 chunks (10 * 1000 * 4)
Plane BC: Need 16 chunks (100 * 1000 * 16)
Total memory: 1,641,000
7/11/2017
Data Mining: Concepts and Techniques
94
Memory Used
A: 40 distinct values
B: 400 distinct values
C: 4000 distinct values
CBA: Sort Order
Plane CB: Need 1 chunk (1000 * 100 * 1)
Plane CA: Need 4 chunks (1000 * 10 * 4)
Plane BA: Need 16 chunks (100 * 10 * 16)
Total memory: 156,000
7/11/2017
Data Mining: Concepts and Techniques
95
H-Cubing Example
H-table
condition: ???
Output
H-Tree: 1
root: 4
(*,*,*,*) : 4
a1
3
a2
1
b1
3
b2
1
c1
2
c2
1
c3
1
a1: 3
b1:2
c1: 1
c2: 1
a2: 1
b2:1
b1: 1
c3: 1
c1: 1
Side-links are used to create a virtual tree
In this example for clarity we will print the
resulting virtual tree, not the real tree
7/11/2017
Data Mining: Concepts and Techniques
96
project C: ??c1
H-table
condition: ??c1
Output
H-Tree: 1.1
root: 4
a1
1
a2
1
b1
2
b2
1
7/11/2017
(*,*,*,*) : 4
(*,*,*,c1): 2
a1: 1
b1:1
a2: 1
b1: 1
Data Mining: Concepts and Techniques
97
project ?b1c1
H-table
condition: ?b1c1
Output
H-Tree: 1.1.1
root: 4
a1
1
a2
1
7/11/2017
a1: 1
a2: 1
Data Mining: Concepts and Techniques
(*,*,*) : 4
(*,*,c1): 2
(*,b1,c1): 2
98
aggregate: ?*c1
H-table
condition: ??c1
Output
H-Tree: 1.1.2
root: 4
a1
1
a2
1
7/11/2017
a1: 1
a2: 1
Data Mining: Concepts and Techniques
(*,*,*) : 4
(*,*,c1): 2
(*,b1,c1): 2
99
Aggregate ??*
H-table
condition: ???
Output
H-Tree: 1.2
root: 4
a1
3
a2
1
b1
3
b2
1
7/11/2017
a1: 3
b1:2
a2: 1
b2:1
(*,*,*) : 4
(*,*,c1): 2
(*,b1,c1): 2
b1: 1
Data Mining: Concepts and Techniques
100
Project ?b1*
H-table
condition: ???
Output
H-Tree: 1.2.1
root: 4
a1
2
a2
1
a1: 2
a2: 1
(*,*,*) : 4
(*,*,c1): 2
(*,b1,c1): 2
(*,b1,*): 3
(a1,b1,*):2
After this we also project on a1b1*
7/11/2017
Data Mining: Concepts and Techniques
101
Aggregate ?**
H-table
condition: ???
Output
H-Tree: 1.2.2
root: 4
a1
3
a2
1
a1: 3
a2: 1
(*,*,*) : 4
(*,*,c1): 2
(*,b1,c1): 2
(*,b1,*): 3
(a1,b1,*):2
(a1,*,*): 3
Here we also project on a1
Finish
7/11/2017
Data Mining: Concepts and Techniques
102
3. Explanation of Star Cubing
ABCD-Tree
ABCD-Cuboid Tree
root: 5
NULL
a2: 2
a1: 3
b*: 1
b1: 2
b*: 2
c*: 1
c*: 2
c3: 2
d*: 1
d*: 2
d4: 2
7/11/2017
Data Mining: Concepts and Techniques
103
Step 1
ABCD-Tree
output
BCD:5
root: 5
a2: 2
a1: 3
b*: 1
b1: 2
b*: 2
c*: 1
c*: 2
c3: 2
d*: 1
d*: 2
d4: 2
7/11/2017
ABCD-Cuboid Tree
Data Mining: Concepts and Techniques
104
Step 2
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
output
(a1,*,*) : 3
a2: 2
a1: 3
a1CD/a1:3
b*: 1
b1: 2
b*: 2
c*: 1
c*: 2
c3: 2
d*: 1
d*: 2
d4: 2
7/11/2017
Data Mining: Concepts and Techniques
105
Step 3
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
a2: 2
a1: 3
b*: 1
b1: 2
b*: 2
c*: 1
c*: 2
c3: 2
d*: 1
d*: 2
d4: 2
output
(a1,*,*) : 3
b*: 1
a1CD/a1:3
a1b*D/a1b*:1
7/11/2017
Data Mining: Concepts and Techniques
106
Step 4
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
a2: 2
a1: 3
output
(a1,*,*) : 3
b*: 1
c*: 1
b*: 1
b1: 2
b*: 2
c*: 1
c*: 2
c3: 2
d*: 1
d*: 2
d4: 2
a1CD/a1:3
c*: 1
a1b*D/a1b*:1
a1b*c*/a1b*c*:1
7/11/2017
Data Mining: Concepts and Techniques
107
Step 5
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
output
(a1,*,*) : 3
b*: 1
a2: 2
a1: 3
c*: 1
b*: 1
b1: 2
b*: 2
c*: 1
c*: 2
c3: 2
d*: 1
d*: 2
d*: 1
a1CD/a1:3
d4: 2
c*: 1
d*: 1
a1b*D/a1b*:1
d*: 1
a1b*c*/a1b*c*:1
7/11/2017
Data Mining: Concepts and Techniques
108
Mine subtree ABC/ABC
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
output
(a1,*,*) : 3
b*: 1
a2: 2
a1: 3
c*: 1
b*: 1
b1: 2
b*: 2
c*: 1
c*: 2
c3: 2
d*: 1
a1CD/a1:3
d*: 2
d4: 2
c*: 1
d*: 1
a1b*D/a1b*:1
mine this subtree
but nothing to do
remove
d*: 1
a1b*c*/a1b*c*:1
7/11/2017
Data Mining: Concepts and Techniques
109
Mine subtree ABD/AB
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
output
(a1,*,*) : 3
b*: 1
a2: 2
a1: 3
c*: 1
b*: 1
b1: 2
b*: 2
c*: 2
c3: 2
d*: 1
a1CD/a1:3
d*: 2
d4: 2
c*: 1
d*: 1
a1b*D/a1b*:1
mine this subtree
but nothing to do
remove
7/11/2017
d*: 1
Data Mining: Concepts and Techniques
110
Step 6
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
b*: 1
b1: 2
output
(a1,*,*) : 3
(a1,b1,*):2
a2: 2
a1: 3
c*: 1
b1: 2
b*: 2
c*: 2
c3: 2
d*: 1
a1CD/a1:3
d*: 2
d4: 2
c*: 1
d*: 1
a1b1D/a1b1:2
7/11/2017
Data Mining: Concepts and Techniques
111
Step 7
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
b*: 1
b1: 2
c*: 1
c*: 2
output
(a1,*,*) : 3
(a1,b1,*):2
a2: 2
a1: 3
b1: 2
b*: 2
c*: 2
c3: 2
d*: 1
a1CD/a1:3
d*: 2
d4: 2
c*: 3
d*: 1
a1b1D/a1b1:2
a1b1c*/a1b1c*:2
7/11/2017
Data Mining: Concepts and Techniques
112
Step 8
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
b*: 1
b1: 2
c*: 1
c*: 2
d*: 1
d*: 2
output
(a1,*,*) : 3
(a1,b1,*):2
a2: 2
a1: 3
b1: 2
c*: 2
b*: 2
c3: 2
a1CD/a1:3
d*: 2
d4: 2
c*: 3
d*: 3
a1b1D/a1b1:2
d*: 3
a1b1c*/a1b1c*:2
7/11/2017
Data Mining: Concepts and Techniques
113
Mine subtree ABC/ABC
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
b*: 1
b1: 2
c*: 1
c*: 2
d*: 1
d*: 2
output
(a1,*,*) : 3
(a1,b1,*):2
a2: 2
a1: 3
b1: 2
c*: 2
b*: 2
c3: 2
a1CD/a1:3
d4: 2
c*: 3
d*: 3
a1b1D/a1b1:2
mine this subtree
but nothing to do
remove
7/11/2017
d*: 3
a1b1c*/a1b1c*:2
Data Mining: Concepts and Techniques
114
Mine subtree ABC/ABC
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
b*: 1
b1: 2
c*: 1
c*: 2
d*: 1
d*: 2
output
(a1,*,*) : 3
(a1,b1,*):2
a2: 2
a1: 3
b1: 2
b*: 2
c3: 2
a1CD/a1:3
d4: 2
c*: 3
d*: 3
mine this subtree
but nothing to do (all
interior nodes *)
remove
7/11/2017
a1b1D/a1b1:2
d*: 3
Data Mining: Concepts and Techniques
115
Mine subtree ABC/ABC
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
a1: 3
b*: 1
b1: 2
c*: 1
c*: 2
d*: 1
d*: 2
output
(a1,*,*) : 3
(a1,b1,*):2
a2: 2
b*: 2
c3: 2
a1CD/a1:3
d4: 2
c*: 3
d*: 3
mine this subtree
but nothing to do (all
interior nodes *)
remove
7/11/2017
Data Mining: Concepts and Techniques
116
Step 9
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
b*: 1
b1: 2
c*: 1
c*: 2
d*: 1
d*: 2
a2: 2
b*: 2
output
(a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
c3: 2
a2CD/a2:2
d4: 2
7/11/2017
Data Mining: Concepts and Techniques
117
Step 10
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
b*: 3
b1: 2
c*: 1
c*: 2
d*: 1
d*: 2
a2: 2
b*: 2
output
(a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
c3: 2
a2CD/a2:2
d4: 2
a2b*D/a2b*:2
7/11/2017
Data Mining: Concepts and Techniques
118
Step 11
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
b*: 3
b1: 2
a2: 2
c*: 1
b*: 2
c3: 2
d*: 1
output
(a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
c*: 2
d*: 2
c3: 2
a2CD/a2:2
d4: 2
c3: 2
a2b*D/a2b*:2
a2b*c3/a2b*c3:2
7/11/2017
Data Mining: Concepts and Techniques
119
Step 11
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
b*: 3
b1: 2
a2: 2
b*: 2
c*: 1
c3: 2
c*: 2
d*: 1
d4: 2
d*: 2
output
(a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
c3: 2
a2CD/a2:2
d4: 2
c3: 2
d4: 2
a2b*D/a2b*:2
a2b*c3/a2b*c3:2
7/11/2017
Data Mining: Concepts and Techniques
120
Mine subtree ABC/ABC
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
b*: 3
b1: 2
a2: 2
b*: 2
c*: 1
c3: 2
c*: 2
d*: 1
d4: 2
d*: 2
output
(a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
c3: 2
a2CD/a2:2
c3: 2
d4: 2
a2b*D/a2b*:2
mine subtree
nothing to do
remove
7/11/2017
a2b*c3/a2b*c3:2
Data Mining: Concepts and Techniques
121
Step 11
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
b*: 3
b1: 2
a2: 2
b*: 2
c*: 1
c3: 2
c*: 2
d*: 1
d4: 2
d*: 2
output
(a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
a2CD/a2:2
c3: 2
d4: 2
mine subtree
nothing to do
remove
7/11/2017
a2b*D/a2b*:2
Data Mining: Concepts and Techniques
122
Mine Subtree ACD/A
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
b*: 3
b1: 2
a2: 2
c*: 1
c3: 2
c*: 2
d*: 1
d4: 2
d*: 2
output
(a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
a2CD/a2:2
c3: 2
d4: 2
mine subtree
AC/AC, AD/A
remove
7/11/2017
Data Mining: Concepts and Techniques
123
Recursive Mining – step 1
ACD/A tree
a2CD/a2:2
ACD/A-Cuboid Tree
a2D/a2:2
output
(a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
c3: 2
d4: 2
7/11/2017
Data Mining: Concepts and Techniques
124
Recursive Mining – Step 2
ACD/A tree
a2CD/a2:2
ACD/A-Cuboid Tree
a2D/a2:2
c3: 2
d4: 2
7/11/2017
output
(a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
(a2,*,c3): 2
a2c3/a2c3:2
Data Mining: Concepts and Techniques
125
Recursive Mining - Backtrack
ACD/A tree
a2CD/a2:2
c3: 2
ACD/A-Cuboid Tree
a2D/a2:2
d4: 2
d4: 2
output
(a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
(a2,*,c3): 2
(a2,c3,d4): 2
a2c3/a2c3:2
Same as before
As we backtrack
recursively mine child
trees
7/11/2017
Data Mining: Concepts and Techniques
126
Mine Subtree BCD
ABCD-Tree
ABCD-Cuboid Tree
BCD:5
root: 5
b*: 3
b1: 2
c*: 1
c3: 2
c*: 2
d*: 1
d4: 2
d*: 2
output
(a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
(a2,*,c3): 2
(a2,c3,d4): 2
mine subtree
BC/BC, BD/B, CD
remove
7/11/2017
Data Mining: Concepts and Techniques
127
Mine Subtree BCD
BCD-Tree
BCD-Cuboid Tree
(a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
(a2,*,c3): 2
(a2,c3,d4): 2
BCD:5
b*: 3
b1: 2
c*: 1
c3: 2
c*: 2
d*: 1
d4: 2
d*: 2
output
Will Skip this step
You may do it as an
exercise
7/11/2017
Data Mining: Concepts and Techniques
128
Finish
ABCD-Tree
ABCD-Cuboid Tree
(a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
(a2,*,c3): 2
(a2,c3,d4): 2
BCD tree patterns
root: 5
7/11/2017
output
Data Mining: Concepts and Techniques
129
Efficient Computation of Data Cubes
General heuristics
Multi-way array aggregation
BUC
H-cubing
Star-Cubing
High-Dimensional OLAP
Computing non-monotonic measures
Compressed and closed cubes
7/11/2017
Data Mining: Concepts and Techniques
130
Computing Cubes with NonAntimonotonic Iceberg Conditions
J. Han, J. Pei, G. Dong, K. Wang. Efficient Computation of Iceberg
Cubes With Complex Measures. SIGMOD’01
Most cubing algorithms cannot compute cubes with nonantimonotonic iceberg conditions efficiently
Example
CREATE CUBE Sales_Iceberg AS
SELECT month, city, cust_grp, AVG(price), COUNT(*)
FROM Sales_Infor
CUBEBY month, city, cust_grp
HAVING AVG(price) >= 800 AND COUNT(*) >= 50
How to push constraint into the iceberg cube computation?
7/11/2017
Data Mining: Concepts and Techniques
131
Non-Anti-Monotonic Iceberg
Condition
Anti-monotonic: if a process fails a condition, continue
processing will still fail
The cubing query with avg is non-anti-monotonic!
(Mar, *, *, 600, 1800) fails the HAVING clause
(Mar, *, Bus, 1300, 360) passes the clause
Month
City
Cust_grp
Prod
Cost
Price
Jan
Tor
Edu
Printer
500
485
Jan
Tor
Hld
TV
800
1200
Jan
Tor
Edu
Camera
1160
1280
Feb
Mon
Bus
Laptop
1500
2500
Mar
Van
Edu
HD
540
520
…
…
…
…
…
…
7/11/2017
CREATE CUBE Sales_Iceberg AS
SELECT month, city, cust_grp,
AVG(price), COUNT(*)
FROM Sales_Infor
CUBEBY month, city, cust_grp
HAVING AVG(price) >= 800 AND
COUNT(*) >= 50
Data Mining: Concepts and Techniques
132
From Average to Top-k Average
Let (*, Van, *) cover 1,000 records
Avg(price) is the average price of those 1000 sales
Avg50(price) is the average price of the top-50 sales
(top-50 according to the sales price
Top-k average is anti-monotonic
7/11/2017
The top 50 sales in Van. is with avg(price) <= 800
the top 50 deals in Van. during Feb. must be with
avg(price) <= 800
Month
City
Cust_grp
Prod
Cost
Price
…
…
…
…
…
…
Data Mining: Concepts and Techniques
133
Binning for Top-k Average
Computing top-k avg is costly with large k
Binning idea
Avg50(c) >= 800
Large value collapsing: use a sum and a count to
summarize records with measure >= 800
7/11/2017
If count>= 50, no need to check “small” records
Small value binning: a group of bins
One bin covers a range, e.g., 600~800, 400~600, etc.
Register a sum and a count for each bin
Data Mining: Concepts and Techniques
134
Computing Approximate top-k average
Suppose for (*, Van, *), we have
Range
Sum
Count
Over 800 28000
20
600~800 10600
15
400~600 15200
30
…
7/11/2017
…
…
Approximate avg50()=
(28000+10600+600*15)/50=952
Top 50
The cell may pass the HAVING clause
Month
City
Cust_grp
Prod
Cost
Price
…
…
…
…
…
…
Data Mining: Concepts and Techniques
135
Weakened Conditions Facilitate
Pushing
Accumulate quant-info for cells to compute average
iceberg cubes efficiently
Three pieces: sum, count, top-k bins
Use top-k bins to estimate/prune descendants
Use sum and count to consolidate current cell
weakest
strongest
Approximate avg50()
real avg50()
avg()
Anti-monotonic, can
be computed
efficiently
Anti-monotonic, but
computationally
costly
Not antimonotonic
7/11/2017
Data Mining: Concepts and Techniques
136
Computing Iceberg Cubes with Other
Complex Measures
Computing other complex measures
Key point: find a function which is weaker but ensures
certain anti-monotonicity
Examples
Avg() v: avgk(c) v (bottom-k avg)
Avg() v only (no count): max(price) v
Sum(profit) (profit can be negative):
7/11/2017
p_sum(c) v if p_count(c) k; or otherwise, sumk(c) v
Others: conjunctions of multiple conditions
Data Mining: Concepts and Techniques
137
Efficient Computation of Data Cubes
General heuristics
Multi-way array aggregation
BUC
H-cubing
Star-Cubing
High-Dimensional OLAP
Computing non-monotonic measures
Compressed and closed cubes
7/11/2017
Data Mining: Concepts and Techniques
138
Compressed Cubes: Condensed or Closed
Cubes
This is challenging even for icerberg cube: Suppose 100 dimensions,
only 1 base cell with count = 10. How many aggregate cells if
count >= 10?
Condensed cube
Only need to store one cell (a1, a2, …, a100, 10), which
represents all the corresponding aggregate cells
Efficient computation of the minimal condensed cube
Closed cube
7/11/2017
Dong Xin, Jiawei Han, Zheng Shao, and Hongyan Liu, “C-Cubing:
Efficient Computation of Closed Cubes by Aggregation-Based
Checking”, ICDE'06.
Data Mining: Concepts and Techniques
139
Exploration and Discovery in
Multidimensional Databases
Discovery-Driven Exploration of Data Cubes
Complex Aggregation at Multiple Granularities:
Multi-Feature Cubes
Cube-Gradient Analysis
7/11/2017
Data Mining: Concepts and Techniques
140
Cube-Gradient (Cubegrade)
Analysis of changes of sophisticated measures in multidimensional spaces
Query: changes of average house price in Vancouver
in ‘00 comparing against ’99
Answer: Apts in West went down 20%, houses in
Metrotown went up 10%
Cubegrade problem by Imielinski et al.
Changes in dimensions changes in measures
Drill-down, roll-up, and mutation
7/11/2017
Data Mining: Concepts and Techniques
141
From Cubegrade to Multi-dimensional
Constrained Gradients in Data Cubes
Significantly more expressive than association rules
Capture trends in user-specified measures
Serious challenges
7/11/2017
Many trivial cells in a cube “significance constraint”
to prune trivial cells
Numerate pairs of cells “probe constraint” to select
a subset of cells to examine
Only interesting changes wanted “gradient constraint”
to capture significant changes
Data Mining: Concepts and Techniques
142
MD Constrained Gradient Mining
Significance constraint Csig: (cnt100)
Probe constraint Cprb: (city=“Van”, cust_grp=“busi”,
prod_grp=“*”)
Gradient constraint Cgrad(cg, cp):
(avg_price(cg)/avg_price(cp)1.3)
Probe cell: satisfied Cprb
Base cell
Aggregated cell
Siblings
Ancestor
7/11/2017
(c4, c2) satisfies Cgrad!
Dimensions
Measures
cid
Yr
City
Cst_grp
Prd_grp
Cnt
Avg_price
c1
00
Van
Busi
PC
300
2100
c2
*
Van
Busi
PC
2800
1800
c3
*
Tor
Busi
PC
7900
2350
c4
*
*
busi
PC
58600
2250
Data Mining: Concepts and Techniques
143
Efficient Computing Cube-gradients
Compute probe cells using Csig and Cprb
The set of probe cells P is often very small
Use probe P and constraints to find gradients
Pushing selection deeply
Set-oriented processing for probe cells
Iceberg growing from low to high dimensionalities
Dynamic pruning probe cells during growth
Incorporating efficient iceberg cubing method
7/11/2017
Data Mining: Concepts and Techniques
144
© Copyright 2026 Paperzz