Document

Prediction Cubes
Bee-Chung Chen, Lei Chen,
Yi Lin and Raghu Ramakrishnan
University of Wisconsin - Madison
Big Picture
• Subset analysis: Use “models” to identify
interesting subsets
• Cube space: Dimension hierarchies
– Combination of dimension-attribute values
defines a candidate subset (like regular OLAP)
• Measure of interest for a subset
– Decision/prediction behavior within the subset
– Big difference from regular OLAP!!
2
Example (1/5): Traditional OLAP
Goal: Look for patterns of unusually
high numbers of applications
CA
USA
…
04
03
…
100
80
…
90
90
…
…
…
…
2004
2003
…
Jan … Dec Jan … Dec …
30
70
…
20
5
…
50
5
…
25
10
…
30
…
…
…
…
…
Time
# of App.
…
AL, USA
…
…
Dec, 04
…
...
7
…
WY, USA
Dec, 04
3
2004
Jan … Dec
Roll up
CA
USA
…
Location
Drill
down
…
…
…
Cell value: Number of loan applications
CA
USA
…
AB
…
YT
AL
…
WY
…
20
5
5
55
5
10
…
15
2
3
3
12
5
…
15
20
15
10
7
3
…
…
…
…
…
…
…
…
…
3
Example (2/5): Decision Analysis
Goal: Analyze a bank’s loan approval process
w.r.t. two dimensions: Location and Time
Data table D
Location
Time
Race Sex … Approval
AL, USA
…
Dec, 04 White
…
…
M
…
…
…
Yes
…
WY, USA
Dec, 04
F
…
No
Black
Model
E.g., decision tree
Location
All
Country
State
Time
All
All
Japan
USA
AL
Norway
WY
Year
Month
All
85
86
Jan., 86
04
Dec., 86
4
Example (3/5): Questions of Interest
• Goal: Analyze a bank’s loan decision process with
respect to two dimensions: Location and Time
• Target: Find discriminatory loan decision
• Questions:
– Are there locations and times when the decision making
was similar to a set of discriminatory decision examples
(or similar to a given discriminatory decision model)?
– Are there locations and times during which approvals
depended highly on Race or Sex?
5
Example (4/5): Prediction Cube
2004
2003
…
Jan
…
Dec Jan
…
Dec
…
WA
0.4
0.8
0.9
0.8
…
…
WI
0.2
0.3
0.5
…
…
…
…
…
…
…
…
…
…
0.6
…
1. Build a model using data
from WI in Dec., 1985
2. Evaluate that model
Data
Location
AL,USA
…
WY, USA
Time
Race
Dec, 04 White
…
…
Dec, 04 Black
Sex
…
Approval
M
…
Y
…
…
…
F
…
N
Measure in a cell:
• Accuracy of the model
• Predictiveness of Race
measured based on that
model
• Similarity between that
model and a given model
Model
E.g., decision tree
6
Example (4/5): Prediction Cube
2004
2003
…
Roll up
04
03
…
Jan
…
Dec Jan
…
Dec
…
CA
0.3
0.2
…
CA
0.4
0.1
0.3
0.6
0.8
…
…
USA
0.2
0.3
…
USA
0.7
0.4
0.3
0.3
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
2004
Cell value: Predictiveness of Race
CA
Drill down
USA
…
2003
…
Jan
…
Dec
Jan
…
Dec
…
AB
0.4
0.2
0.1
0.1
0.2
…
…
…
0.1
0.1
0.3
0.3
…
…
…
YT
0.3
0.2
0.1
0.2
…
…
…
AL
0.2
0.1
0.2
…
…
…
…
…
0.3
0.1
0.1
…
…
…
WY
0.9
0.7
0.8
…
…
…
…
…
…
…
…
…
…
…
…
7
Model-Based Subset Analysis
• Given: A data table D with schema [Z, X, Y]
– Z: Dimension attributes, e.g., {Location, Time}
– X: Predictor attributes, e.g., {Race, Sex, …}
– Y: Class-label attribute, e.g., Approval
Data table D
Location
Time
Race Sex … Approval
AL, USA
…
Dec, 04 White
…
…
M
…
…
…
Yes
…
WY, USA
Dec, 04
F
…
No
Black
8
Model-Based Subset Analysis
[USA, Dec 04](D)
Z: Dimension
X: Predictor Y: Class
Location
Time
Race
Sex
…
Approval
AL, USA
…
Dec, 04
…
White
…
M
…
…
…
Yes
…
WY, USA
Dec, 04
Black
F
…
No
• Goal: To understand the relationship between X and Y on
different subsets Z(D) of data D
– Relationship: p(Y | X, Z(D))
• Approach:
– Build model h(X; Z(D))  p(Y | X, Z(D))
– Evaluate h(X; Z(D))
• Accuracy, model similarity, predictiveness
9
Outline
•
•
•
•
•
Motivating example
Definition of prediction cubes
Efficient prediction cube materialization
Experimental results
Conclusion
10
Prediction Cubes
• User interface: OLAP data cubes
– Dimensions, hierarchies, roll up and drill down
• Values in the cells:
– Accuracy
– Similarity
– Predictiveness
11
Prediction Cubes
• Three types of prediction cubes:
– Test-set accuracy cube
– Model-similarity cube
– Predictiveness cube
12
Test-Set Accuracy Cube
Given:
- Data table D
- Test set 
Data table D
Location
Time
Race Sex … Approval
AL, USA
…
Dec, 04 White
…
…
M
…
…
…
Yes
…
WY, USA
Dec, 04
F
…
No
Black
2004
2003
…
Jan … Dec Jan … Dec …
CA 0.4 0.2
USA 0.2 0.3
… … …
0.3
0.6
0.9
…
…
0.5
…
…
…
…
…
…
…
…
Build a model
Accuracy
Level: [Country, Month]
The decision model of USA during Dec 04
has high accuracy when applied to 
Prediction
Race Sex …
Approval
White
…
F
…
…
…
Yes
…
Yes
Black
M
…
No
Yes
Test set 
…
13
Model-Similarity Cube
Given:
- Data table D
- Model h0(X)
- Test set  w/o labels
Data table D
Location
Time
Race Sex … Approval
AL, USA
…
Dec, 04 White
…
…
M
…
…
…
Yes
…
WY, USA
Dec, 04
F
…
No
Black
2004
2003
…
Jan … Dec Jan … Dec …
CA 0.4 0.2
USA 0.2 0.3
… … …
0.3
0.6
0.9
…
…
0.5
…
…
…
…
…
…
…
…
Level: [Country, Month]
Build a model
Similarity
The loan decision process in USA during Dec 04
is similar to a discriminatory decision model
h0(X)
Race Sex …
White
…
F
…
…
…
Yes
Yes
…
…
Black
M
…
No
Yes
Test set 
14
Predictiveness Cube
Given:
- Data table D
- Attributes V
- Test set  w/o labels
Data table D
Location
0.3
0.6
0.9
…
…
0.5
…
…
…
…
…
…
…
…
Level: [Country, Month]
Race Sex … Approval
AL, USA
…
Dec, 04 White
…
…
M
…
…
…
Yes
…
WY, USA
Dec, 04
F
…
No
2004
2003
…
Jan … Dec Jan … Dec …
CA 0.4 0.2
USA 0.2 0.3
… … …
Time
Yes
No
.
.
Yes
h(X)
Black
Yes
No
.
.
No
Build models
h(XV)
Predictiveness of V
Race is an important factor of loan approval
decision in USA during Dec 04
Race Sex …
White
…
F
…
…
…
Black
M
…
Test set 
15
Outline
•
•
•
•
•
Motivating example
Definition of prediction cubes
Efficient prediction cube materialization
Experimental results
Conclusion
16
Roll Up and Drill Down
2004
2003
…
Roll up
04
03
…
Jan
…
Dec Jan
…
Dec
…
CA
0.3
0.2
…
CA
0.4
0.1
0.3
0.6
0.8
…
…
USA
0.2
0.3
…
USA
0.7
0.1
0.1
0.3
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
2004
CA
Drill down
USA
…
2003
…
Jan
…
Dec
Jan
…
Dec
…
AB
0.4
0.2
0.1
0.1
0.2
…
…
…
0.1
0.1
0.3
0.3
…
…
…
YT
0.3
0.2
0.1
0.2
…
…
…
AL
0.2
0.1
0.2
…
…
…
…
…
0.3
0.1
0.1
…
…
…
WY
0.9
0.2
0.1
…
…
…
…
…
…
…
…
…
…
…
…
17
Full Materialization
Full Materialization Table
[All,All]
[Country,All]
[All,Year]
1985
[All, All]
1986
…
2004
All
Location
Time
Cell Value
[All,All]
ALL
ALL
0.7
CA
ALL
0.4
…
ALL
…
USA
ALL
0.9
ALL
1985
0.8
ALL
…
…
ALL
2004
0.3
CA
1985
0.9
CA
1986
0.2
…
…
…
USA
2004
0.8
[Country,All]
[Country,Year]
[All, Year]
Level
All
[All,Year]
All
1985
1986
…
2004
All
CA
CA
…
…
USA
USA
[Country, Year]
[Country, All]
[Country,Year]
18
Bottom-Up Data Cube
Computation
1985
1986
1987
1988
47
107
76
67
1985
1986
1987
1988
Norway
10
30
20
24
Norway
84
…
23
45
14
32
…
114
USA
14
32
42
11
USA
99
All
All
All
297
All
Cell Values: Numbers of loan applications
19
Functions on Sets
• Bottom-up computable functions: Functions that can be
computed using only summary information
• Distributive function: (X) = F({(X1), …, (Xn)})
– X = X1  …  Xn and Xi  Xj =
– E.g., Count(X) = Sum({Count(X1), …, Count(Xn)})
• Algebraic function: (X) = F({G(X1), …, G(Xn)})
– G(Xi) returns a length-fixed vector of values
– E.g., Avg(X) = F({G(X1), …, G(Xn)})
• G(Xi) = [Sum(Xi), Count(Xi)]
• F({[s1, c1], …, [sn, cn]}) = Sum({si}) / Sum({ci})
20
Scoring Function
• Conceptually, a machine-learning model h(X; S) is
a scoring function Score(y, x; S) that gives each
class y a score on test example x
– h(x; S) = argmax y Score(y, x; S)
– Score(y, x; S)  p(y | x, S)
– S: A set of training examples
Location
Time
x
h(X; S)
Race Sex … Approval
AL, USA
…
Dec, 85 White
…
…
M
…
…
…
Yes
…
WY, USA
Dec, 85
F
…
No
Black
S
[Yes: 80%, No: 20%]
21
Bottom-up Score Computation
• Key observations:
– Observation 1: Having the scores for each test
example is sufficient to compute the value of a cell
• Details depend on what each cell means (i.e., type of prediction
cubes); but straightforward
– Observation 2: Fixing the class label y and test
example x, Score(y, x; S) is a function of the set S of
training examples; if it is distributive or algebraic, the
data cube bottom-up technique can be directly applied
22
Algorithm
• Input: The dataset D and test set 
• For each finest-grained cell, which contains data bi(D):
– Build a model on bi(D)
– For each x   and y, compute:
• Score(y, x; bi(D)), if distributive
• G(y, x; bi(D)), if algebraic
• Use standard data cube computation technique to compute
the scores in a bottom-up manner (by Observation 2)
• Compute the cell values using the scores (by Observation 1)
23
Machine-Learning Models
• Naïve Bayes:
– Scoring function: algebraic
• Kernel-density-based classifier:
– Scoring function: distributive
• Decision tree, random forest:
– Neither distributive, nor algebraic
• PBE: Probability-based ensemble (new)
– To make any machine-learning model distributive
– Approximation
24
Probability-Based Ensemble
Decision tree on [WA, 85]
PBE version of decision
tree on [WA, 85]
1985
Jan
…
W
…
A
…
…
1985
Dec
Jan
…
Dec
…
W
…
A
…
Decision tree trained on a
finest-grained cell
25
Outline
•
•
•
•
•
Motivating example
Definition of prediction cubes
Efficient prediction cube materialization
Experimental results
Conclusion
26
Experiments
• Quality of PBE on 8 UCI datasets
– The quality of the PBE version of a model is slightly
worse (0 ~ 6%) than the quality of the model trained
directly on the whole training data.
PBE
…
W
A
…
…
1985
1985
…
…
vs.
…
W
A
…
…
• Efficiency of the bottom-up score computation
technique
• Case study on demographic data
27
Efficiency of the Bottom-up
Score Computation
• Machine-learning models:
–
–
–
–
J48: J48 decision tree
RF: Random forest
NB: Naïve Bayes
KDC: Kernel-density-based classifier
• Bottom-up method vs. Exhaustive method
 PBE-J48
 PBE-RF
 NB
 KDC
 J48ex
 RFex
 NBex
 KDCex
28
Synthetic Dataset
• Dimensions: Z1, Z2 and Z3.
All
Z1 and Z2
A
B
C
Z3
D
All
E
0 1 2 3 4 5 6 7 8 9
0
1
n
• Decision rule:
Condition
Rule
When Z1>1
Y = I(4X1+3X2+2X3+X4+0.4X6 > 7)
else when Z3 mod 2 = 0
Y = I(2X1+2X2+3X3+3X4+0.4X6 > 7)
else
Y = I(0.1X5+X1>1)
29
Efficiency Comparison
2500
RFex
Execution Time (sec)
KDCex
2000
NBex
1500
Using exhaustive
method
J48ex
NB
1000
KDC
500
0
40K
RFPBE
J48PBE
80K
120K
160K
# of Records
Using bottom-up
score computation
200K
30
Conclusion
• Exploratory data analysis paradigm:
– Models built on subsets
– Subsets defined by dimension hierarchies
• Meaningful subsets
– Precomputation
– Interactive analysis
31
Questions
Test-Set-Based Model Evaluation
• Given a set-aside test set  of schema [X, Y]:
– Accuracy of h(X):
• The percentage of  that are correctly classified
– Similarity between h1(X) and h2(X):
• The percentage of  that are given the same class
labels by h1(X) and h2(X)
– Predictiveness of V  X: (based on h(X))
• The difference between h(X) and h(XV) measured
by ; i.e., the percentage of  that are predicted
differently by h(X) and h(XV)
33
Model Accuracy
• Test-set accuracy (TS-accuracy):
– Given a set-aside test set  with schema [X, Y],
1
accuracy(h(X; D) | ) =
||

( x, y )Δ
I (h(x; D)  y)
• ||: The number of examples in 
• I() = 1 if  is true; otherwise, I() = 0
• Alternative: Cross-validation accuracy
– This will not be discussed further!!
34
Model Similarity
• Prediction similarity (or distance):
– Given a set-aside test set  with schema X:

1
I (h1 (x)  h2 (x))
similarity(h1(X), h2(X)) =
x


||
distance(h1(X), h2(X)) = 1 – similarity(h1(X), h2(X))
• Similarity between ph1(Y | X) and ph2(Y | X):
1
KL-distance =
||
 
x
y
ph1 ( y | x) log
ph1 ( y | x)
ph2 ( y | x)
– phi(Y | X): Class-probability estimated by hi(X)
35
Attribute Predictiveness
• Predictiveness of V  X: (based on h(X))
– PD-predictiveness:
distance(h(X), h(X – V))
– KL-predictiveness:
KL-distance(h(X), h(X – V))
• Alternative:
accuracy(h(X)) – accuracy(h(X – V))
– This will not be discussed further!!
36
Target Patterns
• Find subset (D) such that h(X; (D)) has high prediction
accuracy on a test set
– E.g., The loan decision process in 2003’s WI is similar to
a set  of discriminatory decision examples
• Find subset (D) such that h(X; (D)) is similar to a given
model h0(X)
– E.g., The loan decision process in 2003’s WI is similar to
a discriminatory decision model h0(X)
• Find subset (D) such that V is predictive on (D)
– E.g., Race is an important factor of loan approval
decision in 2003’s WI
37
Test-Set Accuracy
• We would like to discover:
– The loan decision process in 2003’s WI is similar to a
set of problematic decision examples
• Given:
– Data table D: The loan decision dataset
– Test set : The set of problematic decision examples
• Goal:
– Find subset Loc,Time(D) such that h(X; Loc,Time(D)) has
high prediction accuracy on 
38
Model Similarity
• We would like to discover:
– The loan decision process in 2003’s WI is similar to a
problematic decision model
• Given:
– Data table D: The loan decision dataset
– Model h0(X): The problematic decision model
• Goal:
– Find subset Loc,Time(D) such that h(X; Loc,Time(D)) is
similar to h0(X)
39
Attribute Predictiveness
• We would like to discover:
– Race is an important factor of loan approval decision
in 2003’s WI
• Given:
– Data table D: The loan decision dataset
– Attribute V of interest: Race
• Goal:
– Find subset Loc,Time(D) such that h(X; Loc,Time(D)) is
very different to h(X – V; Loc,Time(D))
40
Dimension and Level
Z1 = Location
Z1(3) = All
Z1(2) = State
Z1(1) = City
Z2 = Time
Z2(3) = All
All
MA
WI
MN
Madison, WI
Green Bay, WI
[3,3]
[2,3]
[1,3]
[1,2]
[3,1]
[2,1]
[1,1]
Z2(2) = Year
Z2(1) = Month
85
86
Jan., 86
04
Dec., 86
[All,All]
[3,2]
[2,2]
All
[State,All]
[All,Year]
[City,All] [State,Year] [All,Month]
[City,Year] [State,Month]
[City,Month]
41
Example: Full Materialization
All
[All, All]
All
[All,All]
All
85
AL
[State,All]
[All,Year]
…
…
05
All
WY
[City,All] [State,Year] [All,Month]
[City,Year] [State,Month]
[City, Month]
[City,Month]
42
Bottom-Up Score Computation
• Base cells: The finest-grained cells in a cube
• Base subsets bi(D): The finest-grained data subsets
– The subset of data records in a base cell is a base subset
• Properties:
– D =i bi(D) and bi(D)  bj(D) = 
– Any subset S(D) of D that corresponds to a cube cell is
the union of some base subsets
– Notation:
• S(D) = bi(D)  bj(D)  bk(D), where S = {i, j, k}
43
Bottom-Up Score Computation
Domain
Lattice
[All,All]
[State,All]
Scores:
Data subset:
S(D) = iS bi(D)
1985
…
S(D)
…
1985
…
WA
b1(D)
…
WI
b2(D)
WY
b3(D)
All
Score(y, x; S(D)) =
F({Score(y, x; bi(D)) : i  S})
1985
…
Score(y, x; S(D))
…
1985
…
WA
Score(y, x; b1(D))
…
WI
Score(y, x; b2(D))
WY
Score(y, x; b3(D))
All
[All,Year]
[State,Year]
…
…
44
Decomposable Scoring Function
• Let S(D) = iS bi(D).
– bi(D) is a base (finest-grained) subset
• Distributively decomposable scoring function:
– Score(y, x; S(D)) = F({Score(y, x; bi(D)) : i  S})
– F is an distributive aggregate function
• Algebraically decomposable scoring function:
– Score(y, x; S(D)) = F({G(y, x; bi(D)) : i  S})
– F is an algebraic aggregate function
– G(y, x; bi(D)) returns a length-fixed vector of values
45
Probability-Based Ensemble
• Scoring function:
hPBE ( x;  S (D))  arg max y ScorePBE ( y, x;  S (D))
ScorePBE ( y, x; bi (D))  h( y | x; bi (D))  g (bi | x)
ScorePBE ( y, x; S (D)) 

iS
Score( y, x; bi (D))
– h(y | x; bi(D)): Model h’s estimation of p(y | x, bi(D))
– g(bi | x): A model that predicts the probability that x
belongs to base subset bi(D)
46
Optimality of PBE
• ScorePBE(y, x; S(D)) = c  p(y | x, xS(D))
p ( y | x , x   S (D))
p( y, x   S (D) | x )

p( x   S (D) | x )
 z  p( y, x   S (D) | x )

 z 
 z 
 z
[ bi(D)’s partitions S(D)]
iS
p( y, x  bi (D) | x )
iS
 p( y | x  bi (D), x )  p( x  bi (D) | x )
iS
h( y | x; bi (D))  g (bi | x )
47
Efficiency Comparison
700
J48PBE
600
500
KDC
400
300
NB
200
100
0
200K
RFPBE
400K
600K
800K
1M
48
Where the Time Spend on
100%
80%
Other
60%
Testing
40%
Training
20%
0%
200K
1M
J48-P B E
200K
1M
RF-P B E
200K
1M
KDC
200K
1M
NB
49
Accuracy of PBE
• Goal:
– To compare PBE with the gold standard
• PBE: A set of J48s/RFs each of which is trained on a
small partition of the whole dataset
• Gold standard: A J48/RF trained on the whole data
– To understand how the number of base
classifiers in a PBE affects the accuracy of the
PBE
• Datasets:
– Eight UCI datasets
50
Accuracy of PBE
Adult Dataset
100
98
96
Accuracy
94
RF
92
J48
90
RF-PBE
88
J48-PBE
86
84
82
80
2
5
10
# of base classifiers in a PBE
15
20
51
Accuracy of PBE
Nursery Dataset
100
98
96
Accuracy
94
RF
92
J48
90
RF-PBE
88
J48-PBE
86
84
82
80
2
5
10
# of base classifiers in a PBE
15
20
52
Accuracy of PBE
Error = The average of the absolute difference between
a ground-truth cell value and a cell value computed by PBE
Flat Dataset
J48-PBE
RF-PBE
0.16
0.14
0.14
0.12
Error
Error
RF-PBE
Deep Dataset
0.12
0.1
J48-PBE
0.1
0.08
0.08
0.06
0.06
0.04
0.04
0.02
0.02
0
0
1
10
100
1000
# of base m odels in a PBE
10000
1
10
100
# of base m odels in a PBE
1000
53