KDD - Hong Kong University of Science and Technology

Model Evaluation
Instructor: Qiang Yang
Hong Kong University of Science and Technology
[email protected]
Thanks: Eibe Frank and Jiawei Han
1
INTRODUCTION




Given a set of pre-classified examples, build a model
or classifier to classify new cases.
Supervised learning in that classes are known for
the examples used to build the classifier.
A classifier can be a set of rules, a decision tree, a
neural network, etc.
Question: how do we know about the quality of a
model?
Constructing a Classifier




The goal is to maximize the accuracy on new cases
that have similar class distribution.
Since new cases are not available at the time of
construction, the given examples are divided into the
testing set and the training set. The classifier is built
using the training set and is evaluated using the
testing set.
The goal is to be accurate on the testing set. It is
essential to capture the “structure” shared by both
sets.
Must prune overfitting rules that work well on the
training set, but poorly on the testing set.
3
Example
Classification
Algorithms
Training
Data
NAME RANK
M ike
M ary
B ill
Jim
D ave
Anne
A ssistan t P ro f
A ssistan t P ro f
P ro fesso r
A sso ciate P ro f
A ssistan t P ro f
A sso ciate P ro f
YEARS TENURED
3
7
2
7
6
3
no
yes
yes
yes
no
no
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Example (Conted)
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME RANK
T om
M erlisa
G eorge
Joseph
A ssistant P rof
A ssociate P rof
P rofessor
A ssistant P rof
YEARS TENURED
2
7
5
7
no
no
yes
yes
Tenured?
Evaluation Criteria

Accuracy on test set



No
For binary class values, “yes”
and “no”, a matrix showing
true positive, true negative,
false positive and false
negative rates
Actual Yes True
class
positive
False
negativ
e
the time to build the
classifier and to classify new
cases, and the scalability
with respect to the data size.
No
True
negativ
e
Speed and scalability

Yes
The percentage of wrong
predictions on test set
Confusion Matrix


Predicted class
Error Rate on test set


the rate of correct
classification on the testing
set. E.g., if 90 are classified
correctly out of the 100
testing cases, accuracy is
90%.
Robustness: handling noise
and missing values
False
positive
Evaluation Techniques

Holdout: the training set/testing set.


k-fold Cross-validation:




Good for a large set of data.
divide the data set into k sub-samples.
In each run, use one distinct sub-sample as testing set and
the remaining k-1 sub-samples as training set.
Evaluate the method using the average of the k runs.
This method reduces the randomness of training
set/testing set.
Cross Validation: Holdout Method
—
Break up data into groups of the same size
—
—
—
Hold aside one group for testing and use the rest to build model
—
—
Repeat
Test
iteration
8
8
Cross validation

Natural performance measure
for classification problems: error


rate




#Success: instance’s class is
predicted correctly
#Error: instance’s class is
predicted incorrectly
%Error rate: proportion of
errors made over the whole set
of instances
Confidence


2% error in 100 tests
2% error in 10000 tests
Which one do you trust more?
Resubstitution error: error rate
obtained from the training data

Resubstitution error is
(hopelessly) optimistic!
9
Confidence Interval Concept

Assume the estimated error rate (f) is 25%. How
close is this to the true error rate p?


Prediction is just like tossing a biased (!) coin


Depends on the amount of test data
“Head” is a “success”, “tail” is an “error”
In statistics, a succession of independent events like
this is called a Bernoulli process


Statistical theory provides us with confidence intervals for
the true underlying proportion!
Mean and variance for a Bernoulli trial with success
probability p: p, p(1-p)
10
Confidence intervals


We can say: p lies within a certain specified interval
with a certain specified confidence
Example: S=750 successes in N=1000 trials


Estimated success rate: f=75%
How close is this to true success rate p?


Answer: with 80% confidence p[73.2,76.7]
Another example: S=75 and N=100


Estimated success rate: 75%
With 80% confidence p[69.1,80.1]
11
Confidence Interval for Normal
Distribution



For large enough N, p follows a normal distribution
p can be modeled with a random variable X:
c% confidence interval [-z  X  z] for random variable X
with 0 mean is given by:
Pr[ z  X  z ]  c
c=Area = 1 - 
Pr[ z  X  z ]  1  (2 * Pr[ X  z ])
-Z/2
Z1-  /212
Confidence Interval for
Accuracy

N
Consider a model that produces an accuracy of 80%
when evaluated on 100 test instances:
 N=100, acc = 0.8
1-
Z
 Let 1- = 0.95 (95% confidence)
 From probability table, Z/2=1.96
0.99 2.58
50
100
500
1000
5000
0.98 2.33
0.95 1.96
p(lower)
0.670
0.711
0.763
0.774
0.789
p(upper)
0.888
0.866
0.833
0.824
0.811
0.90 1.65
Confidence limits


Confidence limits for the normal distribution with 0
mean and a variance of 1:
Pr[Xz]
z
Thus:
Pr[1.65  X  1.65]  90%

0.1%
3.09
0.5%
2.58
1%
2.33
5%
1.65
10%
1.28
20%
0.84
40%
0.25
To use this we have to reduce our random variable p
to have 0 mean and unit variance
14
Transforming f



f p
Transformed value for f:
p(1  p) / N
(i.e. subtract the mean and divide by the standard
deviation)
Resulting equation:
Solving for p:

Pr  z 


f p
 z  c
p (1  p ) / N

2
2
2

z
f
f
z
p f 
z


2

2
N
N
N
4
N





 z2 
1  
 N
15
Examples

f=75%, N=1000, c=80% (so that z=1.28):
p  [0.732 ,0.767 ]

f=75%, N=100, c=80% (so that z=1.28):
p  [0.691,0.801]


Note that normal distribution assumption is only valid
for large N (i.e. N > 100)
f=75%, N=10, c=80% (so that z=1.28):
p  [0.549,0.881]
16
Implications

First, the more test data the better


Second, when having limited training data, how do
we ensure a large number of test data?


Thus, cross validation, since we can then make all training
data to participate in the test.
Third, which model are testing?



N is large, thus confidence level is large
Each fold in an N-fold cross validation is testing a different
model!
We wish this model to be close to the one trained with the
whole data set
Thus, it is a balancing act: # folds in a CV cannot be
too large, or too small.
17
More on cross-validation


Standard method for evaluation: stratified ten-fold
cross-validation
Why N=10?



Extensive experiments have shown that this is the best
choice to get an accurate estimate
There is also some theoretical evidence for this
Stratification

reduces the estimate’s variance
18
Leave-one-out cross-validation

Leave-one-out cross-validation is a particular form of
cross-validation:





The number of folds is set to the number of training
instances
I.e., a classifier has to be built n times, where n is the
number of training instances
Makes maximum use of the data
No random sampling involved
Very computationally expensive
19
LOO-CV and stratification

Another disadvantage of LOO-CV: stratification is not
possible


Extreme example:



It guarantees a non-stratified sample because there is only
one instance in the test set!
completely random dataset with two classes
and equal proportions for both of them
Best classifier predicts majority class (results in 50%
on fresh data from this domain)

LOO-CV estimate on error rate for this classifier will be
100%!
20
Example: Try on This Data
Outlook
Sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Tempreature Humidity Windy Class
hot
high
false
N
hot
high
true
N
hot
high
false
P
mild
high
false
P
cool
normal false
P
cool
normal true
N
cool
normal true
P
mild
high
false
N
cool
normal false
P
mild
normal false
P
mild
normal true
P
mild
high
true
P
hot
normal false
P
mild
high
true
N
The bootstrap

CV uses sampling without replacement


The same instance, once selected, can not be selected again
for a particular training/test set
The bootstrap is an estimation method that uses
sampling with replacement to form the training set



A dataset of n instances is sampled n times with
replacement to form a new dataset of n instances
This data is used as the training set
The instances from the original dataset that don’t occur in
the new training set are used for testing
22
The 0.632 bootstrap

This method is also called the 0.632 bootstrap


A particular instance has a probability of 1-1/n of not being
picked
Thus its probability of ending up in the test data (not
selected) is:
n
 1
1
1


e
 0.368


 n

This means the training data will contain approximately
63.2% of the instances
23
Example: Try on this data
Outlook
Sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Tempreature Humidity Windy Class
hot
high
false
N
hot
high
true
N
hot
high
false
P
mild
high
false
P
cool
normal false
P
cool
normal true
N
cool
normal true
P
mild
high
false
N
cool
normal false
P
mild
normal false
P
mild
normal true
P
mild
high
true
P
hot
normal false
P
mild
high
true
N
Comparing data mining
methods


Frequent situation: we want to know which one of
two learning schemes performs better
Obvious way: compare 10-fold CV estimates



Problem: variance in estimate
Variance can be reduced using repeated CV
However, we still don’t know whether the results are
reliable

Solution: include confidence interval
25
Taking costs into account

The confusion matrix:
Predicted class
Actual
class

Yes
No
Yes
True
positive
False
negative
No
False
positive
True
negative
There many other types of costs!

E.g.: cost of collecting training data, test data
26
Lift charts

In practice, ranking may be important



Decisions are usually made by comparing possible scenarios
Sort the likelihood of x being +ve class from high to low
Question:

How do we know if one ranking is better than the other?
27
Example

Example: promotional mailout



Situation 1: classifier A predicts that 0.1% of all one million
households will respond
Situation 2: classifier B predicts that 0.4% of the 10,000
most promising households will respond
Which one is better?



Suppose to mail out a package, it costs 1 dollar
But to get a response, we obtain 1000 dollars
A lift chart allows for a visual comparison
28
Generating a lift chart
Instances are sorted according to their predicted probability of
being a true positive:


Rank
Predicted probability
Actual class
1
0.95
Yes
2
0.93
Yes
3
0.93
No
4
0.88
Yes
…
…
…
In lift chart, x axis is sample size and y axis is number of true
positives
29
Steps in Building a Lift Chart

1. First, produce a ranking of the data, using your learned model (classifier,
etc):



Rank 1 means most likely in + class,
Rank n means least likely in + class
2. For each ranked data instance, label with Ground Truth label:

This gives a list like





3. Count the number of true positives (TP) from Rank 1 onwards





Rank 1, +, TP = 1
Rank 2, -, TP = 1
Rank 3, +, TP=2,
Etc.
4. Plot # of TP against % of data in ranked order (if you have 10 data
instances, then each instance is 10% of the data):





Rank 1, +
Rank 2, -,
Rank 3, +,
Etc.
10%, TP=1
20%, TP=1
30%, TP=2,
…
This gives a lift chart.
30
A hypothetical lift chart
True positives
31
ROC curves

ROC curves are similar to lift charts



“ROC” stands for “receiver operating characteristic”
Used in signal detection to show tradeoff between hit rate
and false alarm rate over noisy channel
Differences to lift chart:


y axis shows percentage of true positives in sample (rather
than absolute number)
x axis shows percentage of false positives in sample (rather
than sample size)
32
A sample ROC curve
33
Cost-sensitive learning

Most learning schemes do not perform cost-sensitive
learning



Simple methods for cost-sensitive learning:



They generate the same classifier no matter what costs are
assigned to the different classes
Example: standard decision tree learner
Resampling of instances according to costs
Weighting of instances according to costs
Some schemes are inherently cost-sensitive, e.g.
naïve Bayes
34
Measures in information
retrieval





Percentage of retrieved documents that are relevant:
precision=TP/TP+FP
Percentage of relevant documents that are returned:
recall =TP/TP+FN
Precision/recall curves have hyperbolic shape
Summary measures: average precision at 20%, 50%
and 80% recall (three-point average recall)
F-measure=(2recallprecision)/(recall+precision)
35
Summary of measures
Domain
Plot
Explanation
Lift chart
Marketing
TP
Subset size
TP
(TP+FP)/
(TP+FP+TN+FN)
ROC curve
Communications
TP rate
FP rate
TP/(TP+FN)
FP/(FP+TN)
Recallprecision
curve
Information retrieval
Recall
Precision
TP/(TP+FN)
TP/(TP+FP)
36
Evaluating numeric prediction





Same strategies: independent test set, crossvalidation, significance tests, etc.
Difference: error measures
Actual target values: a1, a2,…,an
Predicted target values: p1, p2,…,pn
Most popular measure: mean-squared error
( p1  a1 ) 2  ...  ( pn  an ) 2
n

Easy to manipulate mathematically
37
Other measures


The root mean-squared error:
( p1  a1 ) 2  ...  ( pn  an ) 2
n
The mean absolute error is less sensitive to outliers
than the mean-squared error:
| p1  a1 | ... | pn  an |
n

Sometimes relative error values are more appropriate
(e.g. 10% for an error of 50 when predicting 500)
38
The MDL principle


MDL stands for minimum description length
The description length is defined as:
space required to describe a theory
+
space required to describe the theory’s mistakes



In our case the theory is the classifier and the
mistakes are the errors on the training data
Aim: we want a classifier with minimal DL
MDL principle is a model selection criterion
39
Model selection criteria

Model selection criteria attempt to find a good
compromise between:
A.
B.


The complexity of a model
Its prediction accuracy on the training data
Reasoning: a good model is a simple model that
achieves high accuracy on the given data
Also known as Occam’s Razor: the best theory is
the smallest one that describes all the facts
40
Elegance vs. errors




Theory 1: very simple, elegant theory that explains
the data almost perfectly
Theory 2: significantly more complex theory that
reproduces the data without mistakes
Theory 1 is probably preferable
Classical example: Kepler’s three laws on planetary
motion

Less accurate than Copernicus’s latest refinement of the
Ptolemaic theory of epicycles
41
MDL and compression

The MDL principle is closely related to data
compression:





It postulates that the best theory is the one that compresses
the data the most
I.e. to compress a dataset we generate a model and then
store the model and its mistakes
We need to compute (a) the size of the model and (b)
the space needed for encoding the errors
(b) is easy: can use the informational loss function
For (a) we need a method to encode the model
42
DL and Bayes’s theorem




L[T]=“length” of the theory
L[E|T]=training set encoded wrt. the theory
Description length= L[T] + L[E|T]
Bayes’s theorem gives us the a posteriori probability
of a theory given the data:
Pr[ E | T ] Pr[T ]
Pr[T | E ] 
Pr[ E ]

Equivalent to:
constant
 log Pr[T | E ]   log Pr[ E | T ]  log Pr[T ]  log Pr[ E ]
43
Discussion of the MDL
principle

Advantage:


makes full use of the training data when selecting a model
Disadvantage


1: appropriate coding scheme/prior probabilities for theories
are crucial
2: no guarantee that the MDL theory is the one which
minimizes the expected error

Note: Occam’s Razor is an axiom!
44