Classification: Evaluation 2

COMP527:
Data Mining
COMP527: Data Mining
M. Sulaiman Khan
([email protected])
Dept. of Computer Science
University of Liverpool
2009
Classification: Evaluation 2
February 24, 2009
Slide 1
COMP527:
Data Mining
COMP527: Data Mining
Introduction to the Course
Introduction to Data Mining
Introduction to Text Mining
General Data Mining Issues
Data Warehousing
Classification: Challenges, Basics
Classification: Rules
Classification: Trees
Classification: Trees 2
Classification: Bayes
Classification: Neural Networks
Classification: SVM
Classification: Evaluation
Classification: Evaluation 2
Regression, Prediction
Classification: Evaluation 2
Input Preprocessing
Attribute Selection
Association Rule Mining
ARM: A Priori and Data Structures
ARM: Improvements
ARM: Advanced Techniques
Clustering: Challenges, Basics
Clustering: Improvements
Clustering: Advanced Algorithms
Hybrid Approaches
Graph Mining, Web Mining
Text Mining: Challenges, Basics
Text Mining: Text-as-Data
Text Mining: Text-as-Language
Revision for Exam
February 24, 2009
Slide 2
COMP527:
Data Mining
Today's Topics
Confusion Matrix
Costs
Lift Curves
ROC Curves
Numeric Prediction
Classification: Evaluation 2
February 24, 2009
Slide 3
COMP527:
Data Mining
Predict Yes:
Predict No:
Confusion Matrix
The 'Confusion Matrix':
Actual Yes
Actual No
True Positive
False Positive
False Negative
True Negative
We want to ensure that True Positive and True Negative are as
high as possible. Same with more than two classes, you want
the diagonal from top left to bottom right to be high, and the
others to be low.
(Think of the output from WEKA for example)
Classification: Evaluation 2
February 24, 2009
Slide 4
COMP527:
Data Mining
Kappa Statistic
But what about random luck? An accuracy of 50% against 1000
classes is obviously better than against 2 classes.
We can derive the Kappa statistic from a confusion matrix for a
classifier and an artificial confusion matrix with the classes
divided in proportion to the overall distribution.
Classification: Evaluation 2
February 24, 2009
Slide 5
COMP527:
Data Mining
Kappa Statistic
Sum the diagonal in the expected by chance matrix. (82)
Sum the diagonal in the classifier's matrix (140)
Subtract expected from classifier. (58)
Subtract expected from total instances (200 – 82 = 118)
Divide and express as percentage: (58 / 118 = 49%)
Classification: Evaluation 2
February 24, 2009
Slide 6
COMP527:
Data Mining
Cost
For some situations, it's a lot worse to have a false negative than a
false positive.
Example: Better to have all true positives, no false negatives and
some misclassifications if the application is detection of (insert
un-favourite nasty medical condition)
Example2: If there's a very skewed ratio of classes (eg 99% class
A, 1% class B) then you want to tell the system that getting 99%
accuracy by always predicting A is not good enough. The cost of
getting it wrong for class B needs to be higher than the value of
getting it right for class A.
Classification: Evaluation 2
February 24, 2009
Slide 7
COMP527:
Data Mining
Cost
Another example application: Mass mailed advertising.
If it costs 40 pence to send out a letter, you want to maximize the
number of letters sent to people who will buy, and minimize the
number of letters sent out to those that won't.
So the Confusion Matrix:
Actual Yes
Actual No
Predict Yes
Profit -40p
-40p
Classification: Evaluation 2
Predict No
Potential profit not used
Saved money
February 24, 2009
Slide 8
COMP527:
Data Mining
Cost
Can use a cost matrix to determine the cost of errors of a classifier.
Default Cost Matrix:
A
B
C
A
0
1
1
B
1
0
1
C
1
1
0
But we might wish to change those values for different scenarios.
Then when evaluating, we sum the values in the cells rather than
just count up the errors. Then use model with least total cost.
This is only useful for evaluation, not training a cost sensitive
classifier.
Classification: Evaluation 2
February 24, 2009
Slide 9
COMP527:
Data Mining
Training with Costs
Can artificially inflate a 2-class training set with duplicates of the
preferred class. Then an error minimising classifier will attempt
to reduce the errors on the inflated number.
Eg: Duplicate each 'false' instance 9 more times. Then biased
against predicting no wrongly, as it's 10 errors instead of 1.
Then evaluate against the correct proportion of instances.
Some classification algorithms also allow instances to be weighted
directly, rather than duplicating them.
Classification: Evaluation 2
February 24, 2009
Slide 10
COMP527:
Data Mining
Probabilities
Some classifiers give a probability rather than a definite yes/no (eg
Bayesian techniques)
These must be taken into account when determining cost. Eg: A
51% correct probability is not that much better than a 51%
incorrect probability.
We have some extra tricks that we can use to evaluate
probabilities...
Classification: Evaluation 2
February 24, 2009
Slide 11
COMP527:
Data Mining
Quadratic Loss
Quadratic Loss Function:
∑j(pj-aj)2
Where it is summed over the probabilities of each of j classes for a
single instance. A is 1 for the correct class and 0 for the others,
P is the probability assigned to that class.
Then sum the loss over all test instances for a classifier.
You could then find the mean across different cross validation
folds... at which point you have the mean squared error.
Classification: Evaluation 2
February 24, 2009
Slide 12
COMP527:
Data Mining
Quadratic Loss
∑j(pj-aj)2
Example:
In a 5 class problem, an instance might have:
(0.5, 0.2, 0.05, 0.15, 0.1)
When you want the first class:
(1,0,0,0,0)
= -0.52 + 0.22 + 0.052 + 0.152 + 0.12
= .25 + .04 + .0025 + .023 + .01
= 0.325
(and then summed for all instances, and the mean taken across CV
folds)
Classification: Evaluation 2
February 24, 2009
Slide 13
COMP527:
Data Mining
Information Loss
The opposite of information gain, we can use the same function as
a cost.
-E1log(p1) -E2log(p2) ...
Where E is the true probability and p is the predicted probability.
If there is only one class, then the only one that matters is the
correct class, as the rest will be multiplied by 0.
Note that if you assign a 0 probability to the true class, you get an
infinite error! (Don't Do That Then)
Classification: Evaluation 2
February 24, 2009
Slide 14
COMP527:
Data Mining
Precision/Recall
Information Retrieval uses the same confusion matrix:
Recall: relevant and retrieved / total relevant
Precision: relevant and retrieved / total retrieved
eg 10 relevant, of which 6 are retrieved = 60% recall
100 retrieved, with all 10 relevant = 10% precision.
The best result is all relevant documents retrieved, and no
irrelevant documents retrieved
False Positive: Document retrieved but not relevant
False Negative: Relevant, but not retrieved
Classification: Evaluation 2
February 24, 2009
Slide 15
COMP527:
Data Mining
Lift Charts
To go back to the directed advertising example... A data mining tool
might predict that, given a sample of 100,000 recipients, 400 will
buy (0.4%). Given 400,000, then it predicts that 800 will buy
(0.2%).
In order to work out where the ideal point is, we need to include
information about the cost of sending an advertisement vs the
profit gained from someone that responds. (eg will 300,000 extra
ads be worth 400 extra people).
This can be graphed, hence a lift chart...
Classification: Evaluation 2
February 24, 2009
Slide 16
COMP527:
Data Mining
Lift Charts
The lift is what is gained
from the baseline to
the black line, as
determined by the
classification engine.
(or a Cumulative
Gains chart)
This can be accomplished by ranking instances by the highest
probabilities first.
Classification: Evaluation 2
February 24, 2009
Slide 17
COMP527:
Data Mining
ROC Curves
From signal processing: Receiver Operating Characteristic.
Tradeoff between hit rate and false alarm rate when trying to find
real data in noisy channel.
Plot true positives vertically, and false positives horizontally.
As with Lift charts, the place to be is the top left.
Generate an ordered list of instances and if the classifier correctly
classifies them. Then for each true positive take a step up, for
each false positive take a step to the right.
Eg...
Classification: Evaluation 2
February 24, 2009
Slide 18
COMP527:
Data Mining
ROC Curves
We can generate the smooth curve by the use of cross-validation.
Eg generate a curve for each fold, and then average them.
Classification: Evaluation 2
February 24, 2009
Slide 19
COMP527:
Data Mining
ROC Curves
We can also plot two curves on the same chart, each generated
from different classifiers. This lets us see at which point it's
better to use one classifier rather than the other.
By using both A and B classifiers with appropriate weightings, it's
possible to get at points in between the two peaks.
Classification: Evaluation 2
February 24, 2009
Slide 20
COMP527:
Data Mining
Numeric Prediction
Most common is Mean Squared Error. And have seen before.
(subtract prediction from actual, square it, average)
Also Mean Absolute Error – don't square it, just average the
magnitude of each error.
But if there is a great difference between numbers to be predicted
we might want to use a relative error. Eg 50 out in an prediction
of 500, vs 0.2 out in a prediction of 2. Same magnitude, relatively
speaking.
So, we have the Relative Squared Error.
Classification: Evaluation 2
February 24, 2009
Slide 21
COMP527:
Data Mining
Numeric Prediction
Classification: Evaluation 2
February 24, 2009
Slide 22
COMP527:
Data Mining


Further Reading
Witten, Chapter 5
Han, 6.15
Classification: Evaluation 2
February 24, 2009
Slide 23