COMP527: Data Mining COMP527: Data Mining M. Sulaiman Khan ([email protected]) Dept. of Computer Science University of Liverpool 2009 Classification: Evaluation 2 February 24, 2009 Slide 1 COMP527: Data Mining COMP527: Data Mining Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction Classification: Evaluation 2 Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam February 24, 2009 Slide 2 COMP527: Data Mining Today's Topics Confusion Matrix Costs Lift Curves ROC Curves Numeric Prediction Classification: Evaluation 2 February 24, 2009 Slide 3 COMP527: Data Mining Predict Yes: Predict No: Confusion Matrix The 'Confusion Matrix': Actual Yes Actual No True Positive False Positive False Negative True Negative We want to ensure that True Positive and True Negative are as high as possible. Same with more than two classes, you want the diagonal from top left to bottom right to be high, and the others to be low. (Think of the output from WEKA for example) Classification: Evaluation 2 February 24, 2009 Slide 4 COMP527: Data Mining Kappa Statistic But what about random luck? An accuracy of 50% against 1000 classes is obviously better than against 2 classes. We can derive the Kappa statistic from a confusion matrix for a classifier and an artificial confusion matrix with the classes divided in proportion to the overall distribution. Classification: Evaluation 2 February 24, 2009 Slide 5 COMP527: Data Mining Kappa Statistic Sum the diagonal in the expected by chance matrix. (82) Sum the diagonal in the classifier's matrix (140) Subtract expected from classifier. (58) Subtract expected from total instances (200 – 82 = 118) Divide and express as percentage: (58 / 118 = 49%) Classification: Evaluation 2 February 24, 2009 Slide 6 COMP527: Data Mining Cost For some situations, it's a lot worse to have a false negative than a false positive. Example: Better to have all true positives, no false negatives and some misclassifications if the application is detection of (insert un-favourite nasty medical condition) Example2: If there's a very skewed ratio of classes (eg 99% class A, 1% class B) then you want to tell the system that getting 99% accuracy by always predicting A is not good enough. The cost of getting it wrong for class B needs to be higher than the value of getting it right for class A. Classification: Evaluation 2 February 24, 2009 Slide 7 COMP527: Data Mining Cost Another example application: Mass mailed advertising. If it costs 40 pence to send out a letter, you want to maximize the number of letters sent to people who will buy, and minimize the number of letters sent out to those that won't. So the Confusion Matrix: Actual Yes Actual No Predict Yes Profit -40p -40p Classification: Evaluation 2 Predict No Potential profit not used Saved money February 24, 2009 Slide 8 COMP527: Data Mining Cost Can use a cost matrix to determine the cost of errors of a classifier. Default Cost Matrix: A B C A 0 1 1 B 1 0 1 C 1 1 0 But we might wish to change those values for different scenarios. Then when evaluating, we sum the values in the cells rather than just count up the errors. Then use model with least total cost. This is only useful for evaluation, not training a cost sensitive classifier. Classification: Evaluation 2 February 24, 2009 Slide 9 COMP527: Data Mining Training with Costs Can artificially inflate a 2-class training set with duplicates of the preferred class. Then an error minimising classifier will attempt to reduce the errors on the inflated number. Eg: Duplicate each 'false' instance 9 more times. Then biased against predicting no wrongly, as it's 10 errors instead of 1. Then evaluate against the correct proportion of instances. Some classification algorithms also allow instances to be weighted directly, rather than duplicating them. Classification: Evaluation 2 February 24, 2009 Slide 10 COMP527: Data Mining Probabilities Some classifiers give a probability rather than a definite yes/no (eg Bayesian techniques) These must be taken into account when determining cost. Eg: A 51% correct probability is not that much better than a 51% incorrect probability. We have some extra tricks that we can use to evaluate probabilities... Classification: Evaluation 2 February 24, 2009 Slide 11 COMP527: Data Mining Quadratic Loss Quadratic Loss Function: ∑j(pj-aj)2 Where it is summed over the probabilities of each of j classes for a single instance. A is 1 for the correct class and 0 for the others, P is the probability assigned to that class. Then sum the loss over all test instances for a classifier. You could then find the mean across different cross validation folds... at which point you have the mean squared error. Classification: Evaluation 2 February 24, 2009 Slide 12 COMP527: Data Mining Quadratic Loss ∑j(pj-aj)2 Example: In a 5 class problem, an instance might have: (0.5, 0.2, 0.05, 0.15, 0.1) When you want the first class: (1,0,0,0,0) = -0.52 + 0.22 + 0.052 + 0.152 + 0.12 = .25 + .04 + .0025 + .023 + .01 = 0.325 (and then summed for all instances, and the mean taken across CV folds) Classification: Evaluation 2 February 24, 2009 Slide 13 COMP527: Data Mining Information Loss The opposite of information gain, we can use the same function as a cost. -E1log(p1) -E2log(p2) ... Where E is the true probability and p is the predicted probability. If there is only one class, then the only one that matters is the correct class, as the rest will be multiplied by 0. Note that if you assign a 0 probability to the true class, you get an infinite error! (Don't Do That Then) Classification: Evaluation 2 February 24, 2009 Slide 14 COMP527: Data Mining Precision/Recall Information Retrieval uses the same confusion matrix: Recall: relevant and retrieved / total relevant Precision: relevant and retrieved / total retrieved eg 10 relevant, of which 6 are retrieved = 60% recall 100 retrieved, with all 10 relevant = 10% precision. The best result is all relevant documents retrieved, and no irrelevant documents retrieved False Positive: Document retrieved but not relevant False Negative: Relevant, but not retrieved Classification: Evaluation 2 February 24, 2009 Slide 15 COMP527: Data Mining Lift Charts To go back to the directed advertising example... A data mining tool might predict that, given a sample of 100,000 recipients, 400 will buy (0.4%). Given 400,000, then it predicts that 800 will buy (0.2%). In order to work out where the ideal point is, we need to include information about the cost of sending an advertisement vs the profit gained from someone that responds. (eg will 300,000 extra ads be worth 400 extra people). This can be graphed, hence a lift chart... Classification: Evaluation 2 February 24, 2009 Slide 16 COMP527: Data Mining Lift Charts The lift is what is gained from the baseline to the black line, as determined by the classification engine. (or a Cumulative Gains chart) This can be accomplished by ranking instances by the highest probabilities first. Classification: Evaluation 2 February 24, 2009 Slide 17 COMP527: Data Mining ROC Curves From signal processing: Receiver Operating Characteristic. Tradeoff between hit rate and false alarm rate when trying to find real data in noisy channel. Plot true positives vertically, and false positives horizontally. As with Lift charts, the place to be is the top left. Generate an ordered list of instances and if the classifier correctly classifies them. Then for each true positive take a step up, for each false positive take a step to the right. Eg... Classification: Evaluation 2 February 24, 2009 Slide 18 COMP527: Data Mining ROC Curves We can generate the smooth curve by the use of cross-validation. Eg generate a curve for each fold, and then average them. Classification: Evaluation 2 February 24, 2009 Slide 19 COMP527: Data Mining ROC Curves We can also plot two curves on the same chart, each generated from different classifiers. This lets us see at which point it's better to use one classifier rather than the other. By using both A and B classifiers with appropriate weightings, it's possible to get at points in between the two peaks. Classification: Evaluation 2 February 24, 2009 Slide 20 COMP527: Data Mining Numeric Prediction Most common is Mean Squared Error. And have seen before. (subtract prediction from actual, square it, average) Also Mean Absolute Error – don't square it, just average the magnitude of each error. But if there is a great difference between numbers to be predicted we might want to use a relative error. Eg 50 out in an prediction of 500, vs 0.2 out in a prediction of 2. Same magnitude, relatively speaking. So, we have the Relative Squared Error. Classification: Evaluation 2 February 24, 2009 Slide 21 COMP527: Data Mining Numeric Prediction Classification: Evaluation 2 February 24, 2009 Slide 22 COMP527: Data Mining Further Reading Witten, Chapter 5 Han, 6.15 Classification: Evaluation 2 February 24, 2009 Slide 23
© Copyright 2026 Paperzz