How to Construct an ROC curve Instance P(+|A) True Class 1 0.95 + 2 0.93 + 3 0.87 - 4 0.85 - 5 0.85 - 6 0.85 + 7 0.76 - 8 0.53 + 9 0.43 - 10 0.25 How to Construct an ROC curve • Use classifier that produces posterior probability for each test instance P(+|A) • Sort the instances according to P(+|A) in decreasing order • Apply threshold at each unique value of P(+|A) • Count the number of TP, FP, TN, FN at each threshold + • TP rate, TPR = TP/(TP+FN) • FP rate, FPR = FP/(FP + TN) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 How to construct an ROC curve © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 1.0 0.8 0.6 TPR 0.4 0.2 0.4 0.6 0.8 1.0 FPR © Tan,Steinbach, Kumar Introduction to Data Mining True class FPR TPR 0.95 + 0 1/5 2 0.93 + 0 2/5 3 0.87 - 1/5 2/5 4 0.85 - 5 0.85 - 6 0.85 + 3/5 3/5 7 0.76 - 4/5 3/5 8 0.53 + 4/5 4/5 9 0.43 - 1 4/5 10 0.25 + 1 1 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Instance P(+) True class FPR TPR 1 0.95 + 0 1/5 2 0.93 + 0 2/5 3 0.87 + 0 3/5 4 0.85 + 0 4/5 5 0.83 + 0 1 6 0.80 - 1/5 1 7 0.76 - 2/5 1 8 0.53 - 3/5 1 9 0.43 - 4/5 1 10 0.25 - 1 1 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 How to Construct an ROC curve 0.0 0.2 P(+) 1 How to Construct an ROC curve How to construct an ROC curve 0.0 Instance 4/18/2004 5 Instance P(+) True class FPR TPR 1 0.95 + 0 1/5 2 0.93 - 1/5 1/5 3 0.87 + 1/5 2/5 4 0.85 - 2/5 2/5 5 0.83 + 2/5 3/5 6 0.80 - 3/5 3/5 7 0.76 + 3/5 4/5 8 0.53 - 4/5 4/5 9 0.43 + 4/5 1 10 0.25 - 1 1 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Model Evaluation l Metrics for Performance Evaluation – How to evaluate the performance of a model? l Methods for Performance Evaluation – How to obtain reliable estimates? l Methods for Model Comparison – How to compare the relative performance among competing models? 0.0 0.2 0.4 TPR 0.6 0.8 1.0 How to construct an ROC curve 0.0 0.2 0.4 0.6 0.8 1.0 FPR © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Confidence Interval for Accuracy l © Tan,Steinbach, Kumar l – acc has a normal distribution with mean p and variance p(1-p)/N – Possible outcomes for prediction: correct or wrong – Collection of Bernoulli trials has a Binomial distribution: l x ∼ Bin(N, p) x: number of correct predictions e.g: Toss a fair coin 50 times, how many heads would turn up? Expected number of heads = N×p = 50 × 0.5 = 25 P (− Z α / 2 < l 4/18/2004 v u u u 9 – From probability table, Zα/2=1.96 1-α 0.99 2.58 100 500 1000 5000 0.95 1.96 p(lower) 0.689 0.722 0.765 0.775 0.789 0.90 1.65 p(upper) 0.911 0.878 0.835 0.825 0.811 4/18/2004 4/18/2004 10 l Metrics for Performance Evaluation – How to evaluate the performance of a model? l Methods for Performance Evaluation – How to obtain reliable estimates? l Methods for Model Comparison – How to compare the relative performance among competing models? 0.98 2.33 50 Introduction to Data Mining Introduction to Data Mining Zα/2 N © Tan,Steinbach, Kumar © Tan,Steinbach, Kumar Model Evaluation Consider a model that produces an accuracy of 80% when evaluated on 100 test instances: – N=100, acc = 0.8 – Let 1-α = 0.95 (95% confidence) Zα /2 acc ± Zα/2t acc(1−acc) N Confidence Interval for Accuracy l -Zα/2 (1-α) × 100% Confidence Interval for p: Can we predict p (true accuracy of model)? Introduction to Data Mining acc − p < Zα / 2 ) p (1 − p) / N = 1− α Given x (# of correct predictions) or equivalently, acc=x/N, and N (# of test instances), © Tan,Steinbach, Kumar 8 Area = 1 - α For large test sets (N > 30), – A Bernoulli trial has 2 possible outcomes u 4/18/2004 Confidence Interval for Accuracy Prediction can be regarded as a Bernoulli trial u Introduction to Data Mining 11 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12 Comparing Performance of 2 Models Errors of M1 and M2 are independent l Given Prob. two models, say M1 and M2, which is better? l Usually the models are evaluated on the same test sample. l Make use of correlation between predictions. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 C1 13 Strong positive correlation C2 Prob. C1 0.18 0.02 correct (1) 0.02 0.78 X -1 P(X) .02 .96 .02 © Tan,Steinbach, Kumar 0 VAR(X)= E(X –E(X))2 = 0.02(-1)2 +0.96(0)2 +0.02(1)2 = 0.04 Introduction to Data Mining 4/18/2004 15 Comparing Performance of 2 Models l Make a cross-table of correct and incorrect predictions of M1 and M2. 0.64 X -1 P(X) .16 .68 .16 © Tan,Steinbach, Kumar 0 +1 Introduction to Data Mining E(X)=0 VAR(X)= E(X –E(X))2 = 0.16(-1)2 +0.68(0)2 +0.16(1)2 = 0.32 4/18/2004 14 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 Comparing Performance of 2 Models cells a and d (both incorrect, both correct). l If models were equally good, we would expect counts in cells b and c to be in balance. l Under the null hypothesis that models have the same error, number in cell b has a binomial distribution with n=n(b)+n(c), and p=0.5. Model M2 incorrect correct incorrect a b correct c d M1 © Tan,Steinbach, Kumar 0.16 0.16 l Ignore Count Model 0.04 correct (1) Larger differences are more likely if errors are independent, and less likely if errors are positively correlated. l Hence, an observed difference may be regarded as significant for models with positively correlated errors but not for models with independent errors. l Our test should reflect (make use of) this property. E(X)=0 +1 incorrect (0) l correct (1) incorrect (0) Let X=C1-C2 correct (1) Comparing Performance of 2 Models Let X=C1-C2 incorrect (0) C2 incorrect (0) Introduction to Data Mining 4/18/2004 17 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18 Comparing Performance of 2 Models Comparing Performance of 2 Models Errors of M1 and M2 are independent We test the null hypothesis: H0 : e1 = e2 Model M2 Count against incorrect correct Ha : e1 ≠ e2 Model where ei denotes the true error-rate of model i. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 incorrect 6 14 24 56 M1 correct 19 Introduction to Data Mining 4/18/2004 20 Comparing Performance of 2 Models 0.12 Binomial distr. with n=38 and p=0.5. © Tan,Steinbach, Kumar 0.10 Errors of M1 and M2 are positively correlated 0.08 p-value = 0.14 Model M2 Count correct 0.06 incorrect 0.04 Model incorrect 18 2 correct 12 68 0.00 0.02 M1 0 2 4 6 © Tan,Steinbach, Kumar 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 Introduction to Data Mining 4/18/2004 21 Binomial distr. with n=14 and p=0.5. Introduction to Data Mining 4/18/2004 22 Comparing Performance of 2 Models 0.20 Although the difference in error-rate is the same in both cases, the independent case produced a p-value of 0.14 (typically not regarded significant) leading to the conclusion that we cannot reject the null hypothesis that both models have the same error-rate. p-value = 0.012 0.10 0.15 © Tan,Steinbach, Kumar 0.00 0.05 The example with positively correlated errors produces a p-value of 0.012, leading to the conclusion that M1 has a significantly lower error rate than M2. 0 © Tan,Steinbach, Kumar 1 2 3 4 5 6 7 8 Introduction to Data Mining 9 10 12 14 4/18/2004 23 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24 Data Mining Classification: Alternative Techniques Bayes (Generative) Classifier A probabilistic framework for solving classification problems l Conditional Probability: P( A, C ) P (C | A) = P ( A) l Lecture Notes for Chapter 5 Introduction to Data Mining P( A | C ) = by Tan, Steinbach, Kumar l Bayes theorem: P(C | A) = © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 Example of Bayes Theorem l © Tan,Steinbach, Kumar Given: P ( S | M ) P ( M ) 0.5 ×1 / 50000 = = 0.0002 P(S ) 1 / 20 Introduction to Data Mining 4/18/2004 27 Bayesian Classifiers l 1 2 n P ( A A K A | C ) P (C ) P(A A K A ) 1 2 n 1 2 n – Choose value of C that maximizes P(C | A1, A2, …, An) – Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C) l 26 l Given a record with attributes (A1, A2,…,An) – Goal is to predict class C – Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An ) l Can we estimate P(C| A1, A2,…,An ) directly from data? © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28 Curse of dimensionality Approach: – compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem P (C | A A K A ) = 4/18/2004 Consider each attribute and class label as random variables If a patient has stiff neck, what’s the probability he/she has meningitis? © Tan,Steinbach, Kumar Introduction to Data Mining l – Prior probability of any patient having stiff neck is 1/20 P(M | S ) = P ( A | C ) P (C ) P( A) Bayesian Classifiers – A doctor knows that meningitis causes stiff neck 50% of the time – Prior probability of any patient having meningitis is 1/50,000 l P( A, C ) P (C ) How to estimate P(A1, A2, …, An | C )? If each attribute is discrete with, say, 5 possible values, then to estimate each possible combination requires the estimation of 5n probabilities per class. l For 10 attributes (n=10) this is about ten million probabilities. In general: mn probabilities per class. l This simple approach runs into the curse of dimensionality. l To be practical, we need to make some simplifying assumptions. l l How to estimate P(A1, A2, …, An | C )? © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30 Conditional Independence Naïve Bayes Classifier X and Y are independent iff P(X,Y)=P(X)P(Y), or, equivalently, P(X|Y)=P(X). l Intuition: Y doesn’t provide any information about X (and vice versa). – Can estimate P(Ai| Cj) for all Ai and Cj. – Now we only need to estimate m × n probabilities per class. X and Y are independent given Z iff P(X,Y|Z)=P(X|Z)P(Y|Z), or, equivalently, P(X|Y,Z)=P(X|Z). Intuition: if we know the value of Z, then Y doesn’t provide any information about X (and vice versa). © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 – New point is classified to Cj if P(Cj) Π P(Ai| Cj) is maximal. 31 © Tan,Steinbach, Kumar How to Estimate Probabilities from Data? l l ca Tid te Refund r go a ic ca te r go a ic co n u tin ou s a cl Marital Status Taxable Income Evade 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes ss l Class: P(C) = Nc/N For discrete attributes: P(Ai | Ck) = |Aik|/ Nc k © Tan,Steinbach, Kumar 4/18/2004 33 g co nt in as cl Tid Refund Marital Status Taxable Income Evade 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 4/18/2004 34 How to Estimate Probabilities from Data? s l 0.08 te Introduction to Data Mining Normal distribution: 1 e 2πσ P( A | c ) = i j − ( Ai − µij ) 2 2 σ ij2 2 Class No Class Yes 0.06 ca u ij – One for each (Ai,cj) pair l 0.04 g Assume attribute follows a normal distribution Use data to estimate parameters of distribution (i.e., mean and standard deviation) u Once probability distribution is known, can use it to estimate the conditional probability P(Ai|c) u For (Income, Class=No): – If Class=No 0.02 te choose only one of the two splits as new attribute © Tan,Steinbach, Kumar s al al uProbabilities How toorEstimate from Data? ic ic uo or ca 32 For continuous attributes: – Discretize the range into bins – Two-way split: (A < v) or (A ≥ v) u P(Status=Married|No) = 4/7 P(Refund=Yes|Yes)=0 Introduction to Data Mining 4/18/2004 – Probability density estimation: – where |Aik| is number of instances having attribute Ai and belongs to class Ck – Examples: 10 Introduction to Data Mining How to Estimate Probabilities from Data? l – e.g., P(No) = 7/10, P(Yes) = 3/10 l Assume independence among attributes Ai when class is given: – P(A1, A2, …, An |Cj) = P(A1| Cj) P(A2| Cj)… P(An| Cj) u sample mean = 110 u sample variance = 2975 P ( Income = 120 | No) = © Tan,Steinbach, Kumar 1 e 2π (54.54) Introduction to Data Mining − ( 120−110 ) 2 2 ( 2975 ) = 0.0072 0.00 10 0 50 100 150 200 250 300 Income 4/18/2004 35 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36 Example of Naïve Bayes Classifier Naïve Bayes Classifier Given a Test Record: If one of the conditional probability is zero, then the entire expression becomes zero l Probability estimation: X = (Refund = No, Married, Income = 120K) l naive Bayes Classifier: P(Refund=Yes|No) = 3/7 P(Refund=No|No) = 4/7 P(Refund=Yes|Yes) = 0 P(Refund=No|Yes) = 1 P(Marital Status=Single|No) = 2/7 P(Marital Status=Divorced|No)=1/7 P(Marital Status=Married|No) = 4/7 P(Marital Status=Single|Yes) = 2/7 P(Marital Status=Divorced|Yes)=1/7 P(Marital Status=Married|Yes) = 0 For taxable income: If class=No: sample mean=110 sample variance=2975 If class=Yes: sample mean=90 sample variance=25 © Tan,Steinbach, Kumar l P(X|Class=No) = P(Refund=No|Class=No) × P(Married| Class=No) × P(Income=120K| Class=No) = 4/7 × 4/7 × 0.0072 = 0.0024 l P(X|Class=Yes) = P(Refund=No| Class=Yes) × P(Married| Class=Yes) × P(Income=120K| Class=Yes) = 1 × 0 × 1.2 × 10-9 = 0 Original : P ( Ai | C ) = N ic Nc Laplace : P ( Ai | C ) = N ic + 1 Nc + a a: number of values of Ai Since P(X|No)P(No) > P(X|Yes)P(Yes) Therefore P(No|X) > P(Yes|X) => Class = No Introduction to Data Mining 4/18/2004 37 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38 Example of Naïve Bayes Classifier Naïve Bayes (Summary) Name l Robust to isolated noise points l Handle missing values by ignoring the instance during probability estimate calculations l Robust to irrelevant attributes l Independence assumption may not hold for some attributes – Use other techniques such as Bayesian Belief Networks (BBN) human python salmon whale frog komodo bat pigeon cat leopard shark turtle penguin porcupine eel salamander gila monster platypus owl dolphin eagle Give Birth yes Give Birth yes no no yes no no yes no yes yes no no yes no no no no no yes no Can Fly no no no no no no yes yes no no no no no no no no no yes no yes Can Fly no © Tan,Steinbach, Kumar Live in Water Have Legs no no yes yes sometimes no no no no yes sometimes sometimes no yes sometimes no no no yes no Class yes no no no yes yes yes yes yes no yes yes yes no yes yes yes yes no yes mammals non-mammals non-mammals mammals non-mammals non-mammals mammals non-mammals mammals non-mammals non-mammals non-mammals mammals non-mammals non-mammals non-mammals mammals non-mammals mammals non-mammals Live in Water Have Legs yes no Class ? Introduction to Data Mining A: attributes M: mammals N: non-mammals 6 6 2 2 P( A | M ) = × × × = 0.06 7 7 7 7 1 10 3 4 P( A | N ) = × × × = 0.0042 13 13 13 13 7 P( A | M ) P (M ) = 0.06 × = 0.021 20 13 P( A | N ) P( N ) = 0.004 × = 0.0027 20 P(A|M)P(M) > P(A|N)P(N) => Mammals 4/18/2004 39 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 40 Naïve Bayes (Summary) Example Independence assumption may not hold for (some) attributes, but l If we evaluate on error-rate, then all that matters, in the binary case, is whether the probability estimate is on the right side of 0.5. l With more than two classes similar reasoning applies, but the “margin of error” becomes smaller. l For ROC curve, what matters is that we get the probabilities in the right order. Suppose P(Yes|A1=a1,…,An=an)=0.7 is the true probability of Class Yes for a given attribute vector. To minimize the error rate we should classify this attribute vector as Yes. l © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41 ^ As long as we have P(Yes|A 1=a1,…,An=an) > 0.5, we will assign to the optimal class. The probability estimate itself may be way off! If we evaluate on “likelihood” this doesn’t fly! © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42 Example C=0 A2 A1 C=1 A2 0 0 0.3 1 0.1 P(A1) 0.4 1 0.1 0.5 0.6 P(A2) 0.4 0.6 1 A1 0 0.6 1 0.1 P(A1) 0.7 1 0.1 P(A2) 0.7 0.2 0.3 0.3 1 0 P(C=0)=1/2, P(C=1)=1/2. P (A1=1,A2=1|C=0)P (C=0) 40 P (C = 0|A1 = 1, A2 = 1) = P (A =1,A =1|C=0)P (C=0)+P (A1=1,A2=1|C=1)P (C=1) = 56 ≈ 0.71 1 2 With Naive Bayes: (A1=1|C=0)P (A2=1|C=0)P (C=0) P (C = 0|A1 = 1, A2 = 1) = P (A =1|C=0)P (A P=1|C=0)P (C=0)+P (A1=1|C=1)P (A2=1|C=1)P (C=1) = 0.8 1 2 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 43
© Copyright 2026 Paperzz