How to Construct an ROC curve How to Construct an ROC curve

How to Construct an ROC curve
Instance
P(+|A)
True Class
1
0.95
+
2
0.93
+
3
0.87
-
4
0.85
-
5
0.85
-
6
0.85
+
7
0.76
-
8
0.53
+
9
0.43
-
10
0.25
How to Construct an ROC curve
• Use classifier that produces
posterior probability for each
test instance P(+|A)
• Sort the instances according
to P(+|A) in decreasing order
• Apply threshold at each
unique value of P(+|A)
• Count the number of TP, FP,
TN, FN at each threshold
+
• TP rate, TPR = TP/(TP+FN)
• FP rate, FPR = FP/(FP + TN)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
1
How to construct an ROC curve
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
3
1.0
0.8
0.6
TPR
0.4
0.2
0.4
0.6
0.8
1.0
FPR
© Tan,Steinbach, Kumar
Introduction to Data Mining
True class FPR
TPR
0.95
+
0
1/5
2
0.93
+
0
2/5
3
0.87
-
1/5
2/5
4
0.85
-
5
0.85
-
6
0.85
+
3/5
3/5
7
0.76
-
4/5
3/5
8
0.53
+
4/5
4/5
9
0.43
-
1
4/5
10
0.25
+
1
1
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
2
Instance
P(+)
True class FPR
TPR
1
0.95
+
0
1/5
2
0.93
+
0
2/5
3
0.87
+
0
3/5
4
0.85
+
0
4/5
5
0.83
+
0
1
6
0.80
-
1/5
1
7
0.76
-
2/5
1
8
0.53
-
3/5
1
9
0.43
-
4/5
1
10
0.25
-
1
1
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
4
How to Construct an ROC curve
0.0
0.2
P(+)
1
How to Construct an ROC curve
How to construct an ROC curve
0.0
Instance
4/18/2004
5
Instance
P(+)
True class FPR
TPR
1
0.95
+
0
1/5
2
0.93
-
1/5
1/5
3
0.87
+
1/5
2/5
4
0.85
-
2/5
2/5
5
0.83
+
2/5
3/5
6
0.80
-
3/5
3/5
7
0.76
+
3/5
4/5
8
0.53
-
4/5
4/5
9
0.43
+
4/5
1
10
0.25
-
1
1
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
6
Model Evaluation
l
Metrics for Performance Evaluation
– How to evaluate the performance of a model?
l
Methods for Performance Evaluation
– How to obtain reliable estimates?
l
Methods for Model Comparison
– How to compare the relative performance
among competing models?
0.0
0.2
0.4
TPR
0.6
0.8
1.0
How to construct an ROC curve
0.0
0.2
0.4
0.6
0.8
1.0
FPR
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
7
Confidence Interval for Accuracy
l
© Tan,Steinbach, Kumar
l
– acc has a normal distribution
with mean p and variance
p(1-p)/N
– Possible outcomes for prediction: correct or wrong
– Collection of Bernoulli trials has a Binomial distribution:
l
x ∼ Bin(N, p)
x: number of correct predictions
e.g: Toss a fair coin 50 times, how many heads would turn up?
Expected number of heads = N×p = 50 × 0.5 = 25
P (− Z α / 2 <
l
4/18/2004
v
u
u
u
9
– From probability table, Zα/2=1.96
1-α
0.99 2.58
100
500
1000
5000
0.95 1.96
p(lower)
0.689
0.722
0.765
0.775
0.789
0.90 1.65
p(upper)
0.911
0.878
0.835
0.825
0.811
4/18/2004
4/18/2004
10
l
Metrics for Performance Evaluation
– How to evaluate the performance of a model?
l
Methods for Performance Evaluation
– How to obtain reliable estimates?
l
Methods for Model Comparison
– How to compare the relative performance
among competing models?
0.98 2.33
50
Introduction to Data Mining
Introduction to Data Mining
Zα/2
N
© Tan,Steinbach, Kumar
© Tan,Steinbach, Kumar
Model Evaluation
Consider a model that produces an accuracy of
80% when evaluated on 100 test instances:
– N=100, acc = 0.8
– Let 1-α = 0.95 (95% confidence)
Zα /2
acc ± Zα/2t acc(1−acc)
N
Confidence Interval for Accuracy
l
-Zα/2
(1-α) × 100% Confidence Interval for p:
Can we predict p (true accuracy of model)?
Introduction to Data Mining
acc − p
< Zα / 2 )
p (1 − p) / N
= 1− α
Given x (# of correct predictions) or equivalently,
acc=x/N, and N (# of test instances),
© Tan,Steinbach, Kumar
8
Area = 1 - α
For large test sets (N > 30),
– A Bernoulli trial has 2 possible outcomes
u
4/18/2004
Confidence Interval for Accuracy
Prediction can be regarded as a Bernoulli trial
u
Introduction to Data Mining
11
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
12
Comparing Performance of 2 Models
Errors of M1 and M2 are independent
l Given
Prob.
two models, say M1 and M2, which
is better?
l Usually the models are evaluated on the
same test sample.
l Make use of correlation between
predictions.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
C1
13
Strong positive correlation
C2
Prob.
C1
0.18
0.02
correct (1)
0.02
0.78
X
-1
P(X)
.02 .96 .02
© Tan,Steinbach, Kumar
0
VAR(X)= E(X –E(X))2 =
0.02(-1)2 +0.96(0)2 +0.02(1)2
= 0.04
Introduction to Data Mining
4/18/2004
15
Comparing Performance of 2 Models
l
Make a cross-table of
correct and incorrect
predictions of M1 and
M2.
0.64
X
-1
P(X)
.16 .68 .16
© Tan,Steinbach, Kumar
0
+1
Introduction to Data Mining
E(X)=0
VAR(X)= E(X –E(X))2 =
0.16(-1)2 +0.68(0)2 +0.16(1)2
= 0.32
4/18/2004
14
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
16
Comparing Performance of 2 Models
cells a and d (both incorrect, both
correct).
l If models were equally good, we would
expect counts in cells b and c to be in
balance.
l Under the null hypothesis that models have
the same error, number in cell b has a
binomial distribution with n=n(b)+n(c), and
p=0.5.
Model M2
incorrect
correct
incorrect
a
b
correct
c
d
M1
© Tan,Steinbach, Kumar
0.16
0.16
l Ignore
Count
Model
0.04
correct (1)
Larger differences are more likely if errors are
independent, and less likely if errors are
positively correlated.
l Hence, an observed difference may be regarded
as significant for models with positively correlated
errors but not for models with independent errors.
l Our test should reflect (make use of) this
property.
E(X)=0
+1
incorrect (0)
l
correct (1)
incorrect (0)
Let X=C1-C2
correct (1)
Comparing Performance of 2 Models
Let X=C1-C2
incorrect (0)
C2
incorrect (0)
Introduction to Data Mining
4/18/2004
17
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
18
Comparing Performance of 2 Models
Comparing Performance of 2 Models
Errors of M1 and M2 are independent
We test the null hypothesis:
H0 : e1 = e2
Model M2
Count
against
incorrect
correct
Ha : e1 ≠ e2
Model
where ei denotes the true error-rate of model i.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
incorrect
6
14
24
56
M1
correct
19
Introduction to Data Mining
4/18/2004
20
Comparing Performance of 2 Models
0.12
Binomial distr. with n=38 and p=0.5.
© Tan,Steinbach, Kumar
0.10
Errors of M1 and M2 are positively correlated
0.08
p-value = 0.14
Model M2
Count
correct
0.06
incorrect
0.04
Model
incorrect
18
2
correct
12
68
0.00
0.02
M1
0
2
4
6
© Tan,Steinbach, Kumar
8
10 12 14 16 18 20 22 24 26 28 30 32 34 36 38
Introduction to Data Mining
4/18/2004
21
Binomial distr. with n=14 and p=0.5.
Introduction to Data Mining
4/18/2004
22
Comparing Performance of 2 Models
0.20
Although the difference in error-rate is the same in both
cases, the independent case produced a p-value of 0.14
(typically not regarded significant) leading to the conclusion
that we cannot reject the null hypothesis that both models
have the same error-rate.
p-value = 0.012
0.10
0.15
© Tan,Steinbach, Kumar
0.00
0.05
The example with positively correlated errors produces a
p-value of 0.012, leading to the conclusion that M1 has a
significantly lower error rate than M2.
0
© Tan,Steinbach, Kumar
1
2
3
4
5
6
7
8
Introduction to Data Mining
9 10
12
14
4/18/2004
23
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
24
Data Mining
Classification: Alternative Techniques
Bayes (Generative) Classifier
A probabilistic framework for solving classification
problems
l Conditional Probability:
P( A, C )
P (C | A) =
P ( A)
l
Lecture Notes for Chapter 5
Introduction to Data Mining
P( A | C ) =
by
Tan, Steinbach, Kumar
l
Bayes theorem:
P(C | A) =
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
25
Example of Bayes Theorem
l
© Tan,Steinbach, Kumar
Given:
P ( S | M ) P ( M ) 0.5 ×1 / 50000
=
= 0.0002
P(S )
1 / 20
Introduction to Data Mining
4/18/2004
27
Bayesian Classifiers
l
1
2
n
P ( A A K A | C ) P (C )
P(A A K A )
1
2
n
1
2
n
– Choose value of C that maximizes
P(C | A1, A2, …, An)
– Equivalent to choosing value of C that maximizes
P(A1, A2, …, An|C) P(C)
l
26
l
Given a record with attributes (A1, A2,…,An)
– Goal is to predict class C
– Specifically, we want to find the value of C that
maximizes P(C| A1, A2,…,An )
l
Can we estimate P(C| A1, A2,…,An ) directly from
data?
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
28
Curse of dimensionality
Approach:
– compute the posterior probability P(C | A1, A2, …, An) for
all values of C using the Bayes theorem
P (C | A A K A ) =
4/18/2004
Consider each attribute and class label as random
variables
If a patient has stiff neck, what’s the probability
he/she has meningitis?
© Tan,Steinbach, Kumar
Introduction to Data Mining
l
– Prior probability of any patient having stiff neck is 1/20
P(M | S ) =
P ( A | C ) P (C )
P( A)
Bayesian Classifiers
– A doctor knows that meningitis causes stiff neck 50% of the
time
– Prior probability of any patient having meningitis is 1/50,000
l
P( A, C )
P (C )
How to estimate P(A1, A2, …, An | C )?
If each attribute is discrete with, say, 5 possible
values, then to estimate each possible combination
requires the estimation of 5n probabilities per class.
l For 10 attributes (n=10) this is about ten million
probabilities. In general: mn probabilities per class.
l This simple approach runs into the curse of
dimensionality.
l To be practical, we need to make some simplifying
assumptions.
l
l
How to estimate P(A1, A2, …, An | C )?
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
29
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
30
Conditional Independence
Naïve Bayes Classifier
X and Y are independent iff P(X,Y)=P(X)P(Y),
or, equivalently, P(X|Y)=P(X).
l
Intuition: Y doesn’t provide any information about X
(and vice versa).
– Can estimate P(Ai| Cj) for all Ai and Cj.
– Now we only need to estimate m × n probabilities per
class.
X and Y are independent given Z iff P(X,Y|Z)=P(X|Z)P(Y|Z),
or, equivalently, P(X|Y,Z)=P(X|Z).
Intuition: if we know the value of Z, then Y doesn’t provide
any information about X (and vice versa).
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
– New point is classified to Cj if P(Cj) Π P(Ai| Cj) is
maximal.
31
© Tan,Steinbach, Kumar
How to Estimate
Probabilities from Data?
l
l
ca
Tid
te
Refund
r
go
a
ic
ca
te
r
go
a
ic
co
n
u
tin
ou
s
a
cl
Marital
Status
Taxable
Income
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
ss
l
Class: P(C) = Nc/N
For discrete attributes:
P(Ai | Ck) = |Aik|/ Nc k
© Tan,Steinbach, Kumar
4/18/2004
33
g
co
nt
in
as
cl
Tid
Refund
Marital
Status
Taxable
Income
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
4/18/2004
34
How to Estimate Probabilities from Data?
s
l
0.08
te
Introduction to Data Mining
Normal distribution:
1
e
2πσ
P( A | c ) =
i
j
−
( Ai − µij ) 2
2 σ ij2
2
Class No
Class Yes
0.06
ca
u
ij
– One for each (Ai,cj) pair
l
0.04
g
Assume attribute follows a normal distribution
Use data to estimate parameters of distribution
(i.e., mean and standard deviation)
u Once probability distribution is known, can use it to
estimate the conditional probability P(Ai|c)
u
For (Income, Class=No):
– If Class=No
0.02
te
choose only one of the two splits as new attribute
© Tan,Steinbach, Kumar
s
al
al
uProbabilities
How toorEstimate
from Data?
ic
ic
uo
or
ca
32
For continuous attributes:
– Discretize the range into bins
– Two-way split: (A < v) or (A ≥ v)
u
P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
Introduction to Data Mining
4/18/2004
– Probability density estimation:
– where |Aik| is number of
instances having attribute
Ai and belongs to class Ck
– Examples:
10
Introduction to Data Mining
How to Estimate Probabilities from Data?
l
– e.g., P(No) = 7/10,
P(Yes) = 3/10
l
Assume independence among attributes Ai when class is
given:
– P(A1, A2, …, An |Cj) = P(A1| Cj) P(A2| Cj)… P(An| Cj)
u
sample mean = 110
u
sample variance = 2975
P ( Income = 120 | No) =
© Tan,Steinbach, Kumar
1
e
2π (54.54)
Introduction to Data Mining
−
( 120−110 ) 2
2 ( 2975 )
= 0.0072
0.00
10
0
50
100
150
200
250
300
Income
4/18/2004
35
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
36
Example of Naïve Bayes Classifier
Naïve Bayes Classifier
Given a Test Record:
If one of the conditional probability is zero, then
the entire expression becomes zero
l Probability estimation:
X = (Refund = No, Married, Income = 120K)
l
naive Bayes Classifier:
P(Refund=Yes|No) = 3/7
P(Refund=No|No) = 4/7
P(Refund=Yes|Yes) = 0
P(Refund=No|Yes) = 1
P(Marital Status=Single|No) = 2/7
P(Marital Status=Divorced|No)=1/7
P(Marital Status=Married|No) = 4/7
P(Marital Status=Single|Yes) = 2/7
P(Marital Status=Divorced|Yes)=1/7
P(Marital Status=Married|Yes) = 0
For taxable income:
If class=No:
sample mean=110
sample variance=2975
If class=Yes: sample mean=90
sample variance=25
© Tan,Steinbach, Kumar
l
P(X|Class=No) = P(Refund=No|Class=No)
× P(Married| Class=No)
× P(Income=120K| Class=No)
= 4/7 × 4/7 × 0.0072 = 0.0024
l
P(X|Class=Yes) = P(Refund=No| Class=Yes)
× P(Married| Class=Yes)
× P(Income=120K| Class=Yes)
= 1 × 0 × 1.2 × 10-9 = 0
Original : P ( Ai | C ) =
N ic
Nc
Laplace : P ( Ai | C ) =
N ic + 1
Nc + a
a: number of values of Ai
Since P(X|No)P(No) > P(X|Yes)P(Yes)
Therefore P(No|X) > P(Yes|X)
=> Class = No
Introduction to Data Mining
4/18/2004
37
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
38
Example of Naïve Bayes Classifier
Naïve Bayes (Summary)
Name
l
Robust to isolated noise points
l
Handle missing values by ignoring the instance
during probability estimate calculations
l
Robust to irrelevant attributes
l
Independence assumption may not hold for some
attributes
– Use other techniques such as Bayesian Belief
Networks (BBN)
human
python
salmon
whale
frog
komodo
bat
pigeon
cat
leopard shark
turtle
penguin
porcupine
eel
salamander
gila monster
platypus
owl
dolphin
eagle
Give Birth
yes
Give Birth
yes
no
no
yes
no
no
yes
no
yes
yes
no
no
yes
no
no
no
no
no
yes
no
Can Fly
no
no
no
no
no
no
yes
yes
no
no
no
no
no
no
no
no
no
yes
no
yes
Can Fly
no
© Tan,Steinbach, Kumar
Live in Water Have Legs
no
no
yes
yes
sometimes
no
no
no
no
yes
sometimes
sometimes
no
yes
sometimes
no
no
no
yes
no
Class
yes
no
no
no
yes
yes
yes
yes
yes
no
yes
yes
yes
no
yes
yes
yes
yes
no
yes
mammals
non-mammals
non-mammals
mammals
non-mammals
non-mammals
mammals
non-mammals
mammals
non-mammals
non-mammals
non-mammals
mammals
non-mammals
non-mammals
non-mammals
mammals
non-mammals
mammals
non-mammals
Live in Water Have Legs
yes
no
Class
?
Introduction to Data Mining
A: attributes
M: mammals
N: non-mammals
6 6 2 2
P( A | M ) = × × × = 0.06
7 7 7 7
1 10 3 4
P( A | N ) = × × × = 0.0042
13 13 13 13
7
P( A | M ) P (M ) = 0.06 × = 0.021
20
13
P( A | N ) P( N ) = 0.004 × = 0.0027
20
P(A|M)P(M) > P(A|N)P(N)
=> Mammals
4/18/2004
39
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
40
Naïve Bayes (Summary)
Example
Independence assumption may not hold for
(some) attributes, but
l If we evaluate on error-rate, then all that matters,
in the binary case, is whether the probability
estimate is on the right side of 0.5.
l With more than two classes similar reasoning
applies, but the “margin of error” becomes
smaller.
l For ROC curve, what matters is that we get the
probabilities in the right order.
Suppose P(Yes|A1=a1,…,An=an)=0.7 is the true probability
of Class Yes for a given attribute vector.
To minimize the error rate we should classify this attribute
vector as Yes.
l
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
41
^
As long as we have P(Yes|A
1=a1,…,An=an) > 0.5, we will
assign to the optimal class.
The probability estimate itself may be way off!
If we evaluate on “likelihood” this doesn’t fly!
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
42
Example
C=0 A2
A1
C=1 A2
0
0
0.3
1
0.1
P(A1)
0.4
1
0.1
0.5
0.6
P(A2) 0.4
0.6
1
A1
0
0.6
1
0.1
P(A1)
0.7
1
0.1
P(A2) 0.7
0.2
0.3
0.3
1
0
P(C=0)=1/2, P(C=1)=1/2.
P (A1=1,A2=1|C=0)P (C=0)
40
P (C = 0|A1 = 1, A2 = 1) = P (A =1,A =1|C=0)P
(C=0)+P (A1=1,A2=1|C=1)P (C=1) = 56 ≈ 0.71
1
2
With Naive Bayes:
(A1=1|C=0)P (A2=1|C=0)P (C=0)
P (C = 0|A1 = 1, A2 = 1) = P (A =1|C=0)P (A P=1|C=0)P
(C=0)+P (A1=1|C=1)P (A2=1|C=1)P (C=1) = 0.8
1
2
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
43