Discrimination and Classification

Discrimination and
Classification
Discrimination
Situation:
We have two or more populations p1, p2, etc
(possibly p-variate normal).
The populations are known (or we have data
from each population)
We have data for a new case (population
unknown) and we want to identify the which
population for which the new case is a member.
Examples
Population p1 and p2
Measured variables X1, X2, X3, ... , Xn
1. Solvent and distressed
property-liability
insurance companies
written.
Total assets, cost of stocks and bonds,
market value of stocks and bonds, loss
expenses, surplus, amount of premiums
2. Nonulcer dyspeptics (those
with stomach problems) and
controls ("normal")
Measures of anxiety, dependence, guilt,
perfectionism.
3. Federalist papers written by
James Madison and those
written by Alexander Hamilton
Frequencies of different words and length
of sentences.
4. Good and poor credit risks.
Income, age, number of credit cards, family size
education
5. Succesful and unsuccessful
(fail to graduate) college
students
Entrance examination scores, high-school gradepoint average, number of high-school activities
6. Purchasers and Non purchasers.
of a home computer
Income, Education, family size, previous
purchase of other home computers, Occupation
7. Two species of chickweed
Sepal length, Petal length, petal cleft depth,
bract length, sreious tip length, sacrious tip
length, pollen diameter
The Basic Problem
Suppose that the data from a new case x1, … , xp
has joint density function either :
p1: f(x1, … , xn) or
p2: g(x1, … , xn)
We want to make the decision to
D1: Classify the case in p1 (f is the
correct distribution) or
D2: Classify the case in p2 (g is the
correct distribution)
The Two Types of Errors
Misclassifying the case in p1 when it actually lies
in p2.
Let P[1|2] = P[D1|p2] = probability of this type of error
1.
Misclassifying the case in p2 when it actually lies
in p1.
Let P[2|1] = P[D2|p1] = probability of this type of error
2.
This is similar Type I and Type II errors in hypothesis
testing.
Note:
A discrimination scheme is defined by splitting p –
dimensional space into two regions.
1.
C1 = the region were we make the decision D1.
(the decision to classify the case in p1)
2.
C2 = the region were we make the decision D2.
(the decision to classify the case in p2)
There can be several approaches to determining the
regions C1 and C2. All concerned with taking into
account the probabilities of misclassification P[2|1] and
P[1|2]
1.
Set up the regions C1 and C2 so that one of the
probabilities of misclassification , P[2|1] say, is at
some low acceptable value a. Accept the level of
the other probability of misclassification P[1|2] =
b.
2.
Set up the regions C1 and C2 so that the total
probability of misclassification:
P[Misclassification] = P[1] P[2|1] + P[2]P[1|2]
is minimized
P[1] = P[the case belongs to p1]
P[2] = P[the case belongs to p2]
3.
Set up the regions C1 and C2 so that the total
expected cost of misclassification:
E[Cost of Misclassification]
= c2|1P[1] P[2|1] + c1|2 P[2]P[1|2]
is minimized
P[1] = P[the case belongs to p1]
P[2] = P[the case belongs to p2]
c2|1= the cost of misclassifying the case in p2
when the case belongs to p1.
c1|2= the cost of misclassifying the case in p1
when the case belongs to p2.
4.
Set up the regions C1 and C2 The two types of
error are equal:
P[2|1] = P[1|2]
Computer security:
p1: Valid users
p2: Imposters
P[2|1] = P[identifying a valid user as an imposter]
P[1|2] = P[identifying an imposter as a valid user ]
P[1] = P[valid user]
P[2] = P[imposter]
c2|1= the cost of identifying the user as an
imposter when the user is a valid user.
c1|2= the cost of identifying the user as a valid
user when the user is an imposter.
This problem can be viewed as an Hypothesis
testing problem
H0:p1 is the correct population
HA:p2 is the correct population
P[2|1] = a
P[1|2] = b
Power = 1 - b
The Neymann-Pearson Lemma
Suppose that the data x1, … , xn has joint density
function
f(x1, … , xn ;q)
where q is either q1 or q2.
Let
g(x1, … , xn) = f(x1, … , xn ;q1) and
h(x1, … , xn) = f(x1, … , xn ;q2)
We want to test
H0: q = q1 (g is the correct distribution) against
HA: q = q2 (h is the correct distribution)
The Neymann-Pearson Lemma states that the Uniformly
Most Powerful (UMP) test of size a is to reject H0 if:

L q 2 
L q1 

h  x1 ,
g  x1 ,
, xn 
, xn 
 ka
and accept H0 if:

L q 2 
L q1 

h  x1 ,
g  x1 ,
, xn 
, xn 
 ka
where ka is chosen so that the test is of size a .
Proof: Let C be the critical region of any test of size a.


h  x1 , , xn 


C   x1 , , xn
 ka 
g  x1 , , xn 




 g  x1 , , xn  dx1 dxn 
Let
*

  g x ,
C*
, xn  dx1
1
dxn  a
C
We want to show that
  h  x ,
1
C*
, xn  dx1
dxn 
  h  x ,
, xn  dx1
1
Note:
C
*

C  C
 
 C   C

C
C*  C  C  C*  C
*
*
dxn
  g  x , , x  dx
  g  x , , x  dx dx
  g  x ,
hence
1
n
C*
1
n
dxn 
1
1
n

C* C
, xn  dx1
1
dxn  a
C * C
  g  x ,
and
1
, xn  dx1
dxn 
C
  g  x , , x  dx dx 
  g  x , , x  dx
1
n
1
n
C* C
1
n
1
dxn  a
C * C
Thus
  g  x ,
1
C* C
, xn  dx1
dxn 
  g  x ,
1
C * C
, xn  dx1
dxn
C
C*  C
C*
C*  C
  g  x ,
1
C* C
, xn  dx1
C*  C
dxn 
  g  x ,
1
C * C
, xn  dx1
dxn
  g  x ,
, xn  dx1
1
C* C
since g  x1 ,
and
1
ka
1
since g  x1 ,
  h  x ,
1
1
ka
, xn  dx1
C* C
1
, xn   h  x1 ,
ka
  g  x ,
C * C
dxn 
, xn  in C *.
, xn  dx1
dxn 
  h  x ,
1
, xn  dx1
C * C
1
, xn   h  x1 ,
ka
, xn  in C *.
dxn
dxn
  h  x ,
Thus
, xn  dx1
1
dxn 
C * C
  h  x ,
1
, xn  dx1
dxn
C * C
and
  h  x , , x  dx
  h  x ,
1
n
1
C*
1
dxn 
, xn  dx1
dxn
C
when we add the common quantity
  h  x ,
1
C* C
to both sides.
, xn  dx1
dxn
Q.E.D.
Fishers Linear Discriminant Function.
Suppose that x1, … , xp is either data from a p-variate
Normal distribution with mean vector:
1 or 2
The covariance matrix  is the same for both
populations p1 and p2.
f x 
g x 
1
 2p 
p/2

1/ 2
1
 2p 
p/2

1/ 2
e
 12  x  1   1  x  1 
e
 12  x  2   1  x  2 
The Neymann-Pearson Lemma states that we should
classify into populations p1 and p2 using:
1

f x
g x

 2p 
p/2

1/ 2
1
 2p 
e
1
2
p/2

1/ 2
e
e
 12  x  1   1  x  1 
 12  x  2   1  x  2 
 x 2  1  x 2  12  x 1  1  x 1 
That is make the decision
D1 : population is p1
if  ≥ ka
or ln  
1
2
 x  2   1  x  2   12  x  1   1  x  1   ln ka
 x  2  1  x  2    x  1  1  x  1   2 ln ka
or
x 1 x  22  1 x  2  12
or
 x 1 x  21 1 x  1 11  2 ln ka
and
1
1
1

1






x

ln
k







2 
 1 2
a
1
2
2 1
Finally we make the decision
D1 : population is p1
if
a  x  Ka
where

a   1  1  2  and Ka  ln ka  12 1 11  2  12

The function
a x   1   2   1 x
Is called Fisher’s linear discriminant function
p2
p1
2
1
a x   1  2   1 x  Ka
In the case where the populations are unknown
but estimated from data
Fisher’s linear discriminant function
â x   x1  x2  S 1 x
x2
200
Classify asp 1
Classify as p 2
100
p1
p2
0
0
20
40
60
80
100
120
x1
A Pictorial representation of Fisher' s procedure for two populations
Example 1
p1 : Riding-mower owners
x1 (Income
in $1000s)
20.0
28.5
21.6
20.5
29.0
36.7
36.0
27.6
23.0
31.0
17.0
27.0
x2 (Lot size
in 1000 sq ft)
9.2
8.4
10.8
10.4
11.8
9.6
8.8
11.2
10.0
10.4
11.0
10.0
p2 : Nonowners
x1 (Income
in $1000s)
25.0
17.6
21.6
14.4
28.0
16.4
19.8
22.0
15.8
11.0
17.0
21.0
x2 (Lot size
in 1000 sq ft)
9.8
10.4
8.6
10.2
8.8
8.8
8.0
9.2
8.2
9.4
7.0
7.4
Lot Size (in thousa nds of square feet)
12
8
Riding Mower owners
Non ownwers
4
10
20
30
Income (in thousands of dollars)
40
Example 2
Annual financial data are collected for firms
approximately 2 years prior to bankruptcy and for
financially sound firms at about the same point in
time. The data on the four variables
• x1 = CF/TD = (cash flow)/(total debt),
• x2 = NI/TA = (net income)/(Total assets),
• x3 = CA/CL = (current assets)/(current liabilties, and
• x4 = CA/NS = (current assets)/(net sales) are given in
the following table.
The data are given in the following table:
Bankrupt Firms
x1
x2
x3
Firm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
x4
CF/TD
NI/TA
CA/CL
CA/NS
Firm
-0.4485
-0.4106
1.0865
0.4526
1
-0.5633
-0.3114
1.5314
0.1642
2
0.0643
0.0156
1.0077
0.3978
3
-0.0721
-0.0930
1.4544
0.2589
4
-0.1002
-0.0917
1.5644
0.6683
5
-0.1421
-0.0651
0.7066
0.2794
6
0.0351
0.0147
1.5046
0.7080
7
-0.6530
-0.0566
1.3737
0.4032
8
0.0724
-0.0076
1.3723
0.3361
9
-0.1353
-0.1433
1.4196
0.4347
10
-0.2298
-0.2961
0.3310
0.1824
11
0.0713
0.0205
1.3124
0.2497
12
0.0109
0.0011
2.1495
0.6969
13
-0.2777
-0.2316
1.1918
0.6601
14
0.1454
0.0500
1.8762
0.2723
15
0.3703
0.1098
1.9914
0.3828
16
-0.0757
-0.0821
1.5077
0.4215
17
0.0451
0.0263
1.6756
0.9494
18
0.0115
-0.0032
1.2602
0.6038
19
0.1227
0.1055
1.1434
0.1655
20
-0.2843
-0.2703
1.2722
0.5128
21
22
23
24
25
Nonbankrupt Firms
x1
x2
x3
x4
CF/TD
NI/TA
CA/CL
CA/NS
0.5135
0.1001
2.4871
0.5368
0.0769
0.0195
2.0069
0.5304
0.3776
0.1075
3.2651
0.3548
0.1933
0.0473
2.2506
0.3309
0.3248
0.0718
4.2401
0.6279
0.3132
0.0511
4.4500
0.6852
0.1184
0.0499
2.5210
0.6925
-0.0173
0.0233
2.0538
0.3484
0.2169
0.0779
2.3489
0.3970
0.1703
0.0695
1.7973
0.5174
0.1460
0.0518
2.1692
0.5500
-0.0985
-0.0123
2.5029
0.5778
0.1398
-0.0312
0.4611
0.2643
0.1379
0.0728
2.6123
0.5151
0.1486
0.0564
2.2347
0.5563
0.1633
0.0486
2.3080
0.1978
0.2907
0.0597
1.8381
0.3786
0.5383
0.1064
2.3293
0.4835
-0.3330
-0.0854
3.0124
0.4730
0.4875
0.0910
1.2444
0.1847
0.5603
0.1112
4.2918
0.4443
0.2029
0.0792
1.9936
0.3018
0.4746
0.1380
2.9166
0.4487
0.1661
0.0351
2.4527
0.1370
0.5808
0.0371
5.0594
0.1268
Examples using SPSS