Document

Lecture Slides for
INTRODUCTION
TO
MACHINE
LEARNING
3RD EDITION
ETHEM ALPAYDIN
© The MIT Press, 2014
[email protected]
http://www.cmpe.boun.edu.tr/~ethem/i2ml3e
CHAPTER 3:
BAYESIAN DECISION
THEORY
Probability and Inference
3




Result of tossing a coin is {Heads,Tails}
Random var X {1,0}
Bernoulli: P {X=1} = poX (1 ‒ po)(1 ‒ X)
Sample: X = {xt }Nt =1
Estimation: po = # {Heads}/#{Tosses} = ∑t xt / N
Prediction of next toss:
Heads if po > ½, Tails otherwise
Classification



Credit scoring: Inputs are income and savings.
Output is low-risk vs high-risk
Input: x = [x1,x2]T ,Output: C Î {0,1}
Prediction:
C  1 if P(C  1| x1 ,x 2 )  0.5
choose 
C  0 otherwise
or
C  1 if P(C  1| x1 ,x 2 )  P(C  0 | x1 ,x 2 )
choose 
C  0 otherwise
4
Bayes’ Rule
5
prior
posterior
likelihood
P C  px|C 
P C | x  
px 
evidence
P C  0   P C  1  1
px   px|C  1P C  1  px |C  0P C  0
pC  0| x   P C  1| x   1
Bayes’ Rule: K>2 Classes
6
px |C i P C i 
P C i | x  
px 
px |C i P C i 
 K
 px|C k PC k 
k 1
K
P C i   0 and  P C i   1
i 1
choose C i if P C i | x   max k P C k | x 
Losses and Risks



Actions: αi
Loss of αi when the state is Ck : λik
Expected risk (Duda and Hart, 1973)
K
R i | x    ik P C k | x 
k 1
choose  i if R i | x   mink R k | x 
7
Losses and Risks: 0/1 Loss
8
0 if i  k
ik  
1 if i  k
K
R  i | x    ik P C k | x 
k 1
  P C k | x 
k i
 1  P C i | x 
For minimum risk, choose the most probable class
Losses and Risks: Reject
0 if i  k

ik   if i  K  1 , 0    1
1 otherwise

K
R K 1 | x    P C k | x   
k 1
R i | x    PC k | x  1  PC i | x 
k i
choose C i if P C i | x   P C k | x  k  i and P C i | x   1  
reject
otherwise
9
Different Losses and Reject
10
Equal losses
Unequal losses
With reject
Discriminant Functions
choose C i if gi x   max k gk x 
gi x , i  1,, K
 R i | x 

gi x   PC i | x 
px | C PC 
i
i

K decision regions R1,...,RK
R i  x|gi x   max k gk x 
11
K=2 Classes


Dichotomizer (K=2) vs Polychotomizer (K>2)
g(x) = g1(x) – g2(x)
C1 if gx   0
choose 
C 2 otherwise

Log odds:
P C1 | x 
log
P C 2 | x 
12
Utility Theory



Prob of state k given exidence x: P (Sk|x)
Utility of αi when state is k: Uik
Expected utility:
EU i | x   Uik PSk | x 
k
Choose αi if EU i | x   max EU j | x 
j
13
Association Rules



Association rule: X  Y
People who buy/click/visit/enjoy X are also likely to
buy/click/visit/enjoy Y.
A rule implies association, not necessarily causation.
14
Association measures
15

Support (X  Y):
# customers who bought X and Y 
P X ,Y  
# customers 

Confidence (X  Y):
P  X ,Y 
P Y | X  
P( X )
# customers who bought X and Y 

 Lift (X  Y):
# customers who bought X 
P  X ,Y  P(Y | X )


P( X )P(Y )
P(Y )
Example
16
Apriori algorithm (Agrawal et al.,
1996)
17



For (X,Y,Z), a 3-item set, to be frequent (have
enough support), (X,Y), (X,Z), and (Y,Z) should be
frequent.
If (X,Y) is not frequent, none of its supersets can be
frequent.
Once we find the frequent k-item sets, we convert
them to rules: X, Y  Z, ...
and X  Y, Z, ...