Decision theory

Decision theory
CS534
Three Main Approaches to Classifier Learning
• Approach 1: Directly learn a mapping from x to target output – E.g., Perceptron, SVM and many others
– Does not provide probabilistic assessment
– May not be appropriate when we want to explicitly consider uncertainty in our decision, e.g., medical diagnosis
Referred to as discriminative approaches
2
Three Main Approaches to Classifier Learning
• Approach 2: Learn the joint probability distribution: – x, captures all of the uncertainty about x and y
– Often achieved by learning x| and , x|
– x,
• Examples: LDA, Naïve Bayes, etc.
• Referred to as Generative approaches
– It tries to learn the generative model that is used to generate the data
• Sampling ~
• Sampling x~ x|
Three Main Approaches to Classifier Learning
• Approach 3: Learn a conditional distribution –
x, |x x – so this avoids modeling the distribution of x, just focuses on how random variable should behave given x
• Example: logistic regression
• Also referred to as discriminative approaches
Decision theory
Given , or , we can use Decision Theory to make the optimal decision considering the uncertainty about , such that
– Misclassification rate is minimized, or – Expected loss is minimized Goal 1: Minimizing misclassification rate
• We need to a rule to assign each to one of the classes
• This rule defines a decision region for each class , such that all points in are assigned to class • Let’s assume we only have two classes and • A mistake occurs when a point belonging to is assigned to , and vice versa
• The probability of this happening is
mistake
x∈
x,
,
x∈
x
x,
,
x
• This is minimized when we assign every point to the class that has the highest , or equivalently 6
Goal 1: Minimizing misclassification rate
• For current decision boundary , the probability of mistake = red + green + purple
• If we move the decision boundary to , the red region diminishes, and the probability of mistake is minimized
7
Decision rule for minimizing Misclassification rate
Decision rule for minimizing p(mistake):
yˆ (x)  arg max p (x, ci )
ci
, it is equivalent to:
Note that yˆ (x)  arg max p (ci | x)
ci
Goal 2: Minimizing Expected Loss
•
We often have more complicated loss function
– e.g. for the spam filter problem, we have:
true y
ŷ
S(1)
NS(2)
S (1)
0
10
NS (2)
1
0
•
Our goal under such scenarios will be to minimize the expected loss
•
For a given x, if it’s true class is denoted by . •
The total expected loss is given by:
and we assign it to , the loss is x,
•
x
This is minimized when we assign every point to the class that has the x,
, or equivalently lowest ∑
|x
lowest ∑
9
Decision rule for Loss Minimizing
true y
ŷ
S(1)
NS(2)
S (1)
0
10
NS (2)
1
0
p(y|x)
0.6
0.4
• For a given , and Expected loss for predicting S:
∗ 0.6
∗ 0.4 4
Expected loss for predicting NS:
∗ 0.6
∗ 0.4 0.6
Expected loss of predicting Reject Option
• One could further include an option to abstain from making a prediction, which incurs a different loss
• We can easily extend the expected loss formula to consider the reject option