Multiclass Classification

Multiclass Classification
Tirgul 8
MNIST Example
• Binary Problem: 1 vs 8
MNIST Example
• Multiclass Problem – 1 vs. 7 vs. 8
Multiclass Classification
• Given:
• A training set: 𝑆 =
𝑥1 , 𝑦1 , … 𝑥𝑚 , 𝑦𝑚 ,
• Where:
• 𝑥𝑖 ∈ ℝ𝑑 (i.e., the examples are vectors of length d)
• 𝑦𝑖 ∈ 𝓨 = 𝟏, 𝟐, … . 𝒌 (i.e., the labels are from a set of k possible classes)
• Each training example belongs to one of k different classes.
• Goal: construct a function that, given a new data point, will correctly
predict the class to which the new point belongs.
Multiclass Classification – cont.
• Two general techniques to build a multiclass classifier:
1. Binary Reduction:
Reduction of the multiclass problem to several binary classification
problems.
2. Multiclass:
Design of a multiclass classifier.
Binary Reductions
One Against All - Intuition
𝒘𝟖
𝒘𝟕
𝒘𝟏
One Against All
• The One-Against-All method:
• Based on a reduction of the multiclass problem into k binary problems
• Each problem discriminates between one class to all the rest.
One Against All
• The Method:
• We now have k binary problems:
• Binary problem #1:
• Assign label +1 to all examples labeled 1
• Assign label -1 to all examples labeled 2, 3, …, k.
• The resulting weight vector: called 𝒘𝟏
• Binary problem #2:
• Assign label +1 to all examples labeled 2
• Assign label -1 to all examples labeled 1, 3, 4, …, k.
• The resulting weight vector: called 𝒘𝟐
• And so on…
One Against All
• … we end up with k weight vectors.
• Given a new instance x, predict the label that gets the highest
confidence:
𝑦 = arg max 𝑤 𝑦 ⋅ 𝑥
𝑦∈𝒴
Example
𝒘𝟖
𝒘𝟕
𝒘𝟏
All Pairs - Intuition
𝒘
𝒘
𝒘
All Pairs
• In this approach, all pairs of classes are compared to each other.
• The Method:
• We now have
𝑘
2
= 𝑘 − 1 𝑘/2 binary problems:
• Binary problem #1: 1 vs 7
• Assign label +1 to all examples labeled 1
• Assign label -1 to all examples labeled 7
• The resulting weight vector: called 𝒘𝟏𝟕
• Binary problem #2: 1 vs 8
• Assign label +1 to all examples labeled 1
• Assign label -1 to all examples labeled 8
• The resulting weight vector: called 𝒘𝟏𝟖
• And so on…
All Pairs
• … we end up with 𝑘 − 1 𝑘/2 weight vectors.
• Given a new instance x, predict the label that gets the highest
confidence:
𝑤 𝑦𝑟 ⋅ 𝑥
𝑦 = arg max
𝑦∈𝒴
𝑟 ≠𝑦
All Pairs - Example
𝒘
𝒘
𝒘
Error Correction Output Codes (ECOC)
• The Method:
• Define a coding matrix 𝑀 ∈ −1, 0, 1
𝑘×𝑙
• 𝑘 is the number of classes
• 𝑙 is the number of hypotheses (/binary problems).
• (Entries 𝑀 𝑟, 𝑠 = 0 indicate that we don’t care how hypothesis 𝑓𝑠 classifies examples labeled 𝑟.)
• Given a learning algorithm A:
• for 𝑠 = 1, … , 𝑙, algorithm A is provided with labeled data of the form: 𝑥𝑖 , 𝑀 𝑦𝑖 , 𝑠
all examples in the training set (except fro those with 𝑀 𝑦𝑖 , 𝑠 = 0).
• Algorithm A uses this data to generate 𝑙 hypotheses 𝑓𝑠 .
for
• Classify an input x to be the class whose representation by matrix M is closest
to the output of the binary classifier sign(f(x)).
ECOC
• Suppose we choose a 𝑘 × 𝑘 matrix 𝑀:
• All diagonal elements are +1
• All other elements are -1
+1
𝑀 = −1
−1
−1 −1
+1 −1
−1 +1
• … this is the One-Against-All approach.
ECOC
• The two approaches of One-Against-All and All-Pairs have been
unified under the framework of ECOC.
• All-Pairs Approach:
• 𝑀 is a 𝑘 ×
𝑘
2
matrix
• Each column corresponds to a distinct pair 𝑟1 , 𝑟2 .
• For this column, 𝑀 has:
• +1 in row 𝑟1
• -1 in row 𝑟2
• Zeros in all other rows
ECOC Example
• Given:
• An input example 𝑥
• 𝑙 binary classifiers
• Suppose the outputs of the binary classifiers are as follows:
•
•
•
•
•
•
•
𝑓1
𝑓2
𝑓3
𝑓4
𝑓5
𝑓6
𝑓7
𝑥
𝑥
𝑥
𝑥
𝑥
𝑥
𝑥
= 0.5
= −7
= −1
= −2
= −10
= −12
= −9
⇒
In short, the vector of predictions is:
𝑓 𝑥 = [0.5, −7, −1, −2, −10, −12, 9]
ECOC Example
• Let the coding matrix M be:
𝑓1
−1
𝑀 = +1
+1
−1
𝑓2
𝑓3
0 −1
−1 0
0 −1
−1 +1
𝑓4
−1
+1
−1
0
𝑓5
𝑓6
𝑓7
+1 −1 −1
+1 +1 −1
−1 +1 +1
−1 −1 +1
• The vector of predictions was:
𝑓 𝑥 = [0.5, −7, −1, −2, −10, −12, 9]
𝑠𝑖𝑔𝑛(𝑓 𝑥 ) = [+1, −1, −1, −1, −1, −1, +1]
𝑐𝑙𝑎𝑠𝑠 #1
𝑐𝑙𝑎𝑠𝑠 #2
𝑐𝑙𝑎𝑠𝑠 #3
𝑐𝑙𝑎𝑠𝑠 #4
ECOC Example
• Given the predictions of 𝑓 𝑥 on test point 𝑥:
• Which of the 𝑘 labels in 𝒴 should be predicted?
• The basic idea:
• Predict the label 𝑟 whose row 𝑀(𝑟) is “closest” to the predictions 𝑓 𝑥 .
• In other words:
• Predict the label 𝑟 that minimizes: 𝑑(𝑀 𝑟 , 𝑓(𝑥)), for some distance function 𝑑.
ECOC Example
• How do we measure the distance between the two vectors?
• Method #1:
• Count the number of positions 𝑠 in which the sign of the prediction 𝑓𝑠 𝑥 is
different than the matrix entry 𝑀 𝑟, 𝑠 .
• Called: Hamming Decoding
ECOC Example
𝑠𝑖𝑔𝑛 𝑧 =
• Hamming Decoding:
𝑙
𝑑𝐻 𝑀 𝑟 , 𝑓 𝑥
=
𝑠=1
1 − 𝑠𝑖𝑔𝑛 𝑀 𝑟, 𝑠 ⋅ 𝑓𝑠 𝑥
2
• Simply put:
𝑙
𝑑𝐻 𝑀 𝑟 , 𝑓 𝑥
=
𝑠=1
0 𝑖𝑓 𝑀 𝑟 = 𝑓 𝑥 ∧ 𝑀 𝑟 ≠ 0 ∧ 𝑓 𝑥 ≠ 0
1 𝑖𝑓 𝑀 𝑟 ≠ 𝑓 𝑥 ∧ 𝑀 𝑟 ≠ 0 ∧ 𝑓 𝑥 ≠ 0
1/2 𝑖𝑓 𝑀 𝑟 = 0 ∨ 𝑓 𝑥 = 0
+1 𝑖𝑓 𝑧 > 0
−0 𝑖𝑓 𝑧 < 0
0 𝑖𝑓 𝑧 = 0
ECOC Example
• For an instance 𝑥 and a matrix 𝑀, the predicted label is:
𝑦 = arg min 𝑑𝐻 𝑀 𝑟 , 𝑓 𝑥
𝑦∈𝒴
• A disadvantage:
• Hamming decoding ignores the magnitude of the predictions, which can often
be an indication of a level of “confidence” of the classifier.
𝑙
𝑑𝐻 𝑀 𝑟 , 𝑓 𝑥
ECOC Example
=
𝑠=1
1 − 𝑠𝑖𝑔𝑛 𝑀 𝑟, 𝑠 ⋅ 𝑓𝑠 𝑥
2
• Back to our example…
−1
𝑀 = +1
+1
−1
0 −1
−1 0
0 −1
−1 +1
−1
+1
−1
0
+1 −1 −1
+1 +1 −1
−1 +1 +1
−1 −1 +1
𝑑𝐻 = 3.5
𝑑𝐻 = 4.5
𝑑𝐻 = 𝟏. 𝟓
𝑑𝐻 = 2.5
𝑓 𝑥 = [0.5, −7, −1, −2, −10, −12, 9]
• Hence, our prediction here is class 3 (the minimal Hamming distance).
ECOC Example
• How do we measure the distance between the two vectors?
• Recall Hamming’s disadvantage:
• Ignores the magnitude of the predictions, which can often be an indication of a level of
“confidence” of the classifier.
• Method #2:
• Takes into account:
• The magnitude of predictions
• A relevant loss function L
• The idea:
• Choose the label r that minimizes the loss on example x.
• Called loss-based decoding
ECOC Example
• Loss-based decoding:
𝑙
𝑑𝐿 𝑀 𝑟 , 𝑓 𝑥
=
𝐿 𝑀 𝑟, 𝑠 ⋅ 𝑓𝑠 𝑥
𝑠=1
• The loss:
𝐿 𝑧 =
max {0, 1 − 𝑧) 𝑓𝑜𝑟 𝑆𝑉𝑀
exp(−𝑧) 𝑓𝑜𝑟 𝐵𝑜𝑜𝑠𝑡𝑖𝑛𝑔
log(1 + exp −𝑧 ) 𝑓𝑜𝑟 𝐿𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛
ECOC Example
• For an instance 𝑥 and a matrix 𝑀, the predicted label is:
𝑦 = arg min 𝑑𝐿 𝑀 𝑟 , 𝑓 𝑥
𝑦∈𝒴
𝑙
ECOC Example
𝑑𝐿 𝑀 𝑟 , 𝑓 𝑥
=
𝐿 𝑀 𝑟, 𝑠 ⋅ 𝑓𝑠 𝑥
𝑠=1
𝐿 𝑧 = exp(−𝑧)
• Back to our example…
−1
+1
𝑀=
+1
−1
0 −1
−1 0
0 −1
−1 +1
−1 +1 −1 −1
+1 +1 +1 −1
−1 −1 +1 +1
0 −1 −1 +1
𝑑𝐿 = exp −(−1 ∗ 0.5) + exp −(0 ∗ −7 ) + …
= 30,132
𝑑𝐿 = 192, 893
𝑑𝐿 = 162, 757
𝑑𝐿 = 𝟓. 𝟒
𝑓 𝑥 = [0.5, −7, −1, −2, −10, −12, 9]
• Hence, our prediction here is class 4 (the minimal loss-based distance).
Multiclass
Multiclass Perceptron
• A direct extension of the binary Perceptron.
Binary Perceptron:
Extended Binary Perceptron:
𝑦 = 𝑠𝑖𝑔𝑛 (𝑤 ⋅ 𝑥𝑖 )
𝑦 = 𝑎𝑟𝑔
𝑖𝑓 𝑦 ≠ 𝑦𝑖 :
𝑤 ← 𝑤 + 𝑦𝑖 𝑥𝑖
𝑖𝑓 𝑦 ≠ 𝑦𝑖 :
max
𝑦 ∈ {−1,+1}
𝑤 𝑦 ← 𝑤 𝑦 + 𝑥𝑖
𝑤 𝑦 ← 𝑤 𝑦 − 𝑥𝑖
Both algorithms are mathematically
the same, where 𝑤 𝑦 = - 𝑤 𝑦
𝑤 𝑦 ⋅ 𝑥𝑖
Multiclass Perceptron
−𝒙
• Extended Binary Perceptron:
• The update rule
𝒘−𝟏 − 𝒙
𝒘+𝟏
+𝒙
𝒘−𝟏
𝒘+𝟏 + 𝒙
Multiclass Perceptron
• Extension of Perceptron to the multiclass case:
𝑦 = 𝑎𝑟𝑔
max
𝑦 ∈ {1,… , 𝑘}
𝑤 𝑦 ⋅ 𝑥𝑖
𝑖𝑓 𝑦 ≠ 𝑦𝑖 :
𝑤 𝑟 ← 𝑤 𝑟 + 𝜏𝑟 𝑥𝑖
(for every class r)
Where:
𝜏𝑟 =
+1
−1
0
𝑖𝑓 𝑟 = 𝑦𝑖
𝑖𝑓 𝑟 = 𝑦
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Here, we update only two classes:
• The correct class
• The predicted class
Multiclass Perceptron
• A different update rule:
𝑦 = 𝑎𝑟𝑔
max
𝑦 ∈ {1,… , 𝑘}
𝑤 𝑦 ⋅ 𝑥𝑖
𝑖𝑓 𝑦 ≠ 𝑦𝑖 :
𝑤 𝑟 ← 𝑤 𝑟 + 𝜏𝑟 𝑥𝑖
(for every class r)
Where:
𝐸𝑦 = 𝑟 ≠ 𝑦 𝑤 𝑟 ⋅ 𝑥 > 𝑤 𝑦 ⋅ 𝑥}
𝜏𝑟 =
+1
𝑖𝑓 𝑟 = 𝑦𝑖
1
−
𝑖𝑓 𝑟 ∈ 𝐸𝑦𝑖
𝐸𝑦𝑖
0
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Here, we update all classes that have a
higher score than the correct one.
Multiclass SVM
• The model is of the following term:
𝑦 = arg max 𝑤 𝑦 ⋅ 𝑥
𝑦∈{1,…,𝑘}
Multiclass SVM
• We want to find the weight vectors 𝑤1 , … , 𝑤𝑘 such that the
probability of an error 𝑦 ≠ 𝑦 is minimized:
{𝑤 𝑟 } = arg min 𝑃(𝑦 ≠ 𝑦)
{𝑤 𝑟 }
= arg min 𝐸 𝛿 𝑦 ≠ 𝑦
{𝑤 𝑟 }
• Replace the expectation with average, and add regularization:
𝑤𝑟
1
= 𝑎𝑟𝑔 min
𝑤𝑟 𝑚
𝑚
𝑖=1
𝜆
𝛿 𝑦 ≠𝑦 +
2
𝑘
𝑟=1
2
𝑟
𝑤
Multiclass SVM
• SVM works by upper-bounding the loss 𝛿 𝑦 ≠ 𝑦 by a surrogate hinge
loss:
𝛿 𝑦 ≠ 𝑦 = 𝛿 𝑦 ≠ 𝑦 − 𝑤𝑦 ⋅ 𝑥 + 𝑤𝑦 ⋅ 𝑥
≤ 𝛿 𝑦 ≠ 𝑦 − 𝑤 𝑦 ⋅ 𝑥 + max[𝑤 𝑦 ⋅ 𝑥 ]
𝑦
≤ max [𝛿 𝑦 ′ ≠ 𝑦 − 𝑤 𝑦 ⋅ 𝑥 + 𝑤 𝑦′ ⋅ 𝑥 ]
𝑦′
Multiclass SVM
• Multiclass SVM optimization problem:
𝑤𝑟
1
= arg min
{𝑤 𝑟 } 𝑚
𝑚
max[𝛿
𝑖=1
𝑦′
𝑦′
≠ 𝑦𝑖 −
𝑤 𝑦𝑖
⋅ 𝑥𝑖
+ 𝑤 𝑦′
𝜆
⋅ 𝑥𝑖 ] +
2
𝑘
𝑟=1
2
𝑟
𝑤
Multiclass SVM
• The Algorithm (with Stochastic Gradient Descent):
′
𝑦 ′ = arg max 𝛿 𝑦 ′ ≠ 𝑦𝑖 + 𝑤 𝑦 ⋅ 𝑥𝑖
𝑦′∈ {1,… , 𝑘}
𝑖𝑓 𝛿 𝑦 ′ ≠ 𝑦𝑖 − 𝑤 𝑦𝑖 ⋅ 𝑥𝑖 + 𝑤 𝑦′ ⋅ 𝑥𝑖 > 0:
𝑤 𝑦𝑖 ← 1 − 𝜆𝜂𝑡 𝑤 𝑦𝑖 + 𝜂𝑡 𝑥𝑖
𝑤 𝑦′ ← 1 − 𝜆𝜂𝑡 𝑤 𝑦′ − 𝜂𝑡 𝑥𝑖
𝑤 𝑦 ← 1 − 𝜆𝜂𝑡 𝑤 𝑦
𝑓𝑜𝑟 𝑦 ≠ 𝑦′, 𝑦𝑖
Thank You