Linear Classifier
Team teaching
Linear Methods for Classification
Lecture Notes for CMPUT 466/551
Nilanjan Ray
2
Linear Classification
• What is meant by linear classification?
– The decision boundaries in the in the feature
(input) space is linear
• Should the regions be contiguous?
R1
R2
X2
R3
R4
X1
Piecewise linear decision boundaries in 2D input space
3
Linear Classification…
• There is a discriminant function k(x) for each
class k
• Classification rule:
Rk {x : k arg max j ( x)}
j
• In higher dimensional space the decision
boundaries are piecewise hyperplanar
• Remember that 0-1 loss function led to the
classification rule: Rk {x : k arg max P(G j | X x)}
• So,
j
P(G k | X )
can serve as k(x)
4
Linear Classification…
• All we require here is the class boundaries {x:k(x) =
j(x)} be linear for every (k, j) pair
• One can achieve this if k(x) themselves are linear or
any monotone transform of k(x) is linear
– An example:
exp( 0 T x)
P (G 1 | X x)
1 exp( 0 T x)
P (G 2 | X x)
So that log[
1
1 exp( 0 T x)
P (G 1 | X x)
] 0 T x
P (G 2 | X x)
Linear
5
Linear Discriminant Analysis
Essentially minimum error Bayes’ classifier
Assumes that the conditional class densities are (multivariate) Gaussian
Assumes equal covariance for every class
Posterior probability
Pr(G k | X x)
f k ( x) k
K
f ( x)
l 1
l
l
Application of
Bayes rule
k is the prior probability for class k
fk(x) is class conditional density or likelihood density
f k ( x)
1
1
T
1
exp(
(
x
)
Σ
( x k ))
k
p/2
1/ 2
(2 ) | Σ |
2
6
LDA…
k
fk
Pr(G k | X x)
log
log
log
Pr(G l | X x)
l
fl
1 T 1
1 T 1
T 1
(log k x Σ k k Σ k ) (log l x Σ l l Σ l )
2
2
T
1
k (x)
l (x)
Classification rule:
Gˆ ( x) arg max k ( x)
k
is equivalent to:
Gˆ ( x) arg max Pr(G k | X x)
k
The good old Bayes classifier!
7
LDA…
When are we going to use the training data?
Total N input-output pairs
Nk number of pairs in class k
Total number of classes: K
( gi , xi ), i 1 : N
Training data utilized to estimate
Prior probabilities:
ˆ k N k / N
Means:
ˆ k g k xi / N k
Covariance matrix:
i
ˆΣ K ( x ˆ )( x ˆ )T /( N K )
i
k
i
k
k 1
g
i
8
LDA: Example
LDA was able to avoid masking here
9
Study case
• Factory “ABC” produces very expensive and
high quality chip rings that their qualities are
measured in term of curvature and diameter.
Result of quality control by experts is given in
the table below.
Curvature
Diameter
Quality Control Result
2.95
6.63
Passed
2.53
7.79
Passed
3.57
5.65
Passed
3.57
5.45
Passed
3.16
4.46
Not passed
2.58
6.22
Not passed
2.16
3.52
Not passed
• As a consultant to the factory, you get a task to
set up the criteria for automatic quality control.
Then, the manager of the factory also wants to
test your criteria upon new type of chip rings that
even the human experts are argued to each
other. The new chip rings have curvature 2.81
and diameter 5.46.
• Can you solve this problem by employing
Discriminant Analysis?
Solutions
• When we plot the features, we can see that
the data is linearly separable. We can draw a
line to separate the two groups. The problem
is to find the line and to rotate the features in
such a way to maximize the distance between
groups and to minimize distance within group.
• X = features (or independent variables) of all
data. Each row (denoted by ) represents one
object; each column stands for one feature.
• Y = group of the object (or dependent
variable) of all data. Each row represents one
object and it has only one column.
x=
2.95
2.35
3.57
3.16
2.58
2.16
3.27
6.63
7.79
5.65
5.47
4.46
6.22
3.52
y=
1
1
1
1
2
2
2
• Xk = data of row k, for example x3 = 3.57 5.65
• g=number of gropus in y, in our example, g=2
• Xi = features data for group i . Each row
represents one object; each column stands for
one feature. We separate x into several
groups based on the number of category in y.
x1=
2.95
2.53
3.57
3.16
6.63
7.79
5.65
5.47
x2=
2.58 4.46
2.16
6.22
3.27 3.52
• μi = mean of features in group i, which is average
of xi
• μ1 = 3.05 6.38 , μ2 = 2.67 4.73
• μ = global mean vector, that is mean of the whole
data set.
• In this example, μ = 2.88 5.676
0
x
• i = mean corrected data, that is the
features data for group i, xi , minus the global
mean vector μ
x10 =
0.060
0.357
0.679
0.269
0.951
2.109
0.025
0.209
x 20 =
0.305 1.218
0.732 0.547
0.386 2.155
Covariance matrix of group i =
0 T
i
(x ) x
ci
ni
C1 =
0
i
0.166 0.192
0.192 1.349
C2 =
0.259 0.286
0.286 2.142
g
1
C(r,s) nic i (r,s)
n i1
= pooled within group covariance matrix. It is
calculated for each entry in the matrix. In our
example, 4/7*0.166 + 3/7*0.259=0.206 ,
4/7*(-0.192)+3/7*(-0.286)=-0.233 and
4/7*1.349+3/7*2.142=1.689 , therefore
C=
0.206 0.233
0.233 1.689
The inverse of covariance matrix is :
C-1 =
5.745 0.791
0.791 0.701
• P = prior probability vector (each row
represent prior probability of group ). If we do
not know the prior probability, we just assume
it is equal to total sample of each group
divided by the total samples, that is
0.571 4 / 7
p =
=
0.429 3/ 7
• discriminant function
1 1 T
f i iC x C i ln( pi )
2
1 T
k
• We should assign object k to group i that has
maximum fi
LDA
Tugas
• Gunakan excel/matlab/tools lain untuk
mengklasifikasi data set breast tissue secara :
• Naïve Bayes
• LDA
Presentasikan minggu depan
© Copyright 2026 Paperzz