Quadratic Discriminant Analysis Posterior vs. Prior Probability

Quadratic Discriminant Analysis
(QDA)
• A quadratic classifier is used in machine
learning to separate measurements of two or
more classes of objects or events by a quadric
surface. It is a more general version of the linear
classifier.
Quadratic Discriminant
Analysis
Xiaojun Qi
• In this lecture, I will start with the naïve Bayes
classifier (a linear discriminant analysis) and
conclude with the derivation and special cases
of the QDA (a generalized Bayes classifier).
1
Naïve Bayes Classifier:
Assumptions
• Naïve Bayes classifiers assume that the effect of
a variable value on a given class is independent
of the values of other variables. This assumption
is called class conditional independence. It is
made to simplify the computation and in this
sense considered to be “Naïve”.
• This assumption is a fairly strong assumption and
is often not applicable. However, bias in
estimating probabilities often may not make a
difference in practice. It is the order of the
probabilities, not their exact values, that
determines the classifications.
3
Naïve Bayes Classifier
Formulation: Example
2
Naïve Bayes Classifier:
Applications
• Studies comparing classification
algorithms have found the Naïve Bayesian
classifier to be comparable in performance
with classification trees and with neural
network classifiers. They have also
exhibited high accuracy and speed when
applied to large databases.
4
Posterior vs. Prior Probability
• Suppose your data consist of fruits,
described by their color and
shape. Bayesian classifiers operate by
saying "If you see a fruit that is red and
round, which type of fruit is it most likely
to be, based on the observed data
sample? In future, classify red and round
fruit as that type of fruit."
• Red, round Î Apple Class
• Let X be the data record whose class label
is unknown.
• Let H be some hypothesis, such as “data
record X belongs to a specified class C.”
5
• For classification, we want to determine P
(H|X) -- the probability that the hypothesis
H holds, given the observed data record X.6
1
Posterior vs. Prior Probability
(Cont.)
Bayes Theorem
• P (H|X) is the posterior probability of H
conditioned on X.
• In contrast, P(H) is the prior probability, or apriori
probability, of H.
• In the fruit example, P(H) is the probability that
any given data record is an apple, regardless of
how the data record looks. The posterior
probability, P (H|X), is based on more
information (such as background knowledge)
than the prior probability, P(H), which is
independent of X.
7
Bayes Theorem
• Similarly, P (X|H) is posterior probability of X
conditioned on H. That is, it is the probability that
X is red and round given that we know that it is
true that X is an apple.
• P(X) is the prior probability of X, i.e., it is the
probability that a data record from our set of
fruits is red and round.
• Bayes theorem is useful in that it provides a way
of calculating the posterior probability, P(H|X),
from P(H), P(X), and P(X|H). Bayes theorem is
P (H|X) = P(X|H) P(H) / P(X)
8
Bayes Theorem
• Given a set of variables, X = {x1,x2,x...,xd}, we want to
construct the posterior probability for the class Cj among
a set of possible classes C = {c1,c2,c...,cd}.
• where p(Cj | x1,x2,x...,xd) is the posterior probability of
class membership, i.e., the probability that X belongs to
Cj. Since Naive Bayes assumes that the conditional
probabilities of the independent variables are statistically
independent we can decompose the likelihood to a
product of terms and rewrite the posterior as:
• Using Bayes' rule above, we label a new
case X with a class level Cj that
achieves the highest posterior
probability.ÎMaximum a Posteriori
(MAP) decision rule.
n
∏ p( X
classify ( x1 , x2 ,..., xd ) = arg max c P (C = c)
i
= xi | C = c)
i =1
9
Bayes Theorem
• Naive Bayes can be modeled in several
different ways including normal, lognormal,
gamma and Poisson density functions:
•
10
Bayes Theorem
• In a supervised learning setting, we want
to estimate the parameters of the
probability model.
• Because of the independent feature
assumption, it suffices to estimate the
class prior and the conditional feature
models independently, using the method
of maximum likelihood, Bayesian inference
or other parameter estimation procedures.
11
12
2
Introduction to Bayes Classifier
Naïve Bayes Classifier: Example
• Bayes classifier uses the Bayesian decision
theory to separate measurements of two or more
classes of objects or events .
• The Bayesian decision theory is a fundamental
statistical approach to the problem of pattern
classification.
It can be easily expanded to the
multi-dimension based on the
13
independence assumption.
• This approach is based on quantifying the
tradeoffs between various classification
decisions using probability and the costs that
accompany such decisions.
14
Introduction to Bayes Classifier
Bayes Rules
-- Conditional Probability
• Probability considerations become
important in pattern recognition because of
the randomness under which pattern
classes normally are generated.
• The conditional probability of an event B in
relationship to an event A is the probability that
event B occurs given that event A has already
occurred. The notation for conditional probability
(posterior probability) is: P(B|A)
• It is possible to derive a classification
approach that is optimal in the sense that,
on average, its use yields the lowest
probability of committing classification
errors.
P ( B j | A) =
P( A | B j ) P( B j )
P ( A)
=
P( A | B j ) P( B j )
n
∑ P( A | B ) P( B )
j
15
Derivation Foundation
The probability that a particular pattern X comes
from class wi is denoted p(wi/x). If the pattern
classifier decides that x came from wj when it
actually came from wi, it incurs a loss, denoted Lij.
As pattern x may belong to any one of W classes
under consideration, the average loss incurred in
assigning x to class wj is:
W
rj (x) = ∑ Lkj p (ω k x)
j
j =1
16
From basic probability theory, we know that
P ( B | A) P( A)
P( A | B) =
P( B)
Using this expression, we write Equation 1 in the form:
r j ( x) =
1 W
∑ Lkj p(x ω k ) P(ω k )
p (x) k =1
Equation 2
Equation 1
k =1
This equation often is called the conditional
average risk or loss in decision-theory terminology.
17
Where p(x/wk) is the probability density function of the
patterns from class wk and P(wk) is the probability of
occurrence of class wk.
18
3
• Because 1/p(x) is positive and common to all the
rj(x), j = 1, 2, …, W, it can be dropped from
Equation 2 without affecting the relative order of
these functions from the smallest to the largest
value.
• The expression for the average loss then reduces
to:
W
rj (x) = ∑ Lkj p (x ω k ) P (ω k )
Equation 3
• The classifier has W possible classes to choose
from for any given unknown pattern. If it computes
r1(x), r2(x), …, rw(x) for each pattern x and assigns
the pattern to the class with the smallest loss, the
total average loss with respect to all decisions will
be minimum. The classifier that minimizes the total
average loss is called the Bayes classifier.
• Thus the Bayes classifier assigns an unknown
pattern x to class wi if ri(x) < rj(x) for j = 1, 2, …, w; j
≠i.
• In other words, x is assigned to class wi if
W
∑L
k =1
k =1
ki
W
p(x ω k ) P (ω k ) < ∑ Lqj p(x ω q ) P(ω q ) Equation 4
q =1
for j = 1, 2, …, w; j ≠ i.
19
• The “loss” for a correct decision generally is assigned a
value of zero, and the loss for any incorrect decision
usually is assigned the same nonzero value (say 1).
20
• The Bayes classifier then assigns a pattern x to class wi
if for all j ≠ i
p (x) − p (x ω i ) P (ω i ) < p (x) − p (x ω j ) P (ω j ) Equation 7
• Under these conditions, the loss function becomes
L ij = 1 − δ ij Equation 5
where δ ij = 1 if i = j, and δ ij = 0 if i ≠ j
p (x ω i ) P (ω i ) > p (x ω j ) P (ω j )
• Equation 5 indicates a loss of unity for incorrect
decisions and a loss of zero for correct decisions.
Substituting Equation 5 into Equation 3 yields
d j ( x) = p (x ω j ) P(ω j )
• Or equivalently, if
Equation 8
• The Bayes classifier for a 0-1 loss function is nothing
more than computation of decision function of the form:
W
rj (x) = ∑ (1 − δ kj ) p(x ω k ) P (ω k )
k =1
= p(x) − p(x ω j ) P (ω j )
j = 1,2, K , W ; j ≠ i
j = 1,2,K , W
Equation 9
Where a pattern vector x is assigned to the class whose
decision function yields the largest numerical value.
Equation 6
21
• The decision functions in Equation 9 are optimal
in the sense that they minimize the average loss
in misclassification.
• For this optimality to hold, the probability density
functions of the patterns in each class, as well
as the probability of occurrence of each class
must be known.
• The latter requirement usually is not a problem
since it can generally be inferred from
knowledge of the problem.
• Estimation of the probability density function
p(x/wj) is another matter. If the pattern vectors,
x, are n dimensional, then p(x/wj) is a function of
n variables, which requires methods from
multivariate probability theory for its estimation. 23
22
• These methods are difficult to apply in practice,
especially if the number of representative
patterns from each class is not large, or if the
underlying form of the probability density
functions is not well behaved.
• Use of the Bayes classifier generally is based on
the assumption of an analytic expression for the
various density functions and then an estimation
of the necessary parameters from sample
patterns from each class.
• By far, the most prevalent form assumed for
p(x/wj) is the Gaussian probability density
function.
• The closer this assumption is to the reality, the
closer the Bayes classifier approaches the
minimum average loss in classification.
24
4
Bayes Classifier for Gaussian
Pattern Classes
• Let us consider a 1-D problem (n=1) involving two
pattern classes (W=2) governed by Gaussian densities,
with means m1 and m2 and standard deviations σ1 and
σ2, respectively.
•
From Equation 9, the Bayes decision functions have the
form:
d j ( x) = p( x ω j ) P(ω j )
=
1
2π σ j
−
e
Equation 10
( x−m j )2
2σ 2j
P(ω j )
j = 1,2
where the patterns are now scalars.
25
• Any pattern (point) to the right of x0 is classified
as belonging to class w1. Similarly, any pattern
to the left of x0 is classified as belonging to class
w2.
• When the classes are not equally likely to occur,
x0 moves to the left if class w1 is more likely to
occur or, conversely, to the right if class w2 is
more likely to occur. Å Since the classifier is
trying to minimize the loss of misclassification.
• For instance, in the extreme case, if class w2
never occurs, the classifier would never make a
mistake by always assigning all patterns to class
w1 (that is, x0 would move to negative infinity).
27
= ln p(x ω j ) + ln P(ω j )
The boundary between the classes are a single point, denoted x0
such that d1(x0) = d2(x0).
If the two classes are equally likely to occur, then P(w1) = P(w2) =
½, and the decision boundary is the value of x0 for which
p(x0/w1)=p(x0/w2). This point is the intersection of the two
probability density functions.
26
• In the n-dimensional case, the Gaussian
density of the vectors in the jth pattern class
has the form:
p(x / ω j ) =
1
(2π ) n 2 C j
12
e
1
− ( x −m j )T C −j 1 ( x −m j )
2
Equation 11
• Where each density is specified completely by
its mean vector mj and covariance matrix Cj,
which are defined as
m j = E j {x}
mj =
{
C j = E j ( x − m j )( x − m j ) T
1
Cj =
Nj
}
1
Nj
∑ω xx
x∈
∑x
x∈ω j
T
− m j m Tj
28
j
• Substituting Equation 11 into Equation 12 yields
• Because of the exponential form of the Gaussian
density, working with the natural logarithm of this
decision function is more convenient. That is:
d j (x) = ln[ p(x ω j ) P (ω j )]
This figure shows a plot of the probability density functions for the
two classes.
[
]
1
1
n
d j (x) = ln P(ω j ) − ln 2π − ln C j − (x − m j )T C −j1 (x − m j )
2
2
2
Equation 13
Equation 12
• The term (n/2)ln2π is the same for all classes, so it
can be eliminated from Equation 13:
[
1
1
d j (x) = ln P(ω j ) − ln C j − (x − m j )T C −j1 (x − m j )
2
2
• It is equivalent to Equation 9 in terms of
classification performance because the logarithm is
monotonically increasing function. In other words,
the numerical order of the decision function in
Equation 9 and Equation 12 is the same.
29
]
Equation 14
for j = 1, 2, …, W.
• The Equation 14 represents the Bayes decision
functions for Gaussian pattern classes under the
condition of a 0-1 loss function. This equation is
also called a quadratic discriminant function.
30
5
Case 1 Example
Several Cases
-- Case 1
• The features are statistically independent with
the same variance for all classes. Ci = σ2
[
1
1
d j ( x) = ln P (ω j ) − ln C j − (x − m j ) T C −j1 (x − m j )
2
2
1
T
T
=
− 2m j x + m j m j + ln P(ω j )
2σ 2
(
]
)
• Since the discriminant is linear, the decision
boundaries will be hyper-planes.
31
32
Case 2 Example
Case 2
• The classes have the same covariance matrix,
but the features are allowed to have different
variances.
N
N
2x[k ]m j [k ] + m j [ k ]2 1
1
−
log
σ k2
d j (x) = ln P (ω j ) −
2 k =1
2
σ k2
k =1
∑
∏
• This discriminant is linear. So the decision
boundaries will be hyper-planes.
33
34
Case 3 Example
Case 3
• All the classes have the same covariance
matrix, but this is no longer diagonal.
1
d j (x) = ln P(ω j ) + x T C −1m j − m Tj C −1m j
2
• This discriminant is linear, so the decision
boundaries will also be hyper-planes.
35
36
6
Case 4 Example
Case 4
• Each class has a different covariance matrix,
which is proportional to the identity matrix
1
1
d j (x) = ln P (ω j ) − N ln σ 2j − (x − m j )T σ −j 2 (x − m j )
2
2
[
]
• This expression cannot be reduced further so:
– The decision boundaries are quadratic: hyperellipses
37
Case 5
-- The Most General Case
38
Case 5 Example
• The covariance matrices of different classes are
not the same.
[
1
1
d j (x) = ln P(ω j ) − ln C j − (x − m j )T C −j1 (x − m j )
2
2
]
• Reorganizing terms in a quadratic form yields
d j ( x ) = x T W j x + wTj x + w j 0
1
W j = − C −j 1
2
w j = C −j 1m j
1
1
w j 0 = − mTj C −j 1m j − ln | C j | + ln P(ω j )
2
2
39
40
A New Approximation Method of the
Quadratic Discriminant Function
-- By S. Omachi, F. Sun, and H. Aso
Other QDA
• While QDA is the most commonly used method for
obtaining a classifier, other methods are also possible.
One such method is to create a longer measurement
vector from the old one by adding all pairwise products
of individual measurements. For instance, the vector
–
• Finding a quadratic classifier for the original
.
measurements would then become the same as finding
a linear classifier based on the expanded measurement
vector.
• For linear classifiers based only on dot products, these
expanded measurements do not have to be actually
computed, since the dot product in the higher
dimensional space is simply related to that in the original
space Î Kernel Trick
41
• In order to avoid the disadvantages of the
quadratic discriminant function, they have
proposed a new approximation method of the
quadratic discriminant function.
• This approximation is done by replacing the values
of small eigenvalues by a constant which is
estimated by the maximum likelihood estimation.
By applying this approximation, a new discriminant
function, simplified quadratic discriminant function,
or SQDF, has been defined.
• This function not only reduces the computational
cost but also improves the classification accuracy.
42
7
A Quadratic Discriminant Function Based on
Bias Rectification of Eigenvalues
-- By M. Sakai, M. Yoneda, H. Hase, et al.
They propose a new quadratic discriminant function.
1) They show that the eigenvalues of a covariance matrix
obtained from samples are biased.
2) They describe how to rectify them and how to use the
rectified eigenvalues.
3) In order to derive the relation between sample
eigenvalues and true eigenvalues, they analyze the
biases of the expected sample eigenvalues by using
the perturbation method and obtain approximate
simultaneous linear equations to rectify sample
eigenvalues in multidimensional normal cases.
4) They show by a Monte Carlo method that their
discriminant function works effectively in eightdimensional normal cases especially in the case of a
small sample size.
43
Summary
• The various discriminant methods are developed
to have optimal properties under various
distributional assumptions.
• However, the purpose of discriminant analysis is
to discriminate. Therefore, the proper
assessment of a discriminant procedure for a
particular data set is not how well the data
fit the assumptions, but how well the procedure
works on a validation data set.
44
8