ECE 6430 Pattern Recognition and Analysis
Fall 2011
Lecture Notes - 2
• What does Bayes theorem give us?
• Lets revisit the ball in the box example.
Figure 1: Boxes with colored balls
• Last class we answered the question, what is the overall probability that
the selection procedure will pick a green ball?
• Now lets look at another problem:
suppose we have a green ball, what is the probability that it came
from blue box? Or red box?
1
• We can solve the problem of reversing the conditional probability by using
Bayes theorem:
p (B = b|F = g) =
p (F = g|B = b) p (B = b)
p (F = g)
(1)
• Note that we know all the probabilities on the RHS from earlier!
• Prior probability : p (B = b) or p (B = r)
it is the probability available before we observe the identity of the
ball.
• Posterior probability : p (B|F )
it is the probability available after we observe the identity of the ball.
• How does Bayes theorem relate to data training?
• Prior probabilities, p (Ck ), can be estimated from the proportions of the
training data which fall into each class.
What does this mean?
How many times was each class chosen? (What is the probability of
choosing the blue box?)
• Class-conditional probability, p (xn |Ck ): is estimated from histograms for
each class.
Why is this happening?
Take the example of hand written letters 'a' and 'b'. We tried to
classify based just on height. In that case, there was a lot of overlap
between classes C1 and C2 .
While the above is the most obvious, even with more number of
features, such instances of over-lapping occurs.
Both boxes (classes) have both orange and green balls!
• What about the denominator, p (xn )?
Once we have the prior and the
can calculate this as,
2
class-conditional
probabilities, we
p (xn ) = p (xn |C1 ) p (C1 ) + · · · + p (xn |Ck ) p (Ck )
(2)
This is just a normalizing value! Why?
• Summarizing,
posterior ∝ prior × likelihood
(3)
• Decision making :
Given a new data value xnt , the probability of misclassication is
minimized if we assign the data to class Ck , for which the posterior
probability p (Ck |xnt ) is largest.
Ck , if p Ck |xnt > p Cj |xnt ∀ k 6= j
(4)
• Rejection threshold in Bayesian context
(
≥θ
if max p (Ck |x )
k
<θ
n
classif y xn as Ck
reject xn
(5)
• Note: that in the text book, discrete probabilities are denoted by a capital
P (), while I have not made that distinction.
3
Discriminant functions
• Discriminant functions y1 (x) , · · · yM (x) is dened such that, an input
vector x is assigned to class Ck if, yk (x) > yj (x) ∀ j 6= k.
• If we compare this to our earlier rule of minimizing probability of misclas-
sication, we would have,
yk (x) = p (Ck |x)
(6)
• Applying Bayes theorem, we will have,
yk (x) = p (x|Ck ) p (Ck )
(7)
• Note that when dening the discriminant function, we can discard the
denominator p (xn ).
Figure 2: Joint probabilities compared
• p (x, C1 ) = p (x|C1 ) p (C1 )
• In general, the decision boundaries are given by the regions where the
discriminant functions are equal. yk (x) = yj (x).
• Since we are looking to compare relative magnitudes, we can replace y (x)
with another monotonic function and expect to get similar results. e.g.
zk (x) = ln p (x|Ck ) + ln p (Ck )
4
(8)
Curve tting revisited
Figure 3: Probability in curve tting!
• Remember: Curve tting involves nding one set of values for w, minimizing error between y (xn , w) and desired output or target values, tn .
• Error function: measures the mist between the function y (xn , w), for
any given value of w, and the training set data points.
Sum of the squares of the errors:
N
E=
1X
2
{y (xn ; w) − tn }
2 n=1
(9)
• What does p (t|x0 ) mean?
• What are we doing here?: Choosing a specic estimate y( x) of the value
of t for each input x.
• The regression function y (x), minimizes the expected squared loss.
ˆ ˆ
2
E (L) =
{y (x) − t} p (x, t) dxdt
(10)
• Choose y (x) to minimize E (L).
5
• Finding partial dierential w.r.t. y (x) and equating it to zero, we will end
up getting,
y (x) = E (t|x)
(11)
Figure 4: Bayesian curve t
• Generalizing for multiple variables, we will have,
y (x) = E (t|x)
(12)
Minimizing risk
• Sometimes, misclassifying one way might be more detrimental.
• e.g. Identication of tumor. It would be riskier to classify a real tumor as
a non tumor than the other way.
• Loss matrix element lkj =penalty associated with assigning a pattern to
class Cj when in belongs to Ck .
Rk = Σcj=1 lkj Σk p(x|Ck )
(13)
Σk lkj p(x|Ck )p(Ck ) < Σk lki p(x|Ck )p(Ck ) ∀ i 6= j
(14)
• END
6
© Copyright 2026 Paperzz