Probability theory
retro
14/03/2017
Probability
(atomic) events (A) and probability space ()
Axioms:
- 0 ≤ P(A) ≤ 1
- P()=1
- If A1, A2, … mutually exclusive events (Ai ∩Aj =
, i j) then
P(k Ak) = k P(Ak)
1
- P(Ø) = 0
- P(¬A)=1-P(A)
- P(A B)=P(A)+P(B) – P(A∩B)
- P(A) = P(A ∩ B)+P(A ∩¬B)
- If A B, then P(A) ≤ P(B) and
P(B-A) = P(B) – P(A)
2
Conditional probability
Conditional probability is the probability
of some event A, given the occurrence
of some other event B.
P(A|B) = P(A∩B)/P(B)
Chain rule:
P(A∩B) = P(A|B)·P(B)
Example:
A: headache, B: influenza
P(A) = 1/10, P(B) = 1/40, P(A|B)=?
3
Conditional probability
Independence of events
A and B are independent iff
P(A|B) = P(A)
Corollary:
P(AB) = P(A)P(B)
P(B|A) = P(B)
5
Product rule
A1, A2, …, An arbitrary events
P(A1A2…An) = P(An|A1…An-1)
P(An-1|A1…An-2)…P(A2| A1)P(A1)
If A1, A2, …, An events form a complete
probability space and P(Ai) > 0 for each
i, then
P(B) = ∑j=1n P(B | Ai)P(Ai)
6
Bayes rule
P(A|B) =
P(A∩B)/P(B) =
P(B|A)P(A)/P(B)
7
Random variable
ξ: → R
Random variable vectors…
8
cumulative distribution
function (CDF),
F(x) = P( < x)
F(x1) ≤ F(x2), if x1 < x2
limx→-∞ F(x) = 0, limx→∞ F(x) = 1
F(x) is non-decreasing and rightcontinuous
9
Discrete vs continous
random variables
Discrete:
its value set forms a finite of infinate
series
Continous: we assume that f(x) is valid on
the (a, b) interval
10
Probability density functions
(pdf)
F(b) - F(a) = P(a < < b) = a∫b f(x)dx
f(x) = F ’(x) és F(x) = .-∞∫x f(t)dt
Histogram
Empirical estimation of a density
12
Independence of random
variables
and are independent, iff any
a ≤ b, c ≤ d
P(a ≤ ≤ b, c ≤ ≤ d) =
P(a ≤ ≤ b) P(c ≤ ≤ d).
13
Composition of random
variables
Discrete case:
=+
iff and are independent
rn = P( = n) =
k=- P( = n - k, = k)
14
Expected value
can take values x1, x2, … with p1, p2, …
probability then
M() = i xipi
continous case:
M() = -∞∫ xf(x)dx
15
Properties of expected value
M(c) = cM()
M( + ) = M() + M()
If and are independent random
variables, then M() = M()M()
16
Standard deviation
D() = (M[( - M())2])1/2
D2() = M(2) – M2()
17
Properties of standard
deviation
- D2(a + b) = a2D2()
- if 1, 2, …, n are independent random
variables then
D2(1 + 2 + … + n) = D2(1) + D2(2) + … +
D2(n)
18
Correlation
Covariance:
c = M[( - M())( - M())]
c is 0 if and are independent
Correlation coefficient:
r = c / ((D()D()),
normalised covariance into [-1,1]
19
Well-known distributions
Normal/Gauss
Binomial:
~ B(n,p)
M() = np
D() = np(1-p)
20
Pattern
Classification
All materials in these slides were taken from
Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
with the permission of the authors and the
publisher
Bayes classification
23
Classification
Supervised learning: Based on training examples (E),
learn a modell which works fine on previously unseen
examples.
Classification: a supervised learning task of categorisation
of entities into predefined set of classes
24
Posterior, likelihood, evidence
P(j | x) = P(x | j) . P (j) / P(x)
Posterior = (Likelihood. Prior) / Evidence
Where in case of two categories
j2
P ( x ) P ( x | j )P ( j )
j 1
Pattern Classification, Chapter 2 (Part 1)
25
Pattern Classification, Chapter 2 (Part 1)
26
Pattern Classification, Chapter 2 (Part 1)
27
Bayes Classifier
•
Decision given the posterior probabilities
X is an observation for which:
if P(1 | x) > P(2 | x)
if P(1 | x) < P(2 | x)
True state of nature = 1
True state of nature = 2
This rule minimizes the probability of the error.
Pattern Classification, Chapter 2 (Part 1)
Classifiers, Discriminant Functions
and Decision Surfaces
28
• The multi-category case
• Set of discriminant functions gi(x), i = 1,…, c
• The classifier assigns a feature vector x to class i
if:
gi(x) > gj(x) j i
Pattern Classification, Chapter 2 (Part 2)
29
For the minimum error rate, we take
gi(x) = P(i | x)
(max. discrimination corresponds to max.
posterior!)
gi(x) P(x | i) P(i)
gi(x) = ln P(x | i) + ln P(i)
(ln: natural logarithm!)
Pattern Classification, Chapter 2 (Part 2)
30
• Feature space divided into c decision regions
if gi(x) > gj(x) j i then x is in Ri
(Ri means assign x to i)
• The two-category case
• A classifier is a “dichotomizer” that has two discriminant
functions g1 and g2
Let g(x) g1(x) – g2(x)
Decide 1 if g(x) > 0 ; Otherwise decide 2
Pattern Classification, Chapter 2 (Part 2)
31
• The computation of g(x)
g( x ) P ( 1 | x ) P ( 2 | x )
P( x | 1 )
P( 1 )
ln
ln
P( x | 2 )
P( 2 )
Pattern Classification, Chapter 2 (Part 2)
32
Discriminant functions
of the Bayes Classifier
with Normal Density
Pattern Classification, Chapter 2 (Part 1)
33
•
The Normal Density
Univariate density
•
•
•
•
Density which is analytically tractable
Continuous density
A lot of processes are asymptotically Gaussian
Handwritten characters, speech sounds are ideal or prototype
corrupted by random process (central limit theorem)
P( x )
2
1
1 x
exp
,
2
2
Where:
= mean (or expected value) of x
2 = expected squared deviation or variance
Pattern Classification, Chapter 2 (Part 2)
34
Pattern Classification, Chapter 2 (Part 2)
35
•
Multivariate density
•
Multivariate normal density in d dimensions is:
P( x )
1
( 2 )
d/2
1/ 2
1
t
1
exp ( x ) ( x )
2
where:
x = (x1, x2, …, xd)t (t stands for the transpose vector form)
= (1, 2, …, d)t mean vector
= d*d covariance matrix
|| and -1 are determinant and inverse respectively
Pattern Classification, Chapter 2 (Part 2)
Discriminant Functions for the Normal
Density
36
• We saw that the minimum error-rate classification
can be achieved by the discriminant function
gi(x) = ln P(x | i) + ln P(i)
• Case of multivariate normal
1
1
d
1
t
g i ( x ) ( x i ) ( x i ) ln 2 ln i ln P ( i )
2
2
2
i
Pattern Classification, Chapter 2 (Part 3)
37
• Case i = 2.I
(I stands for the identity matrix)
g i ( x ) w it x w i 0 (linear discriminant function)
where :
i
1
t
wi 2 ; wi 0
i i ln P ( i )
2
2
( i 0 is called the threshold for the ith category! )
Pattern Classification, Chapter 2 (Part 3)
38
• A classifier that uses linear discriminant functions
is called “a linear machine”
• The decision surfaces for a linear machine are
pieces of hyperplanes defined by:
gi(x) = gj(x)
Pattern Classification, Chapter 2 (Part 3)
39
The hyperplane is always orthogonal to the line linking the means!
Pattern Classification, Chapter 2 (Part 3)
40
• The hyperplane separating Ri and Rj
1
2
x0 ( i j )
2
i j
2
P( i )
ln
( i j )
P( j )
always orthogonal to the line linking the means!
1
if P ( i ) P ( j ) then x0 ( i j )
2
Pattern Classification, Chapter 2 (Part 3)
41
Pattern Classification, Chapter 2 (Part 3)
42
Pattern Classification, Chapter 2 (Part 3)
43
• Case i = (covariance of all classes are
identical but arbitrary!)
• Hyperplane separating Ri and Rj
ln P ( i ) / P ( j )
1
x0 ( i j )
.( i j )
t
1
2
( i j ) ( i j )
(the hyperplane separating Ri and Rj is generally
not orthogonal to the line between the means!)
Pattern Classification, Chapter 2 (Part 3)
44
Pattern Classification, Chapter 2 (Part 3)
45
Pattern Classification, Chapter 2 (Part 3)
46
• Case i = arbitrary
•
The covariance matrices are different for each category
g i ( x ) x tWi x w it x w i 0
where :
1 1
Wi i
2
w i i 1 i
1 t 1
1
w i 0 i i i ln i ln P ( i )
2
2
(Hyperquadrics which are: hyperplanes, pairs of
hyperplanes, hyperspheres, hyperellipsoids,
hyperparaboloids, hyperhyperboloids)
Pattern Classification, Chapter 2 (Part 3)
47
Pattern Classification, Chapter 2 (Part 3)
48
Pattern Classification, Chapter 2 (Part 3)
49
Exercise
Select the optimal decision where:
= {1, 2}
P(x | 1)
P(x | 2)
N(2, 0.5) (Normal distribution)
N(1.5, 0.2)
P(1) = 2/3
P(2) = 1/3
Pattern Classification, Chapter 2
50
Parameter estimation
Pattern Classification, Chapter 3
• Data availability in a Bayesian framework
• We could design an optimal classifier if we knew:
•
•
P(i) (priors)
P(x | i) (class-conditional densities)
Unfortunately, we rarely have this complete information!
• Design a classifier from a training sample
• No problem with prior estimation
• Samples are often too small for class-conditional estimation
(large dimension of feature space!)
1
• A priori information about the problem
• E.g. assume normality of P(x | i)
P(x | i) ~ N( i, i)
Characterized by 2 parameters
• Estimation techniques
• Maximum-Likelihood (ML) and the Bayesian estimations
• Results are nearly identical, but the approaches are different
1
• Parameters in ML estimation are fixed but
unknown!
• Best parameters are obtained by maximizing the
probability of obtaining the samples observed
• Bayesian methods view the parameters as
random variables having some known distribution
• In either approach, we use P(i | x)
for our classification rule!
1
• Use the information
provided by the training samples to estimate
= (1, 2, …, c), each i (i = 1, 2, …, c) is associated with each
category
• Suppose that D contains n samples, x1, x2,…, xn
k n
P ( D | ) P ( xk | )
k 1
P( D | ) is called the likelihood of w.r.t. the set of samples)
• ML estimate of is, by definition the value that
maximizes P(D | )
“It is the value of that best agrees with the actually observed
training sample”
2
•
Let = (1, 2, …, p)t and let be the gradient operator
,
,...,
p
1 2
•
•
t
We define l() as the log-likelihood function
l() = ln P(D | )
New problem statement:
determine that maximizes the log-likelihood
ˆ argmaxl()
2
56
Example: univariate normal density,
and are unknown
azaz = (1, 2) = (, 2)
1
1
l ln P ( xk | ) ln 22
( xk 1 ) 2
2
2 2
(ln
P
(
x
|
))
k
1
0
l
(ln P ( xk | ))
2
1
( xk 1 ) 0
2
2
1
(
x
)
k 21 0
2 2
2 2
Pattern Classification, Chapter 3
57
Sum over the sample:
n 1
ˆ ( xk 1 ) 0
k 1 2
n
2
n
ˆ
1
( xk 1 )
0
2
ˆ
k 1 ˆ k 1
2
2
(1)
(2)
Solving (1) and (2):
n
n
xk
ˆ
k 1 n
;
ˆ 2
Pattern Classification, Chapter 3
(x
k 1
k
ˆ )
n
2
Bayesian Estimation
• In MLE was supposed fix
• In BE is a random variable
• The computation of posterior probabilities P(i | x)
•
lies at the heart of Bayesian classification
Goal: compute P(i | x, D)
Given the sample D, Bayes formula can be written
Px D Px P D d
ter
1
• Bayesian Parameter Estimation: Gaussian
Case
Goal: Estimate using the a-posteriori density
P( | D)
• The univariate case: P( | D)
is the only unknown parameter
P(x | ) ~ N(, 2 )
2
P( ) ~ N( 0 , 0 )
(0 and 0 are known!)
60
P(D | ).P( )
P( | D )
P(D | ).P( )d
(1)
k n
P(x k | ).P( )
k 1
• Reproducing density
P( | D ) ~ N(n , n2 )
Identifying (1) and (2) yields:
n 20
2
ˆ
n
. 0
2
2 n
2
2
n 0
n0 0
and n2
20 2
n 20 2
(2)
• The univariate case P(x | D)
• P( | D) computed
• P(x | D) remains to be computed!
P(x | D ) P(x | ).P( | D )d is Gaussian
It provides:
P(x | D ) ~ N(n , 2 n2 )
(Desired class-conditional density P(x | Dj, j))
Therefore: P(x | Dj, j) together with P(j) and using Bayes
formula, we obtain the Bayesian classification rule:
Max P( j | x, D Max P( x | j , D j ).P( j )
j
j
• Bayesian Parameter Estimation: General
Theory
• P(x | D) computation can be applied to any
situation in which the unknown density can be
parametrized: the basic assumptions are:
• The form of P(x | ) is assumed known, but the value of
•
•
is not known exactly
Our knowledge about is assumed to be contained in a
known prior density P()
The rest of our knowledge is contained in a set D of n
random variables x1, x2, …, xn that follows P(x)
5
64
MLE vs. Bayes estimation
• If n→∞ they are equal!
• MLE
• Simple and fast (convex optimisation vs. numerical
integration)
• Bayes estimation
• We can express our
uncertainty by P()
© Copyright 2026 Paperzz