P(A ∩ B)+

Probability theory
retro
14/03/2017
Probability
(atomic) events (A) and probability space ()
Axioms:
- 0 ≤ P(A) ≤ 1
- P()=1
- If A1, A2, … mutually exclusive events (Ai ∩Aj =
, i  j) then
P(k Ak) =  k P(Ak)
1
- P(Ø) = 0
- P(¬A)=1-P(A)
- P(A  B)=P(A)+P(B) – P(A∩B)
- P(A) = P(A ∩ B)+P(A ∩¬B)
- If A  B, then P(A) ≤ P(B) and
P(B-A) = P(B) – P(A)
2
Conditional probability
Conditional probability is the probability
of some event A, given the occurrence
of some other event B.
P(A|B) = P(A∩B)/P(B)
Chain rule:
P(A∩B) = P(A|B)·P(B)
Example:
A: headache, B: influenza
P(A) = 1/10, P(B) = 1/40, P(A|B)=?
3
Conditional probability
Independence of events
A and B are independent iff
P(A|B) = P(A)
Corollary:
P(AB) = P(A)P(B)
P(B|A) = P(B)
5
Product rule
A1, A2, …, An arbitrary events
P(A1A2…An) = P(An|A1…An-1)
P(An-1|A1…An-2)…P(A2| A1)P(A1)
If A1, A2, …, An events form a complete
probability space and P(Ai) > 0 for each
i, then
P(B) = ∑j=1n P(B | Ai)P(Ai)
6
Bayes rule
P(A|B) =
P(A∩B)/P(B) =
P(B|A)P(A)/P(B)
7
Random variable
ξ:  → R
Random variable vectors…
8
cumulative distribution
function (CDF),
F(x) = P( < x)
F(x1) ≤ F(x2), if x1 < x2
limx→-∞ F(x) = 0, limx→∞ F(x) = 1
F(x) is non-decreasing and rightcontinuous
9
Discrete vs continous
random variables
Discrete:
its value set forms a finite of infinate
series
Continous: we assume that f(x) is valid on
the (a, b) interval
10
Probability density functions
(pdf)
F(b) - F(a) = P(a <  < b) = a∫b f(x)dx
f(x) = F ’(x) és F(x) = .-∞∫x f(t)dt
Histogram
Empirical estimation of a density
12
Independence of random
variables
 and  are independent, iff any
a ≤ b, c ≤ d
P(a ≤  ≤ b, c ≤  ≤ d) =
P(a ≤  ≤ b) P(c ≤  ≤ d).
13
Composition of random
variables
Discrete case:
=+
iff  and  are independent
rn = P( = n) =
k=- P( = n - k,  = k)
14
Expected value
 can take values x1, x2, … with p1, p2, …
probability then
M() = i xipi
continous case:
M() = -∞∫ xf(x)dx
15
Properties of expected value
M(c) = cM()
M( + ) = M() + M()
If  and  are independent random
variables, then M() = M()M()
16
Standard deviation
D() = (M[( - M())2])1/2
D2() = M(2) – M2()
17
Properties of standard
deviation
- D2(a + b) = a2D2()
- if 1, 2, …, n are independent random
variables then
D2(1 + 2 + … + n) = D2(1) + D2(2) + … +
D2(n)
18
Correlation
Covariance:
c = M[( - M())( - M())]
c is 0 if  and  are independent
Correlation coefficient:
r = c / ((D()D()),
normalised covariance into [-1,1]
19
Well-known distributions
Normal/Gauss
Binomial:
 ~ B(n,p)
M() = np
D() = np(1-p)
20
Pattern
Classification
All materials in these slides were taken from
Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
with the permission of the authors and the
publisher
Bayes classification
23
Classification
Supervised learning: Based on training examples (E),
learn a modell which works fine on previously unseen
examples.
Classification: a supervised learning task of categorisation
of entities into predefined set of classes
24
Posterior, likelihood, evidence
P(j | x) = P(x | j) . P (j) / P(x)
Posterior = (Likelihood. Prior) / Evidence
Where in case of two categories
j2
P ( x )   P ( x |  j )P (  j )
j 1
Pattern Classification, Chapter 2 (Part 1)
25
Pattern Classification, Chapter 2 (Part 1)
26
Pattern Classification, Chapter 2 (Part 1)
27
Bayes Classifier
•
Decision given the posterior probabilities
X is an observation for which:
if P(1 | x) > P(2 | x)
if P(1 | x) < P(2 | x)
True state of nature = 1
True state of nature = 2
This rule minimizes the probability of the error.
Pattern Classification, Chapter 2 (Part 1)
Classifiers, Discriminant Functions
and Decision Surfaces
28
• The multi-category case
• Set of discriminant functions gi(x), i = 1,…, c
• The classifier assigns a feature vector x to class i
if:
gi(x) > gj(x) j  i
Pattern Classification, Chapter 2 (Part 2)
29
For the minimum error rate, we take
gi(x) = P(i | x)
(max. discrimination corresponds to max.
posterior!)
gi(x)  P(x | i) P(i)
gi(x) = ln P(x | i) + ln P(i)
(ln: natural logarithm!)
Pattern Classification, Chapter 2 (Part 2)
30
• Feature space divided into c decision regions
if gi(x) > gj(x) j  i then x is in Ri
(Ri means assign x to i)
• The two-category case
• A classifier is a “dichotomizer” that has two discriminant
functions g1 and g2
Let g(x)  g1(x) – g2(x)
Decide 1 if g(x) > 0 ; Otherwise decide 2
Pattern Classification, Chapter 2 (Part 2)
31
• The computation of g(x)
g( x )  P (  1 | x )  P (  2 | x )
P( x | 1 )
P( 1 )
 ln
 ln
P( x |  2 )
P(  2 )
Pattern Classification, Chapter 2 (Part 2)
32
Discriminant functions
of the Bayes Classifier
with Normal Density
Pattern Classification, Chapter 2 (Part 1)
33
•
The Normal Density
Univariate density
•
•
•
•
Density which is analytically tractable
Continuous density
A lot of processes are asymptotically Gaussian
Handwritten characters, speech sounds are ideal or prototype
corrupted by random process (central limit theorem)
P( x ) 
2

1
1 x 
exp   
 ,
2 
 2    
Where:
 = mean (or expected value) of x
2 = expected squared deviation or variance
Pattern Classification, Chapter 2 (Part 2)
34
Pattern Classification, Chapter 2 (Part 2)
35
•
Multivariate density
•
Multivariate normal density in d dimensions is:
P( x ) 
1
( 2 )
d/2

1/ 2
 1

t
1
exp ( x   )  ( x   )
 2

where:
x = (x1, x2, …, xd)t (t stands for the transpose vector form)
 = (1, 2, …, d)t mean vector
 = d*d covariance matrix
|| and -1 are determinant and inverse respectively
Pattern Classification, Chapter 2 (Part 2)
Discriminant Functions for the Normal
Density
36
• We saw that the minimum error-rate classification
can be achieved by the discriminant function
gi(x) = ln P(x | i) + ln P(i)
• Case of multivariate normal
1
1
d
1
t
g i ( x )   ( x   i )  ( x   i )  ln 2  ln  i  ln P (  i )
2
2
2
i
Pattern Classification, Chapter 2 (Part 3)
37
• Case i = 2.I
(I stands for the identity matrix)
g i ( x )  w it x  w i 0 (linear discriminant function)
where :
i
1
t
wi  2 ; wi 0  

i  i  ln P (  i )
2

2
(  i 0 is called the threshold for the ith category! )
Pattern Classification, Chapter 2 (Part 3)
38
• A classifier that uses linear discriminant functions
is called “a linear machine”
• The decision surfaces for a linear machine are
pieces of hyperplanes defined by:
gi(x) = gj(x)
Pattern Classification, Chapter 2 (Part 3)
39
The hyperplane is always orthogonal to the line linking the means!
Pattern Classification, Chapter 2 (Part 3)
40
• The hyperplane separating Ri and Rj
1
2
x0  (  i   j ) 
2
i   j
2
P(  i )
ln
( i   j )
P(  j )
always orthogonal to the line linking the means!
1
if P (  i )  P (  j ) then x0  (  i   j )
2
Pattern Classification, Chapter 2 (Part 3)
41
Pattern Classification, Chapter 2 (Part 3)
42
Pattern Classification, Chapter 2 (Part 3)
43
• Case i =  (covariance of all classes are
identical but arbitrary!)
• Hyperplane separating Ri and Rj


ln P (  i ) / P (  j )
1
x0  (  i   j ) 
.(  i   j )
t
1
2
( i   j )  ( i   j )
(the hyperplane separating Ri and Rj is generally
not orthogonal to the line between the means!)
Pattern Classification, Chapter 2 (Part 3)
44
Pattern Classification, Chapter 2 (Part 3)
45
Pattern Classification, Chapter 2 (Part 3)
46
• Case i = arbitrary
•
The covariance matrices are different for each category
g i ( x )  x tWi x  w it x  w i 0
where :
1 1
Wi    i
2
w i   i 1  i
1 t 1
1
w i 0    i  i  i  ln  i  ln P (  i )
2
2
(Hyperquadrics which are: hyperplanes, pairs of
hyperplanes, hyperspheres, hyperellipsoids,
hyperparaboloids, hyperhyperboloids)
Pattern Classification, Chapter 2 (Part 3)
47
Pattern Classification, Chapter 2 (Part 3)
48
Pattern Classification, Chapter 2 (Part 3)
49
Exercise
Select the optimal decision where:
= {1, 2}
P(x | 1)
P(x | 2)
N(2, 0.5) (Normal distribution)
N(1.5, 0.2)
P(1) = 2/3
P(2) = 1/3
Pattern Classification, Chapter 2
50
Parameter estimation
Pattern Classification, Chapter 3
• Data availability in a Bayesian framework
• We could design an optimal classifier if we knew:
•
•
P(i) (priors)
P(x | i) (class-conditional densities)
Unfortunately, we rarely have this complete information!
• Design a classifier from a training sample
• No problem with prior estimation
• Samples are often too small for class-conditional estimation
(large dimension of feature space!)
1
• A priori information about the problem
• E.g. assume normality of P(x | i)
P(x | i) ~ N( i, i)
Characterized by 2 parameters
• Estimation techniques
• Maximum-Likelihood (ML) and the Bayesian estimations
• Results are nearly identical, but the approaches are different
1
• Parameters in ML estimation are fixed but
unknown!
• Best parameters are obtained by maximizing the
probability of obtaining the samples observed
• Bayesian methods view the parameters as
random variables having some known distribution
• In either approach, we use P(i | x)
for our classification rule!
1
• Use the information
provided by the training samples to estimate
 = (1, 2, …, c), each i (i = 1, 2, …, c) is associated with each
category
• Suppose that D contains n samples, x1, x2,…, xn
k n
P ( D |  )   P ( xk |  )
k 1
P( D |  ) is called the likelihood of  w.r.t. the set of samples)
• ML estimate of  is, by definition the value that
maximizes P(D | )
“It is the value of  that best agrees with the actually observed
training sample”
2
•
Let  = (1, 2, …, p)t and let  be the gradient operator
 

 
  
,
,...,

 p 
  1  2
•
•
t
We define l() as the log-likelihood function
l() = ln P(D | )
New problem statement:
determine  that maximizes the log-likelihood
ˆ  argmaxl()

2
56
Example: univariate normal density,
 and  are unknown
azaz  = (1, 2) = (, 2)
1
1
l  ln P ( xk |  )   ln 22 
( xk  1 ) 2
2
2 2
 

(ln
P
(
x
|

))


k
1

0
 l 
 

(ln P ( xk |  )) 

  2

1
 ( xk  1 )  0
 2

2
1
(
x


)

 k 21  0

2 2
 2 2
Pattern Classification, Chapter 3
57
Sum over the sample:
n 1
 ˆ ( xk  1 )  0
 k 1 2
 n
2
n
ˆ
1
( xk  1 )


0

2
ˆ
 k 1 ˆ k 1


2
2
(1)
(2)
Solving (1) and (2):
n
n
xk
ˆ  
k 1 n
;
ˆ 2 
Pattern Classification, Chapter 3
(x
k 1
k
 ˆ )
n
2
Bayesian Estimation
• In MLE  was supposed fix
• In BE  is a random variable
• The computation of posterior probabilities P(i | x)
•
lies at the heart of Bayesian classification
Goal: compute P(i | x, D)
Given the sample D, Bayes formula can be written
Px D    Px  P D d
ter
1
• Bayesian Parameter Estimation: Gaussian
Case
Goal: Estimate  using the a-posteriori density
P( | D)
• The univariate case: P( | D)
 is the only unknown parameter
P(x |  ) ~ N(,  2 )
2
P( ) ~ N( 0 ,  0 )
(0 and 0 are known!)
60
P(D |  ).P( )
P( | D ) 
 P(D |  ).P( )d
(1)
k n
   P(x k |  ).P( )
k 1
• Reproducing density
P( | D ) ~ N(n , n2 )
Identifying (1) and (2) yields:
 n 20

2
 ˆ 
 n  
. 0
2
2 n
2
2
n 0  
 n0  0   
and  n2 
 20  2
n 20   2
(2)
• The univariate case P(x | D)
• P( | D) computed
• P(x | D) remains to be computed!
P(x | D )   P(x | ).P( | D )d is Gaussian
It provides:
P(x | D ) ~ N(n , 2  n2 )
(Desired class-conditional density P(x | Dj, j))
Therefore: P(x | Dj, j) together with P(j) and using Bayes
formula, we obtain the Bayesian classification rule:



Max P( j | x, D  Max P( x |  j , D j ).P( j )
j
j

• Bayesian Parameter Estimation: General
Theory
• P(x | D) computation can be applied to any
situation in which the unknown density can be
parametrized: the basic assumptions are:
• The form of P(x | ) is assumed known, but the value of 
•
•
is not known exactly
Our knowledge about  is assumed to be contained in a
known prior density P()
The rest of our knowledge  is contained in a set D of n
random variables x1, x2, …, xn that follows P(x)
5
64
MLE vs. Bayes estimation
• If n→∞ they are equal!
• MLE
• Simple and fast (convex optimisation vs. numerical
integration)
• Bayes estimation
• We can express our
uncertainty by P()

Download Report

P(A ∩ B)+

Paperzz.com

Your Paperzz