Powerpoint

CS1512
Foundations of
Computing Science 2
Lecture 4
Bayes Law; Gaussian Distributions
1
© J R W Hunter, 2006; C J van Deemter 2007
(Revd. Thomas) Bayes Theorem
P( E1 and E2 ) = P( E1 )* P( E2 | E1 )
Order of E1 and E2 is not important.
2
(Revd. Thomas) Bayes Theorem
P( E1 and E2 ) = P( E1 )* P( E2 | E1 )
Order of E1 and E2 is not important.
P( E2 and E1 ) = P( E2 )* P( E1 | E2 ) .
3
(Revd. Thomas) Bayes Theorem
P( E1 and E2 ) = P( E1 )* P( E2 | E1 )
Order of E1 and E2 is not important.
P( E2 and E1 ) = P( E2 )* P( E1 | E2 ) . So,
P( E2 )* P( E1 | E2 ) = P( E1 )* P( E2 | E1 ) . So,
P( E1 | E2 ) =
P( E1 )* P( E2 | E1 )
P( E2 )
4
(Revd. Thomas) Bayes Theorem
P( E1 | E2 ) =
P( E1 )* P( E2 | E1 )
P( E2 )
E.g., once you know P(E1) and P(E2),
knowing one of the two conditional probabilities
(that is, knowing either P(E1|E2) or P(E2|E1))
implies knowing the other of the two.
5
(Revd. Thomas) Bayes Theorem
P( E1 | E2 ) =
P( E1 )* P( E2 | E1 )
P( E2 )
A simple example, without replacement:
Two red balls and one black: RRB
P(R)=2/3
P(B)=
P(R|B)=
P(B|R)=
6
(Revd. Thomas) Bayes Theorem
P( E1 | E2 ) =
P( E1 )* P( E2 | E1 )
P( E2 )
A simple example:
Two red balls and one black: RRB
P(R)=2/3
P(B)=1/3
P(R|B)=1
P(B|R)=1/2
7
(Revd. Thomas) Bayes Theorem
P( E1 | E2 ) =
P( E1 )* P( E2 | E1 )
P( E2 )
A simple example:
Two red balls and one black: RRB
P(R)=2/3
P(B)=1/3
P(R|B)=1
P(B|R)=1/2. Now calculate P(R|B) using Bayes:
P(R|B)=
8
(Revd. Thomas) Bayes Theorem
P( E1 | E2 ) =
P( E1 )* P( E2 | E1 )
P( E2 )
A simple example:
Two red balls and one black: RRB
P(R)=2/3
P(B)=1/3
P(R|B)=1
P(B|R)=1/2 Now calculate P(R|B) using Bayes:
P(R|B)=(2/3*1/2)/1/3=(2/6)/(1/3)=1 --- correct!
9
Bayes Theorem
Why bother?
• Suppose E1 is a disease (say measles)
• Suppose E2 is a symptom (say spots)
• P( E1 ) is the probability of a given patient having measles
(before you see them)
• P( E2 ) is the probability of a given patient having spots
(for whatever reason)
• P( E2 | E1 ) is the probability that someone who has measles has spots
Bayes theorem allows us to calculate the probability:
P( E1 | E2 ) = P (patient has measles given that the patient has spots)
10
Discrete random variables
Discrete random variable
• a discrete variable where the values taken are associated with a
probability
Discrete probability distribution
• formed by the possible values and their probabilities
• P(X = <some value>) = ...
11
Simple example – fair coin
Let variable (X) be number of heads in one trial (toss).
Values are 0 and 1
Probability distribution:
0.6
0.5
Value of X
P(X = i)
0
0.5
1
0.5
0.4
0.3
0.2
0.1
0
0
1
12
Example – total of two dice
Let variable (X) be the total when two fair dice are tossed.
Its values are 2, 3, 4, ... 11, 12
Are these “combination” values equally probable?
13
Example – total of two dice
Let variable (X) be the total when two fair dice are tossed.
Its values are 2, 3, 4, ... 11, 12
Not all these “combination” values are equally probable!
Probability distribution:
P(X = 2) = 1/36
P(X = 3) = 2/36 (because 3 = 1+2 = 2+1)
P(X = 4) = 3/36 (because 4 = 1+3 = 3+1 = 2+2)
...
1
2
3
4
5
6
1
2
3
4
5
6
7
2
3
4
5
6
7
8
3
4
5
6
7
8
9
4
5
6
7
8
9
10
5
6
7
8
9
10
11
6
7
8
9
10
11
12
14
Example – total of two dice
Let variable (X) be the total when two fair dice are tossed.
Its values are 2, 3, 4, ... 11, 12
Probability distribution:
0.18
0.16
P(X = 2) = 1/36
P(X = 3) = 2/36
0.14
0.12
...
0.1
0.08
1
2
3
4
5
6
1
2
3
4
5
6
7
2
3
4
5
6
7
8
3
4
5
6
7
8
9
4
5
6
7
8
9
10
5
6
7
8
9
10
11
6
7
8
9
10
11
12
0.06
0.04
0.02
0
2
3
4
5
6
7
8
9
10
11
12
15
Bernouilli probability distribution
Bernoulli trial:
• one in which there are only two possible and exclusive
outcomes;
• call the outcomes ‘success’ and ‘failure’;
• consider the variable X which is the number of successes in
0.7
one trial
0.6
Probability distribution:
Value of X
0
0.5
1
0.4
0.3
P(X = i)
(1-p)
p
0.2
0.1
(fair coin  p=1/2 ;
graph  p=4/10)
0
0
1
16
Binomial distribution
Carry out n Bernoulli trials. X is the number of successes in n trials.
Possible values are: 0, 1, 2, ... i, ... n-1, n
What is P(X=r) ? (= the probability of scoring exactly r successes in n trials)
17
Binomial distribution
Carry out n Bernoulli trials. X is the number of successes in n trials.
Possible values are: 0, 1, 2, ... i, ... n-1, n
What is P(X=r) ? (= the probability of scoring exactly r successes in n trials)
Let s=success and f=failure. Then a trial is a tuple like e.g. <s,s,f,s,f,f,f> (n=7)
Generally, a trial is an n-tuple consisting of s and f.
Consider the 7-tuple <f,s,f,s,f,f,f> . Its probability is
P(f)*P(s)*P(f)*P(s)*P(f)*P(f)*P(f). That’s P(s)2 * P(f)5.
Probability of a given tuple with r s-es and n-r f-s in it = P(s)r * P(f)n-r
18
Binomial distribution
Probability of a given tuple with r s-es and n-r f-s in it = P(s)r * P(f)n-r
We’re interested in P(X=r), that’s the probability of all tuples together
that have r s-es and n-r f-s in them (i.e. the sum of their probabilities).
How many ways are there of picking exactly r elements out of a set of n?
Combinatorics says there are
n!
r! (n-r)!
(Def: n! = n * (n-1) * (n-2) * ... * 2 * 1)
Try a few values. E.g., if n=9 and r=1 then 9! / 1!(9-1)! = 9! / 8! = 9.
So, we need to multiply P(s)r * P(f)n-r with this factor.
19
Binomial distribution
Combining what you’ve seen in the previous slides
Carry out n Bernoulli trials; in each trial the probability of success is P(s)
and the probablity of failure is P(f)
The distribution of X (i.e., the number of successes) is:
P(X=r) = nCr * P(s)r * P(f)n-r
n
n!
Cr =
r! (n-r)!
20
Binomial distribution
(slightly different formulation)
n
P(X=r) = Cr * pr * (1-p)n-r
Remember CS1015, combinations & permutations
21
Example of binomial distribution
Probability of r heads in n trials (n = 20)
P(X = r)
2.00E-01
1.80E-01
1.60E-01
1.40E-01
1.20E-01
1.00E-01
8.00E-02
6.00E-02
4.00E-02
2.00E-02
0.00E+00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
22
Example of binomial distribution
Probability of r heads in n trials (n = 100)
P(X = r)
9.00E-02
8.00E-02
7.00E-02
6.00E-02
5.00E-02
4.00E-02
3.00E-02
2.00E-02
1.00E-02
0.00E+00
0
6
12
18
24
30
36
42
48
54
60
66
72
78
84
90
96
23
Probability of being in a given
range
Probability of r heads in n trials (n = 100)
P(0 ≥ X < 5)
P(5 ≥ X < 10)
...
4.00E-01
3.50E-01
3.00E-01
2.50E-01
2.00E-01
1.50E-01
1.00E-01
5.00E-02
0.00E+00
0
10
20
30
40
50
60
70
80
90
24
Probability of being in a given
range
P(55 ≥ X < 65) = P(55 ≥ X < 60) + P(60 ≥ X < 65) (mutually exclusive events)
Add the heights of the columns
4.00E-01
3.50E-01
3.00E-01
2.50E-01
2.00E-01
1.50E-01
1.00E-01
5.00E-02
0.00E+00
0
10
20
30
40
50
60
70
80
90
25
Continuous random variable
Consider a large population of individuals
e.g. all males in the UK over 16
Consider a continuous attribute
e.g. Height: X
Select an individual at random so that any individual
is as likely to be selected as any other, observe
his/her Height
X is said to be a continuous random variable
26
Continuous random variable
Suppose the heights in your sample range
from 31cm (a baby) to 207cm
Continuous  can’t list every height separately  need bins.
If you are satisfied with a rough division into a few
dozen categories, you might decide to use bins as follows:
•
•
•
•
19.5 - 39.5cm: relative frequency = 3/100
39.5 - 59.5cm: relative frequency = 7/100
59.5 - 79.5cm: relative frequency = 23/100
etc.
Relative frequencies may be plotted as follows:
27
Relative Frequency Histogram
relative frequency
2.50E-01
2.00E-01
height of the bar
gives the relative
frequency
1.50E-01
1.00E-01
5.00E-02
0.00E+00
19. 5
39. 5
59. 5
79. 5
99. 5
119.5
139.5
159.5
179.5
199.5
28
Probability density function
The probability distribution of X is the function P such that, for all a and b :
P( {x : a ≥ x > b} ) = the area under the curve between a and b
P (total area under curve) = 1.0
29
Increase the number of samples
... and decrease the width of the bin ...
30
The ‘normal’ or Gaussian distribution
Very common distribution:
• variable measured for large number of nominally identical objects;
• variation assumed to be caused by a large number of factors;
• each factor exerts a small random positive or negative influence;
• e.g. height: age, diet, genes
-- Symmetric about mean μ
-- Monotonically increasing where < μ
-- Monotonically descreasing where > μ
-- (Unimodal: only one peak)
0.25
0.2
0.15
0.05
18
14
14
.8
15
.6
16
.4
17
.2
10
10
.8
11
.6
12
.4
13
.2
9.
2
8.
4
7.
6
6
6.
8
5.
2
4.
4
3.
6
0
2
2.
8
The greater the number of trials (n),
the more closely does the binomial
Distribution approximate a normal
Distribution
0.1
31
Mean = 30
.4
.8
.2
.6
40
38
36
35
.4
.8
.2
.6
40
38
36
35
33
.4
.8
.2
.6
32
30
28
27
33
.4
.8
.2
.6
32
30
28
27
24
.4
.8
.2
.6
25
22
20
19
24
.4
.8
.2
.6
25
22
20
19
16
.4
.8
17
14
16
.4
.8
17
14
.2
11
11
12
9.
6
9.
6
12
8
8
.2
6.
4
6.
4
0
4.
8
0.05
4.
8
0.1
3.
2
0.15
3.
2
0.2
1.
6
0.25
1.
6
0
Mean = 10
0
Mean
Mean determines the centre of the curve:
0.25
0.2
0.15
0.1
0.05
0
32
Standard deviation
Standard deviation determines the ‘width’ of the curve:
Std. Devn. = 2
0.25
0.2
0.15
0.1
0.05
18
.4
.2
17
.6
15
16
14
4.
6
.8
14
4
.2
13
.4
12
.6
11
.8
10
10
9.
2
8.
4
7.
6
6.
8
6
5.
2
4.
4
3.
6
2.
8
2
0
Std. Devn. = 1
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
7
1
7.
6
1
6.
4
1
5.
8
1
5.
2
1
1
1
3.
4
1
2.
8
1
2.
2
1
1.
6
1
1
1
.8
9
0.
4
1
.2
9
.6
8
8
.8
.4
7
6
.2
6
5
.6
5
.4
4
.8
3
.2
3
.6
2
2
0
33
Normal distribution
Given a perfectly normal distribution, mean and standard deviation
together determine the function uniquely.
Certain things can be proven about any normal distribution
This is why statisticians often assume that their sample space
is approximately normally distributed .... even though this is
usually very difficult to be sure about!
Example (without proof): In normally distributed data, the
probability of lying within 2 standard deviations of the mean,
on either side, is approximately 0.95
34
Probability of sample lying within
mean ± 2 standard deviations
Example:
Mean (μ)= 10.0
Std. devn (σ)= 2.0
P(X < μ – σ) = 15.866%
P(X < μ – 2σ) = 2.275%
P(μ – 2σ < X < μ + 2σ) =
= (100 – 2 * 2.275)%
= 94.5%
x
2.0
2.4
2.8
3.2
3.6
4.0
4.4
4.8
5.2
5.6
6.0
6.4
6.8
7.2
7.6
8.0
8.4
8.8
9.2
9.6
10.0
Prob Dist
0.00013383
0.000291947
0.000611902
0.001232219
0.002384088
0.004431848
0.007915452
0.013582969
0.02239453
0.035474593
0.053990967
0.078950158
0.110920835
0.149727466
0.194186055
0.241970725
0.289691553
0.333224603
0.36827014
0.391042694
0.39894228
Cum Prob Dist
00.003%
00.007%
00.016%
00.034%
00.069%
00.135%
00.256%
00.466%
00.820%
01.390%
02.275%
03.593%
05.480%
08.076%
11.507%
15.866%
21.186%
27.425%
34.458%
42.074%
50.000%
35
Probability of sample lying within
mean ± 2 standard deviations
0.25
0.2
0.15
0.1
2.275%
94.5%
2.275%
0.05
μ – 2σ
μ
18
14
14
.8
15
.6
16
.4
17
.2
10
10
.8
11
.6
12
.4
13
.2
9.
2
8.
4
7.
6
6.
8
6
5.
2
4.
4
3.
6
2.
8
2
0
μ + 2σ
36
Non-normal distributions
Some non-normal distributions are so close to ‘normal’ that it doesn’t
harm much to pretend that they are normal
But some distributions are very far from ‘normal’
As a stark reminder, suppose we’re considering the variable X= “error
when a person’s age is stated”. Error =def real age – age last birthday
Range(X)= 0 to 12 months
Which of these values is most probable?
37
Non-normal distributions
Some non-normal distributions are so close to ‘normal’ that it doesn’t
harm much to pretend that they normal
But some distributions are very far from ‘normal’
As a stark reminder, suppose we’re considering the variable X= “error
when a person’s age is state”. Error =def real age – age last birthday
Range(X)= 0 to 12 months
Which of these values is most probable? None: always P(x)=1/12
How would you plot this?
38
Uniform probability distribution
Also called rectangular distribution
P(X)
1.0
P(X < y) = 1.0 * y
=y
x
0.0
0.0
1.0
y
39
Uniform probability distribution
P(X)
P(X < a) = a
P(X < b) = b
1.0
P(a ≤ X < b) = b - a
x
0.0
0.0
a
b
1.0
40
Towards Inferential Statistics
If this was a course on inferential statistics, we would now discuss the
role of normal distributions in the statistical testing of hypotheses.
The rough idea (in one representative type of situation):
Your hypothesis: Brits are taller than Frenchmen
•
•
•
•
•
•
Take two samples: heights of Brits, heights of Frenchmen.
Assume both populations (as a whole) are normally distributed.
Compute the mean values, μb and μf , for these two samples.
Suppose you find μb > μf This is promising …. but it could be an accident!
Now compute the probability p of measuring these means if, in fact,
μb = μf in the two populations as a whole (i.e., your finding was an accident).
If p < 0.05 then conclude that Brits are taller than Frenchmen.
41
Summing up this lecture
• Bayes’ Theorem (first encounter)
•
•
•
•
•
•
Discrete random variable
Benouilli trial; series of Bernouilli trials
Binomial distribution
Probability of r successes out of n trials
Continuous random variable
Normal (= Gaussian) distribution
42