Learning about a coin

Learning about a coin
1 / 26
H
T H
H
T H
H H
T H
Probability, that the next coin flip results in a tail?
∼0
∼ 0.3
∼ 0.38
∼ 0.5
∼ 0.76
∼1
2 / 26
30%?
Every flip is random. So every sequence of flips is random, i.e. it has
some probability to be observed.
For the i-th coin flip we write
pi (Fi = ”T ”) = θi
To denote that the probability distribution depends on θi , we write
pi (Fi = ”T ”|θi )
Note the i in the index!
3 / 26
p(”HT HHT HHHT H”|θ1 , θ2 , . . . , θ10 )
All the randomness of a sequence of flips is governed (modeled) by the
parameters θ1 , . . . , θ10 .
What about θ11 ?
Find θi ’s such that that p(”HT HHT HHHT H”|θ1 , . . . , θ10 ) is as high
as possible.
Maximize the likelihood of our observation
(Maximum Likelihood)
4 / 26
????
p(”HT HHT HHHT H”|θ1 , . . . , θ10 )
????
p(”HT HHT HHHT H”|θ) = p1 (F1 = ”H”|θ1 ) · p2 (F2 =
”T ”|θ2 ) · p3 (F3 = ”H”|θ3 ) · p4 (F4 = ”H”|θ4 ) · p5 (F5 =
”T ”|θ5 ) · p6 (F6 = ”H”|θ6 ) · p7 (F7 = ”H”|θ7 ) · p8 (F8 =
”H”|θ8 ) · p9 (F9 = ”T ”|θ9 )p10 (F10 = ”H”|θ10 )
Compact:
Q10
i=1
pi (Fi = fi |θi )
Notice the i in pi , θi ! This indicates: The coin flip at time 1 is different
from the one at time 2, . . . .
5 / 26
Q10
i=1
pi (Fi = fi |θi ) =
Q10
i=1
p(Fi = fi |θ)
Summary of the last steps: Independent and identically distributed (i.i.d)
Remember: p(Fi |θ) is θ, if the i−th flip is ”T”, and 1 − θ if it is ”H”.
Read closely! There is no more i associated with θ. Probabilistically, all
coin flips behave equally.
Remember θ11 ?
Now we can write down the probability of our sequence with respect to θ:
Q10
i=1 p(Fi = fi |θ) = (1 − θ)θ(1 − θ)(1 − θ)θ(1 − θ)(1 − θ)(1 − θ)θ(1 − θ)
Compact (what property of multiplication is used here?):
Q10
3
7
i=1 p(Fi |θ) = θ (1 − θ)
6 / 26
p(”HT HHT HHHT H”|θ) = θ3 (1 − θ)7
This is a function with θ as the variable. We want to find the maxima of
this function.
logarithm is a monotone function
7 / 26
Maximum Likelihood Estimation (MLE) for any coin sequence?
θMLE =
|T |
|T |+|H|
|T |, |H| denote number of tails, heads
Just for fun, a total different sequence (same coin!):
H
H
θ is itself a random variable
8 / 26
For any x, I want to be able to express p(θ = x|D), where D is a random
variable that models the observed sequence at hand (D = Data).
Because we are talking about coin flips, we know that p(θ = x|D) only
makes sense for x ∈ [0, 1].
9 / 26
p(θ = x|D) =
p(D|θ=x)·p(θ=x)
p(D)
p(D|θ = x): We know this one! It is the expression that we were working
with just before in MLE. Only now, θ is fixed to a certain value x!
p(θ = x): θ is a real value, so this should denote a density. But if we
think of a very tiny δθ (on both sides of the equality sign!), then it’s the
probability that θ is (around) x, but without any observations.
10 / 26
p(θ = x): The probability that θ is x, but before data is available.
How should we choose p(θ = x)? There are no observations, so there are
no assumptions/constraints/. . .
Z
p(θ = x)dx = 1
X
(We prefer to write
R
p(θ)dθ = 1)
11 / 26
12 / 26
Beta(θ|a, b) =
Γ(a + b) a−1
θ
(1 − θ)b−1 ,
Γ(a)Γ(b)
θ ∈ [0, 1]
Decode!
I
Γ(n) = (n − 1)!, if n ∈ N
I
We have seen this part before: θa−1 (1 − θ)b−1
13 / 26
θa−1 (1 − θ)b−1
In the MLE section, we had a similar functional form, but (a − 1) and
(b − 1) were substituted by the number of Tails and Heads respectively!
Interpretation: a − 1 and b − 1 are the numbers of Tails and Heads I
think we would see, if we made a + b − 2 many coin flips.
14 / 26
15 / 26
Beta(θ|a, b) =
Γ(a + b) a−1
θ
(1 − θ)b−1 ,
Γ(a)Γ(b)
θ ∈ [0, 1]
It is a distribution, so:
Z
1
Beta(θ|a, b)dθ = 1
0
But this implies:
Z
0
1
θa−1 (1 − θ)b−1 dθ =
Γ(a)Γ(b)
Γ(a + b)
16 / 26
How did we end up here?
Bayes’ Theorem
p(θ = x|D) =
p(D|θ = x) · p(θ = x)
p(D)
Let’s put together the numerator!
p(D|θ = x) = x|T | (1 − x)|H|
p(θ = x|a, b) =
Γ(a + b) a−1
x
(1 − x)b−1
Γ(a)Γ(b)
So we get for the numerator
Γ(a + b) |T |+a−1
x
(1 − x)|H|+b−1
Γ(a)Γ(b)
|T |, |H| denote number of tails, heads
17 / 26
We are missing one term:
p(θ = x|D) =
p(D|θ = x) · p(θ = x|a, b)
p(D)
p(D), that’s the probability of the observed data.
Z
Z
p(D) = p(D, θ)dθ = p(D|θ)p(θ|a, b)dθ
We have seen the expression in the integral already!
It looks like the numerator! Only, θ is not fixed to an x!
18 / 26
The integral over the numerator in our case:
Z 1
Γ(a + b) |T |+a−1
p(D) =
θ
(1 − θ)|H|+b−1 dθ
0 Γ(a)Γ(b)
We can solve that by remembering the reverse engineering
Z 1
Γ(|T | + a)Γ(|H| + b)
θ|T |+a−1 (1 − θ)|H|+b−1 dθ =
Γ(|H| + |T | + a + b)
0
19 / 26
p(θ = x|D) =
p(D|θ = x) · p(θ = x)
p(D)
Put all pieces together (on your own . . . )
20 / 26
p(θ = x|D) is itself a Beta distribution!
Use it to find the θ, that is most probable given the observation!
Algebraically do the same things like with MLE.
θMAP =
|T | + a − 1
|H| + |T | + a + b − 2
21 / 26
The probability that the next coin flip is Tail, given observations D:
p(F = ”T ”|D)
A flip depends on θ, but we don’t know what specific value for θ we
should use. So integrate over all possible values of θ!
Z 1
p(F = ”T ”|D) =
p(F = ”T ”, θ|D)dθ
0
22 / 26
p(F = f |θ) = p(f |θ) = θf (1 − θ)1−f
Convince yourself that this is true by plugging in values (i.e. 1 or 0) for f !
Z
p(f |D, a, b) =
1
Z
0
1
p(f |θ)p(θ|D, a, b)dθ
p(f, θ|D, a, b)dθ =
0
Product Rule + conditional independence assumption!
??
If we know θ, then the result for f only depends on that knowledge, all
other information is no longer important, i.e.:
p(f |θ, D, a, b) = p(f |θ)
23 / 26
Continue by plugging in all formulae we have . . .
Z 1
p(f |θ)p(θ|D, a, b)dθ =
0
=
1
Γ(|T | + a + |H| + b) |T |+a−1
θ
(1 − θ)|H|+b−1 dθ
Γ(|T
| + a)Γ(|H| + b)
0
Z
Γ(|T | + a + |H| + b) 1 f +|T |+a−1
θ
(1 − θ)|H|+b−f dθ
=
Γ(|T | + a)Γ(|H| + b) 0
Z
θf (1 − θ)1−f
=
Γ(|T | + a + |H| + b) Γ(f + |T | + a)Γ(|H| + b − f + 1)
Γ(|T | + a)Γ(|H| + b)
Γ(|T | + a + |H| + b + 1)
24 / 26
How many flips?
For MLE we had
θMLE =
|T |
|T | + |H|
Clearly, we get the same result for |T | = 1, |H| = 4 and
|T | = 10, |H| = 40. Which one is better? Why?
Hoeffding’s Inequality for a sampling complexity bound
p(|θMLE − θ| ≥ ε) ≤ 2e−2N ε
2
N = |T | + |H|
25 / 26
Note!
The idea here was to introduce some basic concepts informally, but
staying within acceptable limits (mathematically). If you think, I violated
some of these too much, let me know. Also, please point out plain
mistakes.
26 / 26