Learning about a coin 1 / 26 H T H H T H H H T H Probability, that the next coin flip results in a tail? ∼0 ∼ 0.3 ∼ 0.38 ∼ 0.5 ∼ 0.76 ∼1 2 / 26 30%? Every flip is random. So every sequence of flips is random, i.e. it has some probability to be observed. For the i-th coin flip we write pi (Fi = ”T ”) = θi To denote that the probability distribution depends on θi , we write pi (Fi = ”T ”|θi ) Note the i in the index! 3 / 26 p(”HT HHT HHHT H”|θ1 , θ2 , . . . , θ10 ) All the randomness of a sequence of flips is governed (modeled) by the parameters θ1 , . . . , θ10 . What about θ11 ? Find θi ’s such that that p(”HT HHT HHHT H”|θ1 , . . . , θ10 ) is as high as possible. Maximize the likelihood of our observation (Maximum Likelihood) 4 / 26 ???? p(”HT HHT HHHT H”|θ1 , . . . , θ10 ) ???? p(”HT HHT HHHT H”|θ) = p1 (F1 = ”H”|θ1 ) · p2 (F2 = ”T ”|θ2 ) · p3 (F3 = ”H”|θ3 ) · p4 (F4 = ”H”|θ4 ) · p5 (F5 = ”T ”|θ5 ) · p6 (F6 = ”H”|θ6 ) · p7 (F7 = ”H”|θ7 ) · p8 (F8 = ”H”|θ8 ) · p9 (F9 = ”T ”|θ9 )p10 (F10 = ”H”|θ10 ) Compact: Q10 i=1 pi (Fi = fi |θi ) Notice the i in pi , θi ! This indicates: The coin flip at time 1 is different from the one at time 2, . . . . 5 / 26 Q10 i=1 pi (Fi = fi |θi ) = Q10 i=1 p(Fi = fi |θ) Summary of the last steps: Independent and identically distributed (i.i.d) Remember: p(Fi |θ) is θ, if the i−th flip is ”T”, and 1 − θ if it is ”H”. Read closely! There is no more i associated with θ. Probabilistically, all coin flips behave equally. Remember θ11 ? Now we can write down the probability of our sequence with respect to θ: Q10 i=1 p(Fi = fi |θ) = (1 − θ)θ(1 − θ)(1 − θ)θ(1 − θ)(1 − θ)(1 − θ)θ(1 − θ) Compact (what property of multiplication is used here?): Q10 3 7 i=1 p(Fi |θ) = θ (1 − θ) 6 / 26 p(”HT HHT HHHT H”|θ) = θ3 (1 − θ)7 This is a function with θ as the variable. We want to find the maxima of this function. logarithm is a monotone function 7 / 26 Maximum Likelihood Estimation (MLE) for any coin sequence? θMLE = |T | |T |+|H| |T |, |H| denote number of tails, heads Just for fun, a total different sequence (same coin!): H H θ is itself a random variable 8 / 26 For any x, I want to be able to express p(θ = x|D), where D is a random variable that models the observed sequence at hand (D = Data). Because we are talking about coin flips, we know that p(θ = x|D) only makes sense for x ∈ [0, 1]. 9 / 26 p(θ = x|D) = p(D|θ=x)·p(θ=x) p(D) p(D|θ = x): We know this one! It is the expression that we were working with just before in MLE. Only now, θ is fixed to a certain value x! p(θ = x): θ is a real value, so this should denote a density. But if we think of a very tiny δθ (on both sides of the equality sign!), then it’s the probability that θ is (around) x, but without any observations. 10 / 26 p(θ = x): The probability that θ is x, but before data is available. How should we choose p(θ = x)? There are no observations, so there are no assumptions/constraints/. . . Z p(θ = x)dx = 1 X (We prefer to write R p(θ)dθ = 1) 11 / 26 12 / 26 Beta(θ|a, b) = Γ(a + b) a−1 θ (1 − θ)b−1 , Γ(a)Γ(b) θ ∈ [0, 1] Decode! I Γ(n) = (n − 1)!, if n ∈ N I We have seen this part before: θa−1 (1 − θ)b−1 13 / 26 θa−1 (1 − θ)b−1 In the MLE section, we had a similar functional form, but (a − 1) and (b − 1) were substituted by the number of Tails and Heads respectively! Interpretation: a − 1 and b − 1 are the numbers of Tails and Heads I think we would see, if we made a + b − 2 many coin flips. 14 / 26 15 / 26 Beta(θ|a, b) = Γ(a + b) a−1 θ (1 − θ)b−1 , Γ(a)Γ(b) θ ∈ [0, 1] It is a distribution, so: Z 1 Beta(θ|a, b)dθ = 1 0 But this implies: Z 0 1 θa−1 (1 − θ)b−1 dθ = Γ(a)Γ(b) Γ(a + b) 16 / 26 How did we end up here? Bayes’ Theorem p(θ = x|D) = p(D|θ = x) · p(θ = x) p(D) Let’s put together the numerator! p(D|θ = x) = x|T | (1 − x)|H| p(θ = x|a, b) = Γ(a + b) a−1 x (1 − x)b−1 Γ(a)Γ(b) So we get for the numerator Γ(a + b) |T |+a−1 x (1 − x)|H|+b−1 Γ(a)Γ(b) |T |, |H| denote number of tails, heads 17 / 26 We are missing one term: p(θ = x|D) = p(D|θ = x) · p(θ = x|a, b) p(D) p(D), that’s the probability of the observed data. Z Z p(D) = p(D, θ)dθ = p(D|θ)p(θ|a, b)dθ We have seen the expression in the integral already! It looks like the numerator! Only, θ is not fixed to an x! 18 / 26 The integral over the numerator in our case: Z 1 Γ(a + b) |T |+a−1 p(D) = θ (1 − θ)|H|+b−1 dθ 0 Γ(a)Γ(b) We can solve that by remembering the reverse engineering Z 1 Γ(|T | + a)Γ(|H| + b) θ|T |+a−1 (1 − θ)|H|+b−1 dθ = Γ(|H| + |T | + a + b) 0 19 / 26 p(θ = x|D) = p(D|θ = x) · p(θ = x) p(D) Put all pieces together (on your own . . . ) 20 / 26 p(θ = x|D) is itself a Beta distribution! Use it to find the θ, that is most probable given the observation! Algebraically do the same things like with MLE. θMAP = |T | + a − 1 |H| + |T | + a + b − 2 21 / 26 The probability that the next coin flip is Tail, given observations D: p(F = ”T ”|D) A flip depends on θ, but we don’t know what specific value for θ we should use. So integrate over all possible values of θ! Z 1 p(F = ”T ”|D) = p(F = ”T ”, θ|D)dθ 0 22 / 26 p(F = f |θ) = p(f |θ) = θf (1 − θ)1−f Convince yourself that this is true by plugging in values (i.e. 1 or 0) for f ! Z p(f |D, a, b) = 1 Z 0 1 p(f |θ)p(θ|D, a, b)dθ p(f, θ|D, a, b)dθ = 0 Product Rule + conditional independence assumption! ?? If we know θ, then the result for f only depends on that knowledge, all other information is no longer important, i.e.: p(f |θ, D, a, b) = p(f |θ) 23 / 26 Continue by plugging in all formulae we have . . . Z 1 p(f |θ)p(θ|D, a, b)dθ = 0 = 1 Γ(|T | + a + |H| + b) |T |+a−1 θ (1 − θ)|H|+b−1 dθ Γ(|T | + a)Γ(|H| + b) 0 Z Γ(|T | + a + |H| + b) 1 f +|T |+a−1 θ (1 − θ)|H|+b−f dθ = Γ(|T | + a)Γ(|H| + b) 0 Z θf (1 − θ)1−f = Γ(|T | + a + |H| + b) Γ(f + |T | + a)Γ(|H| + b − f + 1) Γ(|T | + a)Γ(|H| + b) Γ(|T | + a + |H| + b + 1) 24 / 26 How many flips? For MLE we had θMLE = |T | |T | + |H| Clearly, we get the same result for |T | = 1, |H| = 4 and |T | = 10, |H| = 40. Which one is better? Why? Hoeffding’s Inequality for a sampling complexity bound p(|θMLE − θ| ≥ ε) ≤ 2e−2N ε 2 N = |T | + |H| 25 / 26 Note! The idea here was to introduce some basic concepts informally, but staying within acceptable limits (mathematically). If you think, I violated some of these too much, let me know. Also, please point out plain mistakes. 26 / 26
© Copyright 2026 Paperzz