COMMON DERIVATIONS OF MLE 1. MLE for Biased Coin Flip Let

COMMON DERIVATIONS OF MLE
NIR AILON
1. MLE for Biased Coin Flip
Let X1 ..Xm ∈ {0, 1} be the outcome of iid random coin flips with unknown
probability Θ of 1. For any Θ̂ ∈ [0, 1], the log-likelihood of the observation is
L(X1 , . . . , Xm ; Θ̂)
=
m
X
Xi log Θ̂ + (1 − Xi ) log(1 − Θ̂)
i=1
= n1 log Θ̂ + n0 log(1 − Θ̂) ,
where ni = #{j : Xj = i} for i = 0, 1. Deriving with respect to Θ̂:
dL(X1 , . . . , Xm ; Θ̂)
dΘ̂
=
n1
log Θ̂
−
n0
1 − Θ̂
.
1
The derivative vanishes when n1 − n1 Θ̂ = n0 Θ̂, equivalently when Θ̂ = n0n+n
,
1
in other words when Θ̂ is the empirical probability of n1 . It remains as an exercise
to show that this value of Θ̂ is indeed the maximizer of the log-likelihood, or the
MLE.
2. MLE for 1d Gaussian
The density function of a one dimensional Gaussian with expectation µ and
variance σ 2 is
2
2
1
f (x) = √ e−(x−µ) /2σ .
σ 2π
Given m iid observations X1 , . . . , Xm , the corresponding log-likelihood function is
L(X1 , . . . , Xm ; µ̂, σ̂ 2 )
= −
m
X
1
(Xi − µ̂)2 /2σ̂ 2 − m log σ̂ − m log 2π .
2
i=1
Derivation with respect to µ̂:
m
X
∂L(X1 , . . . , Xm ; µ̂, σ̂ 2 )
= −2
(Xi − µ̂)/2σ̂ 2 .
∂ µ̂
i=1
Pm
∂L/∂ µ̂ = 0 implies µ̂ = i=1 Xi /m. Derivation with respect to σ̂ 2 :
Pm
2
∂L(X1 , . . . , Xm ; µ̂, σ̂ 2 )
m
i=1 (Xi − µ̂)
=
−
.
3
∂ σ̂
σ̂
σ̂
Pm
∂L/∂ σ̂ = 0 implies σ̂ 2 = i=1 (Xi − µ̂)2 /m.
1
2
NIR AILON
3. MLE for multivariate Gaussian
A Gaussian r.v. in Rn has density function, for x ∈ Rn ,
1
1
f (x) = p
exp − (x − µ)T Σ−1 (x − µ) ,
n
2
(2π) |Σ|
where Σ is the covariance matrix, Σ| is its determinant and µ ∈ Rn is the expectation. Note that Σ is positive semi-definite, which means that z T Σz ≥ 0 for all
z ∈ Rn . Also recall from standard probability, that the covariance Σ(i, j) of the
i’th and j’th coordinate is given as
Σ(i, j) = E[xi xj ] − E[xi ]E[xj ] .
Now assume that we view m iid draws X1 , . . . , Xm ∈ Rn from the same multivariate gaussian. The log-likelihood function is given as
m
1X
(Xi − µ̂)T Σ̂−1 (Xi − µ̂) .
L(X1 , . . . , Xm ; µ̂, Σ̂) = C − m log |Σ̂| −
2 i=1
where C depends on n, m only. Now we apply a “trace trick”:
m
1X
L(X1 , . . . , Xm ; µ̂, Σ̂) = C − m log |Σ̂| −
Trace Σ̂−1 (Xi − µ̂)(Xi − µ̂)T
2 i=1
m
X
1
Trace
Σ̂−1 (Xi − µ̂)(Xi − µ̂)T
2
i=1
T
Now we use the fact (without proof) that ∇A log |A| = A−1 and ∇A Trace AB =
B T to obtain that the optimal Σ̂ is obtained at
m
1 X
Σ̂ =
(Xi − µ̂)(Xi − µ̂)T .
m i=1
= C − m log |Σ̂| −
Technion
E-mail address: [email protected]