Lecture Notes 5 Likelihood Statistical Models Likelihood

Lecture Notes 5
Likelihood
Statistical Models
We focus for now on parametric models. A parametric model is a set of distributions
P = {f (x; θ) : θ ∈ Θ} where Θ ⊂ Rd .
The model comes from assumptions. There are conventions that sometimes help. For
instance,
• Time until something fails is usually modeled by an exponential distribution.
• Number of rare events is usually modeled by a Poisson distribution.
• Lengths and weights are usually modeled by a normal distribution.
Likelihood
Definition Let (X1 , · · · , Xn ) have joint density f (xn ; θ) = f (x1 , . . . , xn ; θ). The likelihood
function L : Θ → [0, ∞) and is defined by
L(θ) ≡ L(θ; xn ) = f (xn ; θ)
where xn is fixed and θ varies in Θ.
• The likelihood function is a function of θ.
• The likelihood function is not a probability density function.
• If the data are iid then the likelihood becomes
L(θ) =
n
Y
f (xi ; θ).
i=1
• The likelihood is only defined up to a constant of proportionality.
• The likelihood function is used (i) to generate estimators (the maximum likelihood
estimator) and (ii) as a key ingredient in Bayesian inference.
Example Let X1 , · · · , Xn be iid Bernoulli random variables with parameter θ. Then for
θ ∈ [0, 1],
L(θ) = θS (1 − θ)n−S
where S =
Pn
i=1
Xi .
1
Example 1 X1 , · · · , Xn ∼ N (µ, 1). Then,
(
)
n2
n
X
1
−1
L(µ; xn ) =
exp
(xi − µ)2
2π
2 i=1
1 √
2
.
n(x − µ)
∝ exp −
2
Exercise. These 2 samples have the same likelihood function except on the constants:
(X1 , X2 , X3 ) ∼ Multinomial (n = 6, θ, θ, 1 − 2θ)
1
X = (1, 3, 2)
=⇒
X = (2, 2, 2)
=⇒
6! 1 3
θ θ (1 − 2θ)2 ∝ θ4 (1 − 2θ)2
1!3!2!
6! 2 2
L(θ) =
θ θ (1 − 2θ)2 ∝ θ4 (1 − 2θ)2
2!2!2!
L(θ) =
(Parametric) Point Estimation
X1 , . . . , Xn ∼ F (x|θ). Want to estimate θ = (θ1 , . . . , θk ).
Definition 2 A point estimator
θb = θbn = w(X1 , . . . , Xn )
is a function of the sample; that is, any statistic is a point estimator.
Methods: (1) Method of Moments (2) Maximum likelihood (MLE), (3) Bayesian estimators.
Evaluating Estimators: Bias and Variance, Mean squared error (MSE) and Minimax Theory,
Large sample theory (later).
Some Terminology
b −θ
• Bias: Eθ (θ)
• the distribution of θbn is called its sampling distribution
• the standard deviation of θbn is called the standard error denoted by se(θbn )
P
• θbn is consistent if θbn −→ θ
• later we will see that if bias → 0 and Var(θbn ) → 0 as n → ∞ then θbn is consistent
• an estimator is robust if it is not strongly affected by perturbations in the data (more
later)
2
2
Maximum Likelihood
b n ) be a parameter value at which
Definition 3 For each sample point xn , let the value θ(x
L(θ) attains its maximum as a function of θ, with xn held fixed. A maximum likelihood
b n ). Equivalently, θb
estimator (mle) of the parameter θ based on a sample X n is θ(X
maximizes `(θ).
Let θb maximize
L(θ) = f (X1 , . . . , Xn ; θ).
Same as maximizing
`(θ) = log L(θ).
Often it suffices to solve
∂`(θ)
= 0,
∂θj
Example 4 Bernoulli. L(p) =
Q
i
j = 1, 2, . . . .
pXi (1 − p)1−Xi = pS (1 − p)n−S where S =
P
i
Xi . So
`(p) = S log p + (n − S) log(1 − p)
and pb = X.
Example 5 Let X1 , · · · , Xn ∼ N (θ, 1). Then,
Ln (θ) =
1
2π
n2
(
n
1X
(Xi − θ)2
exp −
2 i=1
)
n
1X
n
`n (θ) = − log 2π −
(Xi − θ)2
2
2 i=1
n
X
∂
`(θ) =
(Xi − θ)
Then S(θ) =
∂θ
i=1
Now we obtain the θbn by setting S(θ) = 0:
n
X
∂
S(θ) =
`(θ) =
(Xi − θ) = 0
∂θ
i=1
which has the unique solution
θbn = X n .
P
It follows that θbn = X n is also the MLE. Clearly θbn −→ θ by the Weak Law of Large Numbers.
3
Exercise 1:
Try the example above again. Let X1 , · · · , Xn ∼ N (µ, σ 2 ) where we assume
µ is known. We now estimate σ 2 by solving S(σ 2 ) = 0 (you should write it out.) Show that
the likelihood equation now has a unique solution σ
b2 which is both a local maximum and
the MLE.
Example 6 (CB Example 7.2.11) For N (θ, σ 2 ) we have
)
)
(
(
n
n
X
X
1
1
1
1
L(θ, σ 2 ) =
(Xi − θ)2 ∝ 2 n/2 exp − 2
(Xi − θ)2
exp − 2
(2πσ 2 )n/2
2σ i=1
(σ )
2σ i=1
and
n
n
n
1 X
2
`(θ, σ ) = − log 2π − log σ − 2
(Xi − θ)2 .
2
2
2σ i=1
2
The partial derivatives, with respect to θ and σ 2 are
n
n
1 X
n
1 X
∂`
∂`
= 2
=− 2 + 4
(Xi − θ) and
(Xi − θ)2 .
∂θ
σ i=1
∂σ 2
2σ
2σ i=1
By setting
∂`
∂`
= 0,
=0
∂θ
∂σ 2
b which we then plug in to the second equation to
by solving the first equation, we obtain θ,
solve σ
b2 , we obtain
Exercise 2:
n
1X
θb =
Xi ,
n i=1
n
1X
σ
b =
(Xi − X)2 .
n i=1
2
Verify that θb and σ
b2 we thus obtain indeed achieves the global maximum as
required by Definition 3 of the MLE.
Example 7 Let X1 , . . . , Xn ∼ Uniform(0, θ). Then
L(θ) =
1
I(θ > X(n) ) where X(n) := max(X1 , . . . , Xn )
θn
where I(A) is the indicator function, which equals 1 when A holds and 0 otherwise. You
should redraw the pictures. So θbn = X(n) = max(X1 , . . . , Xn ), which we have proved by
P
picture. In fact, we have θbn −→ θ, which we will not show here (maybe later.)
4
3
MSE
The mean squared error (MSE) is
Z
Z
2
b
b 1 , . . . , xn ) − θ)2 f (x1 ; θ) · · · f (xn ; θ)dx1 . . . dxn .
Eθ (θ − θ) = · · · (θ(x
The bias is
b −θ
B = Eθ (θ)
and the variance is
b
V = Varθ (θ).
Theorem 8 We have
M SE = B 2 + V.
b Then
Proof. Let m = Eθ (θ).
M SE = Eθ (θb − θ)2 = Eθ (θb − m + m − θ)2
= Eθ (θb − m)2 + (m − θ)2 + 2Eθ (θb − m)(m − θ)
= Eθ (θb − m)2 + (m − θ)2 = V + B 2 .
An estimator is unbiased if the bias is 0. In that case, MSE = Variance; that is,
b
Eθ (θb − θ)2 = Varθ (θ).
There is often a tradeoff between bias and variance. So low bias can imply high variance
and vice versa.
• We have shown that if bias → 0 and Var(θbn ) → 0 as n → ∞ then θbn is consistent given
the following development.
How to prove that θbn consistent? Show that, for all ε > 0,
P (|θbn − θ| ≥ ε) −→ 0 as n → ∞.
Or, prove convergence in quadratic mean:
MSE(θbn ) = Eθ (θbn − θ)2 = Bias2 (θbn ) + Var(θbn ) −→ 0 as n → ∞.
qm
p
If bias → 0 and var → 0 then θbn → θ which implies that θbn → θ.
5
4
Method of Moments
Define
n
1X
m1 =
Xi ,
n i=1
µ1 (θ) = E(Xi )
n
1X 2
X ,
m2 =
n i=1 i
µ2 (θ) = E(Xi2 )
..
.
..
.
n
1X k
mk =
X ,
n i=1 i
µk (θ) = E(Xik ).
Let θb = (θb1 , . . . , θbk ) solve:
b
mj = µj (θ),
j = 1, . . . , k.
Example 9 N (β, σ 2 ) with θ = (β, σ 2 ). Then µ1 = β and µ2 = σ 2 + β 2 . Equate:
n
n
1X
b
Xi = β,
n i=1
1X 2
X =σ
b2 + βb2
n i=1 i
to get
n
βb = X ,
σ
b2 =
1X
(Xi − X n )2 .
n i=1
Example 10 Suppose
X1 , . . . , Xn ∼ Binomial(k, p)
where
k x
P (Xi = x|k, p) =
p (1 − p)k−x , x = 0, 1, . . . , k
x
where both k and p are unknown. Recall that E(Xi ) = kp and Var(X) = kp(1 − p). This
somewhat unusal application of the binomial model has been used to estimate crime rates for
crimes that are known to have many unreported occurences. For such a crime, both the true
reporting rate, p, and the total number of occurances, k, are unknown. We get
n
kp = X n ,
1X 2
X = kp(1 − p) + k 2 p2
n i=1 i
giving
X
pb = ,
b
k
2
b
k=
X
.
P
1
X − n i (Xi − X)2
6
Method of moments estimators are typically consistent. Consider one parameter. Recall
b = m where m = n−1 Pn Xi . Assume that µ−1 exists and is continuous. So
that µ(θ)
i=1
P
θb = µ−1 (m). By the WLLN m−→ µ(θ). So, by the continuous mapping Theorem,
P
θbn = µ−1 (m)−→ µ−1 (µ(θ)) = θ.
7