The Log Likelihood
Since the observations are independent, the joint density can be
written as the (multiplicative) sum of the marginal densities
Introduction to Maximum Likelihood
The King of Estimators
T
f (x1 , x2 , ..., xT ; θ) = f (x1 ; θ) f (x2 ; θ) ...f (xT ; θ) =
t =1
Bo Sjö
We can also state that observations on X̃t are independently and
identically distributed; xt ∼ iid (θ) .
Department of Management and Engineering
Linköping University, Sweden
iid means that each observation (and thereby also the estimated
moments) are independent from other observation. Knowing any
observation xt have no information regarding any other outcome.
Identical distribution mean just that, each observation can be
described by the density function and the same moments.
October 2014
Linköping University (Institute)
MLE
Oct/2014
1 / 19
The ML Estimator
Linköping University (Institute)
MLE
Oct/2014
3 / 19
The Log Likelihood
Consider a random variable X̃t with k moments in the vector
θ = (θ 1 , θ 2 , ..., θ k ).
We can now ask the question, which estimates of θ i (the various
moments) are the most likely, given the observations and the
functional form of the density function.
I add a subscript to make it clear that X̃t and its observations are
moving with t (= time here)
"Most likely" means, in this context, finding the values of θ that
maximizes the density function.
The sample from X̃t has T independent observations:
xt = {x1 , x2 , ..., xT }
We can restate the problem as
In this case, since we explore MLE, we know from the beginning that
all observations are from the same RV and have one joint density
function
The joint density function is
L (θ; xt )
Next form the log likelihood
log L (θ; xt ) = l (θ; xt )
f (xt ; θ) = f (x1 , x2 , ..., xT ; θ)
Linköping University (Institute)
∏ f (xt ; θ)
MLE
Oct/2014
2 / 19
Linköping University (Institute)
MLE
Oct/2014
4 / 19
To find the maximum of the function
To find the maximum of the function
Take the derivative of l (θ; xt ) with respect to each moment θ i ,
Take the derivative of l (θ; xt ) with respect to each moment θ i ,
Set each derivate equal to zero => k equations
Solve for the k unkown parameters (= the moments θ i )
δl (θ; xt )
=0
δθ i
δl (θ; xt )
=0
δθ i
The solution gives k estimates θ̂ i . The matrix just formed is called
the score matrix, or the efficient score for the parameter vector θ.
Linköping University (Institute)
MLE
Oct/2014
The solution gives k estimates θ̂ i . The matrix just formed is called
the score matrix, or the efficient score for the parameter vector θ.
5 / 19
Linköping University (Institute)
MLE
Oct/2014
5 / 19
Conditin for a Maximium
To find the maximum of the function
Take the derivative of l (θ; xt ) with respect to each moment θ i ,
Set each derivate equal to zero => k equations
The condition for a maximium is that the matrix of second derivatives
is a negative definitive matrix
δl (θ; xt )
= I (θ) < 0
δθ i
δl (θ; xt )
=0
δθ i
The expected value of of the second derivatives is called the
information matrix I (θ).
The solution gives k estimates θ̂ i . The matrix just formed is called
the score matrix, or the efficient score for the parameter vector θ.
Linköping University (Institute)
MLE
Oct/2014
5 / 19
Linköping University (Institute)
MLE
Oct/2014
6 / 19
The Information Matrix
The Normal Distribution
Let us assume a sample of T independent normal random variables
{X̃t }. We want to estimate the first two moments, thus θ = (µ, σ 2 ).
The likelihood is,
"
#
1 T
2 −T /2
2
L(θ; x) = 2π σ
exp − 2 ∑ (xt − µ) .
(1)
2σ t =1
The information matrix is important for demonstrating the properties
of the ML estimator and the so-called classical tests
The Wald test, log-likelihood test, lagragne ratio test.
Taking logs of this expression yields,
It can be shown that the inverse of the information matrix is the
covariance matrix of the estimates θ̂
l (θ; x) = −(T /2) log 2π − (T /2) log σ2 − (1/2σ2 )
T
∑ (xt − µ)2 .
(2)
t =1
[I (θ)]−1 = var θ̂
The partial derivatives with respect to µ and σ2 are,
1
δl
= 2
δµ
σ
T
∑ (xt − µ),
(3)
t =1
and,
Linköping University (Institute)
MLE
Oct/2014
7 / 19
Estimation
T
δl
= −(T /2σ2 ) +MLE
(1/2σ4 ) ∑ (xt − µ)2 .
δσ
t =1
Linköping University (Institute)
2
Oct/2014
9 / 19
Set to Zero and Solve
If these equations are set to zero, the result is,
[I (θ)]
−1
= var θ̂
T
∑ xt − T µ = 0
The information is where the estimation programs break down.
Estimation is done by iteration
T
∑ (xt − µ)2 − T σ2 = 0.
You shouldn’t change starting values or converge criteria
If this system is solved for µ and σ2 we get the estimates of the mean
and the variance as
Observed data and density do not match at all => No converge
Notice ususlly you do not maximize the likelihood.
You minimize the inverse of the log likelihood instead ≡ Minimize the
residual sums of squares
MLE
Oct/2014
(5)
t =1
Perfect multicolinearity => Large standard errors, and/or "No
converge"
Linköping University (Institute)
(4)
t =1
8 / 19
µ̂x
=
σ̂2x
=
Linköping University (Institute)
1
T
1
T
T
∑ xt
(6)
t =1
T
∑ (xt − µ̂x )2 =
t =1
MLE
1
T
T
∑ xt2 −
t =1
1
T
"
T
∑ xt
t =1
#2
.
Oct/2014
(7)
10 / 19
Is it a Maximum?
Are the Estimates Unbiased and Efficient?
To answer that question we have to look at the sign of the Hessian of
the log likelihood function, the second order conditions, evaluated at
the estimated values of the parameters in θ,
δ2 l
δµδµ
2
D l (θ; x)=
δ2 l
δσ2 δµ
δ2 l
T
δµδσ2 − σ2
=
2
δ l
− σ14 ∑(xt −µ)
δσ2 δσ2
− σ14 ∑(xt − µ)
T
2
− 2σ
4 − ∑ (xt − µ )
.
(8)
If we substitute from the solutions of the estimates of µ and σ2 , we get,
T
2
E [D 2 l (θ; x)]=−1 σ̂x
0
0
T =−I (θ̂ ),
2 σ̂4x
The bias given by (T − 1)/T , goes to zero as T → ∞. This is a
typical result from MLE, the mean is correct and the variance is
biased.
To get an unbiased estimate adjust the srtimated variance:
!2
T
1 T
T −1 2
T −1
2
2
× E ∑ X̃t − ∑ X̃t .
σ̂ =
s =
T
T
T
t =1
t =1
(13)
(9)
Since the variance, σ2x is always positive we have a negative definite
matrix, and a maximum value for the function at µ̂x and σ̂2x .
Linköping University (Institute)
MLE
Oct/2014
11 / 19
Linköping University (Institute)
MLE
Oct/2014
13 / 19
MLE
Oct/2014
14 / 19
Linear Regression
Are the Estimates Unbiased and Efficient?
Look at the expected values of the estimated mean and variances.
Replace the observations, in the solutions for µ and σ2x , by the
random variable X̃ and take expectations.
E (µ̂x ) =
1
T
T
1
T
∑ E (X̃ ) = T ∑ µ = µ,
t =1
(10)
t =1
which proves that µ̂x is an unbiased estimation of the mean. The
calculations for the variance are complex, but the idea is the same. The
expected variance is,
!2 #
"
"
T
T
T T
1
1
1
1
2
2
= E T E (X̃t2 ) − E ∑ ∑ X̃
E [σ̂x ] = E
∑ X̃t − T ∑ X̃t
T
T
T
t =1
t =1
t =1 s =1
"
!#
T
T
1
1
1
TE (X̃t2 ) − E (X̃t2 ) − E ∑ t 6=s ∑ X̃t X̃s
(T − 1)E (X̃t2 ) −
=
=
T
T
T
Thus, σ̂2 is not an unbiased estimate of σ2 . The bias given by
Linköping University (Institute)
MLE
Oct/2014
12 / 19
(T − 1)/T , goes to zero as T → ∞. This is a typical result from MLE,
the mean is correct and the variance is biased.
Linköping University (Institute)
Linear Regression
Solution
Look at the signle variable above
Differentiation of S ( β) with respect first to β yields
Create ε̃t = g (X̃ ) so εt = g (xt )
δS
= −2Z ′ (Y − Z β),
δβ
The new random variable εt is a function of a normal random variable
(X̃t )
It follows that εt is also a normal random variable
which, if set to zero, solves to
Set εt = yt − βzt
β̂ = (Z ′ Z )−1 (Z ′ Y )
If we can assume that εt is normal independent and identically
distributed, Niid(0, σ2 ), the ML estimator will work
σ̂2ǫ = ε′ ε /T ,
And so will any function or system of equation that has the Niid
property.
MLE
Oct/2014
14 / 19
Learn more about Beamer
l ( β, σ2ε ; y , z ) = −(T /2) log 2π − (T /2) log σ2ε − (1/2σ2ε )
T
∑ (yt − zt β)
t =1
(14)
The last factor in this expression can be identified as the (objective) sum
of squares function, S ( β). In matrix form we have,
T
∑ (yt − zt β)2 = (Y − Z β)′ (Y − Z β)
(15)
MLE
Oct/2014
16 / 19
Extended to a p-diminsional vector of normal random variables
xt ∼ N (µ, ∑),
D (xt ) = D (x1t , x2t , ..., xpt ),
(20)
~ will have a mean vector µ and a
The random variables in the vector X
covariance matrix ∑. The density function for the multivariate normal is,
D (xt ) = [(2π )n/2 |
and
l ( β, σ2ε ; y , z ) = −(T /2) log 2π − (T /2) log σ2ε − (1/2σ2ε )(Y − Z β)′ (Y − Z β
(16)
MLE
Oct/2014
15 / 19
∑ |n/2 ]−1 exp[−(1/2)(xt − µ)′ ∑ −1 (xt − µ)]
(21)
With multivariate densities it is possible to handle systems of equations.
The bivariate normal is an often used device to derive models including 2
variables. Set X̃ = (X̃1 , X̃2 ), and
∑=
t =1
Linköping University (Institute)
Linköping University (Institute)
(19)
A multidimensional System
We have two unknowns β and σ2ε . The log likelihood function is
S ( β) =
(18)
The variance estimate becomes,
Notice that it is the property of εt that we focus on, not yt or zt .
Linköping University (Institute)
(17)
σ21 σ12
σ21 σ22
with |
∑ |= σ21 σ22 (1 − ρ2 ),
(22)
where ρ is the correlation coefficient. As can be seen | ∑ |> 1 unless
ρ2 = 1. If σ12 = σ21 = 0, the two processes are independent and can
estimated individually without losing any important information. In
principle if σ12 = σ21 6= 0, the two equations are dependent, and it will be
Linköping University (Institute)
MLE
Oct/2014
17 / 19
necessary
to estimate a complete system
of equations to get correct
estimates, which are unbiased and efficient.
The Classical Tests (Graph on the white board)
Wald test
Example: t-test
Estimate one equation
Log likelihood test
Example: F-test on exclusion
Estimate two equations one unrestricted, one restricted
Lagrange multiplier test
Example:Breush-Pagan test for autocorretion, ARCH test
Estimate an auxiliary equation (say from the residuals)
Linköping University (Institute)
MLE
Oct/2014
18 / 19
Oct/2014
19 / 19
Learn more
See my Lecture Notes and other proper books
Linköping University (Institute)
MLE
© Copyright 2026 Paperzz