Applied Microeconometrics with Stata
Discrete Choice Models
Spring Term 2011
1 / 54
Contents
I
Introduction and examples
I
Maximum likelihood estimation, computation, and hypothesis tests
I
Binary choice models
I
Multinomial models
I
Ordered choice models
I
Models for count data
I
Summary and references
2 / 54
Introduction
I
If the dependent variable is discrete, linear regression models may be
inappropriate or lead to results which are difficult to interpret.
I
Various kinds of discrete dependent variables may occur in empirical
studies:
I Binary dependent variables.
I Discrete variables with more than two values which can be
ordered.
I Discrete variables without ordering.
I Count data.
I
Continuous dependent variables with censoring and truncation will
be covered in the following part of the lecture.
I
The econometric models for discrete dependent variables considered
in the following are based on the maximum likelihood approach.
I
In general, the estimated coefficients can not be interpreted directly
as in the linear regression model.
I
For the interpretation of the results, covariate effects on the
probability of the occurence of some outcome value are considered.
3 / 54
Examples
I
Binary dependent variables take on only two values and describe
binary choices.
I
For example, occupational or educational choices, or participation in
training measures can be described by binary dependent variables.
I
Discrete ordered variables are discrete variables with more than two
values, which can be related by some order.
I
Examples are successive educational degrees or the proportion of
risky assets within a portfolio.
I
An example of a discrete variable with more than two values which
cannot be ordered is the travel mode of a commuter.
I
Binary variables are a special case of unordered discrete variables.
I
Count data consists of discrete nonnegative variables. Examples are
any number of items, individuals, or events, like the number of
children, patents, consultations, visits, or further trainings.
I
Count variables with a few different values may also be modelled as
ordered data. For count variables with many different values, linear
models may be appropriate.
4 / 54
Maximum Likelihood Estimation
I
An important framework for specification, estimation and testing of
nonlinear econometric models is the maximum likelihood (ML)
approach.
I
The basic idea of the ML approach is to specify a distribution of the
data and to choose the parameters of the model such that the joint
density of the data is maximized.
I
In classical maximum likelihood theory, a parametric specification of
the data distribution is used.
I
Formally, consider a pair of random variables (Y , X ) and an i.i.d.
sample {(Yi , Xi )}ni=1 from it.
I
The random variables (Y , X ) have the distribution FY |X (y |x; θ0 ).
I
The distribution FY |X (y |x; θ0 ) (and the density fY |X (y |x; θ0 )) is
assumed to be known except of the parameter θ0 .
I
Therefore, for the estimation of the unknown parameter θ0 , the
joint density of the sample, fY |X (y |x; θ), is viewed as a function of
the parameter θ, and is maximized with respect to this parameter.
5 / 54
Maximum Likelihood Estimation
I
The conditional likelihood function of the unknown parameter θ is
defined as
`(θ; Y |X ) ≡ f (Y |X ; θ).
I
The idea is to treat the data (Y , X ) as known and use `(θ; Y |X ) to
estimate the unknown parameter θ0 .
I
Denote the logarithm of the likelihood function by L(θ; Y |X ):
L(θ; Y |X ) ≡
log `(θ; Y |X ).
L(θ; Y |X ) is called the conditional log-likelihood function.
I
The sample log-likelihood function is defined using the joint density
of the sample:
!
n
Y
n
L(θ; {Yi , Xi }i=1 ) = log
f (Yi |Xi ; θ)
i=1
=
n
X
log f (Yi |Xi ; θ).
i=1
6 / 54
Maximum Likelihood Estimation
I
As the single sample realisations are independent of each other, the
joint density is the product of the single densities.
I
To see this, recall the general definition of independence of two
random variables A and B: A and B are independent, if
Pr (A ∩ B)
I
= Pr (A)Pr (B).
Using the density functions fA and fB of A and B, the definition of
independence is expressed as
fA,B (x, y )
= fA (x)fB (y ),
where fA,B is the joint density of A and B, and x and y are values
taken on by A and B.
I
The second equality follows by the following general result on the
logarithm of products:
log(ab)
= log(a) + log(b).
7 / 54
Maximum Likelihood Estimation
I
The maximum of L(θ; U) with respect to θ is equal to the maximum
of `(θ; U), as the logarithm is a strictly monotone transformation.
I
For a strictly monotone transformation g (x) it holds that either
g (x1 ) > g (x2 )
∀x1 > x2 ,
g (x1 ) < g (x2 )
∀x1 > x2 .
or that
I
Intuitively, a monotone transformation does not change the order of
its arguments. The values of the arguments are changed, however.
I
As a simple example, consider the linear transformation:
g (x)
=
ax + b.
I
Due to the monotonicity property, the log-likelihood function is used
for the analysis of the ML estimator. Here, g (x) = log(x), and the
argument is the joint density of the sample.
I
The advantages of using the log-likelihood function are easier
computation and the possibility to use results for sums of random
variables like laws of large numbers and central limit theorems.
8 / 54
Maximum Likelihood Estimation
I
The maximium likelihood estimator (MLE) θ̂ is defined as the value
which maximizes the empirical expectation (i.e., the sample mean)
of the log-likelihood function:
≡ arg max En [L(θ; Y |X )].
θ∈Θ
Pn
−1
Here and in the following, En [X ] denotes n
i=1 Xi .
θ̂
I
Θ is the set of all possible values of the parameters of the ML
model. For example, Θ = Rd , if the dimension of θ is equal to d,
where d ∈ N (i.e., if the model has d unknown parameters).
I
It is assumed that the joint density of (Y , X ) is maximized by an
unique value θ0 .
I
To state this assumption in a formal way, consider first the following
definition of global identification: The parameter θ0 is (globally)
identified in Θ if for all θ1 ∈ Θ, θ0 6= θ1 , it holds that
Pr (f (Y |X ; θ0 ) 6= f (Y |X ; θ1 )) > 0.
Note that f (Y |X ) is itself a random variable, as it depends on the
random variable X .
9 / 54
Maximum Likelihood Estimation
I
To analyze the MLE further, assume global identification of the true
parameter θ0 ∈ Θ of the ML model.
I
Additionally, a dominance assumption is imposed: the expectation
E [supθ∈Θ |L(θ; Y |X )|] exists, i.e., it is finite.
I
Using these assumptions, it can be shown that for the true value θ0
and some other θ1 ∈ Θ, the fact that θ0 6= θ1 implies that
E [L(θ1 ; Y |X )] <
E [L(θ0 ; Y |X )].
I
This is the strict expected log-likelihood inequality. For a proof, see
Lemma 14.2 of Ruud (2000) (where a slightly different notation is
used).
I
This result states that a maximal value of the expection of the
log-likelihood function is equal to the true value of the parameter of
the ML model.
I
Furthermore, this maximal value is unique.
I
Put differently, the MLE is identified.
10 / 54
Maximum Likelihood Estimation
I
Having established the uniqueness of the MLE, it can be analyzed
further.
I
The expection of the log-likelihood function is assumed to be twice
differentiable.
I
Define now the score function Lθ (θ) as the derivative of the loglikelihood function with respect to θ:
Lθ (θ) ≡
∂L(θ)
,
∂θ
where L(θ) is a shortcut for L(θ; Y |X ).
I
It can be shown that for the true value θ0 of the parameter θ of the
ML model it holds that
E [Lθ (θ0 )|X = x]
= 0.
This result is called the score identity (Lemma 14.3 of Ruud (2000)).
I
That is, a value of θ with E [Lθ (θ)|X = x] = 0 is equal to the true
parameter θ0 .
11 / 54
Maximum Likelihood Estimation
I
To see that the score identity holds, note first that
Z
1 =
f (y |x; θ)dy ,
as f (y |x; θ) is a probability density (which integrates to one).
I
Differentiate both sides with respect to θ:
Z
∂f (y |x; θ)
dy ,
0 =
∂θ
which can be rewritten as
Z
1
∂f (y |x; θ)
0 =
f (y |x; θ)dy
f (y |x; θ)
∂θ
(1)
by simultaneous multiplication and division by f (y |x; θ). This
relationship holds for all values of θ.
I
Note further that it holds that
∂f (Y |X ; θ)
1
f (Y |X ; θ)
∂θ
= Lθ (θ; Y |X ).
12 / 54
Maximum Likelihood Estimation
I
This last equality can be seen as follows:
Lθ (θ; Y |X )
=
∂
∂
L(θ; Y |X ) =
ln f (Y |X ; θ) =
∂θ
∂θ
∂
∂θ f
(Y |X ; θ)
.
f (Y |X ; θ)
I
Consider now the conditional expectation of Lθ (θ; Y |X ):
Z
∂f (Y |x; θ) ∂f (y |x; θ)
1
1
f (y |x; θ0 )dy .
X =x =
E
f (Y |x; θ)
∂θ
f (y |x; θ)
∂θ
I
Note that the expectation uses the true density, i.e., the density
f (y |x; θ0 ) which weights the integrand 1/f (y |x; θ) × ∂f (y |x; θ)/∂θ
depends on θ0 .
I
By the general relationship (1) derived above, it follows that
Z
1
∂f (y |x; θ) f (y |x; θ0 )dy = E [Lθ (θ0 ; Y |X )|X = x]
f (y |x; θ0 )
∂θ
θ=θ0
= 0,
when the integrand 1/f (y |x; θ) × ∂f (y |x; θ)/∂θ depends on θ0 .
Note that f (x)|x=x̄ = f (x̄).
I
This shows the validity of the score identity.
13 / 54
Maximum Likelihood Estimation
I
An implication of the score identity is the following equivalence:
∂E [L(θ)] θ0 = arg max E [L(θ)] ⇐⇒
= 0,
θ∈Θ
∂θ
θ=θ0
which holds provided E [L(θ)] is concave.
I
Therefore, an estimate of θ0 can be obtained as a solution θ of the
equation E [Lθ (θ)] = 0.
I
A condition of concavity of E [L(θ)] is the negative definiteness of
the second derivative of the expected sample log-likelihood function,
i.e., for a suitable vector c 6= 0 it holds that
c 0 En [Lθθ (θ)]c
< 0,
where En [Lθθ (θ)] ≡ ∂ 2 En [L(θ)]/∂θ∂θ0 .
I
Concavity of E [L(θ)] follows by the information identity (Lemma
14.4 of Ruud (2000)), which states that the conditional expectation
of the second derivative of the log-likelihood function is the negative
of a variance, i.e., is negative definite:
E [Lθθ (θ0 )|X = x]
= −Var [Lθ (θ0 )|X = x].
14 / 54
Maximum Likelihood Estimation
I
To see this result, note first that by the score identity it holds that
Z
0 =
Lθ (θ0 ; y |x)f (y |x; θ0 )dy .
I
Consider now the derivative of the following expression:
∂
Lθ (θ; y |x)f (y |x; θ)
∂θ0
=
∂Lθ (θ; y |x)
∂f (y |x; θ)
f (y |x; θ) + Lθ (θ; y |x)
∂θ0
∂θ0
= Lθθ (θ; y |x)f (y |x; θ) + Lθ (θ; y |x)fθ0 (y |x; θ)
=
Lθθ (θ; y |x) + Lθ (θ; y |x)L0θ (θ; y |x) f (y |x; θ),
where the second term of the rhs of the last equality follows by
Lθ (θ; y |x)fθ0 (y |x; θ)
fθ0 (y |x; θ)
= Lθ (θ; y |x)
f (y |x; θ)
f (y |x; θ)
= Lθ (θ; y |x)L0θ (θ; y |x)f (y |x; θ),
as Lθ (θ) =
∂
∂θ L(θ)
=
∂
∂θ
ln f (θ) =
fθ (θ)
f (θ)
(as already seen).
15 / 54
Maximum Likelihood Estimation
I
Now, differentiate the score identity and consider θ = θ0 :
Z 0 =
Lθθ (θ0 ; y |x) + Lθ (θ0 ; y |x)L0θ (θ0 ; y |x) f (y |x; θ0 )dy .
I
As the integral is additive wrt to the integrand, this is equivalent to
Z
Z
Lθθ (θ0 ; y |x)f (y |x; θ0 )dy = − Lθ (θ0 ; y |x)L0θ (θ0 ; y |x)f (y |x; θ0 )dy
E [Lθθ (θ0 ; Y |X )|X = x]
I
= −E [Lθ (θ0 ; Y |X )L0θ (θ0 ; Y |X )|X = x].
As the expectation of Lθ (θ0 ; Y |X ) is zero, the rhs of the last
expression is equal to the variance of Lθ (θ0 ; Y |X ):
−E [Lθ (θ0 ; Y |X )L0θ (θ0 ; Y |X )|X = x]
=
−Var [Lθ (θ0 ; Y |X )|X = x],
which completes the proof of the information identity.
I
The variance is called conditional information matrix =(θ0 |x), i.e.,
=(θ0 |x) = Var [Lθ (θ0 ; Y |X )|X = x]. The population information
matrix =(θ0 ) is defined as E [=(θ0 |x)].
I
The variance of every unbiased estimator of θ0 is bounded from
below by =(θ0 |x)−1 (Cramér-Rao lower bound).
16 / 54
Maximum Likelihood Estimation
I
Until now, only properties of the log-likelihood function were
derived.
I
Of central interest, however, are the properties of the MLE θ̂ itself.
I
One important property of the MLE is consistency:
θ̂
I
p
→
θ0 .
An estimator θ̂ is consistent for a parameter θ0 , if for n → ∞ and
any > 0, it holds that
Pr θ̂ − θ0 > → 0.
Note that is viewed as very small. Another notation for the
consistency of a parameter θ̂ is plim θ̂ = θ0 .
I
The proof of consistency is less straightforward here than in the
OLS case, as the MLE is defined implicitely as the maximum of the
empirical expectation of the log-likelihood function.
I
In the following, the basic steps of the proof will be outlined. For a
complete derivation, see the proof of Proposition 16 and chapter 15
of Ruud (2000), or section 2 of Newey and McFadden (1994).
17 / 54
Maximum Likelihood Estimation
I
To show consistency of the MLE, note first that the sample
log-likelihood function converges for every θ ∈ Θ (i.e., uniformly) to
the expected log-likelihood function:
p
En [L(θ)] → E [L(θ)].
[Lemma 15.1 of Ruud (2000).]
I
By definition it holds that θ̂ = arg max En [L(θ)], and by the
log-likelihood inequality it follows that θ0 = arg max E [L(θ)].
I
If a sequence of functions converges to a continuous limit function,
the maximizing values of the sequence of functions converges to the
maximizing value of the limit function (Lemma 15.2 of Ruud
(2000)).
I
Therefore, as En [L(θ)] converges uniformly to E [L(θ)], the
maximum of En [L(θ)] (i.e., θ̂) converges to that of E [L(θ)] (which
equals θ0 by the strict expected log-likelihood inequality).
I
p
From this it follows directly that θ̂ → θ0 , which shows the
consistency of the MLE.
18 / 54
Maximum Likelihood Estimation
I
Consider now the asymptotic distribution of the MLE, which is
given by
√ d
−1
n θ̂ − θ0
→ N 0, =(θ0 )
,
where =(θ0 ) denotes the population information:
=(θ0 )
= E [Lθ (θ0 ; Y |X )Lθ (θ0 ; Y |X )0 ].
[Proposition 16 of Ruud (2000).]
I
Consistency of the MLE was shown by properties of the sample and
expected log-likelihood function, i.e, no explicit expression of the
MLE was used.
I
In contrast, to derive the asymptotic distribution of the MLE, an
explicit expression of the MLE is needed.
I
This expression is obtained by an application of the mean value
theorem.
I
Again, only the basic steps will be presented. A complete derivation
is contained in section 15.3 of Ruud (2000).
19 / 54
Maximum Likelihood Estimation
I
The mean value theorem states that for a differentiable function
f (x) with x ∈ [a, b] there exists at least one x̄ ∈ (a, b) such that
f (b) − f (a)
d
f (x)
,
=
dx
b−a
x=x̄
where a and b are elements of the domain of f (x). The value x̄ can
be written as x̄ = αa + (1 − α)b for some α ∈ (0, 1).
I
This expression can be rewritten as
f (b)
d
= f (a) +
f (x)
dx
(b − a).
x=x̄
I
Consider now the usage of the mean value theorem for deriving the
asymptotic distribution of the MLE.
I
Let b be an estimator and a its true value, and let the function f (·)
be the derivative of the empirical expectation of the log-likelihood
function.
I
By applying the mean value theorem to this derivative, the
difference between the (implicitely defined) MLE and its true value
can be separated.
20 / 54
Maximum Likelihood Estimation
√ I The aim is to obtain an expression for n θ̂ − θ0 .
I
Apply the mean value theorem to En [Lθ (θ̂)] for the points θ0 and θ̂
and rearrange the terms:
En [Lθ (θ̂)]
= En [Lθ (θ0 )] + En [Lθθ (θ̄)](θ̂ − θ0 ),
where θ̄ = αθ̂ + (1 − α)θ0 , α ∈ [0, 1].
I
Note that the function which is approximated is a derivative itself.
I
Recall that En [Lθ (θ̂)] = 0 by definition. Therefore, the expression
just stated equals
0
I
= En [Lθ (θ0 )] + En [Lθθ (θ̄)](θ̂ − θ0 ).
This can be rewritten as follows:
−En [Lθθ (θ̄)](θ̂ − θ0 )
θ̂ − θ0
I
= En [Lθ (θ0 )]
=
−1
−En [Lθθ (θ̄)]
En [Lθ (θ0 )].
√
Multiply with n to obtain the desired expression:
−1 √
√ n θ̂ − θ0
= −En [Lθθ (θ̄)]
nEn [Lθ (θ0 )].
21 / 54
Maximum Likelihood Estimation
I
Consider the second term oft the rhs of the last equation. By a
central limit theorem, it follows that
√
d
nEn [Lθ (θ0 )] → N (0, =(θ0 )).
[Lemma 15.3 of Ruud (2000).]
I
p
By the consistency of the MLE it holds that θ̄ → θ0 :
θ̄
I
p
= αθ̂ + (1 − α)θ0 → αθ0 + (1 − α)θ0 = θ0 .
By a law of large numbers and by the convergence of θ̄ just shown,
it holds that
p
−En [Lθθ (θ̄n )] →
−=(θ0 ).
[Lemma 15.6 of Ruud (2000).]
I
d
p
Consider now the following result: If bn → N (0, Ω) and Hn → H,
then it holds for the product Hn bn that
H n bn
d
→
N (0, HΩH 0 ).
22 / 54
Maximum Likelihood Estimation
I
I
Combine now all the results to derive the asymptotic distribution of
−1 √
√ n θ̂ − θ0
nEn [Lθ (θ0 )].
= −En [Lθθ (θ̄)]
First,
√
d
nEn [Lθ (θ0 )] → N (0, =(θ0 )).
−1
This expression is multiplied with −En [Lθθ (θ̄)]
, which converges
in probability to −=(θ0 )−1 .
√ I Therefore, the asymptotic distribution of n θ̂ − θ0 is given by
√ d
−1
−1
n θ̂ − θ0
→ N 0, =(θ0 ) =(θ0 )=(θ0 )
−1
= N 0, =(θ0 )
.
I
This completes the derivation of the asymptotic distribution of the
maximum likelihood estimator.
I
Finally, note that En [−Lθθ (θ̂)], Varn [Lθ (θ̂)], and En [=(θ̂)] are all
(asymptotically equivalent) estimators of the information matrix
=(θ0 ).
23 / 54
Maximum Likelihood Computation
I
General problem: Find some θ̂ ∈ Θ which maximizes an objective
function L(θ), i.e.,
θ̂
= arg max L(θ).
θ∈Θ
I
In general, no closed solutions exist for nonlinear models.
I
Therefore, iterative numerical procedures are used.
I
General algorithm of an iterative numerical procedure: Choose θ̂s+1
as a function of the solution θ̂s of the previous step until a
convergence criterion is satisfied.
As criteria of convergence, the conditions for a maximum should be
used:
I The gradient vector of L(θ ) should be small (for example
s
10−4 ).
I The Hessian matrix of L(θ ) has to be negative definite.
s
I The difference θ̂s − θ̂s−1 should not be used as criterion, as it
might be fulfilled due to slow convergence of the optimization,
although the global optimum is not reached.
I
24 / 54
Maximum Likelihood Computation
I
The standard methods for ML computation are gradient search
algorithms.
I
For these algorithms, the general form of the update formula (which
relates the solution θ̂s+1 to the previous solution θ̂s ) is given by:
θ̂s+1
= θ̂s + As gs ,
where As is a weighting matrix and gs is the gradient of the
objective function.
I
In the ML context, gs is the score function evaluated at θ = θ̂s :
∂En [L(θ)] .
gs =
∂θ
θ=θ̂s
I
The gradient gs can be viewed as search direction, and As as the
‘step length’ of the search.
I
The search direction is chosen such that the objective function at
the new parameter value (i.e., in the next step) is increased most.
I
The search direction which increases the objective function most is
equal to the derivative of the objective function.
25 / 54
Maximum Likelihood Computation
I
A simple gradient search algorithm uses the derivative without step
lenght adaption, i.e., by setting As = I , where I is the identity
matrix (method of steepest ascent).
I
Other gradient search algorithms use different choices of As .
I
The standard method of ML computation, the Newton-Raphson
method, uses the Hessian of the log-likelihood function in
determining seach direction and step length.
I
This approach results in the following update function:
!−1
2
∂En [L(θ)] ∂ En [L(θ)] .
θ̂s+1 = θ̂s −
0
∂θ∂θ
∂θ
θ=θ̂s
θ=θ̂s
I
Other algorithms based on the same principle (but on different
variance expressions) are the modified scoring and the BHHH
algorithms.
I
Various further algorithms were proposed in the literature.
26 / 54
Maximum Likelihood Computation
I
Possible problems of gradient-based optimization algorithms:
I Discontinuous objective functions: No derivative exists.
I Multiple maxima: The algorithm converges to a local
maximum.
I Flat areas of the objective function: The search direction
cannot be determined clearly.
I
If one of the problems occur, grid search algorithms may be useful.
I Basic idea: A grid (i.e., a discrete set) of values of θ is defined,
L(θ) is evaluated for each grid point, and θ̂ is choosen as the
grid value which maximizes L(θ).
I A finer grid may be used in a neighborhood of the solution to
refine the estimate of θ.
I Disadvantage: Impractical large grid sizes are needed for k > 1.
I May be of practical interest if only one parameter of the model
has to estimated in this way.
I
Finally, optimization heuristics (like Simulated Annealing, Threshold
Accepting, evolutionary (or genetic) algorithms) may yield more
reliable results when gradient methods fail.
27 / 54
Maximum Likelihood Computation
I
For the models presented in this lecture, standard optimization
methods usually work well.
I
Problems like failing convergence might occur for complicated ML
models, for example those taking into account individual
heterogeneity (like the negative binomial model for count data).
I
Possible solutions for (suspected) numerical problems:
I Different starting values of the iterative algorithms.
I If the explanatory variables differ with respect to their range
(for example, when using indicator variables and squared
experience terms), the larger variables can be divided by 100 or
1000, such that their range is similar to those of the other
variables.
I Use of large samples (i.e., no analysis of subsamples if the
problems occur in small samples).
I
Different algorithms (like BHHH) may be used, if the problem
seems to stem from the used standard approach.
I
Details of the optimization algorithm (like convergence criteria)
should not be changed.
28 / 54
Maximum Likelihood Hypothesis Tests
I
There are three commonly used test procedures in the ML context.
I
The Wald test uses the deviation of the estimated parameters from
their values under the null hypothesis as basis of the test.
I
The score test uses the derivative of the restricted likelihood
function and tests whether it is zero.
I
The likelihood ratio test compares the likelihood functions for the
restricted and unrestricted models and tests whether the difference
is zero.
I
All three test procedures are asymptotically equivalent.
I
For the description of the tests in the following, partition the
parameter vector θ as follows
θ1
θ =
θ2
and consider the following hypothesis about the subset θ2 of θ:
H0 :
θ2 = 0,
that is, θ1 are the unrestricted and θ2 are the restricted parameters.
29 / 54
Maximum Likelihood Hypothesis Tests
I
The Wald test uses the estimated parameters and compares them to
the hypothesized values.
I
If the (normalized) differences between estimated and hypothesized
parameters are large when taking into account estimation
uncertainty, the null hypothesis is rejected.
I
To implement the Wald test, the unrestricted MLE is computed by
θ̂
=
arg max En [L(θ)].
θ∈Θ
I
The variance of the MLE is computed by one of the estimators of
=(θ0 )−1 .
I
The Wald test statistic W is defined as
W
−1
= nθ̂20 V̂W
θ̂2 ,
where V̂W is the block of the variance matrix related to θ̂2 .
I
H0 is rejected if W exceeds the critical value of a χ2k−m distribution,
where k = dim(θ) and m = dim(θ1 ) (i.e., k − m is the number of
restrictions).
30 / 54
Maximum Likelihood Hypothesis Tests
I
The test statistic of the score test is based on the derivative of the
restricted log-likelihood function.
I
A large value of the derivative is a hint that the value of the
likelihood function for the parameter values under the null
hypothesis differs from the global (unconstrained) maximum.
I
Idea: If H0 is true, it holds that L2 (θ̂R ) = 0, where L2 (θ̂R ) is the
derivative of L(θ̂R ) wrt θ2 and
arg maxθ1 ∈Θ En [L(θ1 , 0)]
θ̂R = arg max En [L(θ)] =
.
0
θ∈Θ,θ2 =0
To implement the test, the restricted MLE θ̂R and the score
En [L2 (θ̂R )] are computed first.
√
I The variance of nEn [Lθ (θ0 )] is computed by =(θ̂R ), for example.
I
I
The test statistic equals
= nEn [L2 (θ̂R )]0 V̂S−1 En [L2 (θ̂R )],
√
where V̂S is the variance of nEn [L2 (θ̂0 )].
S
I
H0 is rejected if S exceeds the critical value of a χ2k−m distribution.
31 / 54
Maximum Likelihood Hypothesis Tests
I
The likelihood ratio test is based on the likelihood functions of the
restricted and the unrestricted models.
I
The values of the likelihood functions of θ̂ and θ̂R are used as
measures for goodness of fit, and the difference is used as the test
statistic.
I
If the difference of the likelihood functions is large, the null
hypothesis seems not to be compatible with the observed data set.
I
To implement the test, the restricted and the unrestricted MLEs θ̂R
and θ̂ are computed.
I
The values of the restricted and unrestricted log-likelihood functions
L(θ̂R ) and L(θ̂) are used to compute the test statistic
LR = 2 L(θ̂) − L(θ̂R ) .
I
Again, H0 is rejected if LR exceeds the critical value of a χ2k−m
distribution.
I
Note that in contrast to the Wald and score tests, two models have
to be computed for the LR test.
32 / 54
Binary Choice Models
I
The dependent variable y takes on only two values:
1 with probability p,
y =
0 with probability 1 − p.
I
The basic idea of regression models for binary outcomes is to
specify the conditional probability of y = 1:
pi
= Pr (yi = 1|x) = F (xi0 β),
where F (·) usually is a function with codomain [0, 1].
I
ML model: The conditional probability mass function f (yi |xi ) of a
single value yi is given by
f (yi |xi )
= piyi (1 − pi )1−yi ,
where the pi are functions of the covariates.
I
With pi = F (xi0 β), the joint probability of the data is given by
Yn
Yn
f (yi |xi ) =
F (xi0 β)yi (1 − F (xi0 β))1−yi
i=1
i=1
Y
Y
0
=
F (xi β)
(1 − F (xi0 β)).
{i|yi =1}
{i|yi =0}
33 / 54
Binary Choice Models
I
The log-likelihood function is given by:
L(β)
= ln
n
Y
!
F (xi0 β)yi (1 − F (xi0 β))1−yi
i=1
=
n
X
yi ln F (xi0 β)
+ (1 − yi ) ln(1 −
F (xi0 β))
.
i=1
I
The MLE β̂ solves En [Lβ (β)] = 0, which leads to
n X
yi
1 − yi
0
0
F
(x
β)
−
F
(x
β)
= 0
β
β
i
i
0
0
F (xi β)
1 − F (xi β)
i=1
or, equivalently,
n
X
i=1
yi − F (xi0 β)
0
F
(x
β
i β)
0
0
F (xi β)(1 − F (xi β))
= 0.
Fβ (·) denotes the derivative of the cdf F (·) with respect to β.
34 / 54
Binary Choice Models
I
Various specifications of the link function pi = F (xi0 β) are possible.
I
The probit model uses the cumulative distribution function of a
standard normal random variable as the link function:
2
Z xi0 β
Z xi0 β
1
z
0
√
dz.
pi = Φ(xi β) =
φ(z)dz =
exp −
2
2π
−∞
−∞
I
The logit model uses the logistic cumulative distribution function as
the link function:
exp(xi0 β)
1
0
pi = Λ(xi β) =
=
,
1 + exp(xi0 β)
1 + exp(−xi0 β)
where the last equality follows by multiplying both numerator and
denominator by exp(−xi0 β).
I
The linear probability model uses simply the identity mapping, i.e.,
pi
= xi0 β.
This corresponds to a linear regression of the binary variable y on x.
I
Note that the probabilities obtained by the linear probability model
are not restricted to lie between zero and one.
35 / 54
Binary Choice Models
I
Consider now the interpretation of the parameter estimates of a
(nonlinear) binary choice model.
I
Sign and significance of a parameter can be interpreted similarly as
in the linear model. That is, a variable with a significantly negative
parameter decreases the probability that the dependent variable is
equal to one.
I
The estimated parameters of a nonlinear binary choice model
cannot be interpreted directly as marginal effects, however.
I
To see this, consider the expression of the marginal effect of a
continuous regressor xj for a nonlinear binary choice model:
∂Pr (yi = 1|xi )
∂xji
=
Fβ (xi0 β)βj .
I
The marginal effect depends on xi and therefore differs for each
individual.
I
For an indicator variable xj , the marginal effect (given the other
regressors x−j ) is defined as the difference
Pr (yi = 1|x−j,i , xji = 1) − Pr (yi = 1|x−j,i , xji = 0).
36 / 54
Binary Choice Models
I
To obtain a summary measure of the marginal effects of a variable,
some kind of average marginal effect may be computed.
I
The first possibility to compute a summary measure is the average
marginal effect of variable j:
n
1X
Fβ (xi0 β̂)β̂j .
n
i=1
I
Interpretation: The average marginal effect of the jth variable shows
how Pr (yi = 1|xi ) changes on average when the jth variable is
marginally increased.
I
A second possibility is to evaluate the marginal effect at the average
of the covariates, i.e., by Fβ (x̄ 0 β̂)β̂j , where x̄ is the vector of
averages of the regressors.
I
Interpretation in this case: This average effect states how
Pr (yi = 1|xi ) changes when the jth variable is marginally increased
for an individual with ‘average’ characteristics.
I
As x̄ may not be representative for the sample, the average marginal
effect is preferable.
37 / 54
Binary Choice Models
I
For the logit model, an additional possibility for the interpretation of
the parameters exists.
I
Recall that it holds for the logit model that
exp(x 0 β)
1
exp(x 0 β)
, and 1−p = 1−
=
.
p =
0
0
0
1 + exp(x β)
1 + exp(x β)
1 + exp(x β)
I
From this the definition of the odds ratio (or relative risk) follows:
0
p
exp(x β)
1
0
=
=
exp(x
β).
0
0
1−p
1 + exp(x β) 1 + exp(x β)
I
The odds ratio is the ratio of the probability for y = 1 relative to
the probability for y = 0. An odds ratio of two means that the
probability for y = 1 is twice as big than that for y = 0.
I
Consider now a change of the jth variable by one (i.e., x̃j = xj + 1).
The odds ratio is then given by exp(x̃ 0 β) = exp(x 0 β) exp(βj ).
I
This means that a change of the jth variable by one unit changes
the odds ratio (i.e., the relative probability) by exp(βj ). If
exp(βj ) > 1, the probability for y = 1 increases relative to the
probability for y = 0.
38 / 54
Multinomial Models
I
A generalization of binary data are dependent variables which take
on more than two discrete values which cannot be ordered.
I
Let y ∈ {0, 1, . . . , J} and define pj = Pr (y = j) as the probability
that option j is chosen.
I
Define the following indicators:
1
yj =
0
I
if y = j,
if y =
6 j.
The density for one observation and the log-likelihood function for
the entire sample are given by
f (y )
= p1y1 × . . . × pJyJ =
J
Y
y
pj j ,
j=1
n Y
J
n X
J
Y
X
y
pijij =
yij ln pij .
ln L = ln
i=1 j=1
I
i=1 j=1
A regression model specifies pij = Pr (yij = 1|xi ) by some cdf (which
depends on some unknown parameters).
39 / 54
Multinomial Models
I
Different models are used for alternative-varying and alternativeinvariant regressors.
I
Consider first the case of alternative-varying regressors.
I
Alternative-varying regressors may take different values for each
alternative (example: price of an alternative).
I
In this case, the conditional logit model is used, which defines the
probabilities pij as follows:
pij
=
exp(xij0 β)
PJ
0
`=1 exp(xi` β)
.
I
Note that in this model the covariates may differ for each option,
whereas the parameters β are the same for all options.
I
With δjk ≡ 1{j = k} (and omitting i), the marginal effect of xk is
∂pj
∂xk
I
= pj (δjk − pk )β.
If the parameter estimate is positive, an increase of the covariate of
option k (j 6= k) decreases the probability of choosing option j.
40 / 54
Multinomial Models
I
Next, consider the second case of alternative-invariant regressors,
where the regressors have the same values for all alternatives (for
example gender).
I
In this case, a multinomial logit model is used, which specifies pij as
follows:
pij
=
exp(xi0 βj )
PJ
0β )
exp(x
i `
`=1
.
I
Consider three unordered choices, i.e., y ∈ {1, 2, 3}.
I
The probabilities for choosing one of these possibilities are given by
Pr (y = 1)
=
Pr (y = 2) =
Pr (y = 3) =
exp(x 0 β1 )
,
exp(x 0 β1 ) + exp(x 0 β2 ) + exp(x 0 β3 )
exp(x 0 β2 )
,
exp(x 0 β1 ) + exp(x 0 β2 ) + exp(x 0 β3 )
exp(x 0 β3 )
.
exp(x 0 β1 ) + exp(x 0 β2 ) + exp(x 0 β3 )
41 / 54
Multinomial Models
I
Two of the probabilities just stated determine the third, as
Pr (y = 1) + Pr (y = 2) + Pr (y = 3) = 1.
I
Therefore, not all parameters are identified, i.e., different
combinations of β1 , β2 and β3 may lead to the same probabilities.
I
To identify the effects, one parameter vector is restricted to zero (in
the following, β1 = 0), which solely changes the interpretation, but
not the model as such:
1
Pr (y = 1) =
1 + exp(x 0 β2 ) + exp(x 0 β3 )
exp(x 0 β2 )
Pr (y = 2) =
1 + exp(x 0 β2 ) + exp(x 0 β3 )
exp(x 0 β3 )
.
Pr (y = 3) =
0
0
1 + exp(x β2 ) + exp(x β3 )
I
To interpret the results for this model, define the relative risk (for
the second option, i.e., for y = 2) by the following ratio:
Pr (y = 2)
Pr (y = 1)
= exp(x 0 β2 ).
42 / 54
Multinomial Models
I
Considering a unit change of the jth covariate, the relative risk ratio
is equal to exp(x 0 β2 ) exp(β2,j ), where β2,j is the jth element of β2 .
I
This means that a change of the jth covariate by one unit leads to a
change of the risk ratio by the factor exp(β2,j ).
I
For example, a value of exp(β2,j ) of three means that a change of
the jth covariate by one unit changes the ratio Pr (y = 2)/Pr (y = 1)
by the factor three.
I
The marginal effect of the multinomial logit model is given by
!!
X
∂pij
= pij βj −
pi` β`
.
∂xik
`
I
Note that a direct interpretation of the sign is not possible for this
model.
43 / 54
Ordered Choice Models
I
Ordered choice variables take on discrete values which can be
ordered, but which can not be viewed as continuous.
I
The choice of some option is modelled by a latent index:
yi∗
=
xi0 β + ui .
I
The values of y depend on y ∗ and some unknown threshold values
αj (for the example j ∈ {1, 2, 3}):
Option 1 if y ∗ ≤ α1 ,
Option 2 if α1 < y ∗ ≤ α2 ,
y=
Option 3 if α2 < y ∗ .
I
For the probability that option j is chosen it holds that:
Pr (yi = j)
= Pr (αj−1 < yi∗ ≤ αj )
= Pr (αj−1 < xi0 β + ui ≤ αj )
= Pr (αj−1 − xi0 β < ui ≤ αj − xi0 β)
= F (αj − xi0 β) − F (αj−1 − xi0 β).
I
F (·) is usually specified by F (z) = Λ(z) or F (z) = Φ(z).
44 / 54
Ordered Choice Models
I
The sign of the estimated parameters indicates whether the latent
index increases with the regressor.
I
The marginal effect of the kth variable on the probability for
choosing option j is given by
∂Pr (yi = j|xi )
∂xk
= (F 0 (αj−1 − xi0 β) − F 0 (αj − xi0 β))βk .
I
Note that the order of the terms containing αj−1 and αj is reversed
compared to the definition of Pr (yi = j) itself, as the linear index
xi0 β has negative sign in the function F (·).
I
Note also that for each possible option a separate average marginal
effect can be computed.
I
Interpretation: The average marginal effect of the kth variable
denotes the average change of Pr (y = j|x) when the kth variable is
marginally increased.
45 / 54
Count Data Models
I
In count data models, the dependent variable y contains the number
of persons, items or decisions, i.e., y ∈ N0 (where N0 = N ∪ {0}).
I
Empirical count data often has the following structure:
I y may take on only a few values, for example 90% of the
sample consists of the values 0, 1, or 2 (for example, if y is the
number of children).
I y is zero for many observations and very large for some others
(for example, if y is the number of patents).
I
Difference to ordered choice models: Count data may take a larger
number of values compared with ordered data discussed before.
I
When the outcome variable takes only few different values, however,
count data may also be analyzed using ordered choice models.
I
Two commonly used parametric count data regression models:
I Poisson regression model
I Negative Binomial regression model
I
These models take into account the nonnegativity and discreteness
of count data.
46 / 54
Count Data Models
I
A basic model for count data is the Poisson regression model.
I
To construct the likelihood function, consider first the probability
mass function of one observation:
exp(−µ)µy
.
Pr (yi = y ) =
y!
y ! is the factorial of y , which is defined as y ! = 1 × 2 × . . . × y for
y ∈ N.
I
It holds that E [y ] = V (y ) = µ, which is the equidispersion property
of the Poisson distribution.
I
By modelling the mean µ by exp(xi0 β), the log-likelihood function of
the Poisson regression is given by
L(β)
=
log
n
Y
exp(− exp(x 0 β)) exp(x 0 β)yi
i
i=1
yi !
i
n
X
=
(yi xi0 β − exp(xi0 β)),
i=1
where it was used that log(ab) = log a + log b, log exp(a) = a, and
log(ab ) = b log a. The term ln y ! was dropped, as it does not
depend on β.
47 / 54
Count Data Models
I
The conditional mean of the dependent variable is specified as
µ = exp(x 0 β).
I
Therefore, the marginal effects of the Poisson model are given by
∂E [y |x]
∂xj
= βj exp(x 0 β).
I
Note that no conditional probability is considered here, but a
conditional mean.
I
As for the discrete choice models presented before, the marginal
effect depends on the covariates and differs therefore for each
individual.
I
Again, an average marginal effect can be used for summarizing the
effect of a covariate.
I
Interpretation: The average marginal effect of the jth variable
denotes how the expected value of the dependent variable changes
on average when the jth variable is marginally increased.
48 / 54
Count Data Models
I
Two problems of the Poisson model are the excess zero problem (far
less predicted zeros than observed in the sample) and the problem
of overdispersion (V (y ) > E [y ]).
I
A possible reason for overdispersion is unobserved heterogeneity.
Therefore, an approach like the negative binomial regression model
with a less restrictive parameterization may be appropriate.
I
Here, as for the Poisson regression model, the dependent variable y
follows a Poisson distribution with parameter λ:
f (y |λ)
I
=
exp(−λ)λy
.
y!
Contrary to the Poisson regression model, the mean parameter λ is
assumed to be random, and is specified as
λ
= νµ,
where µ is a deterministic function of the covariates, for example
µ =
exp(x 0 β),
and ν is a positive random variable (also called the mixing variable).
49 / 54
Count Data Models
I
Assume that ν has distribution g (ν|δ), where δ are some unknown
parameters of the distribution.
I
The marginal density of the dependent variable is obtained by
integrating over all possible values of the mixing variable ν:
Z
h(y |µ, δ) =
f (y |µ, ν)g (ν|δ)dν.
I
This expression does not depend on ν, but only on the parameters
of its distribution (and on µ, which depends on parameters β).
I
This expression can be used to construct a likelihood function,
which depends on the parameters δ and β.
I
A suitable choice of the distribution of ν leads to a tractable
expression of the likelihood function. Therefore, assume that ν
follows a Gamma distribution which is defined by
g (ν|δ)
=
ν δ−1 exp(−νδ)δ δ
,
Γ(δ)
where E [ν] is restricted to one, Var(ν)
1/δ, and Γ(δ) is the
R ∞=a−1
Gamma function defined by Γ(a) ≡ 0 z
exp(−z)dz.
50 / 54
Count Data Models
I
Therefore, assuming that y follows a Poisson distribution, that its
mean is a multiplicative random variable, and using a mixing
variable with Gamma density, the marginal density of y is given by
Z ∞
exp(−µν)(µν)y ν δ−1 exp(−νδ)δ δ
dν
h(y |µ, δ) =
y!
Γ(δ)
0
Z ∞
exp(−(µ + δ)ν)µy ν y +δ−1 δ δ
=
dν
y!
Γ(δ)
0
y δ Z ∞
µ δ
=
exp(−(µ + δ)ν)ν y +δ−1 dν
Γ(δ)y ! 0
µy δ δ Γ(y + δ)
=
Γ(δ)y ! (µ + δ)y +δ
δ y
δ
µ
Γ(y + δ)
=
Γ(δ)Γ(y + 1) µ + δ
µ+δ
I
Here it was used that e a e b = e a+b , Γ(y + 1) = y !, and that
Z ∞
Z ∞
Γ(a)
1
a−1
a−1
=
z
exp(−z)dz
=
z
exp(−bz)dz.
a
a
b
b 0
0
51 / 54
Count Data Models
I
Given this reformulation of the marginal density of y , the likelihood
function can be constructed.
I
The deterministic part of the mean term µ is specified (as in the
Poisson model) by µ = exp(x 0 β).
I
The likelihood function is given by
δ yi
n
0
Y
δ
exp(xi β)
Γ(yi + δ)
`(β, δ) =
Γ(δ)Γ(yi + 1) exp(xi0 β) + δ
exp(xi0 β) + δ
i=1
I
The log-likelihood function equals
n
Y
δ exp(xi0 β)
exp(xi0 β) +
Γ(yi + δ)
δ
L(β, δ) = log
Γ(δ)Γ(yi + 1) exp(xi0 β) + δ
δ
i=1
n X
Γ(yi + δ)
δ
=
log
+ δ log
Γ(δ)Γ(yi + 1)
exp(xi0 β) + δ
i=1
0
exp(xi β)
+yi log
.
0
exp(xi β) + δ
I
This expression is maximized numerically with respect to β and δ.
yi
52 / 54
Summary
I
Basic idea of the ML approach
I
Basic properties of the likelihood functions, idea of the proofs of
consistency and asymptotic normality of the ML estimator
I
Definition and intuition of ML hypothesis tests
I
Definition of binary, multinomial, and ordered choice models
I
Interpretation of the results
I
Definition of count data models and interpretation of the results
I
Unobserved heterogeneity in ML models
53 / 54
References
I
ML theory: P. Ruud, An Introduction to Classical Econometric
Theory, Oxford University Press, 2000, ch. 14 and 15, Cameron and
Trivedi (2005), sec. 5.6.
W. Newey and D. McFadden, “Large sample estimation and hypothesis testing”, Handbook of Econometrics, vol. IV, 1994, ch. 36.
I
ML computation: Cameron and Trivedi (2005), ch. 10.
I
ML hypotesis tests: Ruud (2000), sec. 17.2.
I
Binary choice models: Cameron and Trivedi (2005), sec. 14.1-14.3,
Cameron and Trivedi (2009), sec. 14.1-14.7.
I
Multinomial choice models: Cameron and Trivedi (2005), sec.
15.1-15.4, Cameron and Trivedi (2009), sec. 15.1-15.5.
I
Ordered choice models: Cameron and Trivedi (2005), sec. 15.9.1,
Cameron and Trivedi (2009), sec. 15.9.
I
Count data models: Cameron and Trivedi (2005), sec. 20.1-20.4.1,
Cameron and Trivedi (2009), sec. 17.1-17.3.3.
54 / 54
© Copyright 2026 Paperzz