Chapter 5: Discrete Response Models

Chapter 5: Discrete Response Models
Microeconomic Econometrics I
5.1 Introduction
• In binary response models, interest lies primarily in the response probability,
p(x) ≡ P (y = 1|x) = P (y = 1|x1 , x2 , . . . , xK )
for various values of x.
 1 if a person is employed
 0 otherwise.
 1 success
 0 otherwise.
♠The response probability,
p(x) = p(y = 1|x) = p(y = 1|x1 , x2 , ..., xk )
ex: y: an employment indicator
x: various individual characteristics such as 2. age 3. martial status 4. recent
job training program 5. measures of past criminal behavior
♠For a continuous variable xj ,
the partial effect of xj ,
∂P (y=1|x)
∂P (x)
← holding all other variables fixed.
If xk is a binary variable, interest lies in
p(x1 , x2 , ..., xk−1 , 1) − p(xx , x2 , ..., xk−1 , 0)
, 1 = xk
0 = xk
Basic statistics:
P (y = 1|x) = p(x)
P (y = 0|x) = 1 − p(x)
E(y|x) = 1 · P (x) + 0(1 − p(x)) = p(x)
V ar(y|x) = E[y 2 |x] − E[y|x]2 = E[y|x] − E[y|x]2 = p(x) − p(x)2 = p(x)(1 − p(x))
5.2 The Linear Probability Model for Binary Response
• The linear probability model (LPM) for binary response y is specified as
P (y = 1|x) = β0 + β1 x1 + β2 x2 + · · · + βK xK
• Example 5.1 (Married Womens Labor Force Participation):
♠5.2 Linear probability model (LPM) for binary response y
P (y = 1|x) = β0 + β1 x1 + β2 x2 + ... + βk xk
Interpretation of β1 : the change in the probability of success given a one-unit increase in x1
Using functions such as quadratics, logarithms, and son on among the independent variables
causes no new difficulties.
♠Conditional mean and variance of y:
E(y|x) = β0 + β1 x1 + β2 x2 + ... + βk xk
→ OLS produces consistent and even unbiased
estimators of the βj
V ar(y|x) = Xβ(1 − Xβ)
← heteroskedasticity is present
→ use standard heteroskedasticity-robust standard error and t statistics, robust tests of multiple restrictions
♠Applying weighted least squares (WLS)
Let β̂ : OLS estimator, ŷi : OLS fitted values; the predicted probability 0 < ŷi < 1 for all
observation i
→ the estimated standard deviation σ̂i ≡ [ŷi (1 − ŷi )]1/2
⇒ the WLS estimator β ?
♠Weighted Regression
yi /σ̂i on 1/σ̂i , Xi1 /σ̂i , ..., Xik /σ̂i
, i = 1, 2, ..., N
→ If some of the OLS fitted values are not between zero and one, WLS analysis is not possible
without ad hoc adjustments to bring deviant fitted values into the unit interval.
→ If the main purpose of estimating a binary response model is to approximate the partial
effects of the explanatory variables, averaged across the distribution of x, then the LPM often
does a very good job.→ predicted probabilies < 0 > 1 not a serious concern
♠Ex: Married Women’s Labor Force Participation
ˆ = 0.586 − 0.0034nwif einc − 0.262kidslt6 + other f aactors
nwif einc: nonwife income
kidslt6: number of children less than six years of age.
→ just OLS and report heteroskedasticity-robust s.e.
→ to allow a diminishing effect of young children on the probability of labor force participation
break kidslt6 into three binary indicators:
no young children, one young child, and two or more young children
5.3 Index Models for Binary Response: Probit and Logit
We now study binary response models of the form
P (y = 1|x) = G(xβ) ≡ p(x)
where x is 1 × K, β is K × 1, and we take the first element of x to be unity.
• The model in equation (5.8) is generally called an index model because it restricts the
way in which the response probability depends on x:p(x) is a function of x only through
the index xβ = β1 + β2 x2 + · · · + βK xK .
• Index models where G is a cdf can be derived more generally from an underlying latent
variable model, as in Example 13.1:
y ∗ = xβ + e,
y = 1[y ∗ > 0]
where e is a continuously distributed variable independent of x and the distribution of
e is symmetric about zero; recall from Chapter 13 that 1[·] is the indicator function.
• The probit model is the special case of equation (5.8) with
G(z) ≡ Φ(z) ≡
Φ(z) = (2Π)−1/2 exp(−z 2 /2)
where Φ(z) is the standard normal density
• The logit model is a special case of equation (5.8) with
G(z) = Λ(z) ≡ exp(z)/[1 + exp(z)]
♠5.3 ”Index Models” for Binary Response: probit and Logit
Binary response model of the form
P (y = 1|x) = G(Xβ) ≡ P (x)
Assume that G(·) takes on values in the open unit interval 0 < G(z) < 1 for all z ∈ <
The index: xβ = β1 + β2 x2 + ... + βk xk
In most applications, G is a cumulative distribution function (cdf ) whose specific form can
sometimes be derived form an underlying economic model G can be derived more generally
from an underlying latent variable model.
y ? = Xβ + e
, y = 1[y ? > 0]
1. e is a continuously distributed variable independent of X
2. the distribution of e is symmetric about zero 1 − G(−z) = G(z)
→ P (y = 1|x) = P (y ? > 0|x) = P (e > −xβ|x) = 1 − G(−xβ) = G(xβ)
♠In most applications of binary response model, the primary goal is to explain the effects
of the xj on the response probability P (y = 1|x)
→ the latent variable y ? rarely has a well-defined unit of measurement
→ the magnitude of βj is not especially meaningful except in special cases
♠Probit model:
G(z) = Φ(z) =
where φ(z) = (2Π)−1/2 exp(−z 2 /2): standard normal density
♠Logit model:
G(z) = Λ(z) =
, e has a standard logistic function
♠Xj : continuous
∂P (x)
= g(xβ)βj , where g(z) =
dz (z)
g(xβ): the partial effect of xj on p(x) depends on x through g(xβ
1. If G(·) is a strictly increasing cdf. g(z) > 0. The sign of the effect is given by the sign of
βj .
2. The ratio of partial effects
∂p(x) ∂p(x)
∂xj / ∂xh
= βj /βh
3. g is a symmetric density about zero with unique mode at the largest effect is when xβ = 0
♠Xk : a binary explanatory variable
G(β1 + β2 x2 + ... + βk−1 xk−1 + βk ) − G(β1 + β2 x2 + ... + βk xk−1 )
5.4 Maximum Likelihood Estimation of Binary Response Index Models
f (y|xi ; β) = [G(xi β)]y [1 − G(xi β)]1−y ,
y = 0, 1
• If G(·) is the standard normal cdf, then β̂ is the probit estimator; if G(·) is the logistic
cdf, then β̂ is the logit estimator.
♠5.4 Maximum Likelihood Estimation of binary Response Index Models
N : i.i.d observations
The density of yi given xi
f (y|xi ; β) = G(xi β)y [1 − G(xi β)]1−y
, y = 0, 1
⇒ The log likelihood for observation i
li (β) = yi log G(xi β) + (1 − yi ) log[1 − G(xi β)]
L(β) =
li (β)
The score of the log likelihood for observation i (general case)
si (θ) = ∇θ li (θ)0 = ( ∂θ
(θ), ∂θ
(θ), ..., ∂θ
The Hessian is a symmetric matrix of second partial derivatives of li (θ) (general case)
Hi (θ) ≡ ∇θ si (θ) = ∇2θ li (θ)
⇒ 1. si (β) =
g(xi β)x0i [yi −G(xi β)]
G(xi β[1−G(xi β)]
2. the expected value of the Hessian conditional on xi is
−E[Hi (β)|xi ] =
g(xi β)2 x0i xi
G(xi β[1−G(xi β)]
ˆ β̂) = P
3. Avar(
≡ A(xi , beta)
g(xi β̂)2 x0i xi
G(xi β̂[1−G(xi β̂)]
≡ V̂
As usual, we treat β̂ as being normally distributed with mean zero and variance matrix V̂ .
Inference should be based on the sandwich estimator in equation (13.71)
ˆ θ̂) = ( P Hi (θ̂))−1 ( P si (θ̂)si (θ̂)0 )( P Hi (θ))−1
5.5 Testing in Binary Response Index Models
5.5.1 Testing Multiple Exclusion Restrictions
P (y = 1|x, z) = G(xβ + zγ)
♠5.5 Testing in Binary Response Index Models
Any of the three tests from general MLE analysis can be used:
1. the Wald → easy to obtain, but not invariant to reparameterizations for testing nonlinear
2. likelihood ration (LR)
3. Lagrange multiplier (LM)
♠Testing Multiple Exclusion Restrictions
P (y = 1|x, z) = G(xβ + zγ)
The Null hypothesis H0 : γ = 0 (Q exclusion restrictions)
(i) the Wald statistics: a simple command in Stata
(ii) LR statistics = 2(Lur − Lr ) ∼ χ2Q under H0
← the value of the log-likelihood function
in the unrestricted model
(iii) The LM statistics is numerically identical to the following:
(1) Define ûi ≡ yi − G(xβ̂)
, Ĝ ≡ G(xi β̂) and ĝi ≡ g(xi β̂) from the restricted model
(2) Run the auxiliary OLS regression
Ĝi (1−Ĝi )
LM = N ·
xi , √ ĝi
Ĝi (1−Ĝi )
Ĝi (1−Ĝi )
Ru2 ∼ χ2Q under H0
on √
The LM approach can be an attractive alternative to the LR statistics if z has large dimension
Testing Nonlinear Hypotheses about β
H0 : c(β) = 0
(Q × 1 vector)
W = c(β̂)0 [∇β c(β̂)V̂ ∇β c(β̂)0 ]−1 c(β̂)
5.5.2 Testing Nonlinear Hypotheses about β
W = c(β̂)0 [∇β c(β̂)V̂ ∇β c(β̂)0 ]−1 c(β̂)
♠A more complicated binary choice model:
y ? = xβ + e
Define r =
, e|x ∼ N ormal(0, exp(2x1 δ)) where 1 × K1 : exclude a constant , k1 × 1 : δ
exp(xi δ)
⇒ r is indep. of x with standard normal distribution
⇒ P (y = 1|x) = P (e > −xβ|x) = P (exp(−x1 δ)e > −exp(−x1 δ)xβ) = P (r > −exp(−x1 δ)xβ) =
Φ(exp(−x1 δ)xβ)
When δ = 0, we obtain the standard probit model. Therefore, a test of the probit functional
form for the reponse probatility is a test of H0 : δ = 0
5.5.3 Tests against More General Alternatives
P (y = 1|x) = P (e > −xβ|x) = P [exp(−x1 δ)e > − exp(−x1 δ)xβ]
= P [r > − exp(−x1 δ)xβ] = Φ[exp(−x1 δ)xβ]
♠The LM test for an index model against a more general alternative. (variant of standard
probit or logit)
P (y = 1|x) = m(xβ, x, δ)
Test:H0 : δ = δ0
Let β̂ be the probit or logit estimator of β obtained under δ = δ0 .
(1) Define ûi = y − G(xβ̂)
Ĝi = G(xi β̂) and ĝi = g(xi β̂) → estimation in the restricted
Set ∇δ m̂i = ∇δ m(xi β̂, xi , δ0 )
(2) The regression
Ĝi (1−Ĝi )
on √
LM = nRu2 ∼
Ĝi (1−Ĝi )
xi , √ ∇δ m̂i
Ĝi (1−Ĝi )
→ apply this test to the preceding probit example,
G = Φ(·)
, δ0 = 0
, m(xβ, x, δ) = Φ[exp(−x1 δ)xβ]
5.6 Reporting the Results for Probit and Logit
• One measure of goodness of fit that is usually reported is the percent correctly predicted.
• Various pseudo R-squared measures have been proposed for binary response.
Often we want to estimate the effects of the variables xj on the response probabilities P (y =
1|x). If xj is (roughly) continuous then
∇P̂ (y = 1|x) ≈ g(xβ̂)β̂j ∇xj
for small changes in xj . (As usual when using calculus, the notion of small here is somewhat
• Example 5.2 (Married Womens Labor Force Participation):
♠One measure of goodness of fit that is sometimes reported is the percent correctly predicted.
Define the binary variable,
 1 if G(xi β̂) ≥ 0.5
ỹi =
 0 if G(x β̂) < 0.5
Given ỹi : i = 1, 2, ..., N , we can see how well ỹi predicts yi across all observations.
There are four possible outcomes on each pair, (yi , ỹi ).
When both are zero or both are one, we make the correct prediction.
→ The percent correctly predicted is the percent of times that ỹi = yi .
160 obs. with yi = 0 → correctly predicted 140
40 obs. with yi = 1 → none
→ correctly predict 70% of all outcome
The overall percent correctly predicted is the weighted average fo the percent correctly predicted
for y = 0 and y = 1, with the weights being the fraction of zero and one outcomes on y respectively.
♠Various pseudo-R-squared:
McFadden: 1 − Lur /L0 , where
(1) Lur : the log-likelihood fun. for the estimated model
L0 : the log-likelihood fun. with only an intercept
(2) |Lur | < |L0 | ⇒ the pseudo-R-squared is always between zero and one
♠The effects of the variables xj on the response probabilities P (y = 1|x)
∇P (y\
= 1|x) ' [g(xβ̂)β̂j ]∆xj
[g(xβ̂)β̂j ]: the estimated partial effect (depends on x)
♠We could report this scale factor at
1. medians of the explanatory variables
2. different quantiles
3. the sample averages of the xj → the partial effect at the average (PEA) (g(x̄β))
For discrete variables, it is well known that the average need not even be a possible outcome of the variable.
 1
x2 =
 0
⇒ x̄2 : the fraction of ones in the sample.
⇒ One way to overcome this conceptual problem is to compute partial effects separately for
x2 = 1 and x2 = 0
♠Standard errors of g(xβ̂)β̂j :
Use the delta method:
Consider the case j = k for notational simplicity.
∂P (y=1|x)
Define δk = βk g(xβ) =
≡ h(β)
The gradient of h(β) is
∇β h(β) = [βk dg
dz (xβ), βk x2 dz (xβ), ..., βk xk−1 dz (xβ), βk xk dz (xβ) + g(xβ)]
The delta method implies
[∇β h(β̂)]V̂ [∇β h(β̂)]0
← V̂ : the asymptotic variance estimate of β̂
Stata command: mfx
♠xk : a discrete variable
the change in the predicted probability:
δ̂k = G(β̂1 + β̂2 x̄2 + ... + β̂k−1 x̄k−1 + β̂(ck + 1)) − G(β̂1 + β̂2 x̄2 + ... + β̂k−1 x̄k−1 + β̂ck )
An alternative way to summarize the estimated marginal effect is to estimate the average value
of βk g(xβ) across the population or βk E[g(xβ)]
This quantity is the average partial effect (APE)
Here we are averaging across the distribution of all observable covariates.
A consistent estimator of the APE is
xk :continuous
β̂k [ N1
g(xi β̂)]
xk : binary
1 P
[G(β̂1 + β̂2 xi2 + ... + β̂k−1 xi,k−1 + β̂k ) − G(β̂1 + β̂2 xi2 + ... + β̂k−1 xi,k−1 )]
→ the (counterfactual) effect of the policy for each i
In either the above equations, we can fix some of the explanatory variables at specific values
and average across the remaining ones.
♠PEA v.s. APE
g(x̄β): nonlinear fun of the average
1 P
g(xi β̂): the average of a nonlinear fun
?: Regardless of whether unit i was subject to the policy, we can obtain the predicted probability in each regime.
♠Unlike the magnitude of the coefficient estimates, the APEs can be compared across models.
♠Different scale factors:
LPM: unity, Probit g(o) ' 0.4 logit g(0) ' 0.25
Therefore, we expect the logit slope coefficients to have the largest magnitude followed by the
probit estimates, followed by the LPM estimates.
♠Generally, it is a good idea to compute the PEAs and APEs, as well as partial effects at
other interesting values of x, or at values that allow us to determine policy effect of different
groups in the population.
In one case, we can show that the LPM estimates are consistent for the APEs regardless of
the actual function G(·)
♠Stoker (1986) let m(x) = E[y|x] be the conditional mean
λ = E[∇x m(x)0 ]
Assumptions: 1. x has convex, unbounded support
2. the distribution of x
E[∇x m(x)0 ] = −E[m(x) × ∇x log f (x)0 ] = −E[y × ∇x log f (x)0 ]
→ ∇x E[m(x)] = 0 = E[∇x m(x)0 ] + E[m(x) × ∇x log f (x)0 ]
If x ∼ N ormal(µx , Σx ): multivariate normal distribution then ∇x log f (x) = −(x − µx )Σx−1
⇒ λ = E[y × Σ−1
x (x − µx ) ] = {E[(x − µx ) (x − µx )]} E(x − µx ) y
→ the average derivative
→ the vector of slope coefficients form the linear projection of y on 1, x
→ the estimation of an LPM can consistently estimate average partial effects
♠Potentially the biggest difference:
The LPM implies constant marginal effects, while the logit and probit models allow for a
diminishing effect for continuous and discrete variables
5.7 Specification Issues in Binary Response Models
5.7.1 Neglected Heterogeneity
• We begin by studying the consequences of omitting variables when those omitted variables are independent of the included explanatory variables. This is also called the
neglected heterogeneity problem. The (structural) model of interest is
P (y = 1|x, c) = Φ(xβ + γc)
where x is 1 × K with x1 ≡ 1 and c is a scalar.
♠Those omitted variables are independent of the included explanatory variables
The model of interest:
P (y = 1|x, c) = Φ(xβ + γc) ↔ latent variable form y ? = xβ + γc + e
, y = 1[y ? > 0]
e|x, c ∼
N ormal(0, 1)
♠Now suppose that c is independent of x and c ∼ N ormal(0, τ 2 )
Given these assumptions, the composite term, γc+e, is independent of x and has a N ormal(0, γ 2 τ 2 +
1) distribution.
P (y = 1|x) = P (γc + e > −xβ|x) = Φ(xβ/σ) where σ 2 = γ 2 τ 2 + 1
→ probit of y on x consistent estimates β/σ
⇒ plimβ̂j = βj /σ
, |βj /σ| < |βj |
→ attenuation bias
♠In probit analysis, neglected heterogeneity is a much more serious problem than in linear
models because, even if the omitted heterogeneity is independent of x, the probit coefficients
are inconsistent.
→ In nonlinear models, we usually want to estimate partial effects and not just parameters.
♠”the structural partial effect”
∂P (y = 1|x, c)/∂xj = βj φ(xβ + γc)
← can be evaluated at c = 0
Take the expected value w.r.t. the distribution of c,
E[βj φ(x◦ β + γc)] = βj /σφ(x◦ β/σ) → (the probit estimate) probit of y on x consistently
estimates the average partial effects
If c is correlated with x or otherwise dependent on x (for example, if V ar(c|x) depends on x)
then omission of c is serious
Ex: If c|x ∼ N ormal(xδ, η 2 ), then probit of y on x gives consistent estimates of (β + γδ)/ρ
where ρ2 = γ 2 η 2 + 1
5.7.2 Continuous Endogenous Explanatory Variables
y1∗ = z1 δ1 + α1 y2 + u1
y2 = z1 δ21 + z2 δ22 + v2 = zδ2 + v2
y1 = 1 y1∗ > 0
• Example 5.3 (Testing for Exogeneity of Education in theWomens LFP Model):
• Comparing the Rivers-Vuong approach to the MLE shows that the former is a limited
information procedure.
♠One possibility is to estimate an LPM by 2SLS and it might provide a good estimate of
the average effect.
Write the model as
y1∗ = z1 δ1 + α1 y2 + u1
y2 = z1 δ21 + z2 δ22 + v2 = zδ2 + v2
y1 = 1 y1∗ > 0
where (u1 , v2 ) has a 1. zero mean, 2. bivariate normal distribution and 3. is independent of
The model is applicable when y2 is correlated with u1 because of 1. omitted variables or 2.
measurement error. It can also be applied to the case where y2 is determined jointly with y1 .
♠The normalization that gives the parameters in equation (5.39) an average partial effect
interpretation, at least in the omitted variable and simultaneity contexts is V ar(u1 ) = 1
Holding u1 fixed, the difference in responses is
1[z1◦ δ1 + α1 (y2◦ + 1) + u1 ≥ 0] − 1[z1◦ δ1 + α1 y2◦ + u1 ≥ 0]
Because u1 is unobserved, we average across the distribution of u1 ,
Φ[z1◦ δ1 + α1 (y2◦ + 1)] − Φ[z1◦ δ1 + α1 y2◦ ]
Alternatively, if we allow σ12 = V ar(u1 ) > 0 to be unrestricted, the APE would depend on
δ1 /σ1 and α1 /σ1
→ only need to consistently estimate δ1 and α1 up to scale.
♠A control function approach due to Rivers and Vuong (1988)
& it leads to a simple test for endogeneity of y2
u1 = θ1 v2 + e1 where θ1 = η1 /τ22
η1 = Cov(v2 , u1 )
ande1 : indep. of z and v2
Because of joint normality of (u1 , v2 ), e1 is also normally distributed with E(e1 = 0) and
V ar(e1 ) = V ar(u1 ) − η12 /τ22 = 1 − ρ21 where ρ1 = Corr(v2 , u1 )
y1? = z1 δ1 + α1 y2 + θ1 v2 + e1
e1 |z, y2 , v2 ∼ N ormal(o, 1 − ρ21 )
A standard calculation shows that
+α1 y2 +θ1 v2
P (y1 |z, y2 , v2 ) = Φ( z1 δ1(1−ρ
2 )1/2
when, v2 is observed, probit of y1 on z1 , y2 and v2 consistently estimates
δρ1 ≡ δ1 /(1 − ρ21 )1/2
δρ1 ≡ α1 /(1 − ρ21 )1/2
δρ1 ≡ θ1 /(1 − ρ21 )1/2
♠Procedure 5.1
(a) Run the OLS regression y2 on z and save the residuals v̂2
(b) Run the probit y1 on z1 , y2 , v̂2 to get consistent estimators of the ”scaled coefficients”
δρ1 , αρ1 and θρ1
A nice feature of Procedure 5.1 is that the usual probit t-statistic on v̂2 is a valid test of the
null hypothesis that y2 is exogenous, that is H0 : θ1 = 0
Ex: Testing Exogeneity of Education in the Women’s LFP model
♠We can easily obtain estimates of the unscaled parameters, β1 = (δ10 , α1 )0 consistent estimators of δ2 and τ22 (first-stage) δρ1 , αρ1 and θρ1 (second-stage)
Straightforward algebra shows that 1 + θρ21 τ22 =
 δ1 = (1 − ρ2 )1/2 δρ
 α = (1 − ρ2 )1/2 α
β̂1 = β̂ρ1 /(1 + θ̂ρ21 τ̂22 )1/2 = (δ̂10 , α̂1 )0
← the original coeffictients
Given δ̂1 and α̂1 , we can compute derivatives and differences in Φ(z1 δ̂1 + α̂1 y2 ) at interesting
values of z1 and y2 .
The average partial effects for continuous explanatory variables (including y2 ) the scale factor
φ(zi1 + δ̂1 + α̂1 yi2 )
for multiplying, say α̂1 is N1
An alternative method of computing the APEs does not exploit the normality assumption for
APEs are obtained by taking derivatives or differences of
Ev2 [Φ(z1 δρ1 + αρ1 y2 + θρ1 v2 )] with respect to elements of (z1 , y2 )
A consistent estimator is given by
1 P
Φ(z1 δ̂ρ1 + α̂ρ1 y2 + θ̂ρ1 v̂i2 )...(5.47)
or the estimated APE is
α̂1 ( N1
φ(zi1 δ̂ρ1 + α̂ρ1 yi2 + θ̂ρ1 v̂i2 ))
♠We plug in the reduced form for y2 to get y1 = 1[z1 δ1 + α1 (z1 δ2 ) + α1 v2 + u1 > 0]
1 (z1 δ2 )
⇒ P (y1 = 1|z) = Φ( z1 δ1 +α
where w12 = V ar(α1 v2 + u1 ) = α12 τ22 + 1 + 2αCov(v2 , u1 )
A two-step procedure:
(1) the same first-step OLS regression: get the fitted values, ŷi2 = zi α̂2
(2) followed by a probit of yi1 on zi1 , ŷi2
1. Does not provide a simple test of the null hypothesis that y2 is exogenous
2. the coefficients cannot be directly compared to the usual probit estimates
3. equation (5.47) is not available for the fitted values approach.
♠(Endogeneity of Nonwife Income in the Women’s LFP model)
The Null hypothesis: nwif einc is exogenous in the probit model
IV: husband’s years of schooling → The model is just identified
♠If we have overidentifying restrictions, these are easily tested using the CF approach.
The key restrictions: D(u1 |y2 , z) = D(u1 |v2 )
i.e. the conditional distribution depend only on the linear combination y2 − zδ2 = v2
zδ2 →
z1 δ21 + z2 δ22
♠Rather than use a two-step procedure, we can estimate equations (5.39)-(5.41) by conditional maximum likelihood estimation (CMLE)
The joint distribution of (y1 , y2 ), conditional on z,
f (y1 , y2 |z) = f (y1 |y2 , z)f (y2 |z)
f (y1 |y2 , z) → the conditional density of y1 given (y2 , z)
f (y2 |z) → y2 |z ∼ N ormal(zδ2 , τ22 )
♠v2 = y2 − zδ and y1 = 1(y ? > 0)
2 +ρ1 /τ2 (y2 −zα2 )
⇒ P (y1 |y2 , z) = Φ[ z1 δ1 +α1 y(1ρ
2 )1/2
where θ1 = ρ1 /τ2
Let w denote the term in inside Φ
Then we have derived
f (y1 , y2 |z) = Φ(w)y1 {1 − Φ(w)}1−y1 (1/τ2 )φ( y2 −zδ
τ2 )
Summing expression log f (yi1 , yi2 |z1 ) across all i and maximizing with respect to all parameters gives the MLEs of δ1 , α1 , ρ1 , δ2 , τ22
→ instrumental variables probit or IV probit
♠MLE has some decided advantages over two-step procedures.
First, MLE is more efficient than any two-step procedure.
Second, we get direct estimates of δ1 and α1 , the parameters of interest for computing partial
Testing that y2 is exogenous:
Just test H0 : ρ1 = 0 using an asymptotic t test.
♠Comparing the Rivers-Vuong approach to MLE shows that the former is a limited information procedure.
I. Rivers and Vuong focus on f (y1 |y2 , z)
1.replace the unknown δ2 with the OLS estimator δ̂2
2. ignore the rescaling problem by taking e1 to have unit variance
II. MLE estimate the parameters using the information in f (y1 |y2 , z) and f (y2 |z) simultaneously
♠The CF approach has a somewhat subtle robustness property.
→ CF approach is consistent if we assume that
1. D(u1 |z, v2 ) = D(u1 |v2 )
2. D(u1 |v2 ) is normal with mean linear in v2 and constant variance
or [sufficient]
1. Independence between (u1 , v2 ) and z along with
2. bivariate normality of (u1 , v2 )
♠For notational and interpretational simplicity, our previous analysis has assumed that a
single endogenous explanatory appears additively inside the index function.
y1 = 1[g1 (z1 , y2 )β1 + u1 > 0]
← u1 = θ1 v2 + e(key)
y2 = g2 (z)δ2 + v2
g1 (z1 , y2 ) and g2 (z) known vector function
Because y2 is a function of (z, v2 ), if we maintain independence between (u1 , v2 ) and z then
D(u1 |z, y2 , v2 ) = D(u1 |z, v2 ) = D(u1 |v2 ) and the previous analysis, whether it is the RiversVuong CF approach or MLE goes through by replacing x1 ≡ (z1 , y2 ) with x1 ≡ g1 (z, y2 )
♠Another nice feature of the CF approach is that we can allow for a strictly monotonic
transformation of the endogenous explanatory variable in the reduced form.
Ex: y2 = log(income) → income = exp(y2 ) → better functional form and any function of
income: its level quadratic or interaction terms
→ That including the reduced form residuals in the CF approach accounts for endogeneity of
y2 , even if we have general functions of y2 and z1 in the model, hinges crucially on independence between (u1 , v2 ) and z
5.7.3 A Binary Endogenous Explanatory Variable
y1 = 1 z1 δ1 + α1 y2 + u1 > 0
y2 = 1 zδ2 + v2 > 0
y1 = 1 z1 δ1 + α1 y2 + u1 > 0
y2 = 1 zδ2 + v2 > 0
where (u1 , v2 ) is independent of z and distributed as bivariate normal with 1. mean zero 2.
unit variance and 3. ρ1 = Corr(u1 , v2 )
Note: Model (5.51) and (5.52) applies primarily to omitted variable → rule out :
1. no the reduced form (5.52) if the structural model has y1 as a determinant of y2
2. Measurement error in binary responses does not lead to this model either.
♠Often, the effect of y2 is of primary interest
ex: y1 : employment status
y2 : participation in some sort of program, such as job training
♠The average treatment effect (for a given value of z1 )
Φ(z1 δ1 + α1 ) − Φ(z1 δ1 ) → can be computed for different subgroups or averaged across the
distribution of z1 .
♠To derive the likelihood function, we again need the joint distribution of (y1 , y2 ) given z
which we obtain from equation f (y1 , y2 ) = f (y1 |y2 , z)f (y2 |z)
y1 ← binary
+α1 y2 +ρ1 v2
→ P (y1 |v2 , z) = Φ( z1 δ1(1−ρ
2 )1/2
y2 = 1 if and only if v2 > −zδ2
1. v2 has a standard normal distribution
2. independent of z
⇒ the density of v2 given v2 > −zδ2 is
φ(v2 )
P (v2 >−zδ2 )
φ(v2 )
Φ(zδ2 )
+α1 y2 +ρ1 v2
Therefore, P (y1 = 1|y2 = 1, z) = E[P (y1 = 1|v2 , z)|y2 = 1, z] = E[Φ( z1 δ1(1−ρ
)|y2 =
2 )1/2
+α1 y2 +ρ1 v2
1, z] = Φ(zδ
Φ( z1 δ1(1−ρ
)φ(v2 )dv2
2 )1/2
2 ) −zδ2
P (y1 = 0|y2 = 1, z) = 1 − P (y1 = 1, z)
Similar derivation to P (y1 = 1|y2 = 0, z)
P (y1 = 0|y2 = 0, z)
→ Combining the four possible outcomes of (y1 , y2 ), along with the probit model for y2 and
taking the log gives the log-likelihood function for maximum likelihood analysis.
♠The bivariate probit model
y1 = 1[x1 β1 + e1 > 0]
y2 = 1[x2 β2 + e2 > 0]
where x1 is 1 × k1 and x2 is 1 × k2
e ≡ (e1 , e2 ) is assumed to be independent of (x1 , x2 ) with a bivariate normal distribution
→ e|x ∼ N ormal(0, Ω) where x consists of all exogenous variable and
1 ρ
ρ 1
, ρ = Corr(e1 , e2 ) → y1 and y2 each follow probit models conditional on x, β1 and β2 can be
consistently estimated by estimating separate probit models
Not surprisingly, if e1 and e2 are correlated, a joint maximum likelihood procedure is more
efficient than the separate probit.
♠A simple way to obtain the log-likelihood function is to construct the joint density as
f (y1 |y2 , x)f (y2 |x) can contain a binary endogenous variable y2 → x1 can be any function of
(z1 , y2 )
♠Because computing the MLE requires nonlinear optimization, it is tempting to use some
seemingly ”obvious” two-step procedures.
inappropriately mimic 2SLS
E(y2 |z) = Φ(zδ2 ) and δ2 is consistently estimated by probit of y2 on z
→ Don’t produce consistent parameter estimator by (probably not the APE either) the probit
of y1 on z, Φ̂2 , where Φ̂2 ≡ Φ(z δ̂2 )
Example 5.4 (Women’s Labor Force Participation and Having More than Two Children)
1. The endogenous explanatory variable is y2 = morekids, which is unity if a women has
three or more children (49% of the sample)
2. The response variable is y1 = worked; roughly 59% of the women report being in the labor
force at the time of the survey
3. As an instrumental variable for y2 = morekids we use samesex, which is a binary varialbe
equal to one if the first two children are of the same sex.
4. If we drop samesex from the probit model for morekids-which , in a linear context, would
lead to a lack of identification -we estimate an even larger effect the APE is -0.349
The estimated value of ρ increases substantially when we drop samesex from the model for
morekids, suggesting that including samesex in the model for morekids helps reduce the
correlation between unobservables that affect both mordkids and worked
5.7.4 Heteroskedasticity and Nonnormality in the Latent Variable Model
♠the underlying latent variable formulation, as in y ? = xβ + e
the response probability in equation P (y = 1|x) = G(xβ)
First, consider the problem of nonnormality of e in a probit model.
If e is independent of x, we can write
P (y = 1|x) = G(xβ) 6= Φ(xβ) where, generally, G(z) = 1 − F (−z)
F (·) : the cdf of e
”Nonnormality in a probit model causes bias and inconsistnecy in the parameter estimates”
→ Correct but it largely misses the point.
→ In Section 5.6 we noted that when x has a multivariate normal distribution, the LPM
consistently estimates the APEs for any smooth function G(·).
→ The probit model is likely to do a reasonalbe job of approximating the APEs in lots of cases
where G 6= Φ.
→ Ex5.2:
probit and logit give very similar estimated partial effects
the logit parameter estimates are larger than the probit estimates by roughly a factor of 1.6
→ a consequence of the different implicit scale factors
♠Worrying about the choice of G when the response probability is of interest is no different
from worrying about the functional form E(y|x) in a regression context.
→ Not in the language of ”biased or ”inconsistent”
→ The issue is whether a linear or exponential function form provides the best fit, and whether
the estimated partial effects E(y|x) are very different across the models.
♠Nonnormality of e in y = 1[xβ + e > 0] means that the probit response probability is
Even if we could estimate β consistently, we could not obtain magnitudes of the partial effect.
So our focus should be on how well different methods approximate partial effects and not
whether they estimate parameters consistently.
→ partial effects v.s. parameters
→ relative partial effects of continuous variables can be identified without specifying G(·).
Once we view nonnormality of e as a functional form problem for p(x), we can evaluate more
general parametric models in a sensible way. It can be a good idea to replace Φ(xβ) with a
function such as G(xβ, γ), where γ is an extra set of parameters, especially if G(xβ, γ) is
chosen to nest the probit model.
→ And we definitely should not reject the standard models just because the estimates of β
seem to change a lot: the basis for comparison should be partial effects at various values of x,
APEs, and goodness-of-fit measures such as the 1. values of the log-likelihood functions and
2. the percent correctly predicted
♠Similar comments can be made about heteroskedasticity in the latent error e. Yatchew
and Griliches (1984) concern the inconsistency of the probit MLE when V ar(e|x) depends on
x, even if D(e|x) is normal.
If D(e|x) = N ormal(0, h(x)), the response probability takes the form P (y = 1|x) = Φ( h(x)
1/2 )
and so it is pretty obvious that probit of y on x could not consistently estimate β.
”heteroskedasticity is prevalent with microeconomic data”
→ If V ar(e|x) = 1, V ar(y|x) = Φ(xβ)[1 − Φ(xβ)]
→ In fact, most observed ”discrete” random variable y will have conditional dist such that
V ar(y|x) is not constant but that is due to the discreteness in y, not necessarily because of
heteroskedasticity in an underlying latent error.
♠Is there any reason we should consider possible heteroskedasticity in V ar(y ? |x) when y ?
is an unknown latent variable?
→ generalize the functional form of the response probability
Ex: V ar(e|x) = exp(2x1 β)
⇒ P (y = 1|x) = Φ(xβ/ex1 δ ) : much more flexible response probability
⇒ a cost to the more general model : 1. one way of computing APE
∂P (y=1|x)
are more
complicated and not the same sign as βj
♠As discussed in Wooldridge (2005c), the partial effects in the model yi = 1(xi β + ei > 0)
averaged across the distribution of ei , are the same sign as the corresponding βj . In fact, the
ratios of APEs for the continuous variables are equal to the ratios of the coefficients, even
when ei contains heteroskedastictiy
Ex: ASF (x) = Eei = {1(xβ + ei > 0)} = P (ei > −xβ) = 1 − F (−xβ) where F (·) is the cdf
of e
⇒ 1. APE of xj has the same sign as βj
2. for a continuous xj , the APE is βj f (−xβ)
3. the relative APEs for two continuous variables xj and xh is βj /βh
♠In the heteroskedasticity probit model, we can easily recover the APEs through the ASF.
We use iterated expectations, because what we have modeled is D(ei |xi ) = D(ei |xi1 )
ASF (x) = Exi1 (E{1[xβ + ei > 0]}|xi1 ) = Exi1 {Φ(xβ/exi1 δ )} follows from 1[xβ + ei > 0] =
1[exp(−xi δ)ei > −exp(−xi1 δ)xβ]
For a continuous covariate xj , the partial effect on the ASF is Exi1 {φ(xβ/exi1 δ )}
Given the maximum likelihood estimators β̂ and δ̂ from heteroskedastic probit, a consistent
2. the other way of computing APE
1 P
φ(xβ̂/exi1 δ̂ )β̂j ......(5.60)
♠We have two ways of computing partial effects and they can give conflicting answers,
not just on magnitude of the effects but also the direction of the effects
→ equation (5.60) shows that the PAEs obtained from averaging out ei are proportional to the
⇒ NO convincing way of choosing between eqs 1. and 2.
5.7.5 Estimation under Weaker Assumptions
• There are several semiparametric estimators of the slope parameters, up to scale, that
do not require knowledge of G.
• Obtaining Ĝ requires nonparametric regression of yi on xi β, where β̂ are the scaled
slope estimators.
♠Parametric models : P (y = 1|x) depends on a finite number of parameters
→ relax parametric assumptions
→ estimating the directions and relative sizes of the partial effects and not the response probabilites
♠x: multivariate normal
⇒ the slope coefficients in the linear projection of y on 1, x, say λ are the partial effects of
P (x) averaged across the distribution of x-regardless of the true response probability
♠An alternative approach is to explicitly recognize that we do not know the function G(·),
but the response probability has the index form in eq.
→ y ? = xβ + e, e is indep. of x but the distribution of e is not known
→ There are several semiparametric estimators of the slope parameters, up to scale, that do
not require knowledge of G.
→ the response probabilities, as well as the partial effects on these probabilities can be consistently estimated for unknown G
Remarkably, it is possible to estimate β up to scale without assuming that e and x are independent.
Manski (1975,1988) shows how to consistently estimate β, subject to a scaling, under assumption that the median of e given x is zero.
→ some mild restrictions: at least one element of x with nonzero coefficient is essentially
→ 1. allow e to have any distribution
2. e and x can be dependent → V ar(e|x) is unrestricted
♠Manski’s maximum score estimator:
(A least absolute deviations estimator)
Since the median of y given x is 1[xβ] > 0, the estimator solves
|yi − 1[xi β > 0]| over all β with β 0 β = 1
→ The resulting estimator is consistent but its limiting distribution is nonnormal and converges at rate N 1/3
Horowitz (1992) proposes a smoothed version of the maximum score estimator that converges
at a rate close to N
Drawback: the maximum score estimator does not allow estimation of the APEs for either
continuous or discrete covariate becasue the unconditional distribution of ei is not identified
and without making further assumptions, we cannot find P (yi = 1|xi )
♠Lewbel (2000) offers a different approach to estimating the coefficients up to scale when
1. e is not independent of x but 2. e and x are uncorrelated.
Key assumption: the existence of a continuous variable, say xk , wiht βk 6= 0 such that
D(e|x) = D(e|x1 , x2 , ..., xk−1 )
→ conditional on (x1 , ..., xk−1 ), e and xk are independent
♠As pointed out by Lewbel, one case where his assumption holds is when the latent variable
model is a random coefficient model of the form
y ? = bi1 xi1 + ... + bi,k−1 xi,k−1 + βk xik + ai
where bi1 , ..., bik−1 are random slopes and βk 6= 0
If the vector (ai , bi1 , ..., bi,k−1 ) is independent of xi , then Lewbel’s key assumption holds.
♠Some progress has been made in estimating parameters up to scale in the model y1 =
1[z1 δ1 + α1 y2 + u1 > 0] where y2 might be correlated with u1 and z1 is a 1 × L1 vector of
exogenous variables
Lewbel’s (2000) general approach applies to this situation as well.
Let z be the vector of all exogenous variables uncorrelated withu1 . Then Lewbel requires a
continuous element of z1 with nonzero coefficient zL1 that does not appear in D(u1 |y2 , z)
(Clearly, y2 cannot play the role of the variable excluded from D(u1 |y2 , z) if y2 is thought ot
be endogenous)
When might Lewbel’s exclusion restriction hold? → rule out the dependence zL1 and u1 through
Sufficient is y2 = g2 (z2 ) + v2 , where 1. (u1 , v2 ) is indep. of z and 2. z2 does not contain zL1 .
5.8 Binary Response Models for Panel Data and Cluster Samples
♠A linear probability model for binary outcomes has the same problems as in the cross
section case.
I. less appealing for unobserved effects models, as it implies the unnatural restrictions xit β ≤
ci ≤ 1 − xit β
, t = 1, ..., T on the unobserved effect
→ an unobserved effects LPM is P (yit = 1|xi , ci ) = P (yit = 1|xit , ci ) = xit β + ci
II. But FE or FD estimation of the LPM might provide reasonable estimates of APEs.
Advantage: 1. Not requiring a distributional assumption on D(ci |xi )
2. Nor do we have to assume independence of the responses {yi1 , ..., yiT } conditional on (xi , ci )
5.8.1 Pooled Probit and Logit
P (yit = 1|xit ) = G(xit β),
t = 1, 2, . . . , T
♠P (yit = 1|xit ) = G(xit β)......(5.62)
1. a known function taking on values in the open unit interval
2. xit can contain a variety of factors, including 1. time dummies 2. interactions of time
dummies with time-constant or time varying variables and 3. lagged dependent variables
As with many pooled MLE problems, we have only specified a model for D(yit |xit )
For binary response, we maximize the partial log likelihood
(yit log G(xit β) + (1 − yit ) log(1 − G(xit β)))
i=1 t=1
In the case that the model (5.62) is dynamically complete, that is
P (yit = 1|xit , yit−1 , xit−1 , ...) = P (yit = 1|xit )......(5.63)
D(yit |xit , yit−1 , xit−1 , ...) = D(yit |xit )
? ? ?To test for dynamic completeness, we can always add a 1. lagged dependent variable and
2. possibly lagged explanatory variables.
A simple one-degree-of-freedom test for dynamic completeness
Define uit ≡ yit − Φ(xit β)
Under dynamic completeness, E[uit |xit , yit−1 , xit−1 , ...] = 0
→ uit is uncorrelated with any function of the variables (xit , yit−1 , xit−1 , ...), including uit−1
Let ûit ≡ yit − Φ(xit β̂)
Using pooled probit to estimate the artificial model
P (yit = 1|xit , ûi,t−1 ) = Φ(xit β + γ1 ûi,t−1 )
, t = 2, ..., T
The null hypothesis: H0 : γ1 = 0
If H0 is rejected, then so is dynamically complete, P (yit |xit , yit−1 , xit−1 ) = P (yit = 1|xit )
→ reject dynamically complete
5.8.2 Unobserved Effects Probit Models under Strict Exogeneity
• A popular model for binary outcomes with panel data is the unobserved effects probit
model.The main assumption of this model is
P (yit = 1|xi , ci ) = P (yit = 1|xit , ci ) = Φ(xit β + ci ),
t = 1, . . . , T
• In this spirit, a fixed effects probit analysis treats the ci as parameters to be estimated
along with β, as this treatment obviates the need to make assumptions about the distribution of ci given xi .
• Unfortunately, in addition to being computationally difficult, estimation of the ci along
with β introduces an incidental parameters problem.
• The traditional random effects probit model adds, to assumptions (5.60) and (5.61), the
ci |xi ∼ Normal(0, σc2 )
• (In other branches of applied statistics, such as biostatistics and education, the coeffi27
cients indexing the APEs-βc in our notation-are called the population-averaged parameters.)
• The conditional MLE in this context is typically called the random effects probit estimator, and the theory in Section 13.9 can be applied directly to obtain asymptotic
standard errors and test statistics.
• An alternative to simply calculating robust standard errors for β̂c after pooled probit, or
using the full random effects assumptions and obtaining the MLE, is what is called the
generalized estimating equations (GEE) approach (see Zeger, Liang, and Albert, 1988).
• Unfortunately, the model is then called a population-averaged model. As we just saw, we
can estimate the population-averaged parameters, or APEs, even if we have used random
effects probit estimation. It is best to think of assumption (5.60) as the unobserved
effects model of interest. The relevant question is, What effects are we interested in,
and how can we consistently estimate them under various assumptions? We are usually
interested in averaging across the distribution of ci , but the basic model has not changed.
• Chamberlain (1980) called model (5.60) under assumption (5.67) a random effects probit
model, so we refer to the model as Chamberlain random effects probit model. While
assumption (5.67) is restrictive in that it specifies a distribution for ci given xi , it at
least allows for some dependence between ci and xi .
♠A key assumption is that the observed covariates are strictly exogenous conditional on
the unobserved effect
The assumption can be stated in terms of conditional distributions
D(yit |xi , ci ) ≡ D(yit |xi1 , xi2 , ..., xiT , ci ) = D(yit |xit , ci )
, t = 1, ..., T (5.65)
ci ← unobserved effects
I. Instating this assumption, we are not restricting in any way the joint distribution conditional on (xi , ci ) which we write as D(yi |xi , ci )
II. the strict exogeneity assumption conditional on ci is in terms of the conditional expectation.
III. Assumption (5.65) rules out models with 1. lagged dependent variables in xit and 2. other
situations where one or more elements of xit may react in the future to idiosyncratic changes
in yit
IV. Assumption (5.65) requires that xit includes enough lags of underlying explanatory variables if distributed lag dynamics are present
the unobserved effects probit model
P (yit = 1|xit , ci ) = Φ(xit β + ci )
, t = 1, ..., T ......(5.66)
→ the response probability
Most analyses are not convincing unless xit contains a full set of time dummies.
Without more assumptions, β is not identified
→ specify how ci relates to the covariates
♠What happens if we try to proceed without placing restrictions on D(ci |xi )?
Assume yi1 , ..., yiT conditional on (xi , ci )......(5.67)
⇒ the density of (yi1 , ..., yiT ) conditional on (xi , ci )
f (y1 , ..., yT |xi , ci ; β) =
f (yt |xit , ci ; β) where f (yt |xt , c; β) = Φ(xt β + c)yt [1 − Φ(xt β + c)]1−yt
Ideally, we could estimate the quantities of interest without restricting the relationship between
ci and the xit .
⇒ The log-likelihood function is
li (ci , β) where li (ci , β) = log f (yi1 , ..., yiT |xi , ci ; β)
estimation of the ci along with β introduces an incidental parameter problem
→ inconsistent estimation of β with T fixed and N 7→ ∞
♠linear models, the logit model, and count data models
We can consistently estimate the parameters β without specifying a distribution for ci given
serious biases in the ”fixed effect probit”
→ need assumption about the relationship between ci and xi
♠We always treat ci as an unobservable random variable drawn along with (xi , yi )
The traditional random effects probit model
ci |xi ∼ N ormal(0, σc2 )......(5.69)
1. ci and xi are independent
2. ci has a normal distribution
♠What we want to estimate?
1. consistent estimation of β means that we can consistently estimate the partial effects of
the elements of xt on the response probability P (yt = 1|xt , c) at the average value of c in the
population, c = 0.
2. Since ci ∼ N ormal(0, σc2 ), the ADE for a continuous xtj is
Therefore, we only need to estimate βc ≡
(1+σc2 )1/2
xt β
φ( (1+σ
2 1/2 )
(1+σc2 )1/2
to estimate the APEs
3. Because the ci are not observed, they cannot appear in the likelihood function
A step requires us to integrate out ci
R∞ Q
f (y1 , ..., yT |xi ; θ) = −∞ [ f (yt |xit , c; β)]1/σc φ(c/σc )dc where f (yit |xit , c; β) = Φ(xt β+c)yt [1−
Φ(xt β + c)]1−yt and θ contains β and σc2 → original parameters
⇒ the conditional log likelihood li (θ) = log f (yi1 , ..., yiT |xi ; θ) for each i
The conditional MLE is called the random effects probit estimator
4. Since the variance of the idiosyncratic error in the latent variable model is unity, the relative importance of the unobserved effect is measured as ρ =
σc2 +1
which is also the correlation
between the composite latent error, say ci + eit , across any two time periods.
First consider relaxing assumption, yi1 , ..., yiT are independent conditional on (xi , ci )
We can write (under assumptions (5.66) and (5.69))
P (yit = 1|xit ) = Φ(xit βc ) where βc =
(1+σc2 )1/2
→ estimate βc from pooled probit of yit on xit , t = 1, ..., T meaning that we directly estimate
the APEs.
→ If ci is truly present, {yit : t = 1, ..., T } will not be independent conditional on xi
Robust inference is needed to account for the serial dependence.
♠A different way to relax assumption (5.67) is to assume a particular correlation structure
and then use full CMLE.
For each t, write the latent variable model as
∗ =x β+c +e
∗ > 0]
yit = 1[yit
and assume that the 1. T × 1 vector ei = (ei1 , ..., eiT )0 is multivariate
2. unit variance
3. unrestricted correlation matrix
For moderate T , computation of the CMLE can be very difficult. Recent advances in simulation methods of estimation make it possible to estimate such models for fairly large T .
♠Explicitly allow unobservables ci to be correlated with some elements of xit .
Chamberlain’s assumption is
ci |xi ∼ N ormal(ψ + x̄i ξ, σa2 )......(5.73)
where x̄i is the average of xit , t = 1, ..., T and σa2 : the variance of ai s.t. ci = ψ + x̄i ξ + ai →
indep. of x
→ Although it specifies a distribution for ci given xi , it at least allows for some dependence
between ci and xi .
→ refer to the model as Chamberlain’s correlated random effects probit model
If assumption (5.65) (5.66) (5.67) and (5.73) hold.
∗ = ψ + x β + x̄ ξ + a + e where 1. e independent N ormal(0, 1) conditional on (x , a )
i i
2. ai |xi ∼ N ormal(o, σa2 )
?? → Adding x̄i as a set of controls for unobserved heterogeneity is very intuitive we are
estimating the effect of changing xitj but holding the time average fixed
→ A test of the usual RE probit model is easily obtained as a test of H0 : ξ = 0
♠Given estimates of ψ and ξ, we can estimate E(ci ) as Φ(ûc + xt β̂)
Taking differences or derivatives (with respect to the elements of xt ) allows us to estimate the
partial effects on the response probabilities for any value of xt .
♠If we drop immediately that 1. ψa , βa and ξa can be consistently estimated using a pooled
probit analysis of yit on 1, xit , x̄i
, t = 1, ..., T
i = 1, ..., N
Because the yit will be dependent condition on xit , 2. inference that is robust to arbitrary time
dependence is required.
♠Estimation the APEs
average P (yt = 1|xt = x◦ , ci ) across the distribution of ci
→ E[P (yt = 1|xt = x◦ , ci )] = E[Φ(x◦ β + ci )] for any given value x◦
← integration w.r.t. the marginal distribution of ci
In what follows, x◦ is a nonrandom vector of numbers that we choose as interesting values of
the explanatory variables
Writing ci = ψ + x̄i ξ + ai
E[Φ(ψ + x◦ β + x̄i ξ) + ai ] = E[E{Φ(ψ + x◦ β + x̄i ξ + ai )|xi }]
β+x̄i ξ
E{Φ(ψ + x◦ β + x̄i ξ + ai )|xi : Φ[ Φ(ψ+x
] := Φ(ψa + x◦ βa + x̄i ξa )
(1+σ 2 )1/2
E[Φ(x◦ β
+ ci ] = E[Φ(ψa +
x◦ βa
+ x̄i ξa )] ← consistent estimator
Φ(ψ̂a + x◦ β̂a + x̄i ξˆa )......(5.76)
?? APEs can be estimated by evaluating expression (5.76) at two different values for x◦
and forming the difference, or for continuous variable xj , by using the average across i of
β̂aj φ(ψ̂a + x◦ β̂a + x̄i ξˆa ) to get the approximate APE of a one-unit increase in xj .
♠How does treating the c0i s as parameters to estimate in a ”fixed effect probit” analysis
affect estimation of the APE’s?
Given ĉi
, i = 1, .., N and β̂, the APEs could be based on
Φ(ĉi + x◦ β̂) even though β̂
does not consistently estimate β and the ĉi are estimates of the incidental parameters.
Fern’adez-Val (2009) shows that the inconsistency in APEs constructed using the ĉi and the
accompanying β̂ is on the order of T −1 (if there is no heterogeneity, the inconsistency is only
O(T −2 ))
Hahn and Newey (2004) found small biases in the usual probit APEs; they suggest biascorrected estimators that have inconsistency on the order of T −2 .
→ All of these papers maintain 1. the conditional independence assumption (5.67) 2. the
covariate are independent, identically distributed across t.
Strict exogeneity of the covariates conditional on ci is critical for the previous analysis. As
mentioned earlier, this assumption rules out lagged dependent variables, a case we consider
explicitly in Section 18.5.4
Obtaining a test of strict exogeneity,
wit : 1 × G subset of xit that we suspect of failing the strict exogeneity
⇒ A simple test is to add wi,t+1 as an additional set of covariates
Under the null hypothesis, wi,t+1 should be insignificant obtain x̄i from all T time periods
⇒ Test 1.(RE probit) the distribution of (yi1 , ..., yi,T −1 ) given (xi1 , ..., xiT )
2. (pooled probit) the marginal distribution o fyit given (xi1 , ..., xiT )
, t = 1, ..., T − 1
If the test does not reject, it provides at least some justification for the strict exogeneity
5.8.3 Unobserved Effects Logit Models under Strict Exogeneity
• If we replace the standard normal cdf Φ in assumption (5.60) with the logistic function
Λ, and also maintain assumptions (5.61) and (5.63), we arrive at what is usually called
the random effects logit model.
• The conditional MLE from equation (5.71) is usually called the fixed effects logit estimator.
♠The response probability
P (yit = 1|xit , ci ) = Λ(xit β + ci )
← the logistic function
, t = 1, ..., T
The full set of assumptions:
1. the strict exogenetiy (5.65)
2. the conditional independent (5.67)
3. the normality (5.69)
⇒ the random effect logit model
♠the response probability obtained by integrating out ci
P (yit = 1|xit ) = −∞ Λ(xit β + c) σ1c φ(c/σc )dc ← no close form
Allow D(ci |xi ) to depend on x̄i
P (yit = 1|xi ) = Λ(ψa + xit βa + x̄i ξa )
⇒ ASFs: N1
Λ(ψ̂a + xit β̂a + x̄i ξˆa ) for given xt
→ average structural function (ASF): at a given value x◦ , to be Eq [u1 (x◦ , q)] where u1 (x, q) =
E[y|x, q] ← conditional expectation of interest
♠The real advantage of the logit specifications that we can obtain a
N -consistent, asymp-
totically normal estimator of β without any assumptions on D(ci |xi ) provided that each element of xit is time varying
How can we allow ci and xi to be arbitrarily related in the unobserved effect logit model?
Use some transformation to eliminate ci from the estimating equation
? ? ? → what we do is find the joint distribution of yi = (yi1 , .., yiT )0 conditional on xi , ci and
ni =
→ This conditional distribution does not depend on ci ← a feature of the logit functional form
Use standard CMLE methods to estimate β
First consider the T = 2 case, where ni = yi1 + yi2 takes a value in {0, 1, 2}
⇒ ni cannot be informative for β when ni = 0 or ni = 2
→ completely determine the outcome on yi (yi1 , yi2 )0 = (0, 0) or (yi1 , yi2 )0 = (1, 1) → such
data points have no information for estimating β and they should drop out of the estimation
For ni = 1
P (yi2 =1,ni =1|xi ,ci )
(yi2 =1,yi1 =0|xi ,ci )
= P (yi1 =0,yi2P=1|x
P (ni =1|xi ,ci )
i ,ci )+P (yi1 =1,yi2 =0|xi ,ci )
Λ(xi2 β+ci )[1−Λ(xi1 +ci )]
[1−Λ(xi1 β+ci )]Λ(xi2 β+ci )+Λ(xi1 β+ci )[1−Λ(xi2 β+ci )]
P (yi2 = 1|xi , ci , ni = 1) =
= Λ[(xi2 − xi1 )β + ci ]
Similarly, P (yi1 = 1|xi , ci , ni = 1) = Λ(−(xi2 − xi1 )β) = 1 − Λ((xi2 − xi1 )β)
⇒ The conditional log likelihood for observation i is
li (β) = 1[ni = 1](wi log Λ((xi2 − xi1 )β) + (1 − wi ) log(1 − Λ((xi2 − xi1 )β)))......(5.81)
where wi = 1 if (yi1 = 0, yi2 = 1)
and wi = 0 if (yi1 = 1, yi2 = 0)
The CMLE is obtained by maximizing the sum of the li (β) across i
→ equation (5.81) is just a standard cross-sectional logit of wi on (xi2 − xi1 ) using observations for which ni = 1
→ often called the fixed effects logit estimator or the conditional logit estimator
→ confusing, does not treat the ci as parameters
⇒ the MLE of β that is obtained by maximizing the log likelihood over β and the ci , has
probability limit 2β
♠For general T ,
P (yi1 = y1 , ..., yiT = yT |xi , ci , ni = n) =
P (yi1 ,...,yiT =yT |xi ,ci )
P (ni =n|xi ,ci )
•P (yi1 , ..., yiT = yT |xi , ci ) → P (yi1 = y1 |xi , ci ) · · · P (yiT = yT |xi , ci ) by conditional independence assumption
•P (ni = n|xi , ci ) → the sum of the probabilities of all possible outcomes of yi such that ni = n
→ construct li (β)
1. can’t estimate the partial effects on the response probabilities → need c
2. can’s estimate APEs → E[Λ(xt β + ci )] ← requires specifying a distribution for ci
3. Drawback: require the conditional independence assumption (5.67) for consistency
♠Ex.5.5 Panel Data Models for Women’s Labor Force Participation
The key explanatory variables are kidit (number of children under 18) and lhincit (husband’s
The findings reveal a consistent pattern:
allowing unobserved heterogeneity to be correlated with kidit and lhincit has important effects
on the estimated APE’s.
Table 5.4 contains a simple summary of the strengths and weaknesses of each approach. Each
method is evaluated by answers to five questions
(1) Is the response probability contained in the unit interval?
(2) Is the distribution of heterogeneity, given the covariates, restircted? D(ci |xi )
(3) Is serial dependence allowed after accounting for ci ?
(4) Are the partial effects at the mean heterogeneity identified?
(5) Are the APEs identified?
♠Ex5.5 We include the time-constant variables
(1) educ (years of schooling in the first period)
(2) black (a binary race indicator)
(3) age (age in the first period)
estimated coefficient kidsit - 0.039 lhincit - 0.009 being marginally statistically significant
→ Each child is estimated to reduce the labor force participation probability by about 0.039,
while a 10% increase in a husband’s income lowers the probability by only 0.0009
→ use probit, APEs are much higher
→ As we discussed, the coefficient magnitude are difficult to interpret.
5.8.4 Dynamic Unobserved Effects Models
P (yit = 1|yi,t−1 , . . . , yi0 , zi , ci ) = G(zit δ + ρyi,t−1 + ci )
• But economists are interested in whether there is state dependenceXthat is, ρ 6= 0 in
equation (5.74)-after controlling for the unobserved heterogeneity, ci .
• Our need to integrate c out of the distribution raises the issue of how we treat the initial
observations,yi0 ; this is usually called the initial conditions problem.
♠Suppose we date our observations starting at t = 0, so that the yi0 is the first observation
on y.
The dynamic unobserved effects model
P (yit = 1|yit−1 , ..., yi0 , zi , ci ) = G(zit δ + ρyit−1 + ci )
G ← probit or logit
where zit is a vector of contemporaneous explanatory variables zi = (zi1 , ..., ziT )
1. zit are assumed to satisfy a strict exogeneity assumption zi appears in the conditioning set
but only zit appears on the right-hand side
2. the probability of success at time t is allowed to depend on I. the outcome in t − 1 as well
as II. unobserved heterogeneity
Of particular interest is the hypothesis H0 : ρ = 0
Under this null, the response probability at time t does not depend on past outcomes once ci
(and zi ) have been controlled for.
Economists are interested in whether there is state dependence after controlling for the unobserved hterogeneity ci
→ in addition to an unobserved effect, behavior may depend on past observed behavior
♠Write f (y1 , y2 , ..., yT |y0 , z, c; β) =
f (yt |yt−1 , ..., y1 , y0 , zt , c; β)
G(zt δ + ρyt−1 + c)yt [1 − G(zt δ + ρyt−1 + c)yt ]1−yt
→ with fixed-T asymptotics, the density does not estimate β consistently because of c
→ the incidental parameters problem is even more severe in dynamic models
⇒ What we should do is integrate out the unobserved effect c.
♠Our need to integrate c out of the distribution raises the issue of how we treat the intiial
observations, yi0 ; this is usually called the initial conditions problem
One possibility:
Treat each yi0 as a nonstochastic starting position for each i
→ equation (5.85) can be integrated against the density of c to obtain the density of (y1 , y2 , ..., yT )
given z ↔ this density also depend on y0 through f (y1 |y0 , c, z1 ; β)
♠It does ”not” make sense to assume yi0 and ci are independent
ex: yi0 : an employment indicator in the initial post graduation period
ci contains unobserved attributes that affect yit in periods t ≥ 1
→ It is almost certain that an individual’s initial employment status is yi0 related to ci
Another possibility:
I. specify a density for yi0 given (zi , ci ) → difficult to find
II. multiply this density by equation (5.85) to obtain f (y0 , y1 , y2 , ..., yT |z, c; β, γ)
III. specify a density for ci given zi , h(c|z; α)
IV. f (y0 , y1 , y2 , ..., yT |z, c; β, γ) is integrated against the density h(c|z; α)
Heckman (1981) suggests 1. approximating the conditional density of yi0 given (zi , ci ) and
then 2. specifying a density for ci given zi
For example: 1. → a probit model with success probability Φ(η + zi Π + γci ) for yi0
2. → specify the density of ci given zi as normal
→ Then, multiplied by eq. (5.85), and c can be integrated out to approximate the density of
(yi0 , yi1 , ..., yiT ) given zi
If we can find the density of (yi1 , yi2 , ..., yiT ) given (yi0 , zi ), in terms of β and other parameters, then we can use standard CMLE methods.
→ To obtain f (y1 , y2 , ..., yT |yi0 , zi ), we need to propose a density for ci given (yi0 , zi )
Given a density h(c|y0 , z; γ),
f (y1 , y2 , ..., yT |y0 , z, θ) = −∞ f (y1 , y2 , ..., yT |y0 , z, c, β)h(c|y0 , z; γ)dc
← The integral can be replaced with a weighted average if the distribution of c is discrete.
♠When G = Φ in (5.84), a convenient choice for h(c|y0 , z; γ) ∼ N ormal(ψ+ξ0 yi0 +zi ξ, σa2 )
→ ci = ψ + ξ0 yi0 + zi ξ + ai , where ai ∼ N ormal(0, σa2 ) and indep. of (yi0 , zi )
yit = 1[ψ + zit δ + ρyit−1 + ξ0 yi0 + zi ξ + ai eit > 0]
→ two assumptions
I. yit given (yit−1 , ..., yi0 , zi , ai ) follows a probit model
II. ai given (yi0 , zi ) is distributed as N ormal(0, σa2 )
⇒ the density of (yi1 , ..., yiT ) given (yi0 , zi ) has exactly the form the equation(5.70) where
xit = (1, zit , yit−1 , yi0 , zi ) with a and σa replacing c and σc , respectively
R∞ Q
→ f (yi1 , ..., yiT |xi ; θ) = −∞ [ f (yt |xit , a; β)]1/σa φ(1/σ)da
→ Conveniently, this finding means that we can use standard RE probit software to estimate
ψ, δ, ρ, ξ0 , ξ and σa2
We simply expand the list of explanatory variables to include yi0 and zi in each time period.
♠Estimating APEs as we average out 1. the initial condition along with 2. the leads and
lags of all strictly exogenous variables
Let zt and yt−1 be given values of the explanatory variables the ASF, consistently E[Φ(zt δ +
ρyt−1 + ci )] = E[Φ(ψa + zt δa + ρa yt−1 + ξa0 yi0 + zi ξa )]
ˆ (zt , yt−1 ) = 1 P Φ(ψ̂a + zt δ̂a + ρ̂a yt−1 + ξˆa0 yi0 + zi ξˆa )
a: multiplied by (1 + σˆa2 )−1/2
ˆ ...CM LEs
ψ̂, ...ξ,
We can take 1. derivatives of this expression with respect to continuous elements of zt , or 2.
take differences with respect to discrete elements.
→ Of particular interest is to alternatively set yt−1 = 1 and yt−1 = 0 and obtain the change
in the probability that yit = 1 when yt−1 goes from zero to one
♠Example 5.6 Dynamic Women’s LFP Equation
P (lf pit = 1|kidsit , lhincit , lf pit−1 , ci )
1. One lag of labor force participation is assumed to suffice for the dynamics
2. {(kidsit , lhincit ) : t = 1, ..., T } is assumed to be strictly exogenous conditional on ci .
RE probit gives ρ̂ = 1.541 even after controlling for unobserved heterogneity
But to obtain the size of the effect, we compute the PAE for lf pt−1
→ The calculation involves averaging Φ(ψ̂a + zt δ̂a + ρ̂a + ξˆa0 yi0 + zi ξˆa ) − Φ(ψ̂a + zt δ̂a + ρ̂a +
ξˆa0 yi0 + zi ξˆa ) across all t and i
? → The APE=0.260 (panel bootstrap standard error is 0.026 with 500 replication) In other
words, averaged across all women and all time periods, the probability of being in the labor
force at time t is about 0.26 higher if the women was in the labor force at time t − 1 than if
she was not.
♠It is instructive to compare the APE with the estimate of a dynamic probit model that
ignores ci
→ ρ̂ = 2.876 much higher
APE for state dependence ' 0.837
→ Therefore, in this example, much of the persistence in labor participation of married women
is accounted for by the unobserved heterogeneity. There is some state dependence, but its value
is much smaller than a simple dynamic probit indicates
5.8.5 Semiparametric Approaches
♠The pervious methods assumed
1. strict exogeneity
2. sequential exogeneity
→ Use probit models to account for 1. unobserved heterogeneity and 2. contemporaneous
Two cases:
(1) the endogenous explanatory variable has a conditional normal dist
(2) the endogenous explanatory variable follows a reduced form probit
yit1 = 1[zit1 δ1 + α1 yit2 + ci1 + uit1 ≥ 0]...(5.86)
uit1 |zi , ci1 ∼ N ormal(0, 1)
yit1 : the binary response
yit2 : the endogenous explanatory variable
where {uit1 : t = 1, ..., T } is the sequence of idiosyncratic errors zi = (zi1 , ..., ziT ) is the sequence of strictly exogenous variables conditional on ci1
Notice that uit1 is assumed to be independent of (zi , ci )
Our focus will be on estimating APEs
ASF (zt1 , yt2 ) = Eci1 [Φ(zt1 δ1 + α1 yt2 + ci1 )]
← the expected value with respect to the distri-
bution of ci1
ci1 = ψ1 + z̄i ξ1 + ai1
2 )
, ai1 |z ∼ N ormal(0, σa1
← z̄i contains the time averages of all strictly exogenous variables (except any aggregate time
effects, such as period dummies)
← (z̄i1 , z̄i2 ) the time average of zit2 (all elements of z̄i )
Plugging into (5.86)
yit1 = 1[zit1 δ1 + α1 yit2 + ψ1 + z̄i ξ1 + ai1 + uit1 ≥ 0]
≡ 1[zit1 δa1 + αa1 yit2 + ψa1 + z̄i ξa1 + eit1 ≥ 0]......(5.89)
, eit1 ≡
ai1 +uit1
2 )1/2
← standard normal
dist. conditional on zi
♠The Average structural function
E(z̄i , eit1 ){1[zit1 δa1 + αa1 yit2 + ψa1 + z̄i ξa1 + eit1 ≥ 0]} = Ez̄i [Φ(zit1 δa1 + αa1 yit2 + ψa1 + z̄i ξa1 )]
I. a reduced form of yit2 : continuous
yit2 = zit δ2 + ψ2 + z̄i ξ2 + vit2 ...(5.90)
, vit2 |zi ∼ N ormal(0, τ22 )
← (zit1 , zit2 ) the instruments for yit2 omitted from (5.89)
→ use pooled MLE directly (pooled IV probit) on equations (5.89) and (5.90)
II. yit2 : Binary
yit2 = 1[zit δ2 + ψ2 + z̄i ξ2 + vit2 ≥ 0]
→ apply pooled bivariate probit
♠Pooled estimation is very convenient because any routine that allows estimation of the
models for cross section data can be used for panel data, provided robust standard errors and
test statistics are computed to account for the neglected time dependence.
MLE jointly across time would require specification of the joint distribution of {(eit1 , eit2 ) :
t = 1, ..., T } , and not just the bivariate distribution of (eit1 , eit2 ) for each t.
→ Assuming independence across t would be unrealistic, and allowing for realistic correlation
over time would be computationally expensive.
♠When yit2 is continuous, a control function approach is also available
yit1 = 1[zit1 δg1 + αg1 yit2 + θg1 vit2 + ψg1 + z̄i ξg1 + γit1 ≥ 0]
θg1 ← use v̂it2 : 1. pooled OLS of yit2 on zit 2. obtain residuals
→ 1. pooled probit of yit1 on zit1 , yit2 , v̂it2 , 1, z̄i
2. endogenous test: a t-test of H0 : θg1 = 0
5.8.6 Cluster Samples
♠Under strict exogeneity of the explanatory variable, it’s possible to consistently estimate
β up to scale under very week assumptions.
Manski(1987) derives ” the maximum score estimator ”
∗ = x β+
on objective funciton that identifies β up to scale in the T = 2 case when ei1 in yit
ci + eit
∗ > 0] are identical distributed conditional on (x , x , c ) and x is strictly
, yit = 1[yit
i1 i2 i
Honore’ and Kyriazidou (2000)
unobserved effects logit model with a lagged dependent variable and strict exogenous variable
without making distributional assumption about the unobserved effect.
Drawback: 1. does not converge at n rate (CAN)
2. time dummies are ruled out
3. it is not possible to estimate the APEs
♠A middle ground between completely unrestricted and parametric assumptions
Altonji and Matzkin (2005) ”exchangeability” assumption
D(ci |xi ) = D(ci |x̄i )
5.9 Multinomial Response Models
5.9.1 Multinomial Logit
• The logit model for binary outcomes extends to the case where the unordered response
has more than two outcomes.
• The multinomial logit (MNL) model has response probabilities
P (y = j|x) = exp(xβj )/ 1 +
where βj is K × 1, j = 1, . . . , J.
exp(xβh ) ,
j = 1, . . . , J
• Example 5.4 (School and Employment Decisions for Young Men):
5.9.2 Probabilistic Choice Models
= xij β + aij ,
j = 0, . . . , J
• As shown by McFadden (1974), if the aij , j = 0, . . . , J are independently distributed
with cdf F (a) = exp[− exp(−a)]- the type I extreme value distribution-then
P (y = j|xi ) = exp(xij β)/
exp(xih β) ,
j = 0, . . . , J
• The response probabilities in equation (5.80) constitute what is usually called the conditional logit model.
• This is called the independence from irrelevant alternatives (IIA) assumption because
it implies that adding another alernative or changing the characteristics of a third alternative does not aect the relative odds between alternatives j and h.
• A different approach to relaxing IIA is to specify a hierarchical model.
• The most popular of these is called the nested logit model.
5.10 Ordered Response Models
5.10.1 Ordered Logit and Ordered Probit
• Another kind of multinomial response is an ordered response.
• The ordered probit model for y (conditional on explanatory variables x) can be derived
from a latent variable model.
• Let α1 < α2 < · · · < αJ be unknown cut points (or threshold parameters), and define
y = 0 if y ∗ ≤ α1
y = 1 if α1 < y ∗ ≤ α2
y = J if y ∗ > αJ
• Other distribution functions can be used in place of Φ. Replacing Φ with the logit
function, Λ, gives the ordered logit model.
• Example 5.5 (Asset Allocation in Pension Plans):
5.10.2 Applying Ordered Probit to Interval-Coded Data
• When the quantitative outcome we would like to explain is grouped into intervals, we
say that we have interval-coded data.
• However, because we only observe whether, say, wealth falls into one of several cells, we
have a data-coding problem.
• Fortunately, some econometrics packages have a feature for interval regression, which
is exactly ordered probit with the cut points fixed and with β and σ 2 estimated by
maximum likelihood.