Models with Limited Dependent Variables Judge et al 1988, Ch.19

Models with Limited Dependent Variables
Judge et al 1988, Ch.19 Qualitative and Limited Dependent Variable Models
Sometimes dependent variables can be observed only in a limited range. The case was made famous
by Tobin’s (1958) model of the demand for consumer durables.
In this model the utility maximizing amount of expenditures on a durable good ( y  ) is taken to be
explained by the model
(4. 1)
yi  xi β  ei
i  1,, T
where x i is a ( 1 K ) vector of explanatory variables and ei ~ N (0,  2 ) and is independent of other
errors.
The problem for a household is that there is some minimum level of expenditure required to
purchase a durable good, say c . For any household, the amount of expenditure actually observed
( y ) is
(4. 2)
 y
yi   i
0 or c
if yi  c
if yi  c
Thus the observed amount of expenditure may be the desired level or it may be 0 or c , which
simply reflects that yi  c . Assuming c is known and the same for all households, it is customary to
rewrite the model by subtracting c from both sides of the model, resulting in the standard censored
regression model:
(4. 3)
yi  xi β  ei
0
if xi β  ei  0
otherwise
An important characteristic of a censored regression model is that we actually know the values of
the explanatory variables x i for the households or individuals who do not make a purchase. If no
observations are available on individuals who do not make a purchase, then the sample is said to be
truncated.
Properties of the Least Squares Estimator in the Tobit (Censored) Regression Model
Suppose we have T total sample observations, where T0 be the number of observations for which
yi  0 and T1 be the number for which yi  0 . To see the statistical problem caused by this
censored-sample problem, suppose we consider ignoring the T0 observations. For the remaining T1
observations, we have complete observations and use the least-squares estimator to estimate β .
Unfortunately, the least squares estimator is biased and inconsistent in this context.
1
To see this, consider the expectation of the observed values of y i conditional on the fact that
yi  0 :
(4. 4)
E [ yi | yi  0]  xi β  E [ei | yi  0]
If the conditional expectation of the error term is 0, there is no problem, because a least squares
regression on the T1 observations will provide an unbiased estimator for β . But this is not the case
because
(4. 5)
E [ei | yi  0]  E [ei | xi β  ei  0]
 E [ei | ei  xi β ]  0
i.e., the conditional expectation of the error term is nonnegative.
If ei ~ N (0,  2 ) , its p.d.f or f (ei ) has mean value of 0 and symmetric about 0. However, if the
values of e i are restricted to be greater than the value of  x i β , then the p.d.f. would no longer
centered on 0, nor symmetric, and its mathematical expectation would be positive.
The p.d.f. of the resulting truncated normal random variable is:
(4. 6)
f [ei | ei  xi β] 
f (ei )

x β
for ei  xi β
f (t ) dt
i
The denominator of this expression is the area under f (ei ) for ei  xi β which guarantees that the
area under the p.d.f. equals 1.
To illustrate the effect of the positive value of conditional expectation of the error term, consider a
simple case in which E [ yi ]  xi β   1   2 xi 2 as illustrated in Figure 19.3. The nonzero values of
the dependent variable will be scattered around E [ yi | yi  0] , not about the population regression
function E [ yi ] . Consequently, least squares estimators of  1 and  2 will be biased and
inconsistent.
2
In fact, it can be shown that the conditional expectation in (4. 5) is
(4. 7)
E [ei | ei  xi β]   
f (xi β /  )
F (xi β /  )
where f ( xi β /  ) and F (xi β /  ) are the standard normal p.d.f. and c.d.f. evaluated at ( xi β /  ) .
Consequently, in the regression model, if yi  0 ,
(4. 8)
yi  xi β  ei  xi β   
f (xi β /  )
 ui
F (xi β /  )
and applying the usual LS procedures omits the term  
f (xi β /  )
, which is not independent of x i ,
F (xi β /  )
leading to inconsistency and bias of the LS estimators.
It can also be shown that applying OLS to all T observations is an unsatisfactory procedure since the
unconditional expectation of y i is
(4. 9)
E [ yi ]  F (xi β /  ) (xi β)    f (xi β /  )
i.e., applying least squares to all the observations would not lead to a consistent estimator of the
population regression function E [ yi ]  xi β .
Maximum Likelihood Estimation of the Tobit Model
To estimate the parameters β and  2 consistently, we can apply maximum likelihood procedures.
The likelihood of the sample has a component for the observations that are positive and for those
that are 0. Define f i and Fi to be the p.d.f. and c.d.f. of a standard normal random variable
evaluated at z i  xi β /  .
3
For the observations yi  0 all we know is that xi β  ei  0 or ei  xi β , so
x β 
e
Pr[ yi  0]  Pr[ei  xi β]  Pr  i   i   1  Fi
 

since the normal distribution is symmetric. For yi  0 , the “usual” component of the likelihood
function is present, so
xβ 
e
Pr[ yi  0]  Pr[xi β  ei  0]  Pr  i   i   Fi .
 

Hence, denoting the product over the zero observations by  0 and the product over positive
observations by  1 , the likelihood function is given by
   0 (1  Fi ) 1 Fi
 0 (1  Fi ) 1 (2 2 ) 1/ 2 exp{( yi  xi β) 2 / 2 2 }
The log-likelihood function is
(4. 10)
L  ln    0 ln(1  Fi )  (T1 / 2) ln 2  (T1 / 2) ln  2  1 ( yi  xi β) 2 / 2 2
Maximizing this log-likehood function can be carried out by using the Newton-Raphson method or
the method of scoring (interested students can refer to the Judget et al 1988 book for more detail on
this method).
For interpretation and use of the Tobit model, one must take care when using the model to predict.
Please be reminded that in this context:
E [ yi ]  xi β
E [ yi | yi  0]  xi β  
fi
Fi
E [ yi ]  Fi  E [ yi | yi  0]
If we want to know the effects of a unit change of an independent variable x j on the expected value
of the dependent variable, this could be
E [ yi ]
 j
x j
2

 E [ yi | yi  0]
f  f  
  j 1  (xi β /  )  i   i  
xj
Fi  Fi  



E [ yi ]
 Fi  j
xj
4
Sample Selection Corrections
Wooldridge (2009) Ch. 17.5
The problem of incidental truncation
 We do not observe y because of the outcome of another variable.



Example: Analysis of wage offer function in labor economics. The purpose of such analysis is
usually to see how various factors, such as education, affect the wage an individual could
earn in the labor force.
See that we observe the wage offer only for those who are in the workforce (the wage offer
equals the current wage), while for those who are not in the workforce we do not observe
the wage.
This creates potential problem: working status may be systematically correlated with
unobservables that affect the wage offer, creating biased estimators of parameters in the
wage offer equation.
This is the problem of nonrandom sampling (violation of the first OLS assumption)
 Certain types of nonrandom sampling do not cause bias or inconsistency in OLS. This is called
exogenous sample selection. One example is a sample that is chosen on the basis of the
independent variables or sample selection based on the independent variables. Suppose we
want to estimate a saving function
saving   0  1 income   2 age   3 size  u
using a survey of people over 35 years of age. While this is not ideal, this nonrandom sample can
still yield unbiased and consistent estimators of the parameters in the population model,
provided there is enough variation in the independent variables in the subpopulation. This is
because the regression function E (saving | income, age, size ) is the same for any subset of the

population described by income, age, or size.
However, if selection of sample is based on the dependent variable or sample selection based on
the dependent variable, the situation is much different. This is the case of endogenous sample
selection. This creates bias in OLS in estimating the population model. Suppose we want to
estimate the relationship between individual wealth and several other factors in the population
of all adults:
wealth   0  1 educ   2 exper   3 age  u
Suppose that only people with wealth below $250,000 are included in the sample. Using this
sample, OLS estimates of parameters in the model will be biased and inconsistent. This is
because E (wealth | educ, exper, age) is not the same as the expected value conditional on wealth

being less than $250,000.
Cases between these extremes are less clear, so we need to be careful of sample selection bias.
Assume the population model
yi  x i β  ui
Let n be the size of a random sample from the population. If we could observe y i and each x i for all
i , then OLS can be used. Assume that for some reason either y i or some of the independent
5
variables are not observed for certain i . For at least some observations, we observe the full set of
variables.
Define a selection indicator s i for each i by si  1 if we observe all ( y i , x i ) and si  0 otherwise.
Thus when si  1 we will use the observation in the analysis. When si  0 , such observation will not
be used.
We are interested in the statistical properties of the OLS estimators using the selected sample, i.e.
using observation which si  1 . Thus we use fewer than n observations, say n1 .
Estimation Method
Consider the model
(1.7)
hi  0  1 ln (Wi f )  2 Yi  3 Z i  ε1i , i  1,, N
Estimation of equation (1.7) is subject to a problem that one only observes positive hours work and
wage rate for those women who work; for those who not work, the observed hours of work is zero.
Hence the value of the dependent variable in (1.7) is censored from below at zero. Further, if (1.7) is
to be fitted to the entire sample, one needs a consistent estimate of wage rate for the non-working
sample. Suppose one knows or can consistently estimate what the market wage would be for those
who are not working. Then equation (1.7) can be written as
(1.8)
hi  x1i β  ε1i
(1.9)
d i  1 if hi  0
i  1,, N
d i  0 if hi  0
(1.10)
hi  d i  hi
where we use hi to denote the observed hours of work and hi the desired hours of work (the latent
variable). Assume that the disturbance term in model (1.8) is normally distributed and
homoskedastic, ε1 i ~ N ( 0 , 12 ) . Then equations (1.8) ~ (1.10) becomes the Tobit model proposed by
Tobin (1958). It has been well documented, for example in Gourieroux (2000), that in this setting
6
ordinary least squares (OLS) applying to the entire sample or only to sub-sample with positive hours
of work (or the complete sub-sample) will yield biased estimates.1
In a fully parametric approach, the Tobit model can be consistently estimated by maximum
likelihood method (MLE). Since hi are independent, the Tobit MLE estimates bˆ  (βˆ , 12 ) are
obtained by maximizing the likelihood function
(1.11)
N  


  
1
1
1
2
1    x1i β 

ln L N (b)   d i   ln 2  ln 12 
(
h

x
β
)

(
1

d
)
ln
1i
i

  


2
2 12 i
i 1   2
 1 



It is also known, however, that the simple Tobit model is restrictive, due to the assumption
of normally distributed and homoskedastic error ε1 i ~ N ( 0 , 12 ) so that the latent variable hi is
distributed as normal, hi ~ N ( x1i β , 12 ) . When the density is correctly specified, b̂ is consistent.
But when the error ε 1i is either heteroskedastic or nonnormal, the Tobit MLE is inconsistent.2 More
flexible models relaxing the strong distributional assumptions have been proposed to take into
account the nonnormal error in Tobit model. These methods, however, are beyond the scope of this
study.
Since market wage is only observed for the working sub-sample, consistent estimates of
wage rates for those who are not working should be obtained before estimating equation (1.8) with
Tobit. This can be done by estimating parameters in wage equation using the working sub-sample
and imputing the estimated market wage for the non-working sub-sample. However, the working
sub-sample may be subject to self-selectivity and the estimates derived from self-selected samples
1
2
See, for example, Gourieroux (2000, section 7.3) which shows the biases from the application of OLS to the
entire sample or to the complete observations on a simple Tobit model. Others, as Greene (2003, p.768),
suggest that the OLS estimates tend to be smaller in absolute value than the maximum likelihood estimator.
See, for example, Cameron and Trivedi (2005, section 16, p.538).
7
may be biased due to correlation between the independent variables and the stochastic disturbance
induced by the sample selection rule.
Consider a model consisting of a market wage equation and a selection equation fitted to
the entire population i  1,, N
(1.12)
ln ( wif )   x2 iα  ε 2i
(1.13)
d   wi γ  ui
(1.14)
d i  1 if d i  0
i
d i  0 if d i  0
(1.15)
ln ( wif )  d i  ln ( wif )
Equation (1.12) is a simple form of equation (1.4) which says that the market wage is determined by
schooling years and experience. Equation (1.13) and (1.14) reflect the participation condition (1.5)
that participation occurs when the market wage exceeds the reservation wage at zero hours of
work. The right-hand side (1.13) consists of variables determining the market wage, i.e. schooling
years and experience, and variables affecting the reservation wage, such as husband’s income, other
household non-labor income, and number of children. Since we allow the regressors of the
participation equation w i to contain some variables that do not appear among the regressors in the
wage equation x 2 i , the model is identified by exclusion restriction. Equations (1.14) and (1.15)
capture the sample selection such that the latent endogenous variable ln ( wif ) is observed when
d i*  0 . d i* itself is a latent variable; we can only observe the sign whether it is d i*  0 if the
observation is working, or d i*  0 when the observation is not working. The indicator variable d i
equals 1 if d i*  0 and zero otherwise. Disturbance terms in equations (1.12) and (1.13) are
assumed normally distributed with means zero, variances 22 ,  u2 , and covariance 2 u .
8
Suppose we want to estimate the model (1.12) using the sub-sample of working observation
with observed market wage. The conditional expectation of ln ( wif ) given the selected sample is
(1.16)
E [ ln ( wif ) | d i  1 ]
=
E [ ln ( wif ) | d i*  0 ]
=
E [ ln ( wif ) | wi γ  ui  0 ]
=
E [ ln ( wif ) | ui   wi γ ]
=
x2i α  E (ε 2i | ui   wi γ )
Compared to (1.12), conditional equation in (1.16) has an additional term E (ε 2i | ui   wi γ ) . If this
term is zero, equation (1.16) can be consistently estimated using OLS. In the estimation using the
selected sample, however, this term in general is not zero. Assuming joint normal distribution of the
disturbances ε 2 i and u i in equations (1.12) and (1.13) such that
  0
ε2 i 
 u  ~ N   
 i 
  0 
 22
2 u
,
 2 u
u2





with u2 is normalized to 1, the regression function (1.16) can be written as 3
(1.17)
E [ ln ( wif ) | d i  1 ]
=
x2 i α   2
 (wi γ )
 (wi γ )
 x2 i α    i ,
since E ( ε2 i | ui   wi γ )   ε E ( ui | ui  wi γ ) and E ( ui | u i   wiγ)   (wiγ ) /  (wiγ ) .
 () and
 () are the probability and cumulative density functions of the standard normal distribution,  is
the correlation coefficient, and  i   (w i γ ) /  (w i γ ) is the inverse Mills ratio. In this case u2 can
not be estimated since we only observe the sign of d i* , so it is set to equal 1.
3
The least square correction for selectivity bias is outlined in Appendix A.1.2. The conditional expectation
(1.17) is derived by this approach.
9
Estimation of equation (1.17) can be proceeded by initially estimating (1.13) with a
maximum likelihood probit analysis, applying on the entire sample of N observations, to get the
parameters of the participation function γ̂ . Using these estimates ˆ i   (wi γˆ ) /  (wi γˆ ) is
computed for each observation. The resulted estimate of  i is consistent. Then consistent estimates
of the wage parameters α̂ and ̂ are obtained by regressing ln ( wif ) on x 2 i and ̂ i , using the
working sub-sample N1 (  N ), with least squares. This is the standard procedure for sample
selection proposed in Heckman (1979).
Least Square Correction for Selectivity Bias 4
Let us assume the regression model of interest is
(1)
yi  xi β  ui
and the selection equation
(2)
z   wi γ  vi
i
and that y i is observed if and only if zi  1 where
(3)
zi  1 if zi  0 and
zi  0 if zi  0 .
Assume
E( ui )  0 , E( vi )  μv ,
E ( ui u j )
 u2 ,
i j
 0
otherwise,
E [(vi  μv )(vj  μv )]
E ( ui vj )
4
i j
 0
otherwise,
  u v , i  j
 0
E ( ui | vi )
 v2 ,
otherwise,
  u (vi  μv ) / v .
This exposition follows Olsen, Randall J. (1980). “A Least Squares Correction for Selectivity Bias.”
Econometrica, 48(7): 1815-1820.
10
The last line comes from the assumption of joint distribution of u i and v i , where
E ( ui | vi )
 μu  u (vi  μv ) / v
Var ( ui | vi )
 u2 (1  2 ) .
By assuming E ( ui | vi ) is linear in v i , we can use the decomposition
(4)
ui   u (vi  μv ) / v  εi
where Var (εi )  u2 ( 1   2 ) with ε i uncorrelated with v i . When we estimate the regression (1) for
the sub-sample with observed y i we seek to calculate the conditional mean E ( yi | x i , zi  1) or
E ( yi | x i , vi   w i γ ) . Since
(5)
yi  xi β   u (vi  μv ) / v  εi
it follows
(6)
E ( yi | x i , vi  wi γ )  xi β   u E (vi | vi  wi γ ) / v   u μv / v .
The conditional variance of this conditional expectation is
(7)
Var ( yi | xi , vi  wi γ )  2u2 Var (vi | vi  wi γ ) / v2  u2 (1  2 ) .
If we also assume that v i is standard normal, then,
(8)
E ( yi | x i , vi   w i γ )
=
xi β  u E ( vi | vi  wi γ )

=
xi β  u
=
xi β  u
=
xi β  u
=
xi β    i
wi γ vi  ( vi ) dvi
1   (wi γ )
 (wi γ )
 (wi γ )
 (w i γ )
 (w i γ )
where  i   (wi γ ) /  (wi γ )  E ( vi | vi  wi γ ) ,    u , and  () and  () are the standard
normal pdf and cdf, respectively.
11
An example
 LPM, Logit, Probit Estimates of Labor Force Participation (Data: mroz.dta)
Try these commands in STATA to replicate Table 17.1 in Wooldridge (2009) p.585:
reg inlf nwifeinc educ exper expersq age kidslt6 kidsge6
probit inlf nwifeinc educ exper expersq age kidslt6 kidsge6
logit inlf nwifeinc educ exper expersq age kidslt6 kidsge6
Question:
Using the probit estimates and the calculus approximation, what is the approximate change in
the response probability when exper increases from 10 to 11?
12

Tobit (Data: mroz.dta)
Try these commands in STATA to replicate Table 17.2 in Wooldridge (2009) p.593:
reg hours nwifeinc educ exper expersq age kidslt6 kidsge6
tobit hours nwifeinc educ exper expersq age kidslt6 kidsge6,ll(0)

Tobit (Data: judgetobit.dta)
Try these commands in STATA to replicate Table 19.3 in Judget et al (1988) p.801:
reg y x2
reg y x2 if y>0
tobit y x2, ll(0)
13

Sample selection corrections (Data: mroz.dta)
Try these commands in STATA to replicate Table 17.5 in Wooldridge (2009) p.611:
reg lwage educ exper expersq
heckman lwage educ exper expersq, select(educ exper expersq nwifeinc age
kidslt6 kidsge6) twostep
14