Single and Multiple Spell Discrete Time Hazards Models with

Longitudinal and Multilevel
Methods for Multinomial Logit
David K. Guilkey
Focus of this talk:
Unordered categorical dependent variables
Models will be logit based
Empirical example uses data from the Indonesian Family Life
Survey:
Two outcomes:
Binary indicator for whether the respondent uses contraception
Unordered categorical variable for method choice
Data Set Overview
Four waves of data: 1993, 1997, 2000, and 2007
Individual level information on fertility, education, migration
Community and facility level data on health and family planning
providers
Data from 321 enumeration areas – we will consider these
communities
IFLS Longitudinal Sample Size
Initial Participation
Cohort
1993
Wave 1 Cohort
3520
Wave 2 Cohort
Wave 3 Cohort
Wave 4 Cohort
total observations
Survey Year
1997
2873
2207
2000
2684
1742
1466
2007
1498
1152
933
2287
total
10575
5101
2399
2287
20362
IFLS Summary Statistics
Dependent Variables
Contraceptive Use
Method Choice
no method
temporary modern
long Lasting modern
traditional
Independent Variables
highest ed grade school
highest ed high school
highest ed college
age
muslim
number of posyandus
Observations
mean
s.d.
.588
.492
.412
.397
.168
.023
0.669
0.470
0.169
0.375
0.049
0.217
34.099
8.842
0.891
0.312
7.507
6.251
20,000
Basic Model for Longitudinal Logit:
 P (Yti = 1| µi ) 
ln 
 = X ti β + α Pti + Z iδ + ρµi
 P (Yti = 0 | µi ) 
Where:
Yti: observed binary variable (respondent i from time period t)
Xti: time varying explanatory variables (age and education level)
Pti: time varying program variable (posyandus)
Zi: time invariant regressors (Muslim)
i=1,2,…N (individuals)
t=1,2,…Ti (observations per individual -- unbalanced panel)
Assumptions:
µ ~ N (0,1)
i
for the parametric logit in STATA (xtlogit, melogit, and one variant of
GLLAMM)
and:
E=
( X ti µi ) E=
( Pti µi ) E=
( Z i µi ) 0
Note that observations for the same individual will be correlated
because of the time invariant error – sometimes referred to as
unobserved heterogeneity
Given the assumptions, estimation options are:
1. Simple logit yields consistent point estimates but incorrect SE’s
2. Simple logit with cluster option corrects SE’s
3. Parametric or semi-parametric maximum likelihood –allows
estimation of random effects parameters and typically more efficient
The likelihood function for this model is derived as follows:
e X ti β +α Pti + Ziδ + ρµi
P
=
(Yti 1|=
µi )
1 + e X ti β +α Pti + Ziδ + ρµi
This is the probability that individual i at time t is using
contraception conditional on time invariant heterogeneity.
For individual i, we observe Ti binary responses that we can write as:
Yi = (1,0,0,1) for a woman that is observed for 4 time periods and
used contraception at time periods 1 and 4.
Let Yi be the set of observed outcomes for individual i, then:
)
P (Y=
i
Ti
=
∫ ∏ P (Y
ti
∞
−∞
t =1
Y −1
1| µi )Y (1 − P (Y=
1|
)
µ
f ( µi ) d µi
ti
i
ti
ti
Joint probability must be approximated -- approximating
the area under a curve. With the assumption of normality
the approximation method is Gaussian Quadrature or Hermite
integration
Points:
1. More accurate with more Hermite points – but execution
time is longer. Adaptive quadrature is an option in GLLAMM
2. You need more points as Ti gets larger.
Hermite integration replaces the integral with a sum:
P (Y=
)
i
∑ w ∏ P (Y
=
M
m =1
Ti
m t =1
ti
1| µ ) (1 − P (Y
= 1| µ )
Yti
m
ti
Yti −1
m
where the weights (wm ’s) and the masspoints (μm’s) are known
because of the assumption of normality
Alternative:
The discrete factor approximation searches over weights and mass
points along with the other parameters of the model.
Must impose a normalization;
1. Weights sum to one
2. Either set one mass point to zero (fortran program) or set mean
of distribution to zero (GLLAMM)
Multilevel Panel Models
Basic Form of the model:
 P (Ytij = 1| µij , λ j ) 
ln 
 = X tij β + α Ptij + Z jδ + ρ1 µij + ρ 2 λ j
 P (Ytij = 0 | µij , λ j ) 
where
j=1,2,…,J (communities)
i=1,2,…,Nj (individuals from community j)
t=1,2,…,Tij (observations for person i for community j)
Xtij: individual level variables (some could be fixed through time)
Ptij: time varying program variable
Zj: time invariant community level variables
μij: time invariant individual level unobserved heterogeneity
λj: time invariant community level unobserved heterogeneity
This model allows observations on the same individual to be
correlated and observations from the same community to be
correlated.
Assumptions:
( X λ ) E=
( P λ ) E=
(Z λ ) 0
E=
tij
j
tij
j
ij
j
E=
( X µ ) E=
( P µ ) E=
(Z µ ) 0
tij
ij
tij
ij
ij
ij
1. Simple logit yields consistent point estimates but incorrect SE’s
2. Simple logit with cluster option corrects SE’s (at community level)
3. Parametric or semi-parametric maximum likelihood -- estimates
random effects and more efficient
Multilevel Maximum likelihood estimator is a straight forward
extension of the basic longitudinal data model:
P (Ytij 1|=
=
µij , λ j )
e
X tij β +α Ptij + Z jδ + ρ1µij + ρ 2λ j
1+ e
X tij β +α Ptij + Z jδ + ρ1µij + ρ 2λ j
You need the unconditional joint probability of the observed
set of outcomes for the set of individuals in each community:
Conditional on the unobservables at the community level, the
probability of the set of observed outcomes for person i from
community j are:
)
P (Yij | λ=
j
Tij
=
∫ ∏ P (Y
tij
∞
−∞
t =1
1| µij , λ j ) (1 − P (Y=
1| µij , λ j )
tij
Ytij
Ytij −1
f ( µij )d µij
The unconditional joint probability of the set of observed outcomes for
all individuals in community j is then:
P (Yij ) = ∫
∞
−∞
Ni
∏ P (Ytij
i =1
| λ j ) f (λ j ) d λ j
We then either use Hermite integration or the discrete factor method to
approximate the integral.
Parametric Maximum Likelihood
gllamm cont_use posyandus age grade_school high_school college muslim, i(ind_id
com_id) family(binomial) link(logit) nip(20) ip(g) trace dot
number of level 1 units = 20000
number of level 2 units = 9394
number of level 3 units = 313
gllamm model
log likelihood = -12548.522
-----------------------------------------------------------------------------cont_use |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------posyandus |
.0228368
.0069701
3.28
0.001
.0091757
.036498
age |
-.037996
.0026758
-14.20
0.000
-.0432405
-.0327516
.0786581
8.10
0.000
.4826202
.7909543
grade_school |
.6367873
high_school |
.4122478
.0975244
4.23
0.000
.2211036
.6033921
college |
.4165882
.1299495
3.21
0.001
.1618919
.6712844
muslim |
.0376821
.1052797
0.36
0.720
-.1686623
.2440266
_cons |
1.00658
.1701569
5.92
0.000
.6730791
1.340082
------------------------------------------------------------------------------
Variances and covariances of random effects
------------------------------------------------------------------------------
***level 2 (ind_id)
var(1): 2.2860509 (.13570515)
***level 3 (com_id)
var(1): .34625941 (.04611334)
------------------------------------------------------------------------------
Non-parametric Maximum Likelihood
gllamm cont_use posyandus age grade_school high_school college muslim, i(ind_id
com_id) family(binomial) link(logit) nip(3) ip(f) trace dot
number of level 1 units = 20000
number of level 2 units = 9394
number of level 3 units = 313
gllamm model
log likelihood = -12546.725
-----------------------------------------------------------------------------cont_use |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------posyandus |
.0208781
.0067149
3.11
0.002
.0077171
.0340391
-.037949
.0027013
-.0432435
age |
-14.05
0.000
-.0326546
grade_school |
.637163
.0795007
8.01
0.000
.4813445
.7929815
high_school |
.4185102
.097197
4.31
0.000
.2280075
.6090129
college |
.4177381
.1293205
3.23
0.001
.1642745
.6712016
muslim | -.0883427
.1015703
-0.87
0.384
-.2874169
.1107315
_cons |
1.164577
.1836346
6.34
0.000
.8046593
1.524494
------------------------------------------------------------------------------
Probabilities and locations of random effects
------------------------------------------------------------------------------
***level 2 (ind_id)
loc1: -1.9001, 2.6348, .23361
var(1): 2.2386873
prob: 0.295, 0.1648, 0.5402
***level 3 (com_id)
loc1: -1.4082, .65457, -.1872
var(1): .33048135
prob: 0.0826, 0.3421, 0.5753
------------------------------------------------------------------------------
Basic Multinomial Logit with 4 Choices:
U 1ti = X ti β1 + α1 Pti + Z iδ1 + ε 1ti
U 2 ti = X ti β 2 + α 2 Pti + Z iδ 2 + ε 2 ti
U 3ti = X ti β 3 + α 3 Pti + Z iδ 3 + ε 3ti
U 4 ti = X ti β 4 + α 4 Pti + Z iδ 4 + ε 4 ti
Individual i at time t time makes choice 3 (for example) if :
P (U 3ti > U 1ti and U 3ti > U 2 ti and U 3ti > U 4 ti )
If we assume that the ε’s follow independent extreme value
distributions and impose the restriction that:
β= α= δ= 0
1
1
1
So that the probabilities sum to one then:
 P (Yti = ki ) 
ln 
 = X ti β k + α k Pti + Z iδ k
 P (Yti = 1) 
for k=2,3,4.
Let:
Ati = 1 + m=2 e
4
∑
X ti β m +α m Pti + Ziδ m
Then:
1
P (Yti= 1)=
Ati
And:
P (Y=
k=
)
ti
e
X ti β k +α k Pti + Ziδ k
Ati
Which allows us to write the likelihood function
=
L
Ti
∏=
∏ ∏ P (Yti
N
4
=i 1 =t 1 m
= 1
m) B
tim
where
=
Btim 1=
if Yti m and=
Btim 0 otherwise
Maximize with respect to parameters to find MLE
Independence of Irrelevant Alternatives Problem (IIA):
P (Yri = 2) e
= X β +α P + Z δ
P (Yti = 3) e
X ti β 2 +α 2 Pti + Ziδ 2
ti 3
3 ti
i 3
Red Bus/Blue Bus Problem
Start with red bus and walking and calculate odds
Now add blue bus that is exactly like red bus and irrelevant to walking
Because of IIA, the probability of walking falls even though the
blue bus alternative is irrelevant – but perfectly correlated with
read bus which multinomial logit model does not allow
Extension of longitudinal logit to longitudinal multinomial logit:
 P (Yti = k | µi ) 
ln 
 = X ti β k + α k Pti + Z iδ k + ρ k µi
 P (Yti = 1| µi ) 
for k=2,3,4.
The discrete factor model allows a more general pattern of
correlation:
 P (Yti = k | µ km ) 
ln 
 = X ti β k + α k Pti + Z iδ k + µ km
 P (Yti = 1| µ km ) 
for m=1,2…,M and a common set of weights: wm
allows for correlation in the μ’s
Unfortunately, base version of GLLAMM estimates a needlessly
restrictive version of the model:
Parametric:
ρ =ρ
2
3
If there are more than 3 choices, all ρ’s are restricted
Non-parametric:
µ =µ
2m
for all m.
3m
This restriction means that IIA is not solved:
P (Yri = 2) e
e
= =
X β +α P + Z δ + ρµ
X β +α P + Z δ
P (Yti = 3) e
e
X ti β 2 +α 2 Pti + Z iδ 2 + ρµi
ti
3
3 ti
i 3
i
X ti β 2 +α 2 Pti + Z iδ 2
ti
3
3 ti
Since the unobserved heterogeneity drops out.
i 3
Extension to Multilevel Panel Model:
Parametric:
 P (Ytij = k | µij , λ j ) 
ln 
 = X tij β k + α k Ptij + Z jδ k + ρ1k µij + ρ 2 k λ j
 P (Ytij = 1| µij , λ j ) 
Semi-parametric:
 P (Ytij = k | µ km , λkn ) 
ln 
= X tij β k + α k Ptij + Z jδ k + µ km + λkn
 P (Yti = 1| µ km , λkn ) 
The empirical example estimates a model with four choices:
1= Non use
2=Temporary Methods (pill, condom, injection)
3=Long Lasting Methods (IUD, sterilization)
4=Traditional Methods
We show the GLLAMM code to estimate various versions of the model
and some selected results:
Multinomial logit with SE’s corrected at community (outermost) level:
mlogit new_method posyandus age grade_school high_school college
muslim, cluster(com_id) base(1)
Restricted model for longitudinal data and random effect at individual level:
Normal distribution:
gllamm new_method posyandus … muslim, i(ind_id) family(binomial)
link(mlogit) b(1) nip(16) ip(g) trace dot
Discrete Factor:
gllamm new_method posyandus … muslim, i(ind_id) family(binomial)
link(mlogit) b(1) nip(4) ip(f) trace dot
Restricted model for multilevel data and random effects at individual level
and community level:
Normal distribution:
gllamm new_method posyandus … muslim, i(ind_id com_id) family(binomial)
link(mlogit) b(1) nip(16) ip(g) trace dot
Discrete Factor:
gllamm new_method posyandus … muslim, i(ind_id com_id)
family(binomial) link(mlogit) b(1) nip(4) ip(f) trace dot
To estimate the less restrictive version of the model, you must expand
the data set (as is done in conditional logit)-- with 4 choices each original
observation expands to 4 observations:
sort ind_id year
gen tid = _n
expand 4
by tid, sort: gen alt = _n
gen choice = (new_method == alt)
tab alt, gen(a)
eq a2: a2
eq a3: a3
eq a4: a4
eq a5: a2
eq a6: a3
eq a7: a4
Less restrictive models for longitudinal and multilevel models (just
for normal distribution):
Longitudinal:
gllamm alt posyandus … college muslim, i(ind_id) family(binomial) nrf(3) eq(a2
a3 a4) link(mlogit) expand(tid choice m) b(1) nip(8) ip(g) trace dot
Multilevel (use 4 points as starting values for 8 points):
gllamm alt posyandus .. muslim, i(ind_id com_id) family(binomial) nrf(3,3) eq(a2
a3 a4 a5 a6 a7) link(mlogit) expand(tid choice m) b(1) nip(4) ip(g) trace dot
matrix a=e(b)
gllamm alt posyandus … muslim, i(ind_id com_id) family(binomial) nrf(3,3)
eq(a2 a3 a4 a5 a6 a7) link(mlogit) expand(tid choice m) b(1) nip(8) ip(g) trace
dot from(a)
Complete results for Longitudinal Multinomial Logit with Normal
Unobserved Heterogeneity:
-----------------------------------------------------------------------------alt |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------c2
|
posyandus |
.0133383
.0043642
3.06
0.002
.0047847
.0218919
age | -.0654348
.0029524
-22.16
0.000
-.0712214
-.0596483
grade_school |
.665288
.0853316
7.80
0.000
.4980411
.8325349
high_school |
.3435901
.1026004
3.35
0.001
.1424971
.5446831
college |
.1838509
.1410159
1.30
0.192
-.0925352
.460237
muslim |
.8185812
.0950213
8.61
0.000
.632343
1.004819
_cons |
.6515547
.1603845
4.06
0.000
.3372068
.9659026
-------------+---------------------------------------------------------------c3
|
posyandus |
.0563373
.0074169
7.60
0.000
.0418005
.0708742
age |
.0170139
.0049613
3.43
0.001
.0072898
.0267379
grade_school |
.3463888
.1475298
2.35
0.019
.0572358
.6355418
high_school |
.3181615
.1815739
1.75
0.080
-.0377168
.6740398
college |
.8239973
.2348017
3.51
0.000
.3637943
1.2842
muslim | -1.546618
.1443679
-10.71
0.000
-1.829574
-1.263662
_cons | -2.761829
.2676854
-10.32
0.000
-3.286482
-2.237175
-------------+---------------------------------------------------------------c4
|
posyandus |
.0433708
.0077071
5.63
0.000
.0282653
.0584764
age |
.0296803
.00637
4.66
0.000
.0171954
.0421652
grade_school |
1.102582
.23605
4.67
0.000
.6399329
1.565232
high_school |
1.698679
.2555573
6.65
0.000
1.197796
2.199562
college |
1.852888
.2955359
6.27
0.000
1.273649
2.432128
muslim | -.8180473
.1512352
-5.41
0.000
-1.114463
-.5216316
_cons | -5.665755
.4052886
-13.98
0.000
-6.460106
-4.871404
------------------------------------------------------------------------------
Variances and covariances of random effects
------------------------------------------------------------------------------
***level 2 (ind_id)
var(1): 2.5705676 (.15381473)
cov(2,1): 1.3274867 (.22905531) cor(2,1): .27220862
var(2): 9.251828 (.48973169)
cov(3,1): -.33313875 (.21242057) cor(3,1): -.15334027
cov(3,2): .0428903 (.40229692) cor(3,2): .01040617
var(3): 1.8361529 (.34750064)
------------------------------------------------------------------------------
Comparison of Normal Distribution Results with Discrete Factor Results – Less
Restrictive Multilevel – Discrete Factor with 4 points of support
Normal Distribution
Discrete Factor
Variable
Coefficient
SE
Coefficient
SE
Temporary Modern vs none
Temporary Modern vs none
Posyandus
0.0051
0.0041
0.0113
0.0086
Age
-0.0596
0.0025
-0.0650
0.0036
Heterogeneity
Com_1
-0.3364
0.0504
-1.0669
0.3353
Com_2
-1.1237
0.3063
Com_3
-0.2065
0.3053
Ind_1
-1.0171
0.0476
1.5390
0.2828
Ind_2
1.0394
0.5882
Ind_3
-1.2628
0.2664
Long Lasting vs none
Long Lasting vs none
Posyandus
0.0492
0.0102
0.0379
0.0254
Age
0.0239
0.0044
0.0204
0.0074
Heterogeneity
Com_1
-1.4887
0.1191
-4.7315
0.4556
Com_2
-1.8984
0.5143
Com_3
-2.7538
0.4319
Ind_1
2.4651
0.0947
-6.8959
0.9684
Ind_2
-2.3478
0.4319
Ind_3
Traditional vs none
Traditional vs none
Posyandus
0.0391
0.0064
0.0501
0.0134
Age
0.0282
0.0057
0.0274
0.0052
Heterogeneity
Com_1
0.0753
0.0916
-0.0720
0.4741
Com_2
0.4133
0.3854
Com_3
0.1864
0.4403
Ind_1
0.1900
0.1160
-0.9735
0.4741
Ind_2
-28.9052
0.0000
Ind_3
-0.3194
0.3822
Probability Weights for Discrete Factor Model
Com_1
0.2352
Ind_1
0.2194
Com_2
0.2681
Ind_2
0.1649
Com_3
0.4391
Ind_3
0.4453
Com_4
0.5777
Ind_4
0.1704
Summary
IIA is not much of an issue – at least in this example – the basic multinomial logit
got almost the same point estimates as the least restrictive random effects model
More complex models that do not suffer from IIA make a lot of difference in
conditional logit model (see mixlogit application)
Estimation of models with multilevels using Gaussian Quadrature is extremely
computer intensive when you use the preferred less restrictive models
Adaptive Quadrature sounds good in theory but frequently fails to converge
The Discrete Factor Model in GLLAMM is easy to use, does not require the normal
error heterogeneity assumption, and is much faster computationally
Under some circumstances, GLLAMM ran into problems taking
derivatives – either with Gaussian Quadrature or the Discrete Factor model
This was never an issue in fortran which was also light years faster
Competing Risk Model
Computer software for the discrete time competing risk model with
multiple levels is the same as the multilevel multinomial logit
Can be more complicated if a multiple spell model
Example:
“Trajectories of Unintended Fertility” (Rajan, Morgan, Harris, Guilkey,
Hayford, Guzzo)
Competing risks were the time until unintended versus intended
birth (no birth is the base outcome for the 3 unordered
categories)
Multiple spell (through three births) but we focus on single spell here
Review of discrete time hazard:
The variable of interest is: P(t ≤ T < t+n | T > t)
This is the conditional probability that an individual experiences the event
between t and t+n given that she has not experienced the event until that
time.
Example:
The dependent variable is the timing of a first birth. Suppose the discrete time
interval is a year and we observe each woman from the beginning of her child
bearing years:
0…..1…..2…..3
Consider three cases:
Person 1: Has a birth in year 1 (time 0 may be age 12)
Person 2: Has a birth in year 2
Person 3: Still has not had a birth at the end of the observation period
Some important notes:
1. Since we are following the woman from the beginning of her child bearing
years, we have eliminated the possibility of left censoring (the event occurs
before the observation period).
2. Left censoring combined with unobserved heterogeneity introduces bias
into the estimation results. The correction requires the estimation of an
“initial conditions” equation similar to Heckman selection equation which are
well known to yield unstable parameter estimates.
3. The third person is right censored. However, right censoring is easily
handled as part of the estimation process.
4. The dependent variable in a discrete time hazard model is dichotomous.
Logit typically used which can be extended to multinomial logit for competing
risk.
Duration dependence
This is a concept similar to state dependence in a standard panel data
model.
Duration dependence occurs when the value of the hazard at any point in
time depends on the amount of time that has already elapsed.
Examples:
Mortality – hazard increases with time regardless of the values of the
other covariates
Unemployment duration – hazard of finding employment may decrease
as the length of the unemployment spell increases
Important Point:
Difficult to distinguish duration dependence from time variant
unobserved heterogeneity – so must allow for both and then test
Illustrate methods with simulated data so it is easy to compare estimators:
Competing risks are 1 and 3 –person stays in sample as long as keep drawing 2
Number of communities:
200
Individuals per community: 40
Maximum time until right censoring:
20
Generate the unobserved heterogeneity from both a normal distribution and a bimodal
distribution – mean zero and variance one in both cases – note that there is no
duration dependence in the true model
0
.1
.2
Density
.3
.4
.5
Bimodal Distribution for community heterogeneity:
-2
0
2
com
4
Comparison of Normal Distribution Results with Discrete Factor Results for Normal
and Bimodal Unobserved Heterogeneity – Just 2 versus 1
Variable
Coefficient
SE
Coefficient
SE
Data Generated using Normal Unobserved Heterogeneity
Multinomial Logit Corrected SE
Multinomial Logit – Assume
Normality and use Restricted
Model
X1 (2)
1.8505
0.0262
1.9325
0.0367
X2 (-0.75)
-0.7176
0.0200
-0.7495
0.0236
Time (0)
0.0866
0.0048
0.0543
0.0080
Multilevel Multinomial Logit –
Discrete Factor – 4 points
Normality with Less Restrictive
Model
X1 (2)
1.9687
0.0323
1.9822
0.0316
X2 (-0.75)
-0.7319
0.0224
-0.7331
0.0216
Time (0)
0.0142
0.0082
0.0125
0.0082
Data Generated using Bimodal Unobserved Heterogeneity
Multinomial Logit Corrected SE
Multinomial Logit – Assume
Normality and use Restricted
Model
X1 (2)
1.8382
0.0284
1.9747
0.0387
X2 (-0.75)
-0.7394
0.0223
-0.7945
0.0249
Time (0)
0.08554
0.0056
0.0376
0.0083
Discrete Factor – 4 points
Multilevel Multinomial Logit –
Normality with Less Restrictive
Model
X1 (2)
2.0067
0.0342
2.0157
0.0354
X2 (-0.75)
-0.7730
0.0235
-0.7740
0.0234
Time (0)
-0.0068
0.0089
-0.0095
0.0098
Restrictive versions of the unobserved heterogeneity exhibit duration dependence
when none exists
GLLAMM with normal errors is extremely slow but faster than GSEM (results not shown)
– it is not clear how practical it is for large models with many observations
A very surprising result is that GLLAMM assuming normality was robust to a very nonnormal error generating process
GLLAMM with the discrete factor approximation was fast and gave results almost
identical to a fortran program – although it was still much slower than fortran
GLLAMM with the discrete factor approximation was also robust to the true error
distribution process – this is not surprising given the Monte Carlo evidence that
is available