Econometrica, Vol. 60, No. 5 (September,1992),1187-1214
AN EFFICIENT METHOD OF MOMENTS ESTIMATOR
FOR DISCRETE CHOICE MODELS
WITH CHOICE-BASED SAMPLING
BY GUIDO W. IMBENS1
In this papera new estimatoris proposedfor discretechoice modelswith choice-based
sampling.The estimatoris efficient and can incorporateinformationon the marginal
mannerand for that case leads to a procedure
choice probabilitiesin a straightforward
that is computationallyand intuitivelymore appealingthan the estimatorsthat havebeen
of the distribution
proposedbefore. The idea is to startwith a flexibleparameterization
of the explanatoryvariablesand then rewritethe estimatorto remove dependenceon
these parametricassumptions.
KEYWORDS: Discrete choice models, choice-basedsampling,case-controlsampling,
efficiencybounds.
generalizedmethodof momentsestimation,semi-parametric
1. INTRODUCTION
IN THIS PAPER A NEW ESTIMATOR is proposed for discrete choice models with
choice-basedsampling.Discrete choice models, or qualitativeresponse models
as they are also called, are characterizedby the feature that the dependent
variable is discrete instead of continuous. Examples are modes of transport,
choices of school types, or participationdecisions.
Sometimessome of the alternativesare very rare while still importantto the
researcher.Incidence of rare diseases, or the choice of a particularschool type
are examples. In that case the researcher might want to oversample that
particularresponse to increase the accuracyof his analysis(be it the estimation
of parametersor the predictionof behavior).Especiallyin dynamicmodels it
often happens that responses, in this case life histories, that contain relatively
much information,occur relativelyinfrequently.See for a discussionof choicebased samplingin a dynamiccontext Ridder (1987) and Lancasterand Imbens
(1990).Another area where this is relevantis that of the evaluationof treatment
effects and trainingschemes, discussed in, among others, Hsieh, Manski, and
McFadden(1985), Breslow and Day (1980), and Heckmanand Robb (1984). If
the conventionaleconometricpracticeof specifyingthe conditionaldistribution
of the dependent variable rather than the joint distributionof the dependent
and the independentor explanatoryvariablesis maintained,standardmaximum
likelihood techniques do not apply. It is this case that is the subject of the
choice-based,response-based,case-control,or endogenoussamplingliterature.
1 This
paper is based on the first chapterof my Ph.D. dissertationat BrownUniversity.It was
largely written while I was at Tilburg University.I wish to thank Tony Lancasterfor many
stimulatingdiscussionsduring the preparationof this paper and gratefullyacknowledgehelpful
commentsby GaryChamberlain,BertrandMelenberg,Robin Lumsdain,Wilbertvan der Klaauw,a
co-editor, two anonymousreferees, and participantsin seminars at Brown University,Tilburg
University,HarvardUniversity,and CREST/ENSAE. All responsibilityfor errorsis mine.
1187
1188
GUIDO W. IMBENS
In this paper an estimatoris proposedthat improveson those that have been
suggestedpreviously.Some of these earlier estimatorssuch as those by Manski
and Lerman(1977), and Manskiand McFadden(1981) are inefficient,while the
ones that are efficient, notably those proposed by Cosslett (1981a, 1981b) are
very hard to compute. The new estimatorhas the same efficiencyas those by
Cosslett but reduces the computationalburden.The estimatorsthat have been
suggested in the literature can be divided into two groups, firstly those that
assume that the populationprobabilitiesof the choices are knownand secondly
those that assume that they are not. The new estimatorincorporatesthese two
extremesas special cases and can cope with partialknowledgeof the probabilities. If these probabilitiesare knownthey give rise to stochasticrestrictionson
the other parametersthat can be treated as moment equations.If they are not
known, they will be treated as additionalparametersand estimated using the
same equationsthat are used as stochasticrestrictionsin the other case.
Both the procedure followed to obtain the estimator and the form that is
eventually derived are insightful. This procedure is similar to that used by
Chamberlain(1987) to prove efficiencyof method of momentsestimators.Here
we use this procedureto find an estimator,ratherthan to prove efficiencyof an
estimatormotivatedon other grounds.It is assumedat first that the explanatory
variableshave a discrete distributionwith knownpoints of support.In that case
one can estimate the parametersof interestby maximumlikelihoodtechniques.
The next, crucial, step is to change the estimatorthus obtained into one that
does not require a discrete distribution for the explanatoryvariables. The
functionsthat can be interpretedas score functionsin the maximumlikelihood
frameworkwill be interpretedas moment functionsin the generalizedmethod
of moments framework.In this approach one interprets the problem as a
semi-parametricone with the distributionfunctionof the explanatoryvariables
viewed as a nuisancefunction.
The result is a simpler estimatorfor the case where the populationproportions are knownin the sense that optimizationtakes place over a space of lower
dimension. This is important because the computational difficulties with
Cosslett'sestimatorsare severe as noted by Cosslett (1981b, 1991), Manskiand
McFadden (1981), and Gourieroux and Monfort (1989). The estimator also
providessome intuitionabout the way in which informationabout the marginal
distributionof the dependent variable can be used efficiently. This line of
researchis furtherpursuedin Imbens and Lancaster(1991b).
The plan of the paper is as follows: in Section 2 the issues in choice-based
sampling are formally stated and previous solutions from the literature are
discussed. In Section 3 the new estimator is developed and its asymptotic
propertiesanalyzed.A small Monte Carloexperimentis conductedin Section 4
to analyzethe small sample properties,followed by the conclusionin Section 5.
2. NOTATION AND PREVIOUS ESTIMATORS
We follow as much as possible the notation of Cosslett (1981b). In a
populationthe joint densityof a discreterandomvariablei and a continuousor
CHOICE-BASED
1189
SAMPLING
discrete randomvector x is
f (i, x) = P(i Ix, 0) *r(x),
for iEC={1,2,...,M),
xexc9iP,
(1)
and 0E&c 9iK. P( I ,) is a known
function, r(&)is an unknownfunction, and 6 is an unknownparametervector.
The distributionfunction of x will be denoted by R(x). We are interested in
the parameter6 of the conditionalprobabilities.One might also be interested
in Q(), the marginalprobabilityor populationshare of choice i. Even if one is
not interestedin Q(i) itself, it is useful to define it explicitly.This will make it
easier to incorporateprior informationabout it and such prior information
(namelythat one of the choices is very rare) is often a motivationfor sampling
in a choice-basedmanner.In fact, early studies on choice-basedsamplingsuch
as Manski and Lerman (1977) focused exclusivelyon the case where these
probabilitiesare knownexactly.The true value of 6 is 0* and the corresponding
notation for Q(i) is Q*(j):
Q*(i) = P(ilx,6*)dR(x).
Observationsare not drawnrandomlyfrom this population.With probabilityH,
an observationis drawn randomlyfrom that part of the populationfor which
i E 7(s) c C, 7(s) # 0 for all s = 1, 2,. .., S. The H, satisfy Es=HS = 1,
H, > 0. At times these probabilitiesof samplingfrom the differentsubpopulations or stratawill be assumednot to be knownto the investigator.In that case
Hs* will denote true values. The S - 1 dimensionalvector (H1, H2,... , Hs-,)
will be denoted by H and the M - 1 dimensional vector (Q(1), Q(2),....
Q(M - 1)) by Q. Hs and Q(M) will be used as shorthandfor 1 - EsT1-Htand
1 - Yi 1Q(j) respectively.
The joint densityof (s, i, x) is the productof the marginalprobabilityof s, Hs,
and the conditionaldensityof i and x given the stratums. The latter is
(2)
g(i,xls) =
f(i,x)
P(ilx,60) r(x)
E
it'E(s)
ff(i',z)
dz
E
x
it'EfSs)
and the productcan be written as
P(ilx,6) .r(x)
E
JP(i'Iz ,) dR(z)
e3'(s)~~~~~~~~~l
JP(i'Iz,0)
dR(z)
P(ilx,0) .r(x)
i'Ef(js)
-Ts
for i E YXs), s E {1,2,... , S), and x E X.This is the densityfunctioninducedby
the samplingscheme, as opposed to the density function in the population(1).
As a rule ff() will denote populationdensityand probabilityfunctions,and g(*)
density and probabilityfunctions induced by the samplingscheme. The latter
will sometimesloosely be referredto as samplingdensities.
As a simple example consider a model with two choices, i = 1,2, and two
strata, s = 1, 2. With probability H1 = h an observation is drawn from
7(1)= {1M,
and with probabilityH2 = 1 - h it is drawn from 9(2) = {2). The population
probabilityof choice 1 is Q1= q, and that of choice 2 is Q2 = 1 - q. The joint
1190
GUIDO
W. IMBENS
density of (s, i, x) is
(4)
g(s, i, x)
=
[h
1[Ai=1] 1 -h
-P(1 Ix, 0)
[ - (1 - P(1Ix, 0))]
I[i=2]
r(x)
comparedto P(1 Ix,0)I[i=1] *(1 - P(1 Ix,0))I[i=2] * r(x) when the samplingis random. This is the sampling scheme that will be used in the Monte-Carlo
experimentin Section 4.
Cosslett (1978, 1981a, 1981b) analyzed a sampling scheme that is slightly
differentfrom the general one defined above. Instead of fixingthe probabilities
H* with whichobservationsare drawnfrom the variousstrata,he assumedthat
the relativenumberof observationsfromeach subsample,iH = En=' I[s, = t]/N
is fixed.Under both samplingschemesthe conditionaldensityof i and x given s
is equal to (2). Since s is ancillaryunder both samplingschemes, knowledge
of its marginal distributionis immaterialfor inference on 0. The fact that
the estimator that will be proposed here is efficient ensures that it is,
at least asymptotically,conditional on the ancillary statistic s. Imbens and
Lancaster(1991a) discussed the differencesbetween these two stratifiedsampling schemes,as well as otherswhere the stratumindicatorsare not necessarily
ancillary,in more depth.
The complicationsin estimation of choice-basedsamplingmodels arise because maximizationof the log likelihoodfunctioncorrespondingto this density
is not possible withoutparameterizingthe marginaldensityof x in the population, r(x). If the samplingwere random,and consequentlythe density of the
data were (1), maximizationof the logarithmof the likelihoodfunctionwouldbe
straightforward.As long as the density r(x) does not depend on 0, it would
disappearafter takingderivativeswith respect to 0. This can be extendedto the
case where the samplingdepends on the regressorsx. The density induced by
the samplingwould then be
(5)
g(i, x) = P(ilx, 0) *q(x)
with q(x) # r(x). In this exogenoussamplingcase there is still no problem in
maximizingthe logarithmof the likelihood function because the density of x
still factors out.
To stress the reciprocalrelation between H and Q we also define H(i) and
Q':
(6)
Qs
E Q(i)
is fT(s)
(7)
E
H(i) =Q(i)
sli e
Hs
(S) Qs
If there is no s such that ise(s),
then H(i)
=
0. H(i) is the marginal
CHOICE-BASED
SAMPLING
1191
probabilityof choice i induced by the choice-basedsampling,or again somewhat loosely, the sampleprobabilityof choice i. It is not to be confusedwith the
sample frequency of choice i, H(i) = Y2I[in= i]/N. In the population the
marginalprobabilityof choice i is Q(i), but the samplingscheme multipliesthis
by the sum of the bias factors HI/Q,. The essence of choice-basedsamplingis
that for some i the distortion factor Es,ie 87-(S)Hs/Qsdiffers from unity, or,
equivalently,for some i, the population probabilityQ(i) is not equal to the
sample probabilityH(i). The marginalprobabilitythat an observationrandomly
drawnfrom the populationis in r(s) is Qs. Note that while the H(i), Hs, and
the Q(i) add up to one, the sum of the Qs does not have to equal one.
In the following it will be assumed that the investigatorhas a sample of N
independent observations. Ns will denote the number of observationsfrom
stratum $9(s) and N(i) the number of observationswith choice i. In the
remainderof this paper the followingassumptionswill be maintainedthroughout. Other assumptionswill be introducedwhen necessary.
2.1: x EX, X a subset of 91P; i E C,C a finite set with M
ASSUMPTION
elements;and 0* E int 69,6 a compactsubsetof 91K.
function of 0,
ASSUMPTION
2.2: P(i Ix,0) is a twicecontinuouslydifferentiable
and P and its first two derivativeswith respectto 6 are continuousin x for all
6 E &. P(ilx, 0) > 0 for all i E C, x EX and 6 in an open neighborhoodof 0*.
Several procedureshave been proposed for estimating 0 in this setting. We
will brieflymentiontwo of them because their formwill aid the interpretationof
the new estimator.
The conditionalprobabilityof i given x in the sample is
(
(8)
P(ilx, 6*)H*(i)/Q*(i)
g(itx)
-M
E P(jlx,6*)H*(j)/Q*(j)
j=1
Manski and McFadden(1981) proposed maximizingthe correspondingconditional likelihood function as a function of 6 given knowledgeof H* and Q*:
N
(9)
L(6)=
P(intXn69)H*(in)/Q* (in)
, IInM
n=1
EP(ilX
O)H*( j)/Q*( j)
j=1
The conditionalmaximumlikelihood (CML for short) estimatorcan be interpreted as a method of momentsestimator.That is, the estimatorcan be defined
as the solution to the set of equations
N
En=(1
n=1
in,Xn
0,
1192
GUIDO W. IMBENS
where the momentvector is the score of the conditionallikelihood:
dP
1
(10)
f(0(,i,X) = -(ix,
0)
do
P(ilix,O0)
rM81
H*(j) 1rm
H*(j)1
- E la1
do
o (ilx9H)
[ E=1Q*0
P(jlX,
(jQ10
0)
1
dP
=- (ix,0)P(ilx,
0)
do~~~a
- (i'Ix,
i'E- t)do
s
0)
E ~Q*(jV)
t=1
E Ht*iEgt
Ix,O0)
~~P(i'
s
E it*
's(t)
The method of moments interpretationwill later be useful in comparingthe
CML estimator to the new estimator. Cosslett (1981a) showed that a more
efficient estimatorcan be obtained by replacing Ht* in this procedureby the
sample frequency Ht. Lancaster(1990) has an extensive discussion of this in
terms of ancillarityof the stratumindicators.
Part of the potentialloss of efficiencystems from conditioningon x while the
parameterof interest, 0, enters the conditionaldistributionof i given x as well
as the marginaldistributionof x. In fact the marginaldistributionof x is
S Ht
(11) g(x) ==E - E: P(Jlx, 0)r(x),
t= 1 jeQ(t)
and that clearlydepends on 0. Nevertheless,one can still base inferenceon the
conditionallikelihoodfunction.
Cosslett (1978, 1981a, 1981b) proposed the pseudo maximum likelihood
(PML) estimator.Considerthe likelihood function based on the density (3). It
cannot directly be maximized over the parameter space and the space of
densities r(x). However, if one replaces the density r(xn) by a set of discrete
weights rn, such that En'IrN = 1 and rn > 0, maximizationis possible. One
would obtain the followingprogram:
N
max
(12)
0,r1,
r2,.
E
in
~HP
E
rnln
Lje
N
E
n= 1
rn = 1
rn > ?
(ST)
1p'
X,
)-
P(inlXN
E
n'=l
P(jlxnl,0)r rn'
subject
to
1193
SAMPLING
CHOICE-BASED
The solution of the maximizationover r and 6 turnsout to be equivalentto the
= ('11n, fC12),
solution of the problem EN
n=1(A, A 6, in, Sn, Xn) = 0, with OC1
and
(13)
qcll(A, o, s, i, x)
1
dP
ix, ) do((itix, 0)
P(iP(itx,6)
-[ tlE A(t)E
j
(14)
-
(tjlx 6)]/[ EA(t)
do(t)
Lt=l
j
E P(jix,6)],
J
fC12(A,0, s,i, x)t
A(t)
E
x,6)
t()
P(jtx, 0)
for t = 1, 2,..., S - 1 and A(S) = 1. Cosslett proved that the estimator for O* is
efficient in the class of asymptoticallyunbiasedestimators.
For the case with Q* known, Cosslett proposed maximization of the
same function, (12), under the additional restriction that for all je C, we
have Q*(j) = EN rn *P(jXn, 6I). This system is equivalent to solving
En=1fC2(A, 6 ln, Xn) = 0, with
1
(15)
OC21(A,k,i,X)
=
P(ilx)
aP
<-(iIx,0)
PO Ex A( ) dHo
(16)
k, x)0
fC22(A,
=
l,H|
[P(jIx, 0) -P(MIx,
A )(jx
)|
6)Q*(j)/Q*(M)]/
M
Ea
for j=1,2,...,
M-1
A( j)P(
I'X, 0),
and A(M)=(1- E]fl1A(j)Q*(j))/Q*(M).
If H* were
known, the probabilitylimit of A(j), H*(j)/Q*(j), also would be known.
Cosslettprovedhoweverthat his estimatorof 0* is efficient,independentof the
informationon H* available.This is only possible if asymptoticallyA and 0 are
uncorrelated,which in fact is the case. The computationaldifficultiesstem from
the very differentnature of the parametersof the optimizationprogram,A and
0. Most optimizationalgorithmstreat all parametersin the same way and in this
case that does not workvery well. In additionthere need not exist any solution
lrn P(jtx, 6) as pointed out by
for 6 that satisfies the restrictionQ*(j) =
Cosslett (1991).
1194
GUIDO W. IMBENS
3. AN EFFICIENT GMM ESTIMATOR
In this section the new estimatorwill be discussed.The strategyis as follows.
Initiallyit will be assumedthat the regressorsx have a discretedistributionwith
known points of support.This is of course restrictivebut it enables one to use
standardmaximumlikelihoodtheory. In particularthe Cramer-Raobound can
be calculatedand used as an efficiencybound. Potentialrestrictionsin the form
of knowledge of the marginalprobabilitiescan easily be incorporatedin this
case.
The maximumlikelihood estimator for the discrete regressor case can be
written in such a way that knowledge of the points of support is not used
explicitly.It turnsout that the estimatorremainsvalid even if the distributionof
x is continuous.Efficiencywill be provenfor this estimatorin the general case.
The theorybehind the Cramer-Raobound is no longer applicableand therefore
semi-parametricefficiencyboundswill be used.
To give some intuitionfor the way in which assuminga discrete distribution
can lead one to estimatorsthat are valid and efficienteven if the distributionis
continuous,considerthe followingexample.It is similarto one in Chamberlain
(1987). Suppose one is interestedin the probabilitythat a randomvariablez is
positive, a = Pr(z > 0). If z is knownto have a discretedistributionwith points
of support{zW,Z2,..., Z L), and with unknownprobabilities{7r,, ir2 ... ,TrL, one
could efficiently estimate a on the basis of N independent observations
{Zl, Z2 ... ZN) by maximumlikelihoodtechniquesas
~
=
A
=
7r
lIz'>0
1
N
Ei[zn=z']=-
lIz>0
n=1
i
N
EI[Zn
n=1
>0]
In the last representationof the estimatorit does not depend explicitlyon the
points of support,only on the realized observations.It also can be used as an
estimatorfor a if z does not have a discrete distribution.In fact, whateverthe
distributionof z, a is a very good estimator, and efficient in a sense to be
defined later.
The difference with Cosslett's PML estimator is that we go further in
exploitingthe discrete regressorcase. This will enable us to use the maximum
likelihoodtheorythat applies to that case further,simplifythe estimatorfor the
continuous regressorcase, and finally provide more intuition for the general
estimationproblem.
3.1. TheCasewithDiscreteExogenousVariables
The subject of this section is the case where x has a discrete distribution.
This will allow one to use standardmaximumlikelihood theory. Few formal
proofs of consistencyand asymptoticpropertiesof estimatorswill be given in
this section. The main point here, as indicated earlier in the introductionto
Section 3, is to use maximumlikelihoodtheoryto guide one to an estimatorthat
will be used outside the maximumlikelihoodframework.
CHOICE-BASED
1195
SAMPLING
3.1: x is a discreterandomvariablewithprobability7r,> 0 at xl
ASSUMPTION
for 1= 1,2,..., L, and the masspointsxi, elements of 9Jl",are known. The
numberof points of support,L, is largerthan the numberof choices, M.
An observation(s, i, x) can now be written as (s, i, 1), where 1 indicatesthe x
type of the observation. The log likelihood function for the observations
(S,nilin
)N=1 iS
N
L(H, 7r,. ) =:
(17)
n=1
In Hs.+ ln P(n1XZ",@) + Inirl
L
- In
i
).
Eiri,P(ilx",
E
I' = 1
E= -1(S)
Maximizingthis over H, -ir,and 0 subjectto the restrictionEL1r
the followingfirst order conditionsor likelihoodequations:
0=
(18)
dL
A
Wt
0=-
(19)
I[ Sn= t]
A
(H f,XT)
EIN)
AL
AA
A
(H rr)=
-
n=1
Hts
N
E
I[xin=xm]
1 leads to
(t= 1,2 ... ,S-1),
A
*A
= 1P(I"
i$E(Sn)P~
$-S:)
Le
I[Sn =S]
-
=
(m= 1,2,...,L-1),
(21)
~
Ej
~ *X,A~
(jM
A)
A
L
(20)
*7 *P(
Q(j)= -(
jlinXln9
A)
1=1
where ,u is the Langrangianmultiplierfor the adding up restriction.Assume
that the solution (H1,
xr,8) to this systemof equationsis unique. In this discrete
1=
(21)
~
~
the maximumlikelihoodestimatorfor Q(j) is
framework
regressor
L
1196
GUIDO W. IMBENS
Note that the derivativesdL/dO and dL/dwrdo not depend on H. This implies
that the asymptoticcovariancematrixhas a block diagonalstructure.Asymptotically H and 6 are uncorrelatedand knowledge of H does not enhance our
ability to estimate 6, ir, or functions thereof. This is of course a direct consequence of the ancillarityof s.
The next step is to transformthe parametervector into one that includes Q.
This serves two purposes. Firstly, it will provide an easier frameworkfor
analyzingestimationwith restrictionson Q. In the transformedmodel it will be
a conventionalmaximumlikelihood estimationproblemwith linear restrictions
on the parameters.Secondly,and most importantly,the estimatorsfor 6* and
Q* can, after the transformation,be written in a form that does not require
knowledgeof the points of support.They will be writtenin such a way that their
consistencycan be proven directly,without relyingon the maxiinumlikelihood
interpretation.
Define the (M - 1) x L dimensionalmatrix V to be the matrixwith typical
element
vil = P(iIxI, a*).
PartitionV into (VOVj) with VOa squarematrix.The conditionthat will allow
us to do the desired transformationis that VOis nonsingular,possibly after
reordering the points of support. Assume that this condition is satisfied.2
Partition ir into (Orl, -T2) with dim(O1) = M - 1 and dim(w2)
L - M. The
=
Jacobianof the transformationfrom the vector (H, 6, -71,-72) to (H,Q, 0, 72) is
nonzero as a consequenceof the above condition.
The step of rewritingthe equations characterizingthe maximumlikelihood
estimate of 6 is essential to the whole approach.It will therefore be given in
some detail. First note that the Lagrangianmultiplier, is equal to zero. This
can be seen by multiplying(19) by *jm and summing up over m = 1 to L.
Alternatively one can arrive at this result by checking that the likelihood
functionis homogenousof degree zero in r. This enables one to obtaina closed
form solution for *+m as a functionof 0, H, and Q. In fact it is a simple sample
average:
(23)
7+m=[
A
[EXIn=XM]]
= Nn= NI[Xn=X
A
][
n=1
~~~~~~~~~e
Qe i)
-I(sn)
J
In the last representationOf *~m the estimated sample design parameter H
2 This conditionis not trivial.In the next section assumptionswill be madethat guaranteethat it
is satisfied.A case whereit is not satisfiedis if the conditionalprobabilitiesdo not dependon x and
P(ilx;O*)= Q*(j) for all x and j.
1197
SAMPLING
CHOICE-BASED
enters the equation. This is why it is convenient to treat H as a normal
parameterratherthan as a numberfixed by the investigator.
To rewritethe crucial equation that characterizes0, (20), one has to substitute for +m in the second term. The denominatorof this term is equal to
)Q(j). The key is rearrangingthe numeratorafter substitutionof (23)
j_s
for vi in such a way that the whole expression can be written as a sample
average.In doing this we bring in H in the same way it was introducedin the
step from the first to the second line in (23):
n=
m=l
Cj-eY(sn)
A
0)
*m_d
(jlXm
YQ(s L
E~~~~~~~~~'
=N
N
n= 1 fje.7(sn)
m
/[E TmP(jlXm,)l
Y3~~~~jI[x'n'=Xm]
{L
N
]
j Le(sn)
H
N(n=
~S
1
N
)
Q1N
Q W~
A
-
i' E 8(s)
Qj
E
j E Y(Sn)
N ||
NE
Ii'sI
E
i'
=
/|=
Qt,
P(S)Ixj
)
|
))
E
-
EEdsn))}
1}
[E
]t
Now oneitN
)
S(eY
bye can characterizethe maximumlikelihood estimates for 0' H, Q, and
Now
'S
N
(24)
E
n=l
j e
*2
Nn=X
in) = O
P
1198
GUIDO W. IMBENS
where fi = (q4, q4, fi', qi4Ywith Iil an S - 1 vector, q,2an M
vector, and qi4 an L - M vector with typicalelements:
(25)
q,1(H,
(26)
4'2j(H9o9,v29
ir2,
Q,s, i,l ) = Ht-I[s
-
1 vector, Q3a K
= t]
Q9 S9,i,)
s
=Q(j)
-P(jlX,0)
6iE)
-L-(il
(27)
t=1
P(iIx',0)
IiE ,(j,t
E
]
(t)
Q, S9,i,)
41AH,00r2,
ao
-j ~Et
P(ilx' a)
~
a
(@)
/
EY(t)
t
if E (t)oi
(28)
Q(i'
i
; ]
(t)
729 Q9 S9'9
qi4m(H909
I=T2m-I[XI=Xm]
E
P(i'Ix',6)
FaHt ifY(t)
i
EY (t)
The first three parts of the qi vector do not depend on 72. They can therefore
be solved separatelyas a function of H, Q, and 6. Since the solution for H is
trivial,the system that has to be solved to obtain 6 is reduced to a K + M - 1
dimensional one. Note that the only way in which the moments (25)-(27)
depend on the mass points is via the observed x values. This is very similarto
the example in the introductionto Section 3. It implies that the maximum
likelihood estimatorsfor Q, H, and 6 can be calculated without knowing a
priori what the masspointsof the random variable x are. It will be seen in
Section 3.2 that one does not even need the assumption that x has finite
support.
The three moments have clear interpretations.When evaluated at Q = Q*
and H = H*, the third moment if3 is equal to the score for the conditional
likelihood of i and s given x. Compare(27) with (10). If the samplingscheme
were random(say S = 1 and $'(1) = C) the second momentwould comparethe
marginal probabilitywith the average of the conditional probabilities. The
choice-basedsamplingscheme implies that before the comparisoncan be made
the conditional probabilitieshave to be weighted to correct for the sampling
induced bias. The first moment, qi1, is easy to interpret but it is difficult to
CHOICE-BASED
1199
SAMPLING
explain why it has to be in the moment vector even if H* is known.3The
importanceis clear from Cosslett's(1981a) result that using sample frequencies
H instead of the true H* in the CML estimatorincreasesefficiency.
The other advantageof the transformationreferredto earlieris the ease with
which informationabout Q can be incorporated.Before the transformationthis
would have amountedto a maximizationin a K + T + S - 2 dimensionalspace
with M - 1 restrictions.Now it will turn out to involve a maximizationin a
K + S - 2 dimensional space. The following lemma gives an efficient way of
using restrictionson some parametersif one has the recursivestructurewe have
derivedabove.Note that the structureis very similarto that analyzedby Newey
(1984) in his discussionof sequentialestimators.
LEMMA3.1: Supposethe maximumlikelihoodestimatorof a vector 83 with
18= (f'1,8, 1,f3'Y, given N independentand identicallydistributedobservations
{Z1,Z2,...,
ZN),
can be characterizedby
N
E hi(f3if32, Zn)= 0
n=l
and
N
E
h2(8
1, I32,f33,
Zn)=
n=1
with dim(h1) = dim(131)+ dim(Q32) and dim(h2) = dim(f33). Then, the optimal
constrainedmethodof momentsestimatorfor 131given 132=f32 based on minimizationof
[
N "l1(p1p
2 Xn)]A[
N.
E
lp2Zn)]
over f31,whereAN a. [Eh * h' 1, has the same asymptoticcovariancematrixas
the constrainedmaximumlikelihoodestimator. In other words, it achieves the
lowerbound.
Crame'r-Rao
PROOF:
See Appendix.
The relevance for the problem analyzed in this section is clear. If one is
interestedin estimating0 given knowledgeof Q, one does not have to go back
3An examplefrom SUR (seeminglyunrelatedregression)mightprovidesome intuitionfor this.
Considerthe problemof estimatingone parameter(a) on the basisof observations(yn,e,),'Z1 with
the followingstructure:
E(e)=
0, Ete)*(e)
p1)
- a) based on the single momentequation E(y - a) = 0 is 1, which can be
The varianceof VN7(a
reduced to 1 - p2 if both moment equationsare used, despite the fact that the second moment
equationdoes not containany unknownparameters.
1200
GUIDO W. IMBENS
to the discrete likelihoodfunctionin (17). Lemma3.1 applieswith 81 = (0, H),
Q, and f33 = 72. It is sufficientto use the moments(25)-(27) in a generalized method of moments frameworkwith the true value for Q substitutedin,
and the momentsweighted optimally.
p2 =
3.2. TheGeneralCase
In the previoussection it was assumedthat x had a discrete distributionwith
known, finite support.In that case the maximumlikelihood estimatorsfor 0*,
H*, Q*, and rr* were derived. It turned out that the estimators for the
parametersof interest 0* and Q* could be calculatedby solvinga smallerset of
equations that did not involve wr.In this section it will be shown that these
equationscan be used to give an efficient estimatoreven if x is not a discrete
randomvariable.Assumption3.1 will be replacedby the following:
ASSUMPTION
3.2: x is a random vector with distributionfunction R(x) and
boundedsupportX C 9P
The bounded supportassumptionis made for conveniencein the subsequent
efficiency argument, and can be relaxed at the expense of more technical
conditionson the tail behaviorof the distributionof x.
The typical observationis now the triple (s, i, x) E (1, 2, ...., S} x C x . The
first step is to rewrite the moments (25)-(27) slightly.Define '/ = (4'/414, 1, )',
with qf1an S - 1 vector, '/2 an M - 1 vector, and 413a K vector with typical
elements:
(29)
qilt(H, 0,Q. s.i,x) =Ht- I[s =t]9
E
Is
(30)
f2j( H,0,Q,s,i,x)
=Q()
-P(i
Ix,0)
L
H,t
P(i'jx9o)~
)
Q(i')
j' E- gr(t)
(31)
fr3(H,0,Q,s,i,x)
dP
--(ilxg0)
do
~~~1
P(ilx 0)
ao
t=1
E Q(il)
i'EY r(t)
t=1
i'EYg(t)
~~~~Q(i')
CHOICE-BASED
SAMPLING
1201
In Section 3.1 these momentswere derived from likelihood equations.Therefore it was immediate that they had expectation zero. Here their validity as
moments suitable for usage in a method of moments procedure has to be
establisheddirectly.For all three of them it is easy to check that the expectation
over the distributioninducedby the samplingscheme (for good order, g(s, i, x)
in (3)) is zero.
With these moment equations and a possibly stochastic, positive definite
weight matrix CN the objectivefunction RN(0, Q, H) can be defined as
(32)
RN(0,Q, H)
iN
iN
n=1
N n=1
We will use the following shorthand: y = (H', 0', Q')' and y* accordingly.
Define
AO=Ef(H*
,
0*, Q*, s, i, x) * fr(H*, 0*, Q*, s, i, x)'
and
E r(H* 0* Q* s, i, x)
?-0=E
a(H'O'Q')
3.3: (i) (H*, 0*, Q*) is the unique solution to E/J(H, 0, Q, s, i, x)
0. (ii) Ao is nonsingular. (iii)
has full rank (= K + M + S - 2). (iv) CN
CO
,
COis a positive definite matrix.
ASSUMPTION
=
rO
Assumption3.3(i) is difficultto check in practice. In principle identification
can come fromrestrictionson the functionalform of the conditionalprobability,
or from restrictionson the samplingscheme.4If Q* is known,this assumption
can be weakenedto: (H*, 0*) is the unique solutionto Eqi(H, 0, Q*, s, i, x) = O.
We will later discuss some conditions that imply that this condition holds. A
consequenceof Assumption3.3(ii) is that there is no nonzerovector a such that
EM l'1a(i) *P(i Ix;0*) = 0 for all x. It therefore excludescases where the conditional probabilitiesdo not depend on x. This case is also excludedimplicitlyby
Cosslett (1981a) and Manski and McFadden(1981), and analyzedin detail by
4One of the sampling strata might for instance be equal to the population as a whole, in which
case identification would be guaranteed by identification of the random sampling model. A case in
which Assumption 3.3(i) does not hold is the binary logit model if the conditional probabilities can
be written as P(i = 1lx; 6) = 1/[1 + exp (60 + O'x)] and the strata correspond exactly to the two
choices. In that case Q and 00 are not identified.
1202
GUIDO W. IMBENS
Lancasterand Imbens(1991). If Assumption3.3(iii) does not hold then asymptotic normalitywill be a problem.This is unlikelyto happenin practice.If Q* is
known, the followingweaker form of this assumptionis sufficient:the matrix
E(da(H*, 0*, Q*, s, i, x)/d(H'O')) has rank K + S - 1.
The estimator 9 of y* is defined as the minimand of RN(y) over the
Cartesian product of the sets (He 9S- 1 ?H, < 1 -8, 8 < Es-?H, < 1 - 3}
{Q E MlI < Q() < 18 8< EM'Q(j) S 1 -8} (for some 8 > 0 such that
y* is in the interior of the set over which RN(Y) is minimized),and &. The
followingtheorem gives its properties.
THEOREM 3.2: Suppose that Assumptions 2.1-2.2 and 3.2-3.3 hold. Then the
estimator 5 for y* converges almost surely to y* and satisfies
_y*)
(N-
If we partition y and
7=
d
1O
(o,FA-
in
ro = (rOl
Yi)
F6o'ro-).
F02)
then we can estimate y* in the case y* is known with the minimand
-j converges almost surely to y* and it satisfies
-Y
of
RN(yl, y*).
v?V(5
PROOF:
-
'y*) d X(0,(0r,C017Co1) jr,COAOCOFO,(rF,COO,)_j)
See Appendix.
The optimalmethodof momentsestimatoris the one with CO,the limit of the
weight matrix equal to AJ-'. In that case the covariance matrix reduces to
(FJolA-'Fol)' for the restrictedcase. It is this estimatorthat will be analyzed
as a candidatefor efficiency.
In the previous section the estimatorhad a maximumlikelihood interpretation and it therefore achieved the Cramer-Raobound. Here, we are not in a
maximumlikelihood frameworkso we cannot use this bound directly.Instead,
we could use an efficiencyconcept from Hajek(1972),extendedby Chamberlain
(1987) to prove efficiencyfor generalized method of moment estimators.The
idea behind this local asymptotic minimax concept is that we look at the
expected loss for a particular estimator while letting the true value of
the parametervary over a small neighborhood.An estimatoris efficientin this
sense if there is no estimatorthat does better everywherein this neighborhood.
In this particularcase we should, followingChamberlain,also let the distribution of x varyover neighborhoodsof the true distribution.Then it can be shown
that no estimator does better than the one defined in Theorem 3.2 in the
neighborhoodof the true distributionof x and the true parametervalues of 0
CHOICE-BASED
SAMPLING
1203
and Q. For an extensive discussion of this efficiency concept, see Chamberlain
(1987).
However, in order to stress the connectionwith the recent statisticalliterature on semi-parametricefficiencyboundswe will employthe efficiencyconcept
discussedby Begun et al. (1983) and Newey (1990).In this frameworkwe look at
the least favorabledirectionfrom which to approachthe distribution.Because
we already have a candidate for the bound, namely the asymptoticvariance
given in Theorem 3.2, we only have to show that no regularestimatorcan do
better. We will do so by constructinga sequence of (fully) parametricmodels
that will always include the semiparametric model (3). We will then show that
the sequence of Cramer-Rao bounds associated with this sequence of parametric models converges to the asymptotic covariance matrix for the estimator
proposed here. This implies that there are fully parametricmodels with efficiency bounds arbitrarilyclose to the asymptoticcovariance matrix for the
estimatorof Theorem3.2. Thereforethat estimatoris efficient.
THEOREM3.3: The asymptotic covariance matrix V for any regular estimator
for 0, H, and Q satisfies
V-r
F 'A0(
O)1
is a positive semi-definite matrix.
In other words, no regular estimator is more efficient than the estimator in
Theorem 3.2.
PROOF:See Appendix.
The key to the proof is the sequence of parametricsubmodels. It is constructed by partitioning the set X into mutually exclusive subsets Xi for 1=
1,2, . . ., L with as unknownparametersthe probabilities81= P(x E Xj). If the
partitioningis fine enough the model closely resembles the discrete model of
Section 3.1 and the covariancematriceswill converge.
3.3. The Connection with Cosslett's Estimators
The connectionbetween the estimatorproposed in the previoussection and
those proposedby Cosslett can best be seen by comparingthe relevantmoment
vectors. In this section we will do so and as a by-productwe will show that
Cosslett'sestimatordoes not do better than the new one and that consistencyof
Cosslett's estimator implies consistency of the new estimator. From the efficiency results here and in Cosslett (1981a, 1981b) it follows of course directly
that the two estimators have identical asymptoticcovariancematrices. First
considerthe case with known Q. The momentvector for Cosslett'sestimatoris
given in (15) and (16). It was argued there that A(j) could be replaced by its
probability limit H*(j)/Q*(j) without changing the asymptotic covariance
1204
GUIDO W. IMBENS
matrixof 0. The momentvector would then be q = (41, q4)Y:
(33)
q1(0o,H*,Q*,s,i,x)
1
aP
P(ilx, a) do(ilx,)
rMap1
-[ E
E P(j IxE,
0)(H*I(j)/IQ*(/j)
Qj=l
(34)
I
-(jIx,O)H*(j)/Q*(j)
,
412(0 H* , Q*,.s, i, x)j = Pf jjX H) -P(MIx9 H)Q*())]
li-(
H*a)Q(j')
First note that
i'E
EY(t)
and a similarrelationwith dP/do(i Ix,0) substitutedfor P(i Ix,6). After substituting (35) in (33), the latter is, when evaluatedat Q = Q* and H = H*, equal
to (31). Similarly,if we substitute(35) in (34), f2 is equal to Aq,2,412 as in (30),
with A equal to
H(i)Q(M)
H(M)Q(i)2'Aii
-H(M)Q(i)2
9
A1 =
ij
H( j)Q(M)
-for_i__=__j_
H(M)Q(i)2
This shows that the momentsused in Cosslett'sestimatorare a linear combination of those used in the new estimator.Therefore,the covariancematrixof the
latter cannot be larger than the covariance matrix of the former. The new
estimator is easier to compute than Cosslett's estimator in this case. The
optimizationin the known Q and H case is only over the parameter6 and that
numericaloptimizationproblemis much better behaved than the one where A
has to be estimatedas well.
To compare the estimators for the unknown Q case consider first the
moments(13)-(14). They are more difficultto compareto (29)-(31) than in the
previouscase since they involvenot only differentparametersbut also parameters of differentdimension.A is of dimensionS - 1, Q of dimensionM - 1. We
show that Cosslett's estimator cannot be better than the new estimator by
changing Cosslett's estimator in several steps, none of which increases the
asymptoticvariance,until we get the new estimator.
CHOICE-BASED
1205
SAMPLING
Considerthe method of momentsestimatorfor 0, H, A, and Q based on the
moments ific given in (13)-(14) and (29) and (30), with the normalization
ES=,H/A(s) = 1 instead of A(S) = 1. This does not change the covariance
matrixof 6 comparedto the method of moments estimatorfor 6 and A based
on just the moments (13)-(14). The only differenceis that M + S - 2 parameters have been added with M + S - 2 additionalmoment equations. Now we
add the S - 1 restrictions Hs/A(s) = Ei ?-(s)Q(i).This can only reduce the
asymptoticcovariancematrixof 0. If we also make the substitutionbased on
(35) we get the followingmoment equations,in combinationwith (29) and (30)
that do not change:
if1(0,Q,H,s,i,x)
(36)
ap
1
(
P(ilX,~
Ix)
o)
ap
{=t
;H[Eie7~~-(it]x
Ht
t
(37)
0H)
Pei'ixI0)]
E
Ht,,
[
)i' (t)
i' E) tt)
Q
f2(0, Q, H, 5, i, X)s
/-s
E
P(i((x,s(
i'E
S)
Q(
)
i E (s)
[Ht
=1
L
'e
(t)
1Q()
Equation(36) is equal to (31), and (37) is a linear combinationof elements of
(30). Thereforethe estimatorbased on moments(36), (37), (29), and (30), which
is not worse than Cosslett's estimator, does not do better than the new
estimator.That gives us the desired result that the PML estimatornever does
better than the new estimator.
This derivation implies in addition that if Cosslett's (1981a) identification
conditions are satisfied, that is, if the solution (0*, A*) to the equation
Efic1(0, A, s, i, x) = 0 is unique, then the solution to the equation
EqI(H, Q, 0, s, i, x) = 0, namely(0*, H*, Q*), is also unique.ThereforeCosslett's
(1981a)identificationconditionsimplythat Assumption3.3(i) holds.
4. A MONTE-CARLO
INVESTIGATION
In this section we will consider a particularexample and perform a small
Monte-Carlostudy to investigatethe small sample propertiesof the estimator.
We use the samplingscheme from the examplein Section 2 with probability
density function given in (4). M, the numberof choices, and S the numberof
strata are both equal to 2. Also, S7(1) = {1} and Y(2) = {2}. This sampling
where each stratum corresponds to exactly one choice is known as pure
choice-based sampling. To simplify notation we again write q for Q1, with
1206
GUIDO W. IMBENS
1 - q, and h for H1 with H2 = 1 - h. Assume that P(1 Ix,0) can be written
as P(x') = P(00 + x101) with derivatives P(x'O) = aP(x'0)/ao and P00,(x'0) =
a2P(x'0)/1a ao'. We can write the moments (29)-(31) used in the efficient
Q2=
procedureas
=h-I[s=
frl(h,q,O,s,y,x)
1],
-
-h
l-h
12( h, q, 0, s, y, x) = q - P(x'O)/ -Pf x/0) + 1~ (1 - P(x/0),
fr3(h,q, 0, s, y, x)
=
PO0(x'O)
[P(X'I8)
E P(x'0)
J-h
rP0(x'O)
Is2
I[s =2]
h
l- h
1] [lp-f)
1 -h
-
i/
-P(x'6) P(x']))
] +
(1 -
Define
RN,cN(h, q, 0)
iN
=
A
Let
OGMM
iN
?
-
tfJ(h,q,,
Sn, Yn
XnCh)'N*CN
*N
(h,q
09 Sn
Yn
Xn)
be the estimator based on minimizing RN,cN(h*q*, 0). CN is esti-
mated as
-
N
N,
E qi(h*,q*q6
,Sng Yng Xn) *qi(h*9
q*9
0e sn
Yng Xn)f
where 6 is the minimandof RN (h*,q*,0). We will compare this estimator
with the followingalternatives:
6WESML: the estimatorproposedby Manskiand Lerman(1977), here defined
as the maximandof
N
n(
L 0 ) = Fa I [Sn = 1] * I* n P(x 0n)
n=1
1 q*
+,[Sn =2]*-h*
1 -hn
ln(1-P(xI
0)).
6CML: the conditionalmaximumlikelihoodestimatorproposedby Manskiand
McFadden(1981), here defined as the solution for 6 to
N
i=13(h*
q* OSn
Yn Xn) =0o
CHOICE-BASED
SAMPLING
1207
defined above.5 OCML can also be defined as the minimand of
q*,
6), where C is the diagonal matrixwith 1 on its last K diagonal
RN,t(h*,
elements, and 0 on the others. We did not include Cosslett'sefficientestimator
in this Monte-Carlo investigation because of the computational difficulties
associatedwith it.
We compare these three estimators using two different sampling schemes.
The firstis randomsampling(R). In that case the CMLand WESMLestimators
both reduce to the standardrandom samplingmaximumlikelihood estimator
(denoted by RSML).6Formally,the samplingis characterizedby h* = q*. The
second samplingscheme is equal shares sampling(ES). Then h* = 1/S= 1/2.
This sampling scheme is motivated partly by simulation results in Cosslett
(1981a) and by theoretical results in Lancaster and Imbens (1991). These
indicatethat an equal sharessamplingdesign,althoughonly optimalin a limited
number of cases, is close to optimal in a large number of cases. Finally we
compare the above estimatorswith the random samplingmaximumlikelihood
estimatorunder equal shares sampling.This estimatoris inconsistentand will
give some indicationof the biases resultingfrom ignoringthe samplingscheme.
* The distributionof x is a mixtureof a standardnormal distributionand an
exponentialdistributionwith unit variance,shifted one unit to the left to give it
zero mean. This mixture is chosen to guard against special results that might
occur if a normaldistributionis used. The firstfour momentsof this distribution
are 0, 1, 1, and 6 and the distributionfunction is 1/2 *?(x) for x < - 1 and
1/2 ?(x) + 1/2 [1 - exp(-x - 1)] for x > - 1.
Two choices for the conditional probabilityfunction P( ) are employed:
P(x'6) = ?(x'@) (probit) and P(x')= 1/[1 + exp(x'6)] (logit). In this logit
model with an intercept,the first part of the third moment is equal to a linear
combinationof the other moments:4f31 = 4f2*h/q - if, and Assumption3.3(i)
is violated. We therefore drop 1131from the moment vector if, without loss of
efficiency,for the logit estimations.In the probitmodel there is no problemand
AOis of full rank.
Given that we have fixed the distributionof x, the choice of 6 implies a value
for q. For the logit specificationwe used three combinationsfor (0, 1S,Q):
(1.31, 1.00, 0.75), (1.16, 0.50, 0.75), and (2.51, 1.00, 0.90). For the probit
specificationwe use the combinations(1.35, 1.73, 0.75), (0.90, 0.87, 0.75), and
(2.35, 1.87, 0.90). These values were chosen to achieve maximumcomparability
of the logit and probit results.
The results are given in Table I for probit and Table II for logit. They are all
based on data sets with 200 observationsand 200 replications.In this table sse
stands for sample standarderror and is the standarderror of the 200 replicawith
if3
5Cosslett (1981b) showed that one can improveon both the WESMLand CML estimatorby
given above.Becausethe actualdifferencein asymptotic
replacingh* by h in the characterizations
distributionis small, and because it facilitatescomparisonwith randomsamplingwe will use the
originalversionwith h*.
6 This would not be true if we used the improvedversionof the CMLor WESMLestimatorwith
h insteadof h*.
1208
GUIDO W. IMBENS
TABLE I
PROBIT
0o = 1.35
01 = 1.73
Q = 0.75
01
0o
estimator
RSML
GMM
WESML
CML
GMM
RSML
sampl
mean
sse
ase
med
mad
mean
sse
ase
med
mad
R
R
ES
ES
ES
ES
1.38
1.38
1.37
1.37
1.36
0.77
(0.20)
(0.19)
(0.16)
(0.15)
(0.14)
(0.16)
(0.20)
(0.18)
(0.16)
(0.15)
(0.13)
(0.16)
1.36
1.35
1.37
1.37
1.36
0.76
(0.13)
(0.13)
(0.09)
(0.09)
(0.09)
(0.10)
1.78
1.79
1.77
1.76
1.77
1.83
(0.29)
(0.29)
(0.26)
(0.25)
(0.25)
(0.25)
(0.28)
(0.28)
(0.25)
(0.24)
(0.24)
(0.24)
1.74
1.75
1.75
1.73
1.74
1.80
(0.19)
(0.19)
(0.17)
(0.16)
(0.16)
(0.16)
00= 0.90
01 = 0.87
Q = 0.75
01
0o
estimator
RSML
GMM
WESML
CML
GMM
RSML
sampl
mean
sse
ase
med
mad
mean
sse
ase
med
mad
R
R
ES
ES
ES
ES
0.92
0.91
0.90
0.90
0.90
0.27
(0.14)
(0.09)
(0.10)
(0.10)
(0.07)
(0.11)
(0.12)
(0.08)
(0.10)
(0.10)
(0.06)
(0.11)
0.92
0.90
0.90
0.90
0.89
0.26
(0.09)
(0.05)
(0.07)
(0.07)
(0.05)
(0.08)
0.89
0.89
0.88
0.88
0.88
0.93
(0.15)
(0.15)
(0.14)
(0.13)
(0.13)
(0.14)
(0.16)
(0.16)
(0.14)
(0.14)
(0.14)
(0.14)
0.87
0.88
0.86
0.87
0.88
0.91
(0.10)
(0.09)
(0.09)
(0.08)
(0.08)
(0.08)
0o = 2.35
Q = 0.90
01 = 1.73
01
00
estimator
RSML
GMM
WESML
CML
GMM
RSML
sampl
mean
sse
ase
med
mad
mean
sse
ase
med
mad
R
R
ES
ES
ES
ES
2.42
2.42
2.43
2.37
2.38
1.30
(0.37)
(0.36)
(0.30)
(0.21)
(0.20)
(0.25)
(0.34)
(0.32)
(0.23)
(0.20)
(0.19)
(0.22)
2.38
2.39
2.40
2.34
2.34
1.29
(0.21)
(0.21)
(0.17)
(0.12)
(0.11)
(0.13)
1.84
1.86
1.89
1.82
1.82
2.06
(0.39)
(0.41)
(0.42)
(0.29)
(0.29)
(0.35)
(0.36)
(0.37)
(0.30)
(0.27)
(0.27)
(0.29)
1.81
1.80
1.82
1.77
1.77
2.05
(0.22)
(0.23)
(0.27)
(0.16)
(0.15)
(0.17)
tions. Similarlyase is the asymptoticstandard error. It is calculated as the
average over the 200 replications of the asymptoticstandard error for each
replication.If they are far from equal then the asymptoticapproximationis not
a very good one given the particularvalues of the parameterschosen, for 200
observations.The median of the 200 replicationsis denoted by med, and mad
stands for the median of the absolute deviation from the median. If the
distributionwere exactlyequal to a normaldistributionwith standarddeviation
o-, the median absolute deviation should be equal to 0.68 *o-. If the median
absolute deviationis smallerthan 0.68 *o-, the distributionhas thickertails than
the normaldistribution.
There were no problems with convergence for any of the estimators. All
computationswere done in fortranon a 386 pc. One run of 200 replicationsfor
the random sampling maximum likelihood estimator with random sampling
would take about 70 minutes,dependingon the parametersof the optimization
routine (the precisionrequired,and the mannerin which directionand stepsize
were calculated). The WESML and CML estimatorswould take slightly less
time with equal shares sampling,presumablydue to the fact that the objective
CHOICE-BASED
1209
SAMPLING
TABLE II
LOGIT
O0= 1.31
Q = 0.75
Ol = 1.00
01
0o
estimator
RSML
GMM
WESML
CML
GMM
RSML
sampl
mean
sse
ase
med
mad
mean
sse
ase
med
mad
R
R
ES
ES
ES
ES
1.34
1.33
1.31
1.31
1.32
0.21
(0.21),
(0.12)
(0.18)
(0.18)
(0.08)
(0.18)
(0.20)
(0.11)
(0.16)
(0.16)
(0.08)
(0.16)
1.33
1.31
1.31
1.31
1.31
0.21
(0.14)
(0.08)
(0.13)
(0.13)
(0.05)
(0.13)
1.02
1.03
1.02
1.03
1.03
1.03
(0.26)
(0.26)
(0.22)
(0.22)
(0.22)
(0.22)
(0.24)
(0.24)
(0.21)
(0.21)
(0.21)
(0.21)
0.99
0.99
1.00
1.01
1.01
1.01
(0.16)
(0.17)
(0.14)
(0.13)
(0.13)
(0.13)
00 = 1.16
Q = 0.75
ol = 0.50
01
0o
estimator
RSML
GMM
WESML
CML
GMM
RSML
sampl
mean
sse
ase
med
mad
mean
sse
ase
med
mad
R
R
ES
ES
ES
ES
1.19
1.17
1.17
1.17
1.16
0.07
(0.18)
(0.06)
(0.15)
(0.15)
(0.04)
(0.15)
(0.17)
(0.05)
(0.15)
(0.15)
(0.04)
(0.15)
1.18
1.15
1.17
1.17
1.16
0.07
(0.11)
(0.03)
0.50
0.50
0.52
0.52
0.52
0.52
(0.20)
(0.20)
(0.16)
(0.16)
(0.16)
(0.16)
(0.20)
(0.19)
(0.16)
(0.16)
(0.16)
(0.16)
0.49
0.49
0.49
0.50
0.50
0.50
(0.12)
(0.12)
(0.10)
(0.10)
(0.09)
(0.10)
O0= 2.51
(0.09)
(0.09)
(0.02)
(0.09)
Q = 0.90
Ol = 1.00
01
00
estimator
RSML
GMM
WESML
CML
GMM
RSML
sampl
mean
sse
ase
med
mad
mean
sse
ase
med
mad
R
R
ES
ES
ES
ES
2.56
2.53
2.52
2.51
2.52
0.32
(0.33)
(0.19)
(0.18)
(0.17)
(0.09)
(0.17)
(0.31)
(0.19)
(0.18)
(0.17)
(0.09)
(0.17)
2.52
2.48
2.52
2.51
2.51
0.31
(0.23)
(0.11)
(0.12)
(0.11)
(0.06)
(0.11)
0.99
0.99
1.04
1.02
1.02
1.02
(0.33)
(0.33)
(0.25)
(0.21)
(0.21)
(0.21)
(0.33)
(0.32)
(0.22)
(0.21)
(0.21)
(0.21)
0.97
0.97
1.03
1.01
1.02
1.01
(0.22)
(0.23)
(0.17)
(0.13)
(0.13)
(0.13)
functionwas less flat. The GMM estimatorwould take between 1.5 and 2 times
the time requiredfor the RSML estimator.The increaseand the variationwere
due to the second stage of the optimizationprocess, and depended on the
parametersof the optimizationroutine.
In the first combination(60, 01,q) there is little accuracygained from using
the efficient estimator under random sampling for the probit model. Using
equal shares samplingdoes lead to a significantimprovementin the varianceof
the estimates, and it reduces the small sample bias of the estimators.Again
there is little improvementfrom using the optimal GMM estimatorgiven the
samplingscheme, comparedwith the CML and WESMLestimators.
If 01 is smaller, with q the same, this changes somewhat. The choice of
estimator,GMMversus RSMLunder randomsampling,and GMMversusCML
and WESMLunder equal shares sampling,does matter for the varianceof the
estimator of 00, though not for that of 01. That the efficiency gain for the
estimatoris largerwhen the slope coefficientis smallerin absolutevalue is not
surprisinggiven the result in Lancaster and Imbens (1991) that if 01 = 0, 00
rate.
convergesfaster than 1NK
1210
GUIDO W. IMBENS
In the third set of simulations00 is chosen to give a value of q closer to 1.
The differencebetween randomsamplingand equal shares samplingbecomes
more pronounced. Given the equal shares sampling scheme, the WESML
estimatorperformsmarkedlyworse than the CML and GMM estimators,both
in terms of variance and in terms of the difference between the asymptotic
varianceand the small samplevariance.
The differencesbetween the resultsfor the logit model and the probit model
are mostly small. For the logit case it has been shown in Manskiand Lerman
(1977) that if the samplingscheme is ignored,the randomsamplingmaximum
likelihood estimatoris still consistent for 01. This shows up in the table in the
results for the RSML estimatorwith ES sampling.The estimates for 00 are
severelybiased, but those for 0, are not. For the probitmodel the bias in 01 is
not zero, but relativelysmall.
In general the conclusion is that it matters a lot for both 00 and 01 which
sampling scheme is choosen, with equal shares sampling being significantly
better than randomsamplingmost of the time. For estimating01, the CMLand
GMM estimator do about equally well. For estimating 00, especially if 01 is
small, GMM does a lot better. This is importantif the aim is not so much
estimationof parametersbut estimationof the conditionalprobabilities,which
depend on both 01 and 00.
5. CONCLUSION
In this paper an alternativeestimationprocedureis proposedfor choice-based
samples.In choice-basedsamplesthe samplingis conditionalon the dependent
variable. Therefore standard maximumlikelihood techniques do not apply if
only the conditionaldistributionof the dependent variable given the explanatory variablesis parameterized.Various estimatorshave been proposedto deal
with this problem.Some of them, the WESMLand the CMLestimatorsare not
efficient. Cosslett's PML estimatorsare efficient but computationallydemanding.
In the new estimationprocedure some of the problemswith the previously
proposedestimatorsare solved. The new estimatoris efficientwhile the computational burden is reduced compared to Cosslett's estimator.The case where
the marginalprobabilitiesof the choices are knownand that where they are not
knownare both special cases of the general estimator.Efficiencyis provenusing
recentlydeveloped concepts from semiparametricestimation.
A small Monte-Carlostudysuggeststhat for moderatevalues of the marginal
probabilitiesthe optimalestimatorleads to significantlymore accurateestimates
for the intercepts,though there is no real gain for the slope coefficients.
Dept. of Economics,HarvardUniversity,Cambridge,MA 02138, U.S.A.
receivedMay,1990;final revisionreceivedMay,1992.
Manuscript
CHOICE-BASED
1211
SAMPLING
APPENDIX
PROOF OF LEMMA 3.1: Suppose the logarithmof the likelihoodfunctionwith N observations
*) is
ZI, Z21... XZN is L(,3) = E='=1ln f(z, a]). The asymptoticcovariancematrixV of FNQ(13I
[
E
V=I(13)=
dlnf
dlnf
(Z,
p*).
aP (Z,l3*)
d,8
PartitionV and its inverseV-1 accordingto 81, 182, and 183:
(V11 V12 V13
v12 v13
(Vll
V=
V21
V- 1
V23
V22
V21
v23
v22
V31 V32 V33
V31 V32 V33
The varianceof the constrainedestimatorof 13,and 183given 82= 13* is
)
((V11 _ V13(V33)-1V31)
)-1
V
( V1
Since we could characterizethe maximumlikelihoodestimatesof f1 and f82by
N
E hl(P,p,P2,
Zn)
=
0,
n=1
the asymptoticcovariancematrixfor 81 and 132must satisfy
(V11
V12
V21 V22)
r~
=
1
dhl
B
dh,
[Ehh I
)
E 9(pB;B')]J
The estimatorfor P1 given f82= ,8* based on minimizationof
)
N
1N
N Ehl(1,2* Zn) *CN- N E hj(Pj,P*,Zn
n=1
n=1
over f81 with CN_+(Ehjh')-1 has asymptoticcovariancematrix
[[
E- I [E1h']
Elh
dhl
t]]
I E dh,
-i
]]V
[(V11- V12Vi22)
,v2v,)]
[(,
VIIV2V22
'V21
=VIV2
by using the followingrelationsthat
One can show that this is equal to [V1l- V13(V33)-1V31L]-1
follow from the partitioningof V and V-1:
I= VIIV11+ V12V21+ V13V31,
O V21=
+ V22V21 + V23V31,
0=
+ V12V23 + V13V33,
V11V13
0= V21V13+ V22V23+ V23V33.
Q.E.D.
PROOF OF THEOREM 3.2: The assumptionsmade, (2.1)-(2.2) and (3.2)-(3.3) guarantee the
conditionsneeded for standardtheoremson generalizedmethod of momentsestimationto hold.
Q.E.D.
See, for an extensivediscussionand reference,Hansen(1982)and Manski(1988).
PROOF OF THEOREM3.3: For ease of notationwe will assumethat X has densityr(x) on X-7For
any 6 > 0 partitionX into L. subsets X, in such a way that if 1 # m, X, n Xm = 0, and if x, z E Xi,
7As it has been shown in Section 3.1 that the estimatoris exactlymaximumlikelihoodif the
regressorshave a discretedistribution,it is clear that we only have to look at the continuouscase.
The mixedcase can be dealt with at the expenseof additionalnotation.
1212
then lix -
GUIDO W. IMBENS
zIl
< e. Define 'lx to be equal to 1 if x E Xi and 0 otherwise, and
re(X) = r(x)/[
ElixJr(z)dz]
we will employis indexedby e:
The sequenceof parameterizations
L?_
Ix, O)r,,(x)
H,P(i
gl(s
i, X)
Eblolx
L?
E
jG
E 1J P(jIz,O)r?(z)dz
Xi
-7(s)l=l
with re( ) a known function and H, 0, and a the unknownparameters.The dimensionof the
parametervector(H, 0, S) is S-1 + K + L -1. If we are not interestedin the estimatorfor cSwe
can eliminateit followingexactlythe same procedureused in Section3.1 to eliminate'I. Let 0, S,
and H be the maximumlikelihoodestimatorsfor 0, S, and H. Definingthe maximumlikelihood
estimatorof Q as
Q( j)=
E ?61f P(jIz9 @)r,?(z) dZ,
xi
1=1
we can characterizethe maximumlikelihood estimatorsfor (H,0,Q) as GMM estimatorswith
moments
0Ijt(H, 0, Q, s, i, X) = Ht-I[s =t]
4'e2j(H9 0,Q. s.i. x)
-Q(j)- {EkixJ P(jIz9o)rA(z)dz/
E
[s
ElxJ
eHisS()z
=i'
P(i'Iz90)re(z)dz1}
E(t)
Q(')>
1
d9P
= do-(ilx,0)
Ht- a
~ ~
i'F,)=
t=H
-(iIX9o)
s
L(t
~ Q'Qi'
E lxJ P~(i'
Iz,
)
0)r?(z)
dz
ie$S(t)J
In orderto studythe differencebetweenthe asymptoticcovariancematrixT'?for this estimatorand
CHOICE-BASED
SAMPLING
1213
that for the estimatorin Theorem3.2 (V= rF- 1AO(r6)- 1) it is convenientto define
4P(i ix,0)=EEixf
edo
1=1
P(ilz, O)r(z) dz,
Xl
and d; d2P/dO dO'(iIx, 0) accordingly.The differencebetweenthe moments'I' and if in (29)-(31) is
that the latterdependon P(i Ix,0), dP(i Ix,0)/dO and d2P(i Ix,0)/dOdo' while the formerdependon
d;P(i Ix, 0),
dP/d0(i
Ix, 0), and d d2P/dO dO'(i Ix, 0), with the functional dependence being the
same. Define now
Ae =Euie(H, Q,O, s,i, x)
ie(H, Q,O, s,i, x)',
and
Fe=E-
d(H'Q'0')
Uniformconvergence(in x and i) of d;P,
P. dP/dO, and d; d2P/dO do' to P, dP/d0, and d2P/dod0'
then ensuresthat the limits of A, and rF equal A0 and ro respectively.This in turn impliesthat
V?= f7 1A?(F'Y1convergesto V. Since no regularestimatorcan have an asymptoticvariancelower
than the Cramer-Raobound,it cannotimproveon the limitof this sequenceand thereforeit cannot
improveon the asymptoticvarianceof the estimatorin Theorem3.2.
The form of the proof suggeststhat it mightbe possible to extend the result to unboundedx.
Then one would need conditionson P(i Ix,0) that ensure that it is still possible to constructa
partitioningof X that leads to uniformconvergenceof A, and rI to A0 and r respectively. Q.E.D.
REFERENCES
BEGUN, J.
M., W. J. HALL, W-M.HUANG, AND J. A. WELLNER (1983):"Informationand Asymptotic
Efficiency in Parametric-Nonparametric Models," Annals of Statistics, 11, 432-452.
BRESLOW, N., AND N. DAY (1980): The Analysis of Case-control Studies, 1: Statistical Methods in
Cancer Research. Lyon: IARC.
CHAMBERLAIN, G. (1987): "Asymptotic Efficiency in Estimation with Conditional Moment Restrictions," Journal of Econometrics, 34, 305-334.
CossLErr, S. R. (1978): "An Efficient Estimator of Discrete-Choice Models for Choice-Based
Samples," Working Paper, Department of Economics, University of California, Berkeley.
(1981a): "Efficient Estimation of Discrete Choice Models," in Structual Analysis of Discrete
Data with Econometric Applications, ed. by C. F. Manski and D. McFadden. Cambridge, MA:
MIT Press, 51-111.
(1981b): "Maximum Likelihood Estimation for Choice-based Samples," Econometrica, 49,
1289-1316.
(1991): "Efficient Estimation for Endogenously Stratified Samples with Prior Information on
Marginal Probabilities," Working Paper, Department of Economics, Ohio State University.
GOURIEROUX, C., AND A. MONFORT (1989): "EconometricsBased on Endogenous Samples,"
Working Paper, CREST/ENSAE.
HAJEK,J. (1972): "Local Asymptotic Minimax and Admissability in Estimation," Proceedings of the
Sixth Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, CA: University of
California Press.
HANSEN, L. P. (1982): "Large Sample Properties of Generalized Method of Moment Estimators,"
Econometrica, 50, 1029-1054.
HECKMAN, J. J., AND R. ROBB (1984):"AlternativeMethodsfor Evaluatingthe Impactof Interventions," in Longitudinal Analysis of Labor Market Data, ed. by J. J. Heckman and B. Singer.
Cambridge: Cambridge University Press.
HSIEH,D. A., C. F. MANsKi,AND D. McFADDEN (1985): "Estimation of Response Probabilities from
Augmented Retrospective Observations," Journal of the American Statistical Association, 80,
651-662.
1214
GUIDO W. IMBENS
IMBENS,G. W., AND T. LANCASTER (1991a): "Efficient Estimation and Stratified Sampling," Harvard
Institute of Economic Research Discussion Paper 1545.
(1991b): "Combining Micro and Macro Data in Microeconometric Models," Harvard
Institute of Economic Research Discussion Paper 1578.
LANCASTER, T. (1990): "A Paradox in Choice-Based Sampling," Brown University Working Paper.
LANCASTER, T., AND G. W. IMBENS (1990): "Choice-Based Sampling of Dynamic Populations," in
Panel Data and Labor Market Studies, ed. by J. Hartog, G. Ridder, and J. Theeuwes. Amsterdam:
North-Holland.
(1991): "Choice-BasedSampling-Inference and Optimality,"Brown UniversityWorking
Paper.
MANSKI,
Hall.
C. F. (1988): Analog Estimation Methods in Econometrics. New York, NY: Chapman and
MANsIG,C. F., AND S. R. LERMAN (1977): "The Estimation of Choice Probabilities from Choice-based
Samples," Econometrica, 45, 1977-1988.
MANSKI, C. F., AND
D. McFADDEN (1981):"AlternativeEstimatorsand SampleDesignsfor Discrete
Choice Analysis," in Structural Analysis of Discrete Data with Econometric Applications, ed. by
C. F. Manskiand D. McFadden.Cambridge,MA: MIT Press,51-111.
W. (1984):"A Method of MomentsInterpretationof SequentialEstimators,"Economics
NEWEY,
Letters, 14, 201-206.
(1990): "Semiparametric Efficiency Bounds," Journal of Applied Econometrics, 5, 99-135.
RIDDER, G. (1987): Life Cycle Patterns in Labor Market Experience, Ph.d. Dissertation, University of
Amsterdam.
© Copyright 2026 Paperzz