An Efficient Semiparametric Estimator for Binary Response

An Efficient Semiparametric Estimator for Binary Response Models
Author(s): Roger W. Klein and Richard H. Spady
Source: Econometrica, Vol. 61, No. 2, (Mar., 1993), pp. 387-421
Published by: The Econometric Society
Stable URL: http://www.jstor.org/stable/2951556
Accessed: 08/04/2008 04:31
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless
you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you
may use content in the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/action/showPublisher?publisherCode=econosoc.
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.
JSTOR is a not-for-profit organization founded in 1995 to build trusted digital archives for scholarship. We enable the
scholarly community to preserve their work and the materials they rely upon, and to build a common research platform that
promotes the discovery and use of these resources. For more information about JSTOR, please contact [email protected].
http://www.jstor.org
Econometrica, Vol. 61, No. 2 (March, 1993), 387-421
AN EFFICIENT SEMIPARAMETRIC ESTIMATOR
FOR BINARY RESPONSE MODELS
BY ROGER W. KLEIN AND RICHARD H. SPADY 1
This paper proposes an estimator for discrete choice models that makes no assumption
concerning the functional form of the choice probability function, where this function can
be characterized by an index. The estimator is shown to be consistent, asymptotically
normally distributed, and to achieve the semiparametric efficiency bound. Monte-Carlo
evidence indicates that there may be only modest efficiency losses relative to maximum
likelihood estimation when the distribution of the disturbances is known, and that the
small-sample behavior of the estimator in other cases is good.
Binary choice, semiparametric estimation, adaptive kernel density estima-
KEYWORDS:
tion.
1. INTRODUCTION
IN THIS PAPER WE FORMULATE a semiparametric estimator for the discrete
choice model that satisfies the classical desiderata for such estimators: consistency, root-N normality, and efficiency. The estimator is semiparametric in that
it makes no parametric assumption on the form of the distribution generating
the disturbances. We do, however, assume that the choice probability function
depends on a parametrically specified index function. A method for computing
the efficiency bound in semiparametric contexts was given by Begun, Hall,
Huang, and Wellner (1983). General discussions of these bounds and their
calculation can be found in Newey (1989, 1990). The actual efficiency bound for
the binary choice model was given by Chamberlain (1986) and Cosslett (1987).
To simplify our exposition we formally consider throughout most of this paper
only the binary choice model given by
(1)
y=
if(
0
X; 0)
> Uo,
otherwise,
where v(; ) is a known function, x is a vector of exogenous variables, 00 an
unknown parameter vector, and uo a random disturbance. We assume that
observations {xi, y,} are i.i.d. and that the model satisfies an index restriction:
E(y Ix) = E[yIv(x; 00)], where E is the indicated conditional expectation and
v(x; 00) is an aggregator or index.
The index restriction permits multiplicative heteroscedasticity of a general
but known form and heteroscedasticity of an unknown form if it depends only
1
Earlier versions of this paper were distributed in a session of the Econometric Society Summer
Meetings, June, 1987, and presented at the European Econometric Society Meetings, September,
1988. We thank Bob Sherman for helpful discussions and Whitney Newey for numerous comments
and technical assistance. In particular, Whitney Newey provided the basic proof showing the
equivalence between the asymptotic variance-covariance matrix derived for the estimator here and
its lower bound. We would also like to thank L. P. Hansen and anonymous referees for their
comments and suggestions. Any errors are the sole responsibility of the authors.
387
388
R. W. KLEIN AND R. H. SPADY
on the index. We characterize such heteroscedasticity by u - s(x; i80)o , where
E is independent of x. When the (positive) scaling function s is a known and
general function of x, the index restriction is satisfied in the transformed model
with index v* v(x; 00)/s(x;,80). If s is unknown, but depends on x only
through the index v(x; 00), then the index restriction will also be satisfied. In
general, the index restriction will hold if the model can be written in the form
shown in (1) with u - u[v(x; 00), el, where E is independent of x. Notice that
we distinguish u - u[v(x; 00), e] from u uu[v(x;0), e]. We need not make this
distinction when u is independent of x, in which case u = uo = e.
Throughout, for notational simplicity we will define probability functions
through their arguments. In this manner, for example, Pr[uO < a(x)Jb(x)]
denotes the probability of the event uo < a(x) conditioned on b(x). Formally,
with FUOlx
as the distribution function for uo conditioned on x and E denoting a
conditional expectation taken with respect to x conditioned on b(x):
(2)
Pr [uo < a(x)Ib(x)]
_E[F.O1x[a(x)]lb(x)].
To simplify notation, we will also for the most part not distinguish random
variables from their realizations. Typically, the appropriate interpretation will
be clear from the context.
The most common way to estimate 00 in (1) is to apply the method of
maximum likelihood. When the disturbances are independently distributed
according to a known conditional parametric distribution, FUIx,the log likelihood is
N
(3)
L =E[yi
In ( Pi*) + (1 - yi) In (11-Pi* )] ,
i=1
P[v(x;o0);]
Pi*(0)
=]Pr[u
<v(x; 0)Jv(x; 0)] = F,,x[v(x; O)]1
P*[v(Xi;0);0].
Typically, it is further assumed that u and x are independent, in which case FUIx
may be replaced by the unconditional distribution F,. When FUis unknown, one
approach is to choose FUitself jointly with 0 to maximize the likelihood. This
approach is taken in Cosslett (1983), who shows that the resulting estimator of
00 is consistent.
In the above approach, once the distribution function is replaced with its
maximum likelihood estimate, the resulting concentrated likelihood is not a
smooth function of 0. Consequently, it is difficult to establish the asymptotic
distribution for the estimator of 0. To circumvent this problem, we propose to
select 0 so as to maximize a semiparametric likelihood that is a smooth function
of 0 and that locally (for 0 in a neighborhood of 00) approximates the
corresponding parametric likelihood.
To formulate this semiparametric likelihood, which might more appropriately
be termed a quasi-likelihood, we begin by replacing the probability function
Pi*(0) in (3) with a tractable function Pi(0) that locally approximates it. To
BINARY RESPONSE
389
MODELS
construct Pi(0), note that with C as the event [u < v(x; 0)], P* in (3) is
equivalent to Pr[Clv(x;6)], the probability of the event C conditioned on
v(x; 0). We may characterize this probability function by
(4)
P*[v( x ; ); ] -=Pr [Clv( x ; )] = Pr [C]gVJC(v; 0)1gV( v; 0)
where Pr(C) is the unconditional probability of C, gvlc is the density for
v = v(x; 0) conditioned on C, and gv is the unconditional density for v. Let CO
be the event C at 00: [uo < v(x; 0O)],which is observable and is equivalent to the
event y = 1. If we replace C in (4) with CO,we obtain the probability function
P(v; 0). In other words, P(v; 0) is the probability of the event COconditioned on
v = v(x; 0):
(5)
P[ v( x; 0); 0] -Pr [Colv(x; 0)] = Pr [C0]gv1c,( v; O)Igv( v; 0)
Pi(0)-P(Vi;
=Pr[y=vlgV1y=1(v;0)/gV(v;0),
),
where gvly=l is the density for v conditioned on y = 1.
This function has several desirable features. First, at 0 = 00 and under the
index restriction, it is equivalent to the true choice probability. Namely, with
vo-v(x; 00),
(6)
P(vo; 00) = Pr [Colv(x; 60)]
Pr [y = 1Jv(x; 00)] = Pr[y = 1x].
Subject to a continuity condition, the function P in (6) may therefore be viewed
as a local approximation to the corresponding parametric probability function,
P* in (3). Second, although the probability function in (5) is unknown, it can be
estimated directly as a smooth function of 0. We can estimate Pr [y = 1] by the
sample proportion of individuals making the choice in question, while both the
conditional and unconditional densities can be estimated nonparametrically as
smooth functions of the parameter vector 0. Notice that if these densities are
estimated by kernel methods (with the same window being employed for
conditional and unconditional densities), then the estimate of (5) will be the
usual kernel estimate of the expected value of y conditioned on the index.
Finally, since v aggregates the information in the possibly high dimensional
vector x into a scalar, we need only to be concerned with univariate density
estimation.
Using Pi(o) to denote Pi(0) as defined in (5) with its three components
replaced by estimates, we propose to estimate 00 by choosing 0 to maximize an
adjusted version of the estimated quasi-likelihood:
(7)
Q(p(6))
=
N(
/2)
([2ln
[()1
+ (1 -yi) ln [(1
A(o))j])
where ri is a trimming sequence that is introduced for technical reasons to
control the rate at which the estimated densities comprising Pi tend to zero.
Notice also that for both yi and (1 - yi) terms, the argument of the ln function
is squared. As will become apparent below, there are two methods for estimating the densities upon which estimated probability functions depend: locally-
390
R. W. KLEIN AND R. H. SPADY
smoothed kernels and bias reducing kernels. For the former, estimated probability functions lie between zero and one, in which case (7) reduces to the usual
form for a binomial likelihood. For the latter, estimated densities and hence
estimated probabilities can be negative. This problem, which can be ignored
asymptotically, may occur in any given finite sample. The estimated quasi-likelihood function in (7) remains well-defined in this case. We refer to this function
as "estimated" to distinguish (7) from the quasi-likelihood proper which is (7)
with Pi(0) replacedby Pi(0).
Our strategy is essentially to show that the estimator 0(jP), which is obtained
by maximizing (7), behaves asymptotically like the estimator 0(P) defined for a
known probability function P as
A
(8)
0[ P
A
argsup Q(P)
N
-argsup
E
i=l1
[yln (Pi) + (1 -yi) ln (1 -P)]
/N.
The estimator 0(P) can be analyzed by conventional methods to establish
consistency and asymptotic normality. Perhaps not surprisingly, as a likelihoodbased estimator, it is also efficient (under the appropriate regularity conditions).
The recent econometric literature includes a variety of different approaches
to semiparametric estimation in this context, including the papers of Cosslett
(1983), Han (1984, 1988), Horowitz (1992), Ichimura (1986), Manski (1975, 1985,
1988), Powell, Stock, and Stoker (1989), Ruud (1983, 1986), Sherman (1993),
and Stoker (1986). Our estimator is similar to Ichimura's, in that both can be
interpreted as minimum distance estimators. It also resembles Cosslett's in that
it replaces P* in the MLE objective function (3) with a nonparametrically
estimated function.
However, the spirit of our approach is perhaps best illustrated by considering
it in light of the nonparametric discrimination problem first posed in the classic
paper of Fix and Hodges (1951).2 The univariate version of that problem, in our
notation, can be posed as: given a scalar random variable v with samples
v[ y = 0] and v[ y = 1] of realizations of v from two populations (say, respectively, of the y = 0 and y = 1 populations), how should a new realization v of
unknown origin be classified? Fix and Hodges' solution was to estimate gvly[v]
> c,
nonparametrically and to assign v to the y = 0 group if gv1Y=0[V]/gV1Y=1[V]
where c is predetermined. It is a short step-from this setup to the realization
that the densities of v among the subpopulations plus a knowledge of the
overall subpopulation frequencies allows, via Bayes' rule, the calculation of the
probability of group membership conditional upon v. Optimal classification is,
with such probabilities known or estimated, a matter of one's loss function. If
we select 0 so as to minimize the KLIC (Kullback-Leibler information criterion)
recently with a commentary by Silverman and Jones (1989). This paper not only
2Reprinted
clearly formulated the nonparametric discrimination problem, but also originated two methods of
nonparametric density estimation (kernel density estimation and nearest neighbors), as well as
recognizing optimal smoothing as a bias-variance tradeoff.
BINARY RESPONSE
MODELS
391
discrepancy of Pi(0) from Pi*(0) over the sample yi realizations, then the
resulting estimator will be equivalent to that which maximizes the quasi-likelihood function in (8).
In what follows, we begin in Section 2 by first providing an overview of the
main assumptions and then discussing conditions required for identification. In
Section 3 we prove consistency, and in Section 4, we establish asymptotic
normality (at rate N1/2) and efficiency. In each of these sections, for expositional purposes we will outline the proofs. Complete proofs are relegated to an
Appendix. To illustrate the performance of the estimator in practice, in Section
5 we undertake several Monte-Carlo experiments in which the semiparametric
estimator is compared to probit under several model specifications. We find that
the semiparametric estimator performs quite well relative to probit, and can, in
models sufficiently perturbed from the usual probit specification, substantially
dominate the probit estimator. In Section 6 we conclude by indicating several
extensions of the proposed estimator and directions for future research.
2. ASSUMPTIONS
2.1. An Overview
We will require conditions that serve to define the model, insure that various
estimated functions are sufficiently well-behaved, and establish those functions
of the parameters that are identified. In stating these assumptions, we need to
introduce some notation. Let Fuolxbe the distribution function for uo in (1)
conditioned on x. Then, with E denoting a conditional expectation taken with
respect to x conditioned on v(x; 0), define the probability function PE ] as:
(9)
P[v(x;O);O]
=ExIv(x;0){FuIjx[v(x;0o)]}.
In the above definition, P[ ] inherits its dependence on 0 from the manner in
which the distribution of x conditioned on v(x; o) depends on 0. In general,
this distribution will depend on v(x; 0) and separately on O.' Finally, let Drf(z)
denote the rth order partial of f(z) with respect to z and Dof(z) =f(z). Then,
with c denoting a positive generic constant, we assume the following conditions
hold:
(C.1) The data consist of a random sample (yi, xi), i = 1,... N. The random
variable y is binomial with realizations 1 and 0.
(C.2) The parameter vector 0 lies in a compact parameter space, 0. The true
value of 0, 00, is in the interior of 6.
3For example, suppose that x is normal with a nonzero mean vector and that v is linear in the
components of x. In this case, the distribution of x conditioned on v(x; 0) = xO will be normal with
a mean that depends on xO and on 0.
392
R. W. KLEIN AND R. H. SPADY
(C.3a) There exists a scalar aggregator, a(x; 6), such that for any x:
Pr[y = 1lx] E(ylx)
=E[yla(x; 0) -Pr[y =1la(x;
o)]
(C.3b) Withx1 a continuous element of x, there exists a monotonic transformation, T, such that under T
v(x; 00)
T[a(x; 60)]
= VA(X)
+ V2(X2; 60)
Idvl/dxlI >c.
Conditions(C.1)-(C.3) describethe mannerin which the data are generated.
Assumption(C.3a) is especially importantbecause it allows us to reduce the
dimension of the conditioningvariables.As will become apparentbelow, assumption(C.3b) is useful in relatingpropertiesof the distributionof the index
to those of the exogenousvariables.This assumptionwill also be convenientin
obtainingconditionsunderwhich identificationholds. For the linear index case,
as is standardin the literature,condition (3b) will hold under a location-scale
transformationprovidedthat the continuousvariablehas a nonzero coefficient.
For a nonlinearexample,considerthe aggregator
(10)
a =-[X1]1[[X2]
+ I[X3f3]
a > 0. Then the index v -(/-y10)ln(a)
,
has the requiredform for y1o# 0.
(C.4a) There exist P, P that do not depend on x such that (employing (C.3)):
0 <PA Pr[y =1 v(x; 0)] = Pr(y = 1lx) <P< 1.
This condition serves to bound the probabilityfunction P[v(x; 0);0] away
from zero and one. In the notation introducedabove,
(11)
Pr[y = lIv(x;00)] =FuOjx[v(x;00)] and
P[v(x;
a); a] =E(F,,Ojx[v(x;0Oo)]).
This restriction,whichrequiresthat v(x; 00) be a boundedrandomvariable,can
be relaxed as we will be downweightingobservationsfor which the densities
comprisingP[ v(x; 0); 0] are small.
(C.4b) With H[Iv(x; 00)] Pr[uo < v(x; 60)Iv(x;00)], assume that H(t) is continuously differentiablein t for all t and that IdH(t)/dtI <c.
It should be noted that this condition could be replaced by a somewhat
weaker and (notationally)more complicateddominancecondition.To interpret
this condition, note that when x and uo are independent, H[v(x; 00)]=
F"0[v(x; O)],where FUOis the distributionfunction for uo. Consequently,the
above condition holds when uo is independent of x and has a bounded and
continuous density. We require that it also hold when uo is dependent on x
under the index restrictionin (C.3a)-(C.3b).
BINARY RESPONSE
393
MODELS
(C.5) The index v is smooth in that for 0 in a neighborhood of 00 and all t:
(ID[r]V( t; 0)1, ld(Dlrlv( t; 0) )/dx l) < c
(r=O
,2
,4
(C.6) Withv(x; 00) = vl(xl) +
V2(X2; o) from (C.3b), let fxllo (X1) be the density for the continuous variable xl conditioned on the remaining exogenous
variables, X2, and on y, wherex2 is a vector of either discrete and/or continuous
random variables. This conditional density is supported on [a, b] and is smooth in
that for all,x
(r = 1, ..* 4; t E a, b])
IDtrfxll.(t)l < C
Conditions (C.5)-(C.6) insure that the densities that underlie P(v; 0) are
sufficiently smooth. Let gxllo and gvl 0 be the densities for xl and v1 - vA(xd)
respectively,conditionedon x2 and y. Given the smoothness assumptionson
the index, if gxll has r derivatives with respect to xl, g,11, must also have r
derivativeswith respect to vl. It now follows that since
(12)
gb1?(s;0) =gvllo IS-V2(X2;0)],
and v2 is smooth in 0 (C.5), gvlo must have r derivativesin 6. Finally,since
(13)
gvly(s; 0) E[ gvl (s; 0)Iy],
gtly(s;0) must also have r derivativeswith respect to 0. For example, from
(12)-(13) with r = 1:
(14)
Idg,ly(S;
00/01
< SUPIdV2(t;
t
0001O
SUPIdgvily
(s; O)IdsI
-
s
The firstterm above is boundedby assumption,and the second (fromthe above
discussion)inheritsits bound from the upper bound on ldgx1o(w)/dwIand the
lower bound on Idv1/dx1.
To illustrate this smoothness condition in (C.5), let the density for x1
conditionedon x2 and y be given by
(15)
g1 0(t) a [ _r- (t - a2)2
which, with a2
> a1 >
0, has compact support on [a2
-
a,, a2
+ a1]
and r
derivativeson the entire real line. As stated earlier,this smoothnessassumption
is technicallyconvenient in insuringthat the index has a smooth density. As
there are important cases in which this assumption does not hold, in the
concludingsection we will indicatehow the argumentcan be modified.In what
follows,the only problematicfeaturesof the densities for the index will be that
they are permittedto approachzero at an arbitraryrate and at unknownpoints
(that maybe either on the supportboundariesor in the interiorof the support).
394
R. W. KLEIN
E)
=-
R. H. SPADY
the trimmingfunction employed to downweightobservations
(C.7) WithhN -,
has the form
-r(t;
AND
1
+
exp [(he
1
- t)lhe
1
/4]
where E > 0 and t is to be interpretedas a density estimate.
With t above replaced by an estimated density, this function will be employed
to downweight observations for which the corresponding densities are small. As
this trimming function behaves as a smoothed indicator in that it tends
hN -,
exponentially to one for t sufficiently far from zero and to zero for t sufficiently
small.
(C.8) Letting gyj(vi; 0)
Pr (y)g,jY(vj; 0), define the estimator for gyv as
1{.
gy XAY;hN~~- y}vIv:
E
gy(i;
jV=i
A
K[
A
4(N-
(Y =0, 1).
1)
hlNAyi[hNAyiJ
The kernel function, K(z), is a symmetric function that integrates to one, has
bounded second moment, and
IDzK(z)I, lfDzK(z)l dz} < c
The parameter hN is a nonstochastic window, N-16
also either:
(r = 0, 1, 2,3,4).
< hN <N
1/8.
(C.8a) Bias Reducing: With Ay1-1, hNAYI=hN,I=1,...,N,
The kernel is
and
dz = 0;
fz2K(z)
or
(C.8b) Adaptive with local smoothing: Windowshave the form
hNAYJ[ hN]
[ Y(0)]
[ALy]
The component, ay, is the sample standard deviation of v(x; 0) conditioned on y.
For the final component, which reflects local smoothing, let Iy - Yv(vj; 0, 1; hNp)
be a preliminary density estimate of the form shown in (C.8) with window (without
local smoothing) hNP[Jy(0)]. For 0 < E' < 1, from (C.7) define:
=
,rpj- Ir yj, hNP)v
The local smoothing parameter can now be defined by:
A
Ar
,+
^
A
-1/2
BINARY RESPONSE
MODELS
395
where m is the geometric mean of the [yj For notational simplicity, we will write
- gyv(t; 0, AY;hN) to refer to either of the estimates in (C.8a) or (C.8b).
gyvt;
Assumptions (C.8a) and (C.8b) specify two alternative density estimators
whose similar bias properties are required in the normality proof. The estimator
in (C.8a) is a bias reducing kernel that is permitted to be negative in the tails.
With density derivatives uniformly bounded up to order 4, such a kernel has a
uniform bias of order h4. Assumption (C.8b) specifies an adaptive kernel
estimator with a variable and data dependent window size. This estimator
differs from that in (C.8a) first by adjusting the global window size by the scale
of the distribution. We have followed Silverman (1986) in this regard. Such
scaling could also be introduced into the bias reducing kernel estimator in
(C.8a). Note that the density estimator in (C.8b) is also restricted to be positive.
Finally, these estimators differ in that the window in (C.8a) is constant at each
sample size. In contrast, the estimator in (C.8b) specifies variable windows. To
select such windows, Abramson (1982) showed that the [gyj(v)/m1/2, j=
1,...,N,
are optimal invariant local smoothing parameters in a pointwise
mean-squared error sense when densities are bounded away from zero. Under
known local smoothing and for densities bounded away from zero with uniformly bounded derivatives up to order 4, Silverman obtains a uniform bias of
order h4 for the density estimated with local smoothing. By downweighting
observations for which densities are small and appropriately modifying the local
smoothing parameters (to control the rate at which they can tend to zero),
estimated local smoothing parameters can be taken as given. As a result, we will
be able to show that under local smoothing the density estimate has a bias that
is sufficiently small for our purposes. It should be remarked at this point that
both Abramson (1982) and Silverman (1986) report that this two-stage procedure (the first stage being required to obtain estimated local smoothing parameters) performs well in Monte-Carlo studies. Moreover, the second-stage density
estimate is reportedly not too sensitive to the first-stage estimate.
(C.9) The model is identified in that for any x e A, a set of positive probability,
P[y = 1Iv(x;00)] =P[y = Iv (X; 0)] = 60 =6 -.
The final assumption, which is made for purposes of identification, can be
interpreted as either a statement about what parameter restrictions are necessary for identification or what functions of the parameters are identified. To
avoid separate notation for "true" parameter values and identifiable functions
of them, we will proceed under the first interpretation. In the next subsection,
we provide sufficient conditions for identification and discuss several examples.
2.2. Identification
Any value of 0 that maximizes the limiting quasi-likelihood must satisfy the
condition on probabilities in (C.9), which clearly holds if 0* = 00. The identifi-
396
R. W. KLEIN AND R. H. SPADY
cation assumptionrequiresthat this solutionbe unique.There are two classesof
models and correspondingrestrictionsfor which such identificationholds. When
the probabilityfunction, P[ y = i 1v(x;00)], is monotonicin the index, we have
the followingresult.
1 (Monotonic Models): Assume that P[ y = i 1vo] is strictlymonoTHEOREM
tonic in vo v(x; 00) and that conditions(C.3a)-(C.3b) and (C.5)-(C.6) hold.
Let v be a set of positiveprobabilityfor x on whichdP[y = 11v0]/dav0# 0. For
0 ! ), then 00 = 0*. It now follows that
x E a, assumethat if V2(X2;
00) = V2(X2;
identificationholds as specifiedin (C.9).
Cosslett (1983) obtained this result for the linear index model, with the
theorem above requiring an appropriate extension to the argument. From
monotonicityof the probabilityfunction,there exists a function H for which
P(y = 1I[v(x;00)]}
vl(x1)
+ v(x2;
=P{y = 1I[v(x;0*)]}
00) =H[Vl(Xl)+
+
V2(X2;
)]
Differentiatingboth sides of this equalitywith respect to x1 (for x E X), it
follows that H must be the identity, in which case v2(x2;00) = v2(x2; 0).
Identificationthen holds if this conditionimplies that 00 = 0 * as claimed.
To provide an example, it is convenient to write 0 0(y) and 00 0(yo),
where 00 is an identifable parametervector that depends on some possibly
higher dimensionalvector, yo. Then, employingthis notation,suppose that the
(untransformed)index in (C.3a) is given by
(16)
a>0.
a(x;00)=(xX)710(X2)720(X3)"30
From (C.3b), with T as the log transform and x1 continuous, v(x; 0) =
(1/yl) In[a(x; 00)] for y+= 0. Under Theorem 1, the parametervector 00
0(YO) (720/Y10, Y30/710) is identified (the matrix of log observations is as-
sumedto have full rank).This resultis as expectedsince in log formwe have the
usual linear index model with identificationobtained via a scale restriction
(there is no intercept).
A somewhatmore interesting(nonlinear)exampleis given by
(17)
a(x; 00)
= (X1)T10((X2)Y20
+ (X3)Y3))Y4,
a > O.
As in the first example,with ylo 0 0, v(x; 00) (1/y10) ln(a). FromTheorem 1,
we can now identify 00-(746/Y10,
(18)
010 ln
((X2)020
+
(X3)030)
Y20,730) provided that
= 01*
ln ((X2)02*
+(X3)03*),
only holds when 00= 0*. Even if x2 and X3are discretein this example,in the
monotonic index case it is still possible to identify the (appropriatelyscaled)
parametersof the model. For instance,(18) will be uniquelysolved at the true
parametervalues when (X2,X3) take on the discrete values: (1, 1), (0, 2), and
(2,0) with positiveprobability.
BINARY RESPONSE
MODELS
397
As an alternativemodel for which other identifyingconditionsare required,
let the data be generatedby:
(19)
Y
1
if v(x;60) >s[v(x;60)]E
0
otherwise,
where s is an unknownfunctionof the index and E is independentof x. If the
function s was a knownfunctionof x, then with s boundedawayfromzero, we
could redefinethe index as v/s and applyTheorem 1. When s is an unknown
function of the index, the probabilityfunction need not be monotonic in
v(x; 60) and Theorem1 is no longer applicable.We could generalizethis model
furtherby replacingthe conditionunderwhich y is one abovewith f [v(x; 00) +
uo] > 0, where the function f need not be monotonic. For the linear index
model, this case is discussed by Manski (1988). Extending this result for
nonlinearindices,we have the followingtheorem.
2 (Nonmonotonic Models): Assume that X (X1, X2) is a vector
THEOREM
whose components are all continuous on a set v of positive probability. With the
indexgivenfrom (C.3b) by v(x; 60) = V (X1) + V2(X2; 6O), assumethatfor x E X',
the functions v1 and V2 are differentiablein x1 and X2 and av1/dx1 + 0. Let 6 *
be a value of 6 such that on a,
6* )/dx2
dv2(x2;
= dV2(X2;
Then, identification holds if 6 *
To prove Theorem2, let v
Then
(20)
00 uniquely solves this equation.
=
v1(x1) + v2(x2; 60) and v *-v1(xd)
y = 1vl(xl)
G(v)-P[
00)/dX2.
+
+ v(x2; 6*).
V2(X2; 60)]
=P[ Y= 1 IV1(Xl) + V2(X2; 00)]=H(v*).
Differentiating both sides with respect to x1, it follows that for x E v:
dG(v)/dv
=
dH(v *)/dv *. Employing this result, the claim now follows by
differentiatingboth sides of the equality G(v) = H(v*) with respect to x2
(where x E V).
3. THE ESTIMATOR AND CONSISTENCY
3.1. The Estimator
Parameterestimates are obtained by maximizingan estimated quasi-likelihood function,whichwill be defined and discussedin this section. We begin by
defining the estimated probability functions, as they are the fundamental
components of this estimation method. Recall that the "true" probability
functionis given as
(21)\
Dfa
Pi0
fr { ........X /
=g1(
i0
s
1 V v;
a
)
g.......
1
398
R. W. KLEIN AND R. H. SPADY
It would seem natural to define an estimated probability function by replacing
all of these components with the corresponding kernel estimates given in (C8).
However, in so doing we would encounter technical difficulties with a uniform
convergence argument when estimated densities become too small.
To guard against small estimated densities, let
(22)
8YN-hN[e
/(l
+ e')],
z
9yv(v; 0) lhN
hN
be adjustment factors, where (a, b, c) are parameters that must satisfy certain
restrictions to be explained below. Then, with V - g V + gov and aN - ON + 51N
we define the estimated probability function as
(23)
Pf v; 0) = glv(v; 0, hN)
+
1N(V;
0)/
9vv
5)+
N(V;
0)]
The purpose of the adjustment factors in (23) is to control the rate at which
numerator and denominator of estimated probability functions tend to zero. For
this purpose, we first require that 0 < b < c in (21). Under this restriction, for
small estimated densities (of smaller order than hb ) the adjustment factors
behave exponentially like hN. For sufficiently large estimated densities, the
adjustment factors tend exponentially to zero as no adjustment is required in
this case. In this manner, numerator and denominator of the estimated probability function behave asymptotically like hN for small estimated densities and
like the unadjusted form A1 /Iv otherwise. As a second restriction on the
adjustment parameters, we need to select a, which essentially controls the
adjustment rate for the estimated probability functions. In this regard, it must
be noted that we will require two derivatives of the probability function with
respect to 0. As might be expected, it is technically convenient if we can ignore
derivatives of the adjustment factors. In this regard, we require that a be
sufficiently large, and the restriction: a > 2c + 2b > 0 will suffice. For expositional purposes, we parameterize the adjustment factors in a manner that
insures that all restrictions on trimming parameters are satisfied. Namely, with
b = ?'/5, and c = ?'/4. Throughout we will refer
' > 0, in (21) we select a -',
to these probability adjustments as "probability trimming."
With Pi - P(vi; 0), it can be shown that IPi(6) - Pi(O)I converges in probability, uniformly in i, 0 to zero except for vi in a region with vanishing probability.
Then, one can show that without any further trimming, the quasi-likelihood (i.e.
(7) with Ti= 1 for all i) converges to a function that is uniquely maximized at 00.
The estimator that maximizes this objective function would then be consistent.
For the normality argument, we would need to show that the gradient to this
objective function was asymptotically distributed as normal at 0 = 00. Unfortunately, such an argument requires (at 0 = 00) not only that Pi converge to Pi,
but also that such convergence be at a sufficiently fast rate. Since we cannot
establish such a rate in the presence of observations for which densities are
"too" small, we will introduce a trimming sequence that (exponentially) downweights such observations.
BINARY RESPONSE
399
MODELS
AA
Let Op be a preliminary consistent estimate for 00 for which IOp- 60 is
Op(N- 1/3).4 Then, employing (C.7), define the following "likelihood trimming"
function:
(24)
ri-TiOil,
Tiy-=(gyvIV(Xi;
op)
op]
E
for y= 0,1,
where the first argument of r is the kernel estimate in (C.8) for 0 = Op.For the
second argument or trimming rate, 0 < 8" < a -,
where a is given as in (22).
The estimated quasi-likelihood is now defined as
(25)
QN(0 I) - E (Ti12)
[Yi In [Pi(0
+ (1 - yi) In [l-Pi()
I1N.
There are several features of this expression that should be noted. First, as
discussed earlier, this function remains well-defined even in the presence of
(asymptotically negligible) estimated probability functions that are negative.
Second, as to trimming, it should be noted that probability trimming is less
severe than likelihood trimming (8" < '). As a consequence, we will be able to
show that the gradient to QN in (25) at 00 behaves (asymptotically) as if there
were no probability trimming. This result will prove useful in the normality
argument below. Third, this expression incorporates both likelihood and probability trimming. For reasons discussed earlier, probability trimming (as in (23)) is
not in itself sufficient for our purposes. It might appear that likelihood trimming
evaluated at 0 (instead of at the preliminary estimate, Op)would suffice as the
sole form of trimming. However, in this case the gradient would depend on
derivatives of the likelihood trimming function with respect to 0. Since the
trimming function approximates an indicator, there must be regions in which
such derivatives become arbitrarily large. Such regions would present problems
as they are not sufficiently small to ignore in analyzing the gradient (multiplied
by N1/2).
3.2. Consistency
To prove consistency for the estimator maximizing the estimated quasi-likelihood function in (25), we need to establish the uniform (in 0) limit of this
function. In this section, we will sketch the argument and provide details of the
proof in the Appendix which contains all required lemmas. Based on convergence rates for kernel estimates and their derivatives (Lemma 2), Lemma 4 of
the Appendix establishes a uniform (in i, 0) convergence rate for Pi(0) to Pi(0).
As a result, we are able to show that for the estimated quasi-likelihood,
QN(O; I) in (25), estimated probability functions can be taken as known:
(26)
sup IQN(6 I)-QN(OI
)I=OP(1),
0
QN(6r
I~[~n[ P1(O)] + (1 -~l[
P()]N
4At the cost of complicating the argument, it is possible to permit a slower convergence rate. For
the given convergence rate, the estimators given by Han (1988) (see also Sherman (1992)), Horowitz
(1992), and Manksi (1975) would suffice.
400
R. W. KLEIN AND R. H. SPADY
From Lemma 3, we may take the trimmingfunction as given in that with
V V(X; 00), Ti=- 'Ti0i1, and iy=[gy,,0(v(xi; oo); Oo);?"]:
(27)
SUP IQN(G; ) QN(o;r)I =OP(1)X
0
QN(0, 7)- =Er [ y, In [ Pi(0)] + (1- yi) In [1- Pi(0)] IN.
Finally,we show that the trimmingfunction in (27), ri, may be treated as if it
were one (its limitingvalue) in that with
(28)
QN(L)
QN(0,1): SUPIQN(0L,r) -QN(G)I =op(1).
6
This limitingfunction, QN(0), converges in probability,uniformlyin 0, to its
expectationwith maximizingvalue, 0,, characterizedby
P[y= lIx] =P[y = lIv(x;00)] =P[y = lv(x;0L*)].
The first equalityfollows from the aggregatorcondition,(C.3a), and the second
equality is a necessary condition for 0,* to maximizethe limiting quasi-likelihood. Fromthe identificationassumption,(C.9), 0L* = 00 is the unique maximizing value. Consistencynow follows as summarizedin Theorem3 below.
(29)
THEOREM
3 (Consistency):Under conditions (C.1)-(C.9):
G-arg sup (Q(0;))
-G00.
6
Having established consistency, in the next section, we show that 0 is
distributedasymptoticallyas (root N) normal with mean zero and minimum
matrix(under the appropriateregularityconditions).
variance-covariance
4. THE ASYMPTOTIC DISTRIBUTION
OF THE ESTIMATOR
4.1. Normality
We begin with a Taylor series expansionfor the gradientof the quasi-likelihood function,
( ?)
N1/
(0-o
+; T^]1,0 0,
00- 1N 1/2,0
=[d2[0
[ 69;7^ )] ao,
where 0 +E [0,L00].Lemma 5 establishes convergence in probabilityfor the
Hessian term in (30):
(31)
-[a2Q[L+;ij/aoadf
]
P(E
( [
a][
J
p)
_
Accordingly,it remainsto show that the gradientterm in (30) is asymptotically
distributedas normalwith mean zero. As in earlier sections,we will sketch the
proof and leave the details to the Appendix.
BINARY RESPONSE
401
MODELS
To analyze the gradient, write it as a weighted sum of residuals:
(32)
]/a6
G(0O) aQ[0;
ri
Ai-
a
[yi -Pi]
AjVVi;
00)
rjw
g[
ai+
aNV;0)]
ioo
a(Vi;
Pi
Pi
[aP0)
The decomposition is helpful in that the residuals and weights each have
desirable properties that we will exploit below. The residual in (32) estimates
(33)
riN-rN(xi;
?i=
[ [gv(
-
00)
Vi; 00)+
Pi(0o)] /ai
aN(Vi;00O)]
Pi(00)[1 Pi(OO)]
]
With E denoting the indicated conditional expectation,
(34)
E[rN(x;O)lv(x;00)]
=
=E[rN(x; O)Ix]
0,
a useful property in establishing asymptotic normality.
With Pi Pr [ y = 1i v(x; 0) v(xi; 0)], the weight function in (32) estimates
w(xi; 00) gv(vi; o0)[aPJ/0]6=00, which has the useful property that
(35)
E[w(x;00)lv(x;00)]
=0.
This property will hold if the conditional expectation of the derivative of the
semiparametric probability function, P, is zero. To examine this derivative, we
need to be careful to account for the manner in which P depends on 0. Let
be the distribution function for uo conditioned on x and
F(- Ix) FU01X
G(- Is,0) the distribution function for x conditioned on v(x; 0) = s. Then, we
define
(36)
P(S; 0)
fF[v(x; 00)Ix] dG(xls, 0)
F[v(x;00)lx]
=Pr[u0<v(x;o0)1x]
=
fH[v(x;
00)] dG(xls, 0),
=Pr[u0<v(x;00)Iv(x;00)]
-H[v(x;00)].
In (36) the function H, which we introduce for notational convenience, depends
only on v(x; 00) under the index restriction.
In what follows, we want to differentiate P when s v(t; 0) for fixed t. To
this end, let a(x; 0) v(x; 00) - v(x; 0) and with v(t; 0) substituted for s in
402
R. W. KLEIN AND R. H. SPADY
(36), rewrite P as
(37)
=fH[a(x;6)
P[v(t,6);6]
+ v(x;6)] dG[xlv(t;6),61.
Then, from the chain rule with P[v(t; 0); 0] given as in (37),5
(38)
aP/do 1 0= 0
=
D1(t,600):
[fH[v(x;6)
+a(x;0o)]
dG[xlv(t;6),6]
X;)]dG[xjv(t;00,O)
00)+aO(
O]]
+D2(t, 00): -O[H[v(x;
To explain the two terms in (38), for simplicityconsider f [r(6), s(A)] as any
functionof 0. The derivativeof this functionat 0 = 00 is
(39)
+
= [[df/dr][dr/Id]
[df/ds][ds/ad]JO=0O
= [af[r(0), s(00)] /do + df[r(0)
,s(0)]/do]0=00.
df/adOIo
Notice that in the first term in which r varies and s is held fixed, we may
evaluate s at 00 before we differentiate.An analogousresult holds for the term
in which s varies and r is held fixed.Employingthis argumentin (38), in D1 we
consider variationsin 0 with a held fixed. As this term will be evaluated at
0 = 00, in a we are entitled to replace 0 with 00 before we differentiate.In an
analogous manner, we obtain D2 by letting a vary and holding all other
functionsof 0 as fixed.
With Di(t, 00) given in (38), i = 1,2, let D(x, 00) D1(x, 00) + D2(x, 00).
Then, (35) will follow if E[D(x; 00)Iv(x;00)] = 0. To establish this result, we
will proceed by simplifyingeach of the terms in (38). Under (C4.b)-(C.5), H in
(38) is differentiablein a neighborhoodof 00. Then, since a(x; 00) = 0,
(40)
D(t; 60)
=
y-[JH[v(x;6)]
=
dO4fH[v(t;6)] dG[xlv(t;6), a]
=
d [~H[v(t;)] )fdG[xlv(t;
dG[xlv(t;6),6]
]
0=00
6),61]]
do
For D2, since a(x; 00) = 0 and da/dO = -dv(x; 6)/ad, from (C.4b)-(C.5)we
may differentiatewithin the integralto obtain
(41)
D2(t;
00)
=
[f-j(H[v(x;00)
+a(x;6)])dG(xiv(t;00),
Oo)j
= -E[D1(x; 60)lv(x; 60) = v(t; 60)],
5We are grateful to Whitney Newey for suggesting this "chain-rule" approach for the case in
which x and u are independent. A similar argument holds for models satisfying an index restriction.
BINARY RESPONSE
403
MODELS
where the above expectation is taken with respect to the distribution of x
conditioned on v(x; 00) = v(t; 00). From (38)-(41),
(42)
=E[D1(x;00)Jv(x;00)]
E[D(x;00)lv(x;00)]
=0.
-E[D1(x;00)lv(x;00)]
Proceeding with the normality argument, the strategy is to show that all
estimated components of the gradient can be replaced by the corresponding
true components. If this substitution is permissible, then asymptotic normality
will readily follow as the new gradient will conform to a standard central limit
theorem.
Let GN(Oo) be the "true" gradient obtained by replacing ij, Pi, and
wi in G(00) with ri, riN, and wi. In Lemma 6 of the Appendix, we prove
N /2[G(OO) - GN(00)] = op(l). To briefly sketch out the argument here, from
(32)-(35) decompose this difference as follows:
(43)
-
NL/2[G(0o)
iriwi -N-1
G(00)] =N-1
(43a)
=
A:
N- 1/2
E
ririNW
)
riwi-r
(43b)
+B: N-
1/2 E(i-i)
( 43c)
+ C: N-
1/2 E
rNw
f-i
( r^W-wi-
r
In the Appendix, we show that each of the above components converges to
zero in probability. As the arguments for terms B and C are similar to and
simpler than those for A, here we will sketch out the proof for A and leave
other details to the Appendix. To analyze A in (43a), write it as
(44)
N' 7
rj( r
-
rN)wi+N
ri-
A1
+ N-
1/2
r-w)
A2
E Tir
N(
we).
A3
For A1, note from (32) that the estimated residual is a ratio of estimated
functions: r^i=(y -PP1a)/j. As estimated denominators will pose a technical
problem to the (mean-square convergence) argument that we seek to employ,
expand Pi as a function of '5- 1/a^i about m
(45)
= r* + op(NriTr4
1/2),
pi*=[(y.J)i][1(
a
where the remainder term above' is uniformly (in i) of order N(46)
A* =N-1/2
-i(P
*-riN)wi
=A
)Th],
1/2.
Then,
+oP().
Proceeding with A*,
(47)
E(A* )2 =E( Er2(i*-r2N)
w2/).
Notice that in obtaining the above expectation, we have ignored all cross
product terms. By taking an iterated expectation, conditioning first on xi,
i = 1, .. ., N, and then on vj, i = 1, ..., N, such terms vanish from the property of
404
R. W. KLEIN AND R. H. SPADY
the weight function in (35) (see Lemma 6 of Appendix A). Each remaining term
within the expectations operator in (47) converges to zero in probability and can
be shown to satisfy an integrability condition. Therefore, the expectation in (47)
does converge to zero as claimed.
Turning to A2, it converges in probability to zero because from Lemma 4 one
can show that
(48)
supI 1/2( ri riN)| IuSUPI
i
/2w
= op( N-12.
Wi-j)j
i
For the final "A-term", A3, the argument is somewhat similar to that for A1.
Namely, by making use of the fact that the "true" residual has zero conditional
expectation (see (34)), this term converges to zero in mean-square and is
therefore op(l).
From the above analysis of the gradient and (30)-(33):
(49)
N
/20-
ap][dP]'
1
1
+op(l).
pP [NP] 000iriNwi]
Since this expression consists of a sum of i.i.d. terms, each with zero mean and
bounded pth absolute moment, p > 2, from Serfling (1980, Corollary, p. 32) we
have the following result.
-d-
I
do
THEOREM 4 (Asymptotic Normality): Under conditions (C.1)-(C.9), the asymptotic distributionof N12(0 - 00) is N(O, X),
X
(do [ do p[ ( I
P)
10=0
It should be remarked that the covariance matrix above can be estimated in
the usual way when employing this distributional result for inference purposes.
The informational equality holds here between the negative of the Hessian and
the expected outer product gradient matrices. Consequently, one may employ
the standard estimated Hessian, outer product gradient, or White's (1982)
estimator to estimate the covariance matrix. As a result, with the quasi-likelihood treated as if it were the true likelihood, standard errors given from a
conventional likelihood routine will be correct provided that the functional form
for the value function is correct. Indeed, for purposes of inference, the estimated quasi-likelihood may be, treated as if it were a likelihood function. In this
manner, for example, one can conduct the usual likelihood ratio tests.
4.2. Asymptotic Efficiency
Having established asymptotic normality, we now turn to the issue of efficiency. In the next theorem, we show that under an independence assumption
the estimator attains the efficiency bound as given in Chamberlain (1986) and
Cosslett (1987).
BINARY
RESPONSE
MODELS
405
THEOREM 5 (AsymptoticEfficiency):For the index model given in equation (1),
assume that x and u are independent and denote X as the covariance matrix given
in Theorem 4. Then X attains the efficiency bound as specified in Chamberlain
(1986) and Cosslett (1987).
To prove Theorem 5, note that the inverse efficiencybound for this problem
as given in Cosslett (1987, p. 587) is
(50)
Et
Var [dV(X; Oo)ldolV(X; oo)] 2U[V(X; oo)]
P(y= lIx)[1 -P(y= lIx)]
where fuo is the densityfor uo. From Theorem4,
(51)
IX
E
(-ap
- ] aP]
1
[ p(i-P)jJ00
B
When uo is independentof x, the theorem readilyfollows from the expression
for aP(o)/a0jo0o in (38)-(42).
5. MONTE-CARLO
RESULTS
The usefulness of the proposed estimator depends at least partly on its
performanceat moderate sample sizes, where perhaps the leading issues are
efficiencyloss relative to a correct fully parametricspecificationand effectiveness in situationswhere the correctparametricspecificationis unusual.To give
an indication of the performanceof the estimator in both these regards,we
have performeda series of simulations,of which we report three here.
For all three simulations, the true model is given by y* = f1xi, + 2x12 + u
for i = 1,...,100 and yi = 1 if y* > O and yi = O otherwise. The x's are
independentlyand identicallydistributed.For the compactsupportcase, in the
first two simulations, x1 is a chi-squaredvariate with 3 degrees of freedom
truncatedat 6 and standardizedto have zero mean and unit variance;x2 is a
standardnormalvariatetruncatedat + 2 and similarlystandardized.Notice that
in these first two simulationswe are comparingestimatorswithout assuming
that one of the x's has a densitythat is differentiableeverywhere.For technical
convenience,we formulatedthe theoryunderthe assumptionthat there was one
x with a density that was everywheredifferentiable.It is possible to relax this
assumption,and for purposes of evaluatingthe semiparametricestimator,we
wanted to present some results that did not impose such smoothness.The third
simulationhas the x's constructedin the same mannerexcept that they are not
truncated.6In Design 1, the ui's are standardnormal;in Designs 2 and 3 they
are normalwith mean zero and variance.25(1 + v2)2 where vi is ,1x1 + ,2x12.
In all designs 81 = I82 = 1, and the ui's are independentlydistributed.
6In the first two designs, x1 is standardized by subtracting 2.348 and dividing by 1.511; x2 is
divided by .8796. Thus the range of x1 is (- 1.551,2.420) and x2 is (- 2.274, 2.274). The third design
is included to show a case where probit is markedly biased.
406
R. W. KLEIN
AND
R. H. SPADY
TABLE I
CUMULANTS
OFSIMULATED
ESTIMATES
STANDARDIZED
OF/
(Jackknife Estimates of Standard Errors in Parentheses.)
Design 1
Untrimmed
Semiparametric
Probit
MLE
Design 2
Untrimmed
Semiparametric
Probit
MLE
Design 3
Untrimmed
Semiparametric
Probit
MLE
K1
K2
0.99958
(0.00392)
0.99987
(0.00346)
0.01532
(0.00082)
0.01196
(0.00056)
KI
K2
0.99758
(0.00421)
1.00508
(0.00691)
0.01773
(0.00138)
0.04767
(0.00254)
KI
K2
0.99195
(0.00469)
0.91525
(0.00784)
0.02195
(0.00153)
0.06146
(0.00310)
K3/1K/2
0.01400
(0.13875)
- 0.07293
(0.09369)
K3/K3/2
- 0.08483
(0.31315)
0.11877
(0.12233)
K3/K2
- 0.23636
(0.25983)
- 0.15995
(0.10444)
41K2
0.82470
(0.38237)
0.21619
(0.21359)
K41K2
3.99493
(1.04304)
0.82421
(0.30838)
K4/K2
2.81254
(0.87196)
0.53852
(0.23182)
To facilitate comparisonbetween probit and the semiparametricestimator,
we adopt the normalizationIi31 + 1/321 = 2. In our simulations,we compute
probit(with an intercept)in the standardmannerand impose the normalization
on the result;for the semiparametricestimatorthe normalizationis imposedin
the estimation process. Thus we need to report results concerningonly one
number 13
In what follows, results are presented for the untrimmedsemiparametric
estimatorand the probitMLE. There is a wide range of trimmingspecifications
that have almost no effect on the estimates. Moreover,the estimate obtained
without any trimmingperformedquite similarto that under the trimmingthat
we employed.Accordingly,we report results for the semiparametricestimator
obtained without probabilityor likelihood trimming (see the discussion in
Section 3.1). Throughout,we estimateddensitieswith adaptivelocal smoothing
as defined in (C.8b), with local smoothing parameters defined without any
trimming.The nonstochasticwindow component, hN, we set at N- 1/(6.02) to
satisfythe restrictionsplaced on it.7
Table I presentsestimatesof the firstand second cumulantsand the thirdand
fourthstandardizedcumulantsof .13for both designsderivedfrom a simulation
of size 1,000.The second row of the resultsfor each estimatoris the squareroot
of the jackknifeestimate of the varianceof each cumulant,and thus providesa
rough estimate of the standarderrorof the cumulantestimates.
7The corresponding pilot window component, hNp, was set equal to hN. All theoretical results
can be shown to hold for this case. It should be noted, however, that after completing this study, we
discovered that the results under local smoothing are much easier to establish when the pilot
window component is set wider than hN.
BINARY RESPONSE
MODELS
407
In examining Table I, the following facts are notable. First, the efficiency loss
in Design 1 from using the semiparametric estimator is quite tolerable: the
relative efficiency is 78%, which is comparable to the loss from using least
absolute deviations regression rather than ordinary least squares when the
disturbance is i.i.d. normal. Second, there is quite a considerable efficiency gain
in using the semiparametric estimator in Design 2; a similar efficiency gain is
found in Design 3 where probit is quite biased. Notice that by not truncating the
distributions of the x's, the distribution of x,10 is much more asymmetric in
Design 3 than in Design 2. Third, the semiparametric estimator shows considerably more kurtosis than the probit estimator.
Despite the fact that the simulation results concerning the probit estimator of
the normalized coefficients in Design 2 are consistent with that estimator being
unbiased, the probit model cannot give an accurate picture of the behavior of
the choice probability, p(y = lix). Figure 1.A plots p(y = liv) for the two
designs. In Design 2, p(y = 1iv) is not monotonic in v, with reversals coming at
v = + 1. Probit of course assumes that it is possible to construct a projection of
x into v such that p(y = liv) is monotonic in v. In the current example this is
not possible. In Figures 1.B and 1.C we have plotted true probabilities on the
horizontal axis and predicted probabilities on the vertical axis for a typical run
with 1,000 observations of Design 2.8 The semiparametric probabilities track the
450 line reasonably well, and experiments with other sample sizes indicate a
convergence to the 450 line as required by theory. In contrast, the probit
probabilities are inaccurate and Figure 1.C is an accurate representation of
their asymptotic behavior.
A further illustration of the behavior of the two estimators in this example is
given by the perspective plots of Figure 2. The height of the surface is
prob (y = lix); the x and y planes correspond to a rescaling of x1 and x2
(respectively) to the interval (- 1, 1); the viewpoint is (-6, -8,5). The semiparametric perspective plot of Figure 2.B reflects the features of the true
probability perspective except in the extreme foreground (in which the surface
curls slightly up due to sampling error at the extreme corner of the sample
range) and in the far background (where the estimate is somewhat flatter than
the truth). In contrast, the probit perspective plot is obviously too flat.
Clearly the results presented here are only indicative of the performance of
the estimator in a few cases. But apparently there are some cases where the
estimator is vastly superior to probit (even when probit is virtually unbiased) and
some indication that the efficiency loss relative to a correctly specified parametric model can be relatively small. In addition, these exercises suggest that
semiparametric estimation can serve as a specification test of parametric models, a topic we intend to pursue in future research.
8
For our estimator the fitted probabilities converge to the true probabilities at a rate proportional to N1/2hN -N1/3. Thus 1000 observations are required to get about the same precision in
fitted probabilities as is provided by 100 observations for parameter estimation. At 100 observations,
there is not much difference between plots analogous to Figures 1.B and 1.C.
408
R. W. KLEIN AND R. H. SPADY
P(y=l Iv)for Designs I and 2
j
-
|---
1
Design
Design 2
c0J
/11
~
0
2
o
-4
-2
4
FIGURE L.A
v
Probitvs. TrueP(y=lIv),Design2, n=1000
vs. TrueP(y=1Iv),Design2, n=1000
Semiparametric
V~~~~~
o
,
0.0
0.0
rrUe
p-3WIR
0.2
0.4
0.6
0.8
TINep OebIyM
1.0
FIGURE 1.C
FIGURE L.B
6. CONCLUSIONS
As discussed above, we have formulateda "likelihoodbased" estimatorfor
the binarydiscrete choice modejunder minimaldistributionalassumptions.We
have shown that in large samples this estimator is consistent, N'72 normally
distributed,and that it attains an asymptoticefficiencybound.
We also evaluatedthe estimatorin a Monte-Carlostudyby comparingit with
the probit estimator for several model specifications.In terms of estimated
parameters,we found that the semiparametricestimatorhad a small efficiency
loss relative to probit when the probit model was correct. For a "substantial"
perturbationof the probit model, the semiparametricestimatorstronglydominated the probit model. Although this paper has focused on parameteresti-
BINARY
RESPONSE
MODELS
409
True ProbabilityPerspective
FIGURE
2.A
Estimated Semiparametric Probability Perspective,
n=1000
FIGURE
2.B
Estimated Probit Probability Perspective,
n=1000
FIGURE
2.C
mates, it is interesting to note that in many situations choice probabilities will be
inconsistently estimated by probit, but consistently estimated by a semiparametric estimator. In several heteroscedastic designs reported here, we found that
the parametric probit model provided substantially worse probability estimates
than the semiparametric procedure.
Restricting attention to the binary response model, it is possible to extend the
above results in several respects. First, there can be a finite number of points at
which the distribution of the index is not "smooth." For example, if the index is
a linear combination of independent uniform random variables, then the density
for the index will have kink points at which derivatives do not exist. These
points can be detected by comparing left and right numerical derivatives.
410
R. W. KLEIN AND R. H. SPADY
Employing the same type of trimming function (i-) used to detect small densities, we can then downweight observations for which numerical right and left
derivatives are "sufficiently" far apart. All asymptotic results then hold.9
As a second extension, it should be noted that the results above were
obtained under a compact support assumption for the density of the index. To
control for problematic small densities, in the above proof we downweighted
those observations at which the index densities became small. In so doing, the
trimming functions did not depend on whether or not the index had a compact
support. Accordingly, under such density trimming, one would expect that the
above arguments would, with suitable modifications, hold when the index has an
infinite support.
In obtaining the results outlined above, the parametric estimator was examined for the binary discrete choice model. However, it extends to the multinomial case. For example, consider the trinary choice problem. Let alternative Ak0
k = 1,2,3, have indirect utility Uk to a given individual. Let Uk be the corresponding measure of average utility, and define the average utility differences:
and v3 Ul - U3. Then with these utility differences serving as
12 =-U- u2
aggregators, we may write choice probabilities as
(52)
P(AlIx)
=P(Ul > U2, Ul >
U3Ix) =P(Ul
=P(Al)gl(U[lIAl)/g1(v[1]),
> U2, Ul > U31v[l])
V[1] =(12,
v13).
It is now possible to construct a multinomial quasilikelihood as above in terms
of estimated probability functions. Notice that in this example we would require
bivariate density estimation. In general, as the number of alternatives increases,
the estimation problem becomes more difficult, and the theoretical argument
becomes more delicate (see, for example, Lee (1989)).
The estimator obtained here can also be extended to ordered single index
models. Here, unlike the mlutiple alternatives case (which involves multiple
indices), the extension is direct. For each interval of the dependent variable, we
may estimate probabilities as above under a single index. Then form a quasilikelihood for the ordered case that is analogous to that for the ordered
parametric model. Given the results obtained here, it is relatively straightforward to establish consistency and asymptotic normality for the quasi-maximum
likelihood estimator.
Bellcore, 445 South Street, P.O. Box 1910, Morristown,NJ 07962, U.S.A.
and
Nuffield College, Oxford, OX] INF, U.K
Manuscript received March, 1988; final version received September,1992.
9In general, suppose that there are a finite number of "problematic" points at which the
maintained assumptions do not hold. Then, provided that one can detect such points, the theory will
hold if these points are appropriately downweighted.
BINARY RESPONSE
APPENDIX A:
411
MODELS
BIAs REDUCING KERNELS
In Appendix A, we provide the asymptotic results for bias reducing kernels. In Appendix B, we
outline those additional arguments required to obtain the same asymptotic results under locallysmoothed kernels. Throughout, we let g,,lY(v; 0) be the conditional density for v v(x; 0) conditioned on the discrete variable, y, and evaluated at v1 v(xi; 0). It is convenient to state convergence results in terms of the estimator for Pr(y)gvly(v,; 0) gyv(v,; 0). From (C.8) define this
estimator as
N
F-V.-v.1
1y=
,(v1;0)= E
(Al)
K
K
h
j
(N-1),
y=O,l.
We will require higher order derivatives of these estimates throughout the proofs below.
Notationally, define the rth order derivative of any function g with respect to z by
(A2)
(A)
1r
Dz
=0,
r-=1, 2,. *
)r
DzL[g]-\drg/(dz
When z is a vector, interpret the derivative as the matrix of all rth order partials and cross-partials.
We will also denote the above derivative by g r when the argument of g with respect to which we
take the derivative is clear from the context.
Throughout, we assume the basic conditions (C.1)-(C.9) given in Section 2 of the text and denote
c as a positive generic constant. Given the convergence rates established in Lemmas 1-2 and known
trimming (Lemma 3), consistency and asymptotic normality will follow for the proposed estimator.
We begin in Lemma 1, the proof of which is based on Bhattacharya's (1967) method, by determining
a uniform convergence rate of an estimate to its expectation.
LEMMA 1 (Convergence of Estimates to Expectations): Let w be a K dimensional vector and
assume that m(w) is a sample average of terms m(w; z,), i = 1.
N, where {zi} are i.i.d. Assume
that with hN O* 0 we have uniformly over N:
hr1Im(w;
zj)I < c,
r + 1 > 0,
and hsIdm(w; zi)/dwl < c, s > O.
Let E[m(w)] be the expectation of m(w) taken over the distribution of zi, i
w E Y' a compact set, and for any a > 0,
N(1-a)/2hr+ I SUpIm(w)-E[m(w)]
| O-0
=
1.
N. Then with
a.s.
w
PROOF OF LEMMA 1:
m(wI)
With wI and w2 any two values of w and w+e [w1,w2],
= [dm(w
- m(w2)
Since hsI [dm(w+)/dw]
S
)/dw](wI
-
w2).
c, with w1 and w21 as elements of w, and w2,
K
|1M(WI)-M(W2)|
Therefore,with EN > 0 and
<
IWk Ik
k=1 =N1
C1E
-W2kl1
hs
N =ENhN/(3cK),
IWlk -W2k I < 5N 'IM(WI)
M(W2)1
<E-N/3-
Since w E X', a compact space, without loss of generality let Y' be contained in a hypercube
whose sides each have "length" one. Then, following Bhattacharya (1967), cover Yf with sets jN,
bN, where (W1, W2) EN =jN
j =1
IWlk-W2k I < 8N, i = 1,. * *, K. Here, with int (t) as the least
412
R. W. KLEIN AND R. H. SPADY
integerupperboundon t, bN= int[1/GAN)K]. Then, for WjE
sup Im(w) -E[m(w)] I
j=.
s[up U
YlN,
m(wj)]
[m(W)-
+ [E[m(wj)-m(w)]]
<
(213)EN
max
+
+ [m(wj)-E[m(W)]]
I
|
|j[m(w,)-E[m(wj)]]
Therefore,
P[ suplm(w) -E[m(w)]I
max [m(wj)
SP
>ENJ
-
> EN/3j
E[m(wj)]]
bN
I,P
[ m*(wj ) ] ] |> h
[ |[ m*(wj )-E
EN13],X
j=l
m*(wi)
h+m(w,).
N
As noted by Bhattacharya, from Hoeffding (1963; Theorem 1, p. 15 and its extension, p. 25) it
follows that since m* is an average of i.i.d. bounded random variables,
P[
[m* (wj) -E[m* (wj)]]
r
I> hN
N3
[ - CNh2r
S2x
2N]
where c' is a positive constant. Therefore,
P [suplm(w)-E[m(w)]>ENJ
With EN replaced by [N" -a)/2hr+l]-l.,
the sum of the last term over N = 1,
2b N expN[ cNhN ?24].
where
E
is a finite constant that does not depend on N,
Q.E.D.
oo is finite; the result now follows.
In proving that an estimate converges uniformly to the truth, the strategy is to use Lemma 1 to
show that the estimate converges uniformly to its expectation and then that its expectation
converges uniformly to the truth. In this manner, Lemma 2 obtains rates for estimated densities and
derivatives.
LEMMA 2 (Uniform Convergence): Uniformly on v(t; 0) and 0,
(2.1)
| Dg[er1[v(t;0);0]]-E(D
(2.2)
| E{D4r[ Yv[v(t;0);0]]}
(2.3)
= Op(N-D r[gYv[V(t;0);0]]j
= Op(h2),
r=0,1,2;
r= 1,2;
= Op(h)4.
|E[^yv[v(t;0);0]]-gyv[v(t;0);0]
PROOF OF LEMMA 2: The proof of (2.1) immediately follows from Lemma 1. To establish (2.2),
with z
v[(x; 0) - v(t; 0)]/hN for fixed t,
E(D
r[
^Yvo[v(t;a);
a]]}
=DDrEff^Yvo[v(t;a);
a]]}
00
=Pr(y)Dof_
-P00
J
r
K(z)gvly[v(t;0)+hNz;0]/dz
=Pr (Y)f K(z) D rgv,ly[
00
0N;01d
V(t; 0)
+ hNZ
1 11
10Note that in this representation we have moved the derivative operator in and out of the
expectations operator. In each case, the function being integrated is differentiable in a neighborhood of 00 for almost all x and satisfies an appropriate dominance condition.
BINARY RESPONSE
413
MODELS
Next, expand the integrand above as a function of hN and terminate the expansion at the hN term.
From the symmetry of the kernel, the hN term vanishes. The result now follows from the regularity
conditions on the kernel and density derivatives.
The proof of (2.3) is also based on a Taylor series expansion in hN, where the expansion is taken
up to the h4 term. From symmetry of the kernel, the hN and h3 terms vanish. Since fz2K(z) = 0
(the bias-reducing kernel assumption), the hN term also vanishes. The result now follows from the
Q.E.D.
assumed regularity conditions on the kernel and density derivatives.
Apart from the trimming function, Lemma 2 provides the asymptotic behavior of the components
of the (estimated) quasi-likelihood. Lemma 3 below characterizes the large-sample behavior of the
trimming function.
p); p where Op is a
LEMMA 3 (Estimated vs. Known Trimming): Let Oyp(Op)p y,[v(xl;
preliminaryor pilot estimator for which (Hp- 00) = Op(N- 1/3).1 Define a smooth trimmingfunction
as
+ez(t)],
T(t)=11[1
z(t)--(hN
-t)1hCN,,
Let
where O<b <c < 1/3 and N- 6<hN<N-18
gyl.(vi; 00). Then, riy has the following representation:
(a)
A+ hNc(
+
N
+ hK2c(
%,,(60)-gy,)(1-
g
y-= T[g yi(Op);E] and rT,yr[gyi;
T(00)2[y)(1-T]y)T1yhN
(8(00)
-gyi)2Ciy(l
],
gyi=
-
Uiy+NT)+y),
where C1y= 0(1), uniformly in i, y and where the remainder term-is op(N- 1/2) uniformly in i. With
N- 12N)
+
[( yH)gyi
+h
~ -Y0-sy0
~ ?],~~~~0)
(b)~
4gyi(6); e]:iy )Ti op(
6)), and 0]D [1-siy(l
gy.(v(xi;DrD(6)
gyi(6)
( -rL,(h);
where the order of the remainder term is uniform in i, y, and 6.
as a functionof ?
g.E expanding
PROOFOF LEMMA3: For (a), the proof follows by first
as a function of T about0 i 0. For (b), the proof follows an
about g T1and then expanding c
Q.E.D;.
analogous argument.
Employing Lemmas 1-3, we can now determine convergence rates for estimated probabilities
and f tir indivatives. In obtaining these results, it is convenient to do so in terms of adjusted or
; 6), let adjustment
trimmed probability functions. With0p<o' <f1, vb (x;s0), andin g (Tg
factors, which satisfy the restrictions in (22) of subsection 3.1, be given as
(A3)
8yN-hN$[ez/(1
+ ez)],
z-(h(N/5)-gyt.)/h($/4),
and
i3N-3ON+
81N.
With ? (8us ) replacing gy,(v;8), define 3(v;6) and 8N(v;0) analogously. Then, with Q.(E;.).
Emp(l;i+ g1.(v; 6) as the estimated marginaldensity for v, define the following probability
functions:
(A4)
Estimate: P(u;60)- [ ?1t(U;o)+&6N(v;o)I/ [.?V(v;o)+ N(v;o)I;
(A5)
Trimmed: PN(c;in )-[g1W(i 0 ; <) + 1N(v; 0)]/[g y(v; l)+ajN(v;tm)];
(A6)
True:
P(u;60) -g1u(v;6)/gt
(u;6).
Lemma 4 now obtains results in terms of these three functions.
11 This assumption can be relaxed at the expositional expense of incorporating higher order terms
into the series representation of Lemma 4. Manski's (1975) maximum score estimator, Horowitz's
(1992) smooth version of this estimator, and Han's rank estimator (see Sherman (1992)) all converge
to the truthat least as fast as Op(N- 1/3).
414
R. W. KLEIN
R. H. SPADY
AND
LEMMA 4 (Convergence of Estimated Probabilities and their Derivatives): Let E', 0 < ' < 1, be
the adjustment parameter in (A3) employed to control probabilityfunctions in (A4)-(A5). Then, for
N- 1/(6+ 2e) <hN <N 1/8, and under the conditions of Lemmas 1-2, uniformlyin v E r and 0 E e
(4.1a)
hE'|D {P[v(t,
0); 0]}-D
0); 0]l
NP[V(t;
I
= Op(N-
1/2h
r = 0, 1,2;
-(r+l))
I);
);
-PNut
0]-l I= Op(N- 1/2 h-1).
Let YN= {v: gov(v; 0) > h , g1v(v; 0) > hNj, where with e' as given in (A3), 0 < e < (E'/5). Note
that under this condition, the probability adjustmentfactors, 8ON and 1N vanish exponentially. Then,
under the above conditions, uniformly on YIN, 6:
E
(4.1b)
(4.2a)
h|Put
I
hEIDr{P[v(t;0);0]}
(4.2b)
hEl P[u(t;
(r+l))
-Dr{P[V(t;0);0]}1=Op(N-1/2h
0); 0]
-P[u(t;
r=0,1,2,
]-1 = OP(N- 1/2 h-1)
0);
PROOF OF LEMMA 4: Convergence rates for estimated probability functions and their derivatives
are similar to those for estimated densities and their derivatives. The only difference is that for the
former rates must be decreased by the rate at which the relevant denominators tend to zero. We
illustrate the argument for r = 2 in (4.1a); other results are proved analogously.12 A typical term
from Dg{P[v(t; 0); 0]) - D{P [v(t; 0); 0]) is given by13
With aN
[(gv + aN) - (?t, +
+ aN)
1/
][t g
+65
T=-[D{?,
n]
+ 8N)][1
[1/(gv
AN]
From Lemmas 2-3 and the fact that (gv +
yields
T=
Do (1
+
1N)
8N)
+ 81N)]
-Do(g1,
+ SN]-
from the geometric expansion of an inverse,
+ 8N],
8N)]/[gv
+ SIN}]I[gv
-[D0{l
= [1/(gv
=
+
[1 +AN
+AN(l
-AN)]
substitutingthis expansioninto T
NO(hN')
/[gv
+ a)]
AN]
+
op(hN
hN
The result now follows from the convergence rates for the derivatives in Lemma 2 and the bound on
Q.E.D.
gv + axFrom Lemmas 3-4 consistency now follows by showing that the quasi-likelihood converges
uniformly in 0 to a function uniquely maximized at 00. In obtaining this result, we make all
assumptions in Lemmas 3-4. In particular, with e', 0 < ' < 1, as the probability trimming parame<h <N-18.
ter, the window, hN, is selected such that NPROOF OF THEOREM3 (Consistency): Let the estimated probabilities, Pi, be defined in Lemma 4
and the trimming functions, r,y and riy, as in Lemma 3. Let ri TiOtil and i- =iO il. Then, the
In the argument for probability functions, it is convenient to first write
P
-PN=
For reciprocals, with
[(
1V + ^1N)
aN-
(N
- P)/PN,
1/P=(1/PN)[11aN]',=
(glt
+
1N)] + PN [(v?
+
^N)
(gv + AN)] ] /
v? +
^N]-
write
(1IPN)[
N('-AN)
+AN+
]-11
If P is bounded from below by P > 0, then PN is also bounded from below and the convergence
rate of 1/P to 1/PN will be the same as that for P to PN. If P is not bounded away from zero, the
convergence rate will be slowed by an additional factor of order hey.
There will be other terms with squared densities in the denominator. For example, one such
typical term is given by
[o
91 + 61X)
[g +
N]2-
[D1(g1v
+
1)]21[gv
+ ax]
The convergence rate increases when first rather than second derivatives appear in the numerator,
and this increase more than compensates for the additional density term in the denominator.
BINARY
RESPONSE
415
MODELS
quasi-likelihoodis given as
(0)-
(T/2)[yi ln [P(0)2 ] + (1 -y,) ln [(1 -P(0))
Q1(0)
2 I/N
QO(0)
Viewing Q1(0)as a functionof Pi(0), from a Taylorseries expansionabout PN(vi; 0) as givenin
(A5) and from Lemmas2-4
I
SU~~~~~P
Q
()QN0|?QNLTj[y In[PN(vi; 0)]]/IN.
In obtaining' this result, note that since PN> O, (1/2)In[PN2
=
ln[PN].
With
qlN(Vi;0)=
yi ln[PN(vi; 0)] and bi(0) (_ 1 - r[g1l(vi; 0); e"]) for a" < e' and r as definedin (C.7),write
Q1N(0)
=E
bi(0)jqlN(Vi;
0)/N +
E [1 -
bj(0)]J-q1N(Vi;
sup I EbLrjqlN(VL;0)/NN I
O
suplqlNlsup Eb1/N
0)/N.
if P and 1 - P are uniformly
For the b1-terms,note that uniformlyin (i, 0), suplqlNI =0()
boundedawayfrom zero.14 Thus,
-O.
0
ijO
For the (1 - bi)-terms,we maytake gv > glv > h', for a < ?'/5, as anytermsfor whichthis condition
does not hold will vanishexponentially.Then, from Lemma4,
E (1 - b1),rq1N(v1;
where q1 is obtainedfrom qlN
in (A6). Then, since
E
0)/N -
E (1 -
0 uniformlyin 0,
by replacingPN(v,; 0) with P(v,; 0), the "true"probabilityfunction
p
-
(1 - b1)-riql1(0)/N - Eq1l/N
sup |Q1(0) - Q1(0) -| 0,
Q1
b1)-rjqjj(0)/N
-*
0 uniformly in 0,
Eqjj/N.
6
By an analogousargument,with Qo(0) containingthe yi = 0 terms of the quasi-likelihoodand
Q0(0) as the correspondingterms evaluatedat the true probabilityfunctions,Q0(0) convergesin
probability,uniformlyin 0 to Qo(0). By standardarguments,Q1+ Qo convergesin probability,
uniformlyin 0, to its expectation,with maximum,0*, satisfyingfor almostall x
P(y = 1Ix) = P(y = 1Iv(X;Oo)) - P(00)
= P(0),
14
For P and 1 - P uniformlyboundedawayfrom zero, we must show that PN and 1 - PN are
boundedawayfromzero. Recallingthat gv = glv/P, write PN as
[glv + 31N]/[gv
PN
P[glv
+ h"(1
+ SN] =P[glv
- Tl)]I[glv
+ 81N]/[g1v
+ hN"(1-
T)
+ M'
+P(81N+
(1 - T)]
AN)]
=PN*.
As the aboveexpressionis an increasingfunctionof glv, for g1v > .5h(N 5),
PN>
P [.5h?N/5] / [.5h?N/5 + h"(1
> P[ 1/(l
For glv
< .5h(N 15),
+ 4h4E/5
)
-TO)
+ h (1-T1)j
> P15.
as PN is increasingin glv,
- 0) + (1 -r1)] > P(1 - 1)/2.
PN > P(1 -Tl)/[(l
As (1 - rT) convergesexponentiallyto one in this region, the result follows. The argumentfor
(1 - PN) is analogous.It should be remarkedthat the consistencywill still hold under a more
delicateargumentif P is permittedto approachzero and/or one. In this case, for the b-terms,one
mustshow that ln(PN) behaveslike O(h' ), whichis dominatedby the rate at whichthe averageof
the bi-termstends to zero. For the (1 - b)-terms,a tail assumptionon P is required.
416
R. W. KLEIN AND R. H. SPADY
where the first equality holds because v(xi; 00) aggregates information and the second equality
characterizes a maximum. From the identification assumption, 06 =00; consistency then follows.
Q.E.D.
To prove asymptotic normality, let H(6) and G(6) denote the Hessian and gradient respectively
for the quasi-likelihood of Theorem 3. From a standard Taylor series expansion for the gradient:
6 - 00) = - [ H( +)]
where 0 + E [6, 00]. Asymptotic normality will follow from a convergence result for the Hessian
(Lemma 5) and normality for the normalized gradient (Lemma 6).
(A7)
N/2(
LEMMA 5 (Convergence of the Hessian): With e' as the probability adjustmentparameter in (A3),
assume N- 1/(6+ 2E') <hN <N- 1/8 Then, under the conditions of Lemmas 3-4 and with 0+ as any
value for 0 such that 0+e [6, 0],
PROOF OF LEMMA 5: As in the consistency argument, from Lemmas 2-4 we may replace (in
probability, uniformly in 0) the estimated probability function and its derivatives with PN(v,; 0) in
(AW)and its derivatives. Following the consistency argument, we may also replace the likelihood
trimming function, ri, with the corresponding true function, ri. Denote HN(PN, 0) as the resulting
Hessian matrix. We will show that HN(PN, 0) converges uniformly in 0 to its limiting expectation
(Lemma 1), which we will show is finite. With H(0) denoting this limiting expectation, it will then
follow that
HN(PN,6+)
H(0o).
At 00, ri keeps densities sufficiently far from zero to insure that H(00) has the form required by the
lemma.
Proceeding, with vi v(x,; 0) and WiN {PN(Vi; 0)[1 - PN(VL;0)]} ,
HN(PN, 0) =A + B,
A
B
- ETrNWiN[DOPN(vi;6)][DOPN(v,;6)]
riNN[Yi-PN(vi;0)]DO{WiND
/N,
PN(Vj;6)}/N-
is O(hK2), one can show from Lemma 1 that
For the A-term, since [D1PN(vi;0)][D1PN(vi;0)]'
A(M) - EA(0) - 0 uniformly in 0, where EA(O) is finite.'5 Therefore, since 0 is consistent, the
15To demonstrate the argument, it suffices to consider one of the terms comprising this
expectation. In so doing, to simplify the notation, we will consider the scalar case. Proceeding with
one of these terms,
E[([dgl,(v)/aO
+ @N(V;
< 2E[ ([dgJ'(V)1do]2
+ PSMVN(; 0)1do]2)1(g'(V;
For the second term, d8lN/ao = h( E-/4)r(1
O(h(N/5)), in which case [IdlN(V; 0)/d6]2/[g,(V;
first term above is
f[E([gl,(v;
+ AN(V; 0)])2]
0)/80]l[g,(V;0)
-_-),
0)/ao]2 IV)/[gv(V;0)]I
0) + SN(V; 0))2]
which vanishes exponentially unless g1,
0) + 8N(V; 0)]2 iS o(1). An upper bound for the
dv.
As shown in (14) of Section 2, [ag1U(D; 0)/aO]2 = O([dg1'(D; 6)/dy]2). Since g1l has four bounded
derivatives with respect to v, when g1l tends to zero, then so must these derivatives. One can show
that the squared first derivative converges to zero faster than does the density itself (see e.g. the
example in (15), Section 2). The claim now follows.
BINARY RESPONSE
417
MODELS
uniformconvergenceargumentsaboveinsurethat for 0+e [0,08]
A(0+) -lim[E[A(00)]]
]
=E[dj[
P) ])66
where the latter claim follows at 0 = 00 because ri convergesexponentiallyto zero unless both
conditionaldensitiesare appropriatelyboundedawayfromzero.
Employinga similarargument,the B-termalso convergesuniformlyin 0 to its expectation,which
at 00 is zero.
Q.E.D.
To analyzethe gradient,write it as a weightedsum of residuals:
(A8)
G(80) }
Iw/N,
ri [ Yi-Pi]
pi0)/ci,
ai i g(Vi; 0o) [i( 1-Pi
],w
}
(vi;
[@i89
0=0o-
Under this decomposition, with a iN [gv(Vi; 00) + 3N(Vi; 00)][PP(1 - Pi)], riN [yi - Pil/aiN, and
wi dP]d0 16=60denotingthe true functionsto which ai, F^,and wi"tend"(e.g.lai - aiNI -?0), and
with vio = v(xi; 0o):
(A9)
E(riNIxi) = E(riNI viO) = E(wi IviO) = 0.
Employing(A7)-(A8), Lemma 6 shows that all estimatedfunctionsdefiningthe gradientin (A7)
maybe replacedby their limits:
LEMMA6: With e" as the likelihood trimming parameter and r' as the probability trimming
Kr'
parameter, assume that 0 <
<
E, 0 < e' < 1. Select the window, hN, such that N- 1/(6 +2?') < hN <
N- 1/8. Then, under the assumptions of Lemmas 3-4,
N-
-
1/2Eir^iw,
N-12
PROOF OF LEMMA 6: With /iN
N
1/ E2r(i(,i
riNwi
- AiN)
+ N
and y
/2
E
A
(A.10)
+N-1/2
- N"2GN(0O)
N/2G(0o)
EiriNwi=-
Ps',,write N1
= op(1).
N2G-N'/2GN
as
(Ti -Ti)iN
B
E (Ti-001i-
iN)
C
In analyzingthese terms,it is importantto note that ri essentiallyrestrictsall densitiesto be no
smaller than O(h
N/5)
,
< '"< e' < 1. As defined in Lemma 4, we denote )N as the set of v(xi; 00)
whose densitiesare at least this large.Proceedingwith A,
A = N-
1/2
1/2ET(irNW
E: Ti(r^I-N)W-i
A1
+ N-12
A2
E-iriN(w-
wi)
A3
For A1, expand A^ in a Taylorseries about aiN:
r1ri=
rjri +opN
v),
ri*= [( Yi -
aiN][
where the remainderterm above uniformlyhas the stated order.Letting A* be obtainedfrom A,
418
R. W. KLEIN AND R. H. SPADY
by replacingrPwith ri*,it sufficesto show that A* convergesto zero in mean-square:
E([A*
]2)
= E [ Er2w72(Fp* - riN)2/N] + E
E r(N)(r
-
rIN)TJJwiwII
/N.
From Lemma4 and a uniformintegrabilitycondition,16 the first componentabove convergesto
N} and V= {V(xk; 00), k = 1,N..,
zero. For the second component,letting X- {Xk, k = 1.
E(
riN) ( i*-
(i*-
=
=
rJN)TijrJWwj])
- riN)(rj - rjN)(PJ
E(E[(Z
E(E [ (r
-
-riN)(i
= E(E[(Pi*
- rN)(J
rjN)
For A2, since ri essentiallyrestrictsv, to
Lemma4
N/su|wIA21~~~~W
7j IVIwiwi)
rN)rTi'I
-
Wi)ISUP
rjN)1-7jl X] W1Wj)
V]
E[wiwjl V])
and N-
YN
|ri (Pi-riN)
1/(6+2e')
=
0.
0< '
<hN <N- 1/8,
1, froM
I= op(1).
Turningfinallyto A3, the argumentparallelsthat for A,:
=
E[(A3)2]
E
E Tir2(w
-
r
Wi)2 ]/N+EEr
i
-wj)IN.
riTj(w,-wj)(
i?,
Fromthe uniformintegrabilityconditionemployedto analyzeA,, as the first term is op(l), it also
convergesto zero in expectation.For the second term, recall that Wkdoes not depend on Yk and
N}. Therefore,takingan iteratedexpectationin which
that Wk dependsonly on X_ {xi, i =1.
we firstconditionon X, the second term abovesimplifiesto
E iErNrjN7rTzwiw
/N=E E
i ?j
riJNrNTjr[
wi[l]
+ wi]
[
J['I
+ W] /N,
i?j
where Wi[j,is the componentof Wi^containingall terms that depend neitheron y, nor on yJ, and
WJ[i]is defined analogously.Then, since neither of these two componentscontributeto the above
expectationand since wi'jand wj-are each O[(Nh2) -], the expectationof the cross-productterms
convergesto zero as required.
Turning to the term B in (A10) above, under Lemmas 2-3 and the same mean-square
convergenceargumentused to analyzeA3, B = op(l). For the last term in (A10), C, replace ',. with
the representationshownin Lemma3. Since each term in this representationcontainsTi, in every
term vi will be restrictedto YIN. Every term can then be written as a weighted sum of trimmed
residuals.Accordingly,we mayemploythe same argumentused to showthat A = op(l) to showthat
C = op(l).
Q.E.D.
FromLemma6, the gradientis equivalentto a randomvariableto whicha standardcentrallimit
theoremapplies.Therefore,we have the followingtheorem.
THEOREM 4 (AsymptoticNormality):Under the Lemmas 6-7 above, N2(0
callydistributedas N(O,.;),
[o
[PiP
00) is asymptoti-
P) ]
16 From Chung (1974, Thm. 4.5.2), if [ZN- Z]2 -P0 and SUPNE[IZN-Z
lim[E[ ZN- Z]2] = 0.
17 For example,with independenceover observationsand E(riN) = O,
E[riNr1NTiTjwi
-
] = E[iE(r1NX)E(
riNTiTjwLwIIX)]
= 0.
12]
= 0(1),
then
BINARY
PROOF OF THEOREM
.ap
N 1/2
419
MODELS
4: From Lemmas 6-7,
ap
( [ do
From the definitions of
RESPONSE
riN
~pp
Nl
Tr
NW
0
w1and Serfling (1980, Corollary, p. 32),
and
1/2 E r dNW1
> 0.
TirNW -N
The theorem now follows because
N[O,{ ([do][
1
P(1 - P)
do
N-112EriNWi
do]
is asymptotically distributed as
QED.
P(1-P)
APPENDIX B: LOCAL
SMOOTHING
As the arguments under local smoothing are quite similar to those of Appendix A, here we will
briefly outline the argument. From (C8.b), recall that the kernel estimate of gy,(v,; 0) under local
smoothing is
N
(B1)
gYv(v
;6,AY;hN)
___
'{ty,=y}K [I
h
jo
hNAYJ
A /(N-1).
N
hAYJ~
Abstracting from scaling considerations (see (C8.b)),
(B2)
AYJ=AN[I yv(vj; 0, hNp)],
where AN is a function that depends on g
a pilot or first-stage kernel density estimate of gyv
with pilot window hNp. The function AN also depends on N through a trimming parameter that
controls the rate at which AYJcan approach zero (see (C8.b)).
Before proceeding, it should be noted that bias calculations will depend on estimated pilot
density derivatives. To deal with this bias, it is helpful to set wider pilot windows (hNp) than second
stage windows (hN). In so doing, the proof simplifies because it becomes possible to take the
estimated local smoothing parameters as given (see e.g. Klein (1991)).
In Appendix A, we obtained convergence results in Lemma 2 for estimated densities and
derivatives under bias reducing kernels. Once we have obtained analogous results under locallysmoothed kernels, the proofs of consistency and normality will be very similar to those in Appendix
A. We begin by providing these results for pilot densities (the local smoothing parameters).
LEMMA
(a)
7 (Pilot Estimates): In the notation of (B1)-(B2), for the pilot density estimates
sup|DJ[yV[v(t;0);0,hNp]]
v,0
(b)
sup E{D r ['yv[v(t;
-E{D0[1yV[v(t;0);0,hNp]]
0); 0 hNP]
}-Dr[gyv((t;
0); 0)]
}
OP(N N1/2hiPr+l)),
=op(h2)
v,0
PROOF OF LEMMA 7: From Lemma 1 of Appendix A, (a) holds. The proof of (b) follows along the
same lines as the proof of Lemma 2 in Appendix A.
Q.E.D.
Employing Lemma 7, we can now obtain results analogous to Lemma 2 of Appendix A. To this
end, from (B2) it is convenient to define an "average" local smoothing parameter:
(B3
=
JAN
[ E( gY,,(v,; 0, hNp))]-
From (B1), denote gy(v; 0, AY;hN) as the kernel estimate obtained by replacing Ayj with Ayj.
From a standard Taylor series argument and Lemma 7, one can establish convergence rates
420
R. W. KLEIN AND R. H. SPADY
(uniform in v, 0) to zero for
(B4)
IDo(9yV(v; 0o,Ay; hN)) -DJ(YV(vj;
0
hN))I.
AXY;
From Lemma 1, we can now obtain a uniform convergence rate to zero for the latter term in (B4) to
its expectation that is analogous to that in Lemma 2 of Appendix A. Proceeding as in Lemma 2,
from a Taylor series expansion in hN about zero, we obtain a convergence rate for the expectation
of this term to the truth. When the local smoothing parameters are known and bounded away from
zero, one can show that the expected density estimate converges to the truth at a rate (see
Silverman (1986)) of hN. When local smoothing parameters are not bounded away from zero,
uniform convergence rates are decreased by the proximity of the density to zero (the bias depends
on the reciprocal of the density raised to a power). Under the trimming in (C.8b), one can show that
the convergence rate exceeds N- /2hj-' with estimated local smoothing parameters. This rate
suffices for our purposes (see Lemma 6, the analysis of term A).
One can now prove convergence results for estimated probabilities and derivatives that are
analogous to Lemma 4 of Appendix A. Employing Lemma 3 of Appendix A, which characterizes the
large-sample behavior of the trimming functions, i, we can then establish consistency and asymptotic normality in essentially the same manner as in Appendix A.
REFERENCES
I. S. (1982): "On Bandwidth Variation in Kernel Estimates-A Square Root Law," The
Annals of Statistics, 10, 1217-1223.
BEGUN, J., W. HALL, W. HUANG, AND J. WELLNER (1983): "Information and Asymptotic Efficiency
in Parametric-Nonparametric Models," The Annals of Statistics, 11, 432-452.
BHATTACHARYA, P. K. (1967): "Estimation of a Probability Density Function and its Derivatives,"
The Indian Journal of Statistics: Series A, 373-383.
CHAMBERLAIN, G. (1986): "Asymptotic Efficiency in Semiparametric Models with Censoring,"
Journal of Econometrics, 32, 189-218.
CHUNG, K. L. (1974): A Course in Probability Theory. New York: Academic Press.
S. R. (1983): "Distribution-Free Maximum Likelihood Estimation of the Binary Choice
COSSLETT,
Model," Econometrica, 51, 765-782.
(1987): "Efficiency Bounds for Distribution-free Estimators of the Binary Choice and
Censored Regression Models," Econometrica, 55, 559-586.
Fix, E., AND J. L. HODGES (1951): "Discriminate Analysis, Nonparametric Discrimination: Consistency Properties," Technical Report #4, Project No. 21-49-004, USAF School of Aviation
Medicine, Randolph Field, Texas.
HAN, A. K. (1984): "The Maximum Rank Correlation Estimator in Classical, Censored, and
Discrete Regression Models," Technical Report No. 412, Institute for Mathematical Studies in
the Social Sciences, Stanford University, Stanford, CA.
(1988): "Large Sample Properties of the Maximum Rank Correlation Estimator in Generalized Regression Models," Working Paper, Harvard Univ.
HOEFFDING, H. (1963): "Probability Inequalities for Sums of Bounded Random Variables," Journal
of the American Statistical Association, 58, 13-30.
HOROWITZ, J. L. (1992): "A Smoothed Maximum Score Estimator for the Binary Response Model,"
Econometrica, 60, 505-531.
ICHIMURA, H. (1986): "Estimation of Single Index Models," Unpublished manuscript.
KLEIN, R. W. (1991): "Density Estimation with Estimated Local Smoothing Parameters," Technical
Memo #TM-ARH-07103, Bellcore.,
LEE, LUNG-FEI (1989): "Semiparametric Maximum Profile Likelihood Estimation of Polytomous
and Sequential Choice Models," University of Minnesota, Discussion Paper No. 253.
MANsKi,C. F. (1975): "The Maximum Score Estimator of the Stochastic Utility Model of Choice,"
Journal of Econometrics, 3, 205-228.
(1985): "Semiparametric Analysis of Discrete Response: Asymptotic Properties of the
Maximum Score Estimator," Journal of Econometrics, 27, 313-333.
(1988): "Identification of Binary Response Models," Journal of the American Statistical
Association, 83, 729-738.
NEWEY, W. K. (1989): "The Asymptotic Variance of Semiparametric Estimators," Princeton
University, Econometric Research Program Memo. No. 346.
ABRAMSON,
BINARY RESPONSE
MODELS
421
(1990): "Semiparametric Efficiency Bounds," Journal of Applied Econometrics, 5, 99-135.
T. M. STOKER (1989): "Semiparametric Estimation of Weighted Average
Derivatives," Econometrica, 57, 1403-1430.
RUUD, P. A. (1983): "Sufficient Conditions for the Consistency of Maximum Likelihood Estimation
Despite Misspecification of Distribution in Multinomial Discrete Choice Models," Econometrica,
51, 225-228.
(1986): "Consistent Estimation of Limited Dependent Variable Models Despite Misspecification of Distribution," Journal of Econometrics, 32, 157-187.
SERFLING, R. S. (1980): Approximation Theorems of Mathematical Statistics. New York: John Wiley
& Sons.
SHERMAN, R. P. (1993): "The Limiting Distribution of the Maximum Rank Correlation Estimator,"
Econometrica, 61, 123-137.
SILVERMAN, P. W. (1986): Density Estimation. New York: Chapman and Hall.
SILVERMAN, P. W., AND M. C. JONES (1989): "E. Fix and J. L. Hodges (1951): An Important
Contribution to Nonparametric Discriminant Analysis and Density Estimation," International
Statistical Reuiew, 57, 233-247.
STOKER, T. M. (1986): "Consistent Estimation of Scaled Coefficients," Econometrica, 54, 1461-1481.
WHITE, H. (1982): "Maximum Likelihood Estimation of Misspecified Models," Econometrica, 50,
1-25.
POWELL, J., J. STOCK, AND