An Efficient Semiparametric Estimator for Binary Response Models Author(s): Roger W. Klein and Richard H. Spady Source: Econometrica, Vol. 61, No. 2, (Mar., 1993), pp. 387-421 Published by: The Econometric Society Stable URL: http://www.jstor.org/stable/2951556 Accessed: 08/04/2008 04:31 Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/action/showPublisher?publisherCode=econosoc. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit organization founded in 1995 to build trusted digital archives for scholarship. We enable the scholarly community to preserve their work and the materials they rely upon, and to build a common research platform that promotes the discovery and use of these resources. For more information about JSTOR, please contact [email protected]. http://www.jstor.org Econometrica, Vol. 61, No. 2 (March, 1993), 387-421 AN EFFICIENT SEMIPARAMETRIC ESTIMATOR FOR BINARY RESPONSE MODELS BY ROGER W. KLEIN AND RICHARD H. SPADY 1 This paper proposes an estimator for discrete choice models that makes no assumption concerning the functional form of the choice probability function, where this function can be characterized by an index. The estimator is shown to be consistent, asymptotically normally distributed, and to achieve the semiparametric efficiency bound. Monte-Carlo evidence indicates that there may be only modest efficiency losses relative to maximum likelihood estimation when the distribution of the disturbances is known, and that the small-sample behavior of the estimator in other cases is good. Binary choice, semiparametric estimation, adaptive kernel density estima- KEYWORDS: tion. 1. INTRODUCTION IN THIS PAPER WE FORMULATE a semiparametric estimator for the discrete choice model that satisfies the classical desiderata for such estimators: consistency, root-N normality, and efficiency. The estimator is semiparametric in that it makes no parametric assumption on the form of the distribution generating the disturbances. We do, however, assume that the choice probability function depends on a parametrically specified index function. A method for computing the efficiency bound in semiparametric contexts was given by Begun, Hall, Huang, and Wellner (1983). General discussions of these bounds and their calculation can be found in Newey (1989, 1990). The actual efficiency bound for the binary choice model was given by Chamberlain (1986) and Cosslett (1987). To simplify our exposition we formally consider throughout most of this paper only the binary choice model given by (1) y= if( 0 X; 0) > Uo, otherwise, where v(; ) is a known function, x is a vector of exogenous variables, 00 an unknown parameter vector, and uo a random disturbance. We assume that observations {xi, y,} are i.i.d. and that the model satisfies an index restriction: E(y Ix) = E[yIv(x; 00)], where E is the indicated conditional expectation and v(x; 00) is an aggregator or index. The index restriction permits multiplicative heteroscedasticity of a general but known form and heteroscedasticity of an unknown form if it depends only 1 Earlier versions of this paper were distributed in a session of the Econometric Society Summer Meetings, June, 1987, and presented at the European Econometric Society Meetings, September, 1988. We thank Bob Sherman for helpful discussions and Whitney Newey for numerous comments and technical assistance. In particular, Whitney Newey provided the basic proof showing the equivalence between the asymptotic variance-covariance matrix derived for the estimator here and its lower bound. We would also like to thank L. P. Hansen and anonymous referees for their comments and suggestions. Any errors are the sole responsibility of the authors. 387 388 R. W. KLEIN AND R. H. SPADY on the index. We characterize such heteroscedasticity by u - s(x; i80)o , where E is independent of x. When the (positive) scaling function s is a known and general function of x, the index restriction is satisfied in the transformed model with index v* v(x; 00)/s(x;,80). If s is unknown, but depends on x only through the index v(x; 00), then the index restriction will also be satisfied. In general, the index restriction will hold if the model can be written in the form shown in (1) with u - u[v(x; 00), el, where E is independent of x. Notice that we distinguish u - u[v(x; 00), e] from u uu[v(x;0), e]. We need not make this distinction when u is independent of x, in which case u = uo = e. Throughout, for notational simplicity we will define probability functions through their arguments. In this manner, for example, Pr[uO < a(x)Jb(x)] denotes the probability of the event uo < a(x) conditioned on b(x). Formally, with FUOlx as the distribution function for uo conditioned on x and E denoting a conditional expectation taken with respect to x conditioned on b(x): (2) Pr [uo < a(x)Ib(x)] _E[F.O1x[a(x)]lb(x)]. To simplify notation, we will also for the most part not distinguish random variables from their realizations. Typically, the appropriate interpretation will be clear from the context. The most common way to estimate 00 in (1) is to apply the method of maximum likelihood. When the disturbances are independently distributed according to a known conditional parametric distribution, FUIx,the log likelihood is N (3) L =E[yi In ( Pi*) + (1 - yi) In (11-Pi* )] , i=1 P[v(x;o0);] Pi*(0) =]Pr[u <v(x; 0)Jv(x; 0)] = F,,x[v(x; O)]1 P*[v(Xi;0);0]. Typically, it is further assumed that u and x are independent, in which case FUIx may be replaced by the unconditional distribution F,. When FUis unknown, one approach is to choose FUitself jointly with 0 to maximize the likelihood. This approach is taken in Cosslett (1983), who shows that the resulting estimator of 00 is consistent. In the above approach, once the distribution function is replaced with its maximum likelihood estimate, the resulting concentrated likelihood is not a smooth function of 0. Consequently, it is difficult to establish the asymptotic distribution for the estimator of 0. To circumvent this problem, we propose to select 0 so as to maximize a semiparametric likelihood that is a smooth function of 0 and that locally (for 0 in a neighborhood of 00) approximates the corresponding parametric likelihood. To formulate this semiparametric likelihood, which might more appropriately be termed a quasi-likelihood, we begin by replacing the probability function Pi*(0) in (3) with a tractable function Pi(0) that locally approximates it. To BINARY RESPONSE 389 MODELS construct Pi(0), note that with C as the event [u < v(x; 0)], P* in (3) is equivalent to Pr[Clv(x;6)], the probability of the event C conditioned on v(x; 0). We may characterize this probability function by (4) P*[v( x ; ); ] -=Pr [Clv( x ; )] = Pr [C]gVJC(v; 0)1gV( v; 0) where Pr(C) is the unconditional probability of C, gvlc is the density for v = v(x; 0) conditioned on C, and gv is the unconditional density for v. Let CO be the event C at 00: [uo < v(x; 0O)],which is observable and is equivalent to the event y = 1. If we replace C in (4) with CO,we obtain the probability function P(v; 0). In other words, P(v; 0) is the probability of the event COconditioned on v = v(x; 0): (5) P[ v( x; 0); 0] -Pr [Colv(x; 0)] = Pr [C0]gv1c,( v; O)Igv( v; 0) Pi(0)-P(Vi; =Pr[y=vlgV1y=1(v;0)/gV(v;0), ), where gvly=l is the density for v conditioned on y = 1. This function has several desirable features. First, at 0 = 00 and under the index restriction, it is equivalent to the true choice probability. Namely, with vo-v(x; 00), (6) P(vo; 00) = Pr [Colv(x; 60)] Pr [y = 1Jv(x; 00)] = Pr[y = 1x]. Subject to a continuity condition, the function P in (6) may therefore be viewed as a local approximation to the corresponding parametric probability function, P* in (3). Second, although the probability function in (5) is unknown, it can be estimated directly as a smooth function of 0. We can estimate Pr [y = 1] by the sample proportion of individuals making the choice in question, while both the conditional and unconditional densities can be estimated nonparametrically as smooth functions of the parameter vector 0. Notice that if these densities are estimated by kernel methods (with the same window being employed for conditional and unconditional densities), then the estimate of (5) will be the usual kernel estimate of the expected value of y conditioned on the index. Finally, since v aggregates the information in the possibly high dimensional vector x into a scalar, we need only to be concerned with univariate density estimation. Using Pi(o) to denote Pi(0) as defined in (5) with its three components replaced by estimates, we propose to estimate 00 by choosing 0 to maximize an adjusted version of the estimated quasi-likelihood: (7) Q(p(6)) = N( /2) ([2ln [()1 + (1 -yi) ln [(1 A(o))j]) where ri is a trimming sequence that is introduced for technical reasons to control the rate at which the estimated densities comprising Pi tend to zero. Notice also that for both yi and (1 - yi) terms, the argument of the ln function is squared. As will become apparent below, there are two methods for estimating the densities upon which estimated probability functions depend: locally- 390 R. W. KLEIN AND R. H. SPADY smoothed kernels and bias reducing kernels. For the former, estimated probability functions lie between zero and one, in which case (7) reduces to the usual form for a binomial likelihood. For the latter, estimated densities and hence estimated probabilities can be negative. This problem, which can be ignored asymptotically, may occur in any given finite sample. The estimated quasi-likelihood function in (7) remains well-defined in this case. We refer to this function as "estimated" to distinguish (7) from the quasi-likelihood proper which is (7) with Pi(0) replacedby Pi(0). Our strategy is essentially to show that the estimator 0(jP), which is obtained by maximizing (7), behaves asymptotically like the estimator 0(P) defined for a known probability function P as A (8) 0[ P A argsup Q(P) N -argsup E i=l1 [yln (Pi) + (1 -yi) ln (1 -P)] /N. The estimator 0(P) can be analyzed by conventional methods to establish consistency and asymptotic normality. Perhaps not surprisingly, as a likelihoodbased estimator, it is also efficient (under the appropriate regularity conditions). The recent econometric literature includes a variety of different approaches to semiparametric estimation in this context, including the papers of Cosslett (1983), Han (1984, 1988), Horowitz (1992), Ichimura (1986), Manski (1975, 1985, 1988), Powell, Stock, and Stoker (1989), Ruud (1983, 1986), Sherman (1993), and Stoker (1986). Our estimator is similar to Ichimura's, in that both can be interpreted as minimum distance estimators. It also resembles Cosslett's in that it replaces P* in the MLE objective function (3) with a nonparametrically estimated function. However, the spirit of our approach is perhaps best illustrated by considering it in light of the nonparametric discrimination problem first posed in the classic paper of Fix and Hodges (1951).2 The univariate version of that problem, in our notation, can be posed as: given a scalar random variable v with samples v[ y = 0] and v[ y = 1] of realizations of v from two populations (say, respectively, of the y = 0 and y = 1 populations), how should a new realization v of unknown origin be classified? Fix and Hodges' solution was to estimate gvly[v] > c, nonparametrically and to assign v to the y = 0 group if gv1Y=0[V]/gV1Y=1[V] where c is predetermined. It is a short step-from this setup to the realization that the densities of v among the subpopulations plus a knowledge of the overall subpopulation frequencies allows, via Bayes' rule, the calculation of the probability of group membership conditional upon v. Optimal classification is, with such probabilities known or estimated, a matter of one's loss function. If we select 0 so as to minimize the KLIC (Kullback-Leibler information criterion) recently with a commentary by Silverman and Jones (1989). This paper not only 2Reprinted clearly formulated the nonparametric discrimination problem, but also originated two methods of nonparametric density estimation (kernel density estimation and nearest neighbors), as well as recognizing optimal smoothing as a bias-variance tradeoff. BINARY RESPONSE MODELS 391 discrepancy of Pi(0) from Pi*(0) over the sample yi realizations, then the resulting estimator will be equivalent to that which maximizes the quasi-likelihood function in (8). In what follows, we begin in Section 2 by first providing an overview of the main assumptions and then discussing conditions required for identification. In Section 3 we prove consistency, and in Section 4, we establish asymptotic normality (at rate N1/2) and efficiency. In each of these sections, for expositional purposes we will outline the proofs. Complete proofs are relegated to an Appendix. To illustrate the performance of the estimator in practice, in Section 5 we undertake several Monte-Carlo experiments in which the semiparametric estimator is compared to probit under several model specifications. We find that the semiparametric estimator performs quite well relative to probit, and can, in models sufficiently perturbed from the usual probit specification, substantially dominate the probit estimator. In Section 6 we conclude by indicating several extensions of the proposed estimator and directions for future research. 2. ASSUMPTIONS 2.1. An Overview We will require conditions that serve to define the model, insure that various estimated functions are sufficiently well-behaved, and establish those functions of the parameters that are identified. In stating these assumptions, we need to introduce some notation. Let Fuolxbe the distribution function for uo in (1) conditioned on x. Then, with E denoting a conditional expectation taken with respect to x conditioned on v(x; 0), define the probability function PE ] as: (9) P[v(x;O);O] =ExIv(x;0){FuIjx[v(x;0o)]}. In the above definition, P[ ] inherits its dependence on 0 from the manner in which the distribution of x conditioned on v(x; o) depends on 0. In general, this distribution will depend on v(x; 0) and separately on O.' Finally, let Drf(z) denote the rth order partial of f(z) with respect to z and Dof(z) =f(z). Then, with c denoting a positive generic constant, we assume the following conditions hold: (C.1) The data consist of a random sample (yi, xi), i = 1,... N. The random variable y is binomial with realizations 1 and 0. (C.2) The parameter vector 0 lies in a compact parameter space, 0. The true value of 0, 00, is in the interior of 6. 3For example, suppose that x is normal with a nonzero mean vector and that v is linear in the components of x. In this case, the distribution of x conditioned on v(x; 0) = xO will be normal with a mean that depends on xO and on 0. 392 R. W. KLEIN AND R. H. SPADY (C.3a) There exists a scalar aggregator, a(x; 6), such that for any x: Pr[y = 1lx] E(ylx) =E[yla(x; 0) -Pr[y =1la(x; o)] (C.3b) Withx1 a continuous element of x, there exists a monotonic transformation, T, such that under T v(x; 00) T[a(x; 60)] = VA(X) + V2(X2; 60) Idvl/dxlI >c. Conditions(C.1)-(C.3) describethe mannerin which the data are generated. Assumption(C.3a) is especially importantbecause it allows us to reduce the dimension of the conditioningvariables.As will become apparentbelow, assumption(C.3b) is useful in relatingpropertiesof the distributionof the index to those of the exogenousvariables.This assumptionwill also be convenientin obtainingconditionsunderwhich identificationholds. For the linear index case, as is standardin the literature,condition (3b) will hold under a location-scale transformationprovidedthat the continuousvariablehas a nonzero coefficient. For a nonlinearexample,considerthe aggregator (10) a =-[X1]1[[X2] + I[X3f3] a > 0. Then the index v -(/-y10)ln(a) , has the requiredform for y1o# 0. (C.4a) There exist P, P that do not depend on x such that (employing (C.3)): 0 <PA Pr[y =1 v(x; 0)] = Pr(y = 1lx) <P< 1. This condition serves to bound the probabilityfunction P[v(x; 0);0] away from zero and one. In the notation introducedabove, (11) Pr[y = lIv(x;00)] =FuOjx[v(x;00)] and P[v(x; a); a] =E(F,,Ojx[v(x;0Oo)]). This restriction,whichrequiresthat v(x; 00) be a boundedrandomvariable,can be relaxed as we will be downweightingobservationsfor which the densities comprisingP[ v(x; 0); 0] are small. (C.4b) With H[Iv(x; 00)] Pr[uo < v(x; 60)Iv(x;00)], assume that H(t) is continuously differentiablein t for all t and that IdH(t)/dtI <c. It should be noted that this condition could be replaced by a somewhat weaker and (notationally)more complicateddominancecondition.To interpret this condition, note that when x and uo are independent, H[v(x; 00)]= F"0[v(x; O)],where FUOis the distributionfunction for uo. Consequently,the above condition holds when uo is independent of x and has a bounded and continuous density. We require that it also hold when uo is dependent on x under the index restrictionin (C.3a)-(C.3b). BINARY RESPONSE 393 MODELS (C.5) The index v is smooth in that for 0 in a neighborhood of 00 and all t: (ID[r]V( t; 0)1, ld(Dlrlv( t; 0) )/dx l) < c (r=O ,2 ,4 (C.6) Withv(x; 00) = vl(xl) + V2(X2; o) from (C.3b), let fxllo (X1) be the density for the continuous variable xl conditioned on the remaining exogenous variables, X2, and on y, wherex2 is a vector of either discrete and/or continuous random variables. This conditional density is supported on [a, b] and is smooth in that for all,x (r = 1, ..* 4; t E a, b]) IDtrfxll.(t)l < C Conditions (C.5)-(C.6) insure that the densities that underlie P(v; 0) are sufficiently smooth. Let gxllo and gvl 0 be the densities for xl and v1 - vA(xd) respectively,conditionedon x2 and y. Given the smoothness assumptionson the index, if gxll has r derivatives with respect to xl, g,11, must also have r derivativeswith respect to vl. It now follows that since (12) gb1?(s;0) =gvllo IS-V2(X2;0)], and v2 is smooth in 0 (C.5), gvlo must have r derivativesin 6. Finally,since (13) gvly(s; 0) E[ gvl (s; 0)Iy], gtly(s;0) must also have r derivativeswith respect to 0. For example, from (12)-(13) with r = 1: (14) Idg,ly(S; 00/01 < SUPIdV2(t; t 0001O SUPIdgvily (s; O)IdsI - s The firstterm above is boundedby assumption,and the second (fromthe above discussion)inheritsits bound from the upper bound on ldgx1o(w)/dwIand the lower bound on Idv1/dx1. To illustrate this smoothness condition in (C.5), let the density for x1 conditionedon x2 and y be given by (15) g1 0(t) a [ _r- (t - a2)2 which, with a2 > a1 > 0, has compact support on [a2 - a,, a2 + a1] and r derivativeson the entire real line. As stated earlier,this smoothnessassumption is technicallyconvenient in insuringthat the index has a smooth density. As there are important cases in which this assumption does not hold, in the concludingsection we will indicatehow the argumentcan be modified.In what follows,the only problematicfeaturesof the densities for the index will be that they are permittedto approachzero at an arbitraryrate and at unknownpoints (that maybe either on the supportboundariesor in the interiorof the support). 394 R. W. KLEIN E) =- R. H. SPADY the trimmingfunction employed to downweightobservations (C.7) WithhN -, has the form -r(t; AND 1 + exp [(he 1 - t)lhe 1 /4] where E > 0 and t is to be interpretedas a density estimate. With t above replaced by an estimated density, this function will be employed to downweight observations for which the corresponding densities are small. As this trimming function behaves as a smoothed indicator in that it tends hN -, exponentially to one for t sufficiently far from zero and to zero for t sufficiently small. (C.8) Letting gyj(vi; 0) Pr (y)g,jY(vj; 0), define the estimator for gyv as 1{. gy XAY;hN~~- y}vIv: E gy(i; jV=i A K[ A 4(N- (Y =0, 1). 1) hlNAyi[hNAyiJ The kernel function, K(z), is a symmetric function that integrates to one, has bounded second moment, and IDzK(z)I, lfDzK(z)l dz} < c The parameter hN is a nonstochastic window, N-16 also either: (r = 0, 1, 2,3,4). < hN <N 1/8. (C.8a) Bias Reducing: With Ay1-1, hNAYI=hN,I=1,...,N, The kernel is and dz = 0; fz2K(z) or (C.8b) Adaptive with local smoothing: Windowshave the form hNAYJ[ hN] [ Y(0)] [ALy] The component, ay, is the sample standard deviation of v(x; 0) conditioned on y. For the final component, which reflects local smoothing, let Iy - Yv(vj; 0, 1; hNp) be a preliminary density estimate of the form shown in (C.8) with window (without local smoothing) hNP[Jy(0)]. For 0 < E' < 1, from (C.7) define: = ,rpj- Ir yj, hNP)v The local smoothing parameter can now be defined by: A Ar ,+ ^ A -1/2 BINARY RESPONSE MODELS 395 where m is the geometric mean of the [yj For notational simplicity, we will write - gyv(t; 0, AY;hN) to refer to either of the estimates in (C.8a) or (C.8b). gyvt; Assumptions (C.8a) and (C.8b) specify two alternative density estimators whose similar bias properties are required in the normality proof. The estimator in (C.8a) is a bias reducing kernel that is permitted to be negative in the tails. With density derivatives uniformly bounded up to order 4, such a kernel has a uniform bias of order h4. Assumption (C.8b) specifies an adaptive kernel estimator with a variable and data dependent window size. This estimator differs from that in (C.8a) first by adjusting the global window size by the scale of the distribution. We have followed Silverman (1986) in this regard. Such scaling could also be introduced into the bias reducing kernel estimator in (C.8a). Note that the density estimator in (C.8b) is also restricted to be positive. Finally, these estimators differ in that the window in (C.8a) is constant at each sample size. In contrast, the estimator in (C.8b) specifies variable windows. To select such windows, Abramson (1982) showed that the [gyj(v)/m1/2, j= 1,...,N, are optimal invariant local smoothing parameters in a pointwise mean-squared error sense when densities are bounded away from zero. Under known local smoothing and for densities bounded away from zero with uniformly bounded derivatives up to order 4, Silverman obtains a uniform bias of order h4 for the density estimated with local smoothing. By downweighting observations for which densities are small and appropriately modifying the local smoothing parameters (to control the rate at which they can tend to zero), estimated local smoothing parameters can be taken as given. As a result, we will be able to show that under local smoothing the density estimate has a bias that is sufficiently small for our purposes. It should be remarked at this point that both Abramson (1982) and Silverman (1986) report that this two-stage procedure (the first stage being required to obtain estimated local smoothing parameters) performs well in Monte-Carlo studies. Moreover, the second-stage density estimate is reportedly not too sensitive to the first-stage estimate. (C.9) The model is identified in that for any x e A, a set of positive probability, P[y = 1Iv(x;00)] =P[y = Iv (X; 0)] = 60 =6 -. The final assumption, which is made for purposes of identification, can be interpreted as either a statement about what parameter restrictions are necessary for identification or what functions of the parameters are identified. To avoid separate notation for "true" parameter values and identifiable functions of them, we will proceed under the first interpretation. In the next subsection, we provide sufficient conditions for identification and discuss several examples. 2.2. Identification Any value of 0 that maximizes the limiting quasi-likelihood must satisfy the condition on probabilities in (C.9), which clearly holds if 0* = 00. The identifi- 396 R. W. KLEIN AND R. H. SPADY cation assumptionrequiresthat this solutionbe unique.There are two classesof models and correspondingrestrictionsfor which such identificationholds. When the probabilityfunction, P[ y = i 1v(x;00)], is monotonicin the index, we have the followingresult. 1 (Monotonic Models): Assume that P[ y = i 1vo] is strictlymonoTHEOREM tonic in vo v(x; 00) and that conditions(C.3a)-(C.3b) and (C.5)-(C.6) hold. Let v be a set of positiveprobabilityfor x on whichdP[y = 11v0]/dav0# 0. For 0 ! ), then 00 = 0*. It now follows that x E a, assumethat if V2(X2; 00) = V2(X2; identificationholds as specifiedin (C.9). Cosslett (1983) obtained this result for the linear index model, with the theorem above requiring an appropriate extension to the argument. From monotonicityof the probabilityfunction,there exists a function H for which P(y = 1I[v(x;00)]} vl(x1) + v(x2; =P{y = 1I[v(x;0*)]} 00) =H[Vl(Xl)+ + V2(X2; )] Differentiatingboth sides of this equalitywith respect to x1 (for x E X), it follows that H must be the identity, in which case v2(x2;00) = v2(x2; 0). Identificationthen holds if this conditionimplies that 00 = 0 * as claimed. To provide an example, it is convenient to write 0 0(y) and 00 0(yo), where 00 is an identifable parametervector that depends on some possibly higher dimensionalvector, yo. Then, employingthis notation,suppose that the (untransformed)index in (C.3a) is given by (16) a>0. a(x;00)=(xX)710(X2)720(X3)"30 From (C.3b), with T as the log transform and x1 continuous, v(x; 0) = (1/yl) In[a(x; 00)] for y+= 0. Under Theorem 1, the parametervector 00 0(YO) (720/Y10, Y30/710) is identified (the matrix of log observations is as- sumedto have full rank).This resultis as expectedsince in log formwe have the usual linear index model with identificationobtained via a scale restriction (there is no intercept). A somewhatmore interesting(nonlinear)exampleis given by (17) a(x; 00) = (X1)T10((X2)Y20 + (X3)Y3))Y4, a > O. As in the first example,with ylo 0 0, v(x; 00) (1/y10) ln(a). FromTheorem 1, we can now identify 00-(746/Y10, (18) 010 ln ((X2)020 + (X3)030) Y20,730) provided that = 01* ln ((X2)02* +(X3)03*), only holds when 00= 0*. Even if x2 and X3are discretein this example,in the monotonic index case it is still possible to identify the (appropriatelyscaled) parametersof the model. For instance,(18) will be uniquelysolved at the true parametervalues when (X2,X3) take on the discrete values: (1, 1), (0, 2), and (2,0) with positiveprobability. BINARY RESPONSE MODELS 397 As an alternativemodel for which other identifyingconditionsare required, let the data be generatedby: (19) Y 1 if v(x;60) >s[v(x;60)]E 0 otherwise, where s is an unknownfunctionof the index and E is independentof x. If the function s was a knownfunctionof x, then with s boundedawayfromzero, we could redefinethe index as v/s and applyTheorem 1. When s is an unknown function of the index, the probabilityfunction need not be monotonic in v(x; 60) and Theorem1 is no longer applicable.We could generalizethis model furtherby replacingthe conditionunderwhich y is one abovewith f [v(x; 00) + uo] > 0, where the function f need not be monotonic. For the linear index model, this case is discussed by Manski (1988). Extending this result for nonlinearindices,we have the followingtheorem. 2 (Nonmonotonic Models): Assume that X (X1, X2) is a vector THEOREM whose components are all continuous on a set v of positive probability. With the indexgivenfrom (C.3b) by v(x; 60) = V (X1) + V2(X2; 6O), assumethatfor x E X', the functions v1 and V2 are differentiablein x1 and X2 and av1/dx1 + 0. Let 6 * be a value of 6 such that on a, 6* )/dx2 dv2(x2; = dV2(X2; Then, identification holds if 6 * To prove Theorem2, let v Then (20) 00 uniquely solves this equation. = v1(x1) + v2(x2; 60) and v *-v1(xd) y = 1vl(xl) G(v)-P[ 00)/dX2. + + v(x2; 6*). V2(X2; 60)] =P[ Y= 1 IV1(Xl) + V2(X2; 00)]=H(v*). Differentiating both sides with respect to x1, it follows that for x E v: dG(v)/dv = dH(v *)/dv *. Employing this result, the claim now follows by differentiatingboth sides of the equality G(v) = H(v*) with respect to x2 (where x E V). 3. THE ESTIMATOR AND CONSISTENCY 3.1. The Estimator Parameterestimates are obtained by maximizingan estimated quasi-likelihood function,whichwill be defined and discussedin this section. We begin by defining the estimated probability functions, as they are the fundamental components of this estimation method. Recall that the "true" probability functionis given as (21)\ Dfa Pi0 fr { ........X / =g1( i0 s 1 V v; a ) g....... 1 398 R. W. KLEIN AND R. H. SPADY It would seem natural to define an estimated probability function by replacing all of these components with the corresponding kernel estimates given in (C8). However, in so doing we would encounter technical difficulties with a uniform convergence argument when estimated densities become too small. To guard against small estimated densities, let (22) 8YN-hN[e /(l + e')], z 9yv(v; 0) lhN hN be adjustment factors, where (a, b, c) are parameters that must satisfy certain restrictions to be explained below. Then, with V - g V + gov and aN - ON + 51N we define the estimated probability function as (23) Pf v; 0) = glv(v; 0, hN) + 1N(V; 0)/ 9vv 5)+ N(V; 0)] The purpose of the adjustment factors in (23) is to control the rate at which numerator and denominator of estimated probability functions tend to zero. For this purpose, we first require that 0 < b < c in (21). Under this restriction, for small estimated densities (of smaller order than hb ) the adjustment factors behave exponentially like hN. For sufficiently large estimated densities, the adjustment factors tend exponentially to zero as no adjustment is required in this case. In this manner, numerator and denominator of the estimated probability function behave asymptotically like hN for small estimated densities and like the unadjusted form A1 /Iv otherwise. As a second restriction on the adjustment parameters, we need to select a, which essentially controls the adjustment rate for the estimated probability functions. In this regard, it must be noted that we will require two derivatives of the probability function with respect to 0. As might be expected, it is technically convenient if we can ignore derivatives of the adjustment factors. In this regard, we require that a be sufficiently large, and the restriction: a > 2c + 2b > 0 will suffice. For expositional purposes, we parameterize the adjustment factors in a manner that insures that all restrictions on trimming parameters are satisfied. Namely, with b = ?'/5, and c = ?'/4. Throughout we will refer ' > 0, in (21) we select a -', to these probability adjustments as "probability trimming." With Pi - P(vi; 0), it can be shown that IPi(6) - Pi(O)I converges in probability, uniformly in i, 0 to zero except for vi in a region with vanishing probability. Then, one can show that without any further trimming, the quasi-likelihood (i.e. (7) with Ti= 1 for all i) converges to a function that is uniquely maximized at 00. The estimator that maximizes this objective function would then be consistent. For the normality argument, we would need to show that the gradient to this objective function was asymptotically distributed as normal at 0 = 00. Unfortunately, such an argument requires (at 0 = 00) not only that Pi converge to Pi, but also that such convergence be at a sufficiently fast rate. Since we cannot establish such a rate in the presence of observations for which densities are "too" small, we will introduce a trimming sequence that (exponentially) downweights such observations. BINARY RESPONSE 399 MODELS AA Let Op be a preliminary consistent estimate for 00 for which IOp- 60 is Op(N- 1/3).4 Then, employing (C.7), define the following "likelihood trimming" function: (24) ri-TiOil, Tiy-=(gyvIV(Xi; op) op] E for y= 0,1, where the first argument of r is the kernel estimate in (C.8) for 0 = Op.For the second argument or trimming rate, 0 < 8" < a -, where a is given as in (22). The estimated quasi-likelihood is now defined as (25) QN(0 I) - E (Ti12) [Yi In [Pi(0 + (1 - yi) In [l-Pi() I1N. There are several features of this expression that should be noted. First, as discussed earlier, this function remains well-defined even in the presence of (asymptotically negligible) estimated probability functions that are negative. Second, as to trimming, it should be noted that probability trimming is less severe than likelihood trimming (8" < '). As a consequence, we will be able to show that the gradient to QN in (25) at 00 behaves (asymptotically) as if there were no probability trimming. This result will prove useful in the normality argument below. Third, this expression incorporates both likelihood and probability trimming. For reasons discussed earlier, probability trimming (as in (23)) is not in itself sufficient for our purposes. It might appear that likelihood trimming evaluated at 0 (instead of at the preliminary estimate, Op)would suffice as the sole form of trimming. However, in this case the gradient would depend on derivatives of the likelihood trimming function with respect to 0. Since the trimming function approximates an indicator, there must be regions in which such derivatives become arbitrarily large. Such regions would present problems as they are not sufficiently small to ignore in analyzing the gradient (multiplied by N1/2). 3.2. Consistency To prove consistency for the estimator maximizing the estimated quasi-likelihood function in (25), we need to establish the uniform (in 0) limit of this function. In this section, we will sketch the argument and provide details of the proof in the Appendix which contains all required lemmas. Based on convergence rates for kernel estimates and their derivatives (Lemma 2), Lemma 4 of the Appendix establishes a uniform (in i, 0) convergence rate for Pi(0) to Pi(0). As a result, we are able to show that for the estimated quasi-likelihood, QN(O; I) in (25), estimated probability functions can be taken as known: (26) sup IQN(6 I)-QN(OI )I=OP(1), 0 QN(6r I~[~n[ P1(O)] + (1 -~l[ P()]N 4At the cost of complicating the argument, it is possible to permit a slower convergence rate. For the given convergence rate, the estimators given by Han (1988) (see also Sherman (1992)), Horowitz (1992), and Manksi (1975) would suffice. 400 R. W. KLEIN AND R. H. SPADY From Lemma 3, we may take the trimmingfunction as given in that with V V(X; 00), Ti=- 'Ti0i1, and iy=[gy,,0(v(xi; oo); Oo);?"]: (27) SUP IQN(G; ) QN(o;r)I =OP(1)X 0 QN(0, 7)- =Er [ y, In [ Pi(0)] + (1- yi) In [1- Pi(0)] IN. Finally,we show that the trimmingfunction in (27), ri, may be treated as if it were one (its limitingvalue) in that with (28) QN(L) QN(0,1): SUPIQN(0L,r) -QN(G)I =op(1). 6 This limitingfunction, QN(0), converges in probability,uniformlyin 0, to its expectationwith maximizingvalue, 0,, characterizedby P[y= lIx] =P[y = lIv(x;00)] =P[y = lv(x;0L*)]. The first equalityfollows from the aggregatorcondition,(C.3a), and the second equality is a necessary condition for 0,* to maximizethe limiting quasi-likelihood. Fromthe identificationassumption,(C.9), 0L* = 00 is the unique maximizing value. Consistencynow follows as summarizedin Theorem3 below. (29) THEOREM 3 (Consistency):Under conditions (C.1)-(C.9): G-arg sup (Q(0;)) -G00. 6 Having established consistency, in the next section, we show that 0 is distributedasymptoticallyas (root N) normal with mean zero and minimum matrix(under the appropriateregularityconditions). variance-covariance 4. THE ASYMPTOTIC DISTRIBUTION OF THE ESTIMATOR 4.1. Normality We begin with a Taylor series expansionfor the gradientof the quasi-likelihood function, ( ?) N1/ (0-o +; T^]1,0 0, 00- 1N 1/2,0 =[d2[0 [ 69;7^ )] ao, where 0 +E [0,L00].Lemma 5 establishes convergence in probabilityfor the Hessian term in (30): (31) -[a2Q[L+;ij/aoadf ] P(E ( [ a][ J p) _ Accordingly,it remainsto show that the gradientterm in (30) is asymptotically distributedas normalwith mean zero. As in earlier sections,we will sketch the proof and leave the details to the Appendix. BINARY RESPONSE 401 MODELS To analyze the gradient, write it as a weighted sum of residuals: (32) ]/a6 G(0O) aQ[0; ri Ai- a [yi -Pi] AjVVi; 00) rjw g[ ai+ aNV;0)] ioo a(Vi; Pi Pi [aP0) The decomposition is helpful in that the residuals and weights each have desirable properties that we will exploit below. The residual in (32) estimates (33) riN-rN(xi; ?i= [ [gv( - 00) Vi; 00)+ Pi(0o)] /ai aN(Vi;00O)] Pi(00)[1 Pi(OO)] ] With E denoting the indicated conditional expectation, (34) E[rN(x;O)lv(x;00)] = =E[rN(x; O)Ix] 0, a useful property in establishing asymptotic normality. With Pi Pr [ y = 1i v(x; 0) v(xi; 0)], the weight function in (32) estimates w(xi; 00) gv(vi; o0)[aPJ/0]6=00, which has the useful property that (35) E[w(x;00)lv(x;00)] =0. This property will hold if the conditional expectation of the derivative of the semiparametric probability function, P, is zero. To examine this derivative, we need to be careful to account for the manner in which P depends on 0. Let be the distribution function for uo conditioned on x and F(- Ix) FU01X G(- Is,0) the distribution function for x conditioned on v(x; 0) = s. Then, we define (36) P(S; 0) fF[v(x; 00)Ix] dG(xls, 0) F[v(x;00)lx] =Pr[u0<v(x;o0)1x] = fH[v(x; 00)] dG(xls, 0), =Pr[u0<v(x;00)Iv(x;00)] -H[v(x;00)]. In (36) the function H, which we introduce for notational convenience, depends only on v(x; 00) under the index restriction. In what follows, we want to differentiate P when s v(t; 0) for fixed t. To this end, let a(x; 0) v(x; 00) - v(x; 0) and with v(t; 0) substituted for s in 402 R. W. KLEIN AND R. H. SPADY (36), rewrite P as (37) =fH[a(x;6) P[v(t,6);6] + v(x;6)] dG[xlv(t;6),61. Then, from the chain rule with P[v(t; 0); 0] given as in (37),5 (38) aP/do 1 0= 0 = D1(t,600): [fH[v(x;6) +a(x;0o)] dG[xlv(t;6),6] X;)]dG[xjv(t;00,O) 00)+aO( O]] +D2(t, 00): -O[H[v(x; To explain the two terms in (38), for simplicityconsider f [r(6), s(A)] as any functionof 0. The derivativeof this functionat 0 = 00 is (39) + = [[df/dr][dr/Id] [df/ds][ds/ad]JO=0O = [af[r(0), s(00)] /do + df[r(0) ,s(0)]/do]0=00. df/adOIo Notice that in the first term in which r varies and s is held fixed, we may evaluate s at 00 before we differentiate.An analogousresult holds for the term in which s varies and r is held fixed.Employingthis argumentin (38), in D1 we consider variationsin 0 with a held fixed. As this term will be evaluated at 0 = 00, in a we are entitled to replace 0 with 00 before we differentiate.In an analogous manner, we obtain D2 by letting a vary and holding all other functionsof 0 as fixed. With Di(t, 00) given in (38), i = 1,2, let D(x, 00) D1(x, 00) + D2(x, 00). Then, (35) will follow if E[D(x; 00)Iv(x;00)] = 0. To establish this result, we will proceed by simplifyingeach of the terms in (38). Under (C4.b)-(C.5), H in (38) is differentiablein a neighborhoodof 00. Then, since a(x; 00) = 0, (40) D(t; 60) = y-[JH[v(x;6)] = dO4fH[v(t;6)] dG[xlv(t;6), a] = d [~H[v(t;)] )fdG[xlv(t; dG[xlv(t;6),6] ] 0=00 6),61]] do For D2, since a(x; 00) = 0 and da/dO = -dv(x; 6)/ad, from (C.4b)-(C.5)we may differentiatewithin the integralto obtain (41) D2(t; 00) = [f-j(H[v(x;00) +a(x;6)])dG(xiv(t;00), Oo)j = -E[D1(x; 60)lv(x; 60) = v(t; 60)], 5We are grateful to Whitney Newey for suggesting this "chain-rule" approach for the case in which x and u are independent. A similar argument holds for models satisfying an index restriction. BINARY RESPONSE 403 MODELS where the above expectation is taken with respect to the distribution of x conditioned on v(x; 00) = v(t; 00). From (38)-(41), (42) =E[D1(x;00)Jv(x;00)] E[D(x;00)lv(x;00)] =0. -E[D1(x;00)lv(x;00)] Proceeding with the normality argument, the strategy is to show that all estimated components of the gradient can be replaced by the corresponding true components. If this substitution is permissible, then asymptotic normality will readily follow as the new gradient will conform to a standard central limit theorem. Let GN(Oo) be the "true" gradient obtained by replacing ij, Pi, and wi in G(00) with ri, riN, and wi. In Lemma 6 of the Appendix, we prove N /2[G(OO) - GN(00)] = op(l). To briefly sketch out the argument here, from (32)-(35) decompose this difference as follows: (43) - NL/2[G(0o) iriwi -N-1 G(00)] =N-1 (43a) = A: N- 1/2 E ririNW ) riwi-r (43b) +B: N- 1/2 E(i-i) ( 43c) + C: N- 1/2 E rNw f-i ( r^W-wi- r In the Appendix, we show that each of the above components converges to zero in probability. As the arguments for terms B and C are similar to and simpler than those for A, here we will sketch out the proof for A and leave other details to the Appendix. To analyze A in (43a), write it as (44) N' 7 rj( r - rN)wi+N ri- A1 + N- 1/2 r-w) A2 E Tir N( we). A3 For A1, note from (32) that the estimated residual is a ratio of estimated functions: r^i=(y -PP1a)/j. As estimated denominators will pose a technical problem to the (mean-square convergence) argument that we seek to employ, expand Pi as a function of '5- 1/a^i about m (45) = r* + op(NriTr4 1/2), pi*=[(y.J)i][1( a where the remainder term above' is uniformly (in i) of order N(46) A* =N-1/2 -i(P *-riN)wi =A )Th], 1/2. Then, +oP(). Proceeding with A*, (47) E(A* )2 =E( Er2(i*-r2N) w2/). Notice that in obtaining the above expectation, we have ignored all cross product terms. By taking an iterated expectation, conditioning first on xi, i = 1, .. ., N, and then on vj, i = 1, ..., N, such terms vanish from the property of 404 R. W. KLEIN AND R. H. SPADY the weight function in (35) (see Lemma 6 of Appendix A). Each remaining term within the expectations operator in (47) converges to zero in probability and can be shown to satisfy an integrability condition. Therefore, the expectation in (47) does converge to zero as claimed. Turning to A2, it converges in probability to zero because from Lemma 4 one can show that (48) supI 1/2( ri riN)| IuSUPI i /2w = op( N-12. Wi-j)j i For the final "A-term", A3, the argument is somewhat similar to that for A1. Namely, by making use of the fact that the "true" residual has zero conditional expectation (see (34)), this term converges to zero in mean-square and is therefore op(l). From the above analysis of the gradient and (30)-(33): (49) N /20- ap][dP]' 1 1 +op(l). pP [NP] 000iriNwi] Since this expression consists of a sum of i.i.d. terms, each with zero mean and bounded pth absolute moment, p > 2, from Serfling (1980, Corollary, p. 32) we have the following result. -d- I do THEOREM 4 (Asymptotic Normality): Under conditions (C.1)-(C.9), the asymptotic distributionof N12(0 - 00) is N(O, X), X (do [ do p[ ( I P) 10=0 It should be remarked that the covariance matrix above can be estimated in the usual way when employing this distributional result for inference purposes. The informational equality holds here between the negative of the Hessian and the expected outer product gradient matrices. Consequently, one may employ the standard estimated Hessian, outer product gradient, or White's (1982) estimator to estimate the covariance matrix. As a result, with the quasi-likelihood treated as if it were the true likelihood, standard errors given from a conventional likelihood routine will be correct provided that the functional form for the value function is correct. Indeed, for purposes of inference, the estimated quasi-likelihood may be, treated as if it were a likelihood function. In this manner, for example, one can conduct the usual likelihood ratio tests. 4.2. Asymptotic Efficiency Having established asymptotic normality, we now turn to the issue of efficiency. In the next theorem, we show that under an independence assumption the estimator attains the efficiency bound as given in Chamberlain (1986) and Cosslett (1987). BINARY RESPONSE MODELS 405 THEOREM 5 (AsymptoticEfficiency):For the index model given in equation (1), assume that x and u are independent and denote X as the covariance matrix given in Theorem 4. Then X attains the efficiency bound as specified in Chamberlain (1986) and Cosslett (1987). To prove Theorem 5, note that the inverse efficiencybound for this problem as given in Cosslett (1987, p. 587) is (50) Et Var [dV(X; Oo)ldolV(X; oo)] 2U[V(X; oo)] P(y= lIx)[1 -P(y= lIx)] where fuo is the densityfor uo. From Theorem4, (51) IX E (-ap - ] aP] 1 [ p(i-P)jJ00 B When uo is independentof x, the theorem readilyfollows from the expression for aP(o)/a0jo0o in (38)-(42). 5. MONTE-CARLO RESULTS The usefulness of the proposed estimator depends at least partly on its performanceat moderate sample sizes, where perhaps the leading issues are efficiencyloss relative to a correct fully parametricspecificationand effectiveness in situationswhere the correctparametricspecificationis unusual.To give an indication of the performanceof the estimator in both these regards,we have performeda series of simulations,of which we report three here. For all three simulations, the true model is given by y* = f1xi, + 2x12 + u for i = 1,...,100 and yi = 1 if y* > O and yi = O otherwise. The x's are independentlyand identicallydistributed.For the compactsupportcase, in the first two simulations, x1 is a chi-squaredvariate with 3 degrees of freedom truncatedat 6 and standardizedto have zero mean and unit variance;x2 is a standardnormalvariatetruncatedat + 2 and similarlystandardized.Notice that in these first two simulationswe are comparingestimatorswithout assuming that one of the x's has a densitythat is differentiableeverywhere.For technical convenience,we formulatedthe theoryunderthe assumptionthat there was one x with a density that was everywheredifferentiable.It is possible to relax this assumption,and for purposes of evaluatingthe semiparametricestimator,we wanted to present some results that did not impose such smoothness.The third simulationhas the x's constructedin the same mannerexcept that they are not truncated.6In Design 1, the ui's are standardnormal;in Designs 2 and 3 they are normalwith mean zero and variance.25(1 + v2)2 where vi is ,1x1 + ,2x12. In all designs 81 = I82 = 1, and the ui's are independentlydistributed. 6In the first two designs, x1 is standardized by subtracting 2.348 and dividing by 1.511; x2 is divided by .8796. Thus the range of x1 is (- 1.551,2.420) and x2 is (- 2.274, 2.274). The third design is included to show a case where probit is markedly biased. 406 R. W. KLEIN AND R. H. SPADY TABLE I CUMULANTS OFSIMULATED ESTIMATES STANDARDIZED OF/ (Jackknife Estimates of Standard Errors in Parentheses.) Design 1 Untrimmed Semiparametric Probit MLE Design 2 Untrimmed Semiparametric Probit MLE Design 3 Untrimmed Semiparametric Probit MLE K1 K2 0.99958 (0.00392) 0.99987 (0.00346) 0.01532 (0.00082) 0.01196 (0.00056) KI K2 0.99758 (0.00421) 1.00508 (0.00691) 0.01773 (0.00138) 0.04767 (0.00254) KI K2 0.99195 (0.00469) 0.91525 (0.00784) 0.02195 (0.00153) 0.06146 (0.00310) K3/1K/2 0.01400 (0.13875) - 0.07293 (0.09369) K3/K3/2 - 0.08483 (0.31315) 0.11877 (0.12233) K3/K2 - 0.23636 (0.25983) - 0.15995 (0.10444) 41K2 0.82470 (0.38237) 0.21619 (0.21359) K41K2 3.99493 (1.04304) 0.82421 (0.30838) K4/K2 2.81254 (0.87196) 0.53852 (0.23182) To facilitate comparisonbetween probit and the semiparametricestimator, we adopt the normalizationIi31 + 1/321 = 2. In our simulations,we compute probit(with an intercept)in the standardmannerand impose the normalization on the result;for the semiparametricestimatorthe normalizationis imposedin the estimation process. Thus we need to report results concerningonly one number 13 In what follows, results are presented for the untrimmedsemiparametric estimatorand the probitMLE. There is a wide range of trimmingspecifications that have almost no effect on the estimates. Moreover,the estimate obtained without any trimmingperformedquite similarto that under the trimmingthat we employed.Accordingly,we report results for the semiparametricestimator obtained without probabilityor likelihood trimming (see the discussion in Section 3.1). Throughout,we estimateddensitieswith adaptivelocal smoothing as defined in (C.8b), with local smoothing parameters defined without any trimming.The nonstochasticwindow component, hN, we set at N- 1/(6.02) to satisfythe restrictionsplaced on it.7 Table I presentsestimatesof the firstand second cumulantsand the thirdand fourthstandardizedcumulantsof .13for both designsderivedfrom a simulation of size 1,000.The second row of the resultsfor each estimatoris the squareroot of the jackknifeestimate of the varianceof each cumulant,and thus providesa rough estimate of the standarderrorof the cumulantestimates. 7The corresponding pilot window component, hNp, was set equal to hN. All theoretical results can be shown to hold for this case. It should be noted, however, that after completing this study, we discovered that the results under local smoothing are much easier to establish when the pilot window component is set wider than hN. BINARY RESPONSE MODELS 407 In examining Table I, the following facts are notable. First, the efficiency loss in Design 1 from using the semiparametric estimator is quite tolerable: the relative efficiency is 78%, which is comparable to the loss from using least absolute deviations regression rather than ordinary least squares when the disturbance is i.i.d. normal. Second, there is quite a considerable efficiency gain in using the semiparametric estimator in Design 2; a similar efficiency gain is found in Design 3 where probit is quite biased. Notice that by not truncating the distributions of the x's, the distribution of x,10 is much more asymmetric in Design 3 than in Design 2. Third, the semiparametric estimator shows considerably more kurtosis than the probit estimator. Despite the fact that the simulation results concerning the probit estimator of the normalized coefficients in Design 2 are consistent with that estimator being unbiased, the probit model cannot give an accurate picture of the behavior of the choice probability, p(y = lix). Figure 1.A plots p(y = liv) for the two designs. In Design 2, p(y = 1iv) is not monotonic in v, with reversals coming at v = + 1. Probit of course assumes that it is possible to construct a projection of x into v such that p(y = liv) is monotonic in v. In the current example this is not possible. In Figures 1.B and 1.C we have plotted true probabilities on the horizontal axis and predicted probabilities on the vertical axis for a typical run with 1,000 observations of Design 2.8 The semiparametric probabilities track the 450 line reasonably well, and experiments with other sample sizes indicate a convergence to the 450 line as required by theory. In contrast, the probit probabilities are inaccurate and Figure 1.C is an accurate representation of their asymptotic behavior. A further illustration of the behavior of the two estimators in this example is given by the perspective plots of Figure 2. The height of the surface is prob (y = lix); the x and y planes correspond to a rescaling of x1 and x2 (respectively) to the interval (- 1, 1); the viewpoint is (-6, -8,5). The semiparametric perspective plot of Figure 2.B reflects the features of the true probability perspective except in the extreme foreground (in which the surface curls slightly up due to sampling error at the extreme corner of the sample range) and in the far background (where the estimate is somewhat flatter than the truth). In contrast, the probit perspective plot is obviously too flat. Clearly the results presented here are only indicative of the performance of the estimator in a few cases. But apparently there are some cases where the estimator is vastly superior to probit (even when probit is virtually unbiased) and some indication that the efficiency loss relative to a correctly specified parametric model can be relatively small. In addition, these exercises suggest that semiparametric estimation can serve as a specification test of parametric models, a topic we intend to pursue in future research. 8 For our estimator the fitted probabilities converge to the true probabilities at a rate proportional to N1/2hN -N1/3. Thus 1000 observations are required to get about the same precision in fitted probabilities as is provided by 100 observations for parameter estimation. At 100 observations, there is not much difference between plots analogous to Figures 1.B and 1.C. 408 R. W. KLEIN AND R. H. SPADY P(y=l Iv)for Designs I and 2 j - |--- 1 Design Design 2 c0J /11 ~ 0 2 o -4 -2 4 FIGURE L.A v Probitvs. TrueP(y=lIv),Design2, n=1000 vs. TrueP(y=1Iv),Design2, n=1000 Semiparametric V~~~~~ o , 0.0 0.0 rrUe p-3WIR 0.2 0.4 0.6 0.8 TINep OebIyM 1.0 FIGURE 1.C FIGURE L.B 6. CONCLUSIONS As discussed above, we have formulateda "likelihoodbased" estimatorfor the binarydiscrete choice modejunder minimaldistributionalassumptions.We have shown that in large samples this estimator is consistent, N'72 normally distributed,and that it attains an asymptoticefficiencybound. We also evaluatedthe estimatorin a Monte-Carlostudyby comparingit with the probit estimator for several model specifications.In terms of estimated parameters,we found that the semiparametricestimatorhad a small efficiency loss relative to probit when the probit model was correct. For a "substantial" perturbationof the probit model, the semiparametricestimatorstronglydominated the probit model. Although this paper has focused on parameteresti- BINARY RESPONSE MODELS 409 True ProbabilityPerspective FIGURE 2.A Estimated Semiparametric Probability Perspective, n=1000 FIGURE 2.B Estimated Probit Probability Perspective, n=1000 FIGURE 2.C mates, it is interesting to note that in many situations choice probabilities will be inconsistently estimated by probit, but consistently estimated by a semiparametric estimator. In several heteroscedastic designs reported here, we found that the parametric probit model provided substantially worse probability estimates than the semiparametric procedure. Restricting attention to the binary response model, it is possible to extend the above results in several respects. First, there can be a finite number of points at which the distribution of the index is not "smooth." For example, if the index is a linear combination of independent uniform random variables, then the density for the index will have kink points at which derivatives do not exist. These points can be detected by comparing left and right numerical derivatives. 410 R. W. KLEIN AND R. H. SPADY Employing the same type of trimming function (i-) used to detect small densities, we can then downweight observations for which numerical right and left derivatives are "sufficiently" far apart. All asymptotic results then hold.9 As a second extension, it should be noted that the results above were obtained under a compact support assumption for the density of the index. To control for problematic small densities, in the above proof we downweighted those observations at which the index densities became small. In so doing, the trimming functions did not depend on whether or not the index had a compact support. Accordingly, under such density trimming, one would expect that the above arguments would, with suitable modifications, hold when the index has an infinite support. In obtaining the results outlined above, the parametric estimator was examined for the binary discrete choice model. However, it extends to the multinomial case. For example, consider the trinary choice problem. Let alternative Ak0 k = 1,2,3, have indirect utility Uk to a given individual. Let Uk be the corresponding measure of average utility, and define the average utility differences: and v3 Ul - U3. Then with these utility differences serving as 12 =-U- u2 aggregators, we may write choice probabilities as (52) P(AlIx) =P(Ul > U2, Ul > U3Ix) =P(Ul =P(Al)gl(U[lIAl)/g1(v[1]), > U2, Ul > U31v[l]) V[1] =(12, v13). It is now possible to construct a multinomial quasilikelihood as above in terms of estimated probability functions. Notice that in this example we would require bivariate density estimation. In general, as the number of alternatives increases, the estimation problem becomes more difficult, and the theoretical argument becomes more delicate (see, for example, Lee (1989)). The estimator obtained here can also be extended to ordered single index models. Here, unlike the mlutiple alternatives case (which involves multiple indices), the extension is direct. For each interval of the dependent variable, we may estimate probabilities as above under a single index. Then form a quasilikelihood for the ordered case that is analogous to that for the ordered parametric model. Given the results obtained here, it is relatively straightforward to establish consistency and asymptotic normality for the quasi-maximum likelihood estimator. Bellcore, 445 South Street, P.O. Box 1910, Morristown,NJ 07962, U.S.A. and Nuffield College, Oxford, OX] INF, U.K Manuscript received March, 1988; final version received September,1992. 9In general, suppose that there are a finite number of "problematic" points at which the maintained assumptions do not hold. Then, provided that one can detect such points, the theory will hold if these points are appropriately downweighted. BINARY RESPONSE APPENDIX A: 411 MODELS BIAs REDUCING KERNELS In Appendix A, we provide the asymptotic results for bias reducing kernels. In Appendix B, we outline those additional arguments required to obtain the same asymptotic results under locallysmoothed kernels. Throughout, we let g,,lY(v; 0) be the conditional density for v v(x; 0) conditioned on the discrete variable, y, and evaluated at v1 v(xi; 0). It is convenient to state convergence results in terms of the estimator for Pr(y)gvly(v,; 0) gyv(v,; 0). From (C.8) define this estimator as N F-V.-v.1 1y= ,(v1;0)= E (Al) K K h j (N-1), y=O,l. We will require higher order derivatives of these estimates throughout the proofs below. Notationally, define the rth order derivative of any function g with respect to z by (A2) (A) 1r Dz =0, r-=1, 2,. * )r DzL[g]-\drg/(dz When z is a vector, interpret the derivative as the matrix of all rth order partials and cross-partials. We will also denote the above derivative by g r when the argument of g with respect to which we take the derivative is clear from the context. Throughout, we assume the basic conditions (C.1)-(C.9) given in Section 2 of the text and denote c as a positive generic constant. Given the convergence rates established in Lemmas 1-2 and known trimming (Lemma 3), consistency and asymptotic normality will follow for the proposed estimator. We begin in Lemma 1, the proof of which is based on Bhattacharya's (1967) method, by determining a uniform convergence rate of an estimate to its expectation. LEMMA 1 (Convergence of Estimates to Expectations): Let w be a K dimensional vector and assume that m(w) is a sample average of terms m(w; z,), i = 1. N, where {zi} are i.i.d. Assume that with hN O* 0 we have uniformly over N: hr1Im(w; zj)I < c, r + 1 > 0, and hsIdm(w; zi)/dwl < c, s > O. Let E[m(w)] be the expectation of m(w) taken over the distribution of zi, i w E Y' a compact set, and for any a > 0, N(1-a)/2hr+ I SUpIm(w)-E[m(w)] | O-0 = 1. N. Then with a.s. w PROOF OF LEMMA 1: m(wI) With wI and w2 any two values of w and w+e [w1,w2], = [dm(w - m(w2) Since hsI [dm(w+)/dw] S )/dw](wI - w2). c, with w1 and w21 as elements of w, and w2, K |1M(WI)-M(W2)| Therefore,with EN > 0 and < IWk Ik k=1 =N1 C1E -W2kl1 hs N =ENhN/(3cK), IWlk -W2k I < 5N 'IM(WI) M(W2)1 <E-N/3- Since w E X', a compact space, without loss of generality let Y' be contained in a hypercube whose sides each have "length" one. Then, following Bhattacharya (1967), cover Yf with sets jN, bN, where (W1, W2) EN =jN j =1 IWlk-W2k I < 8N, i = 1,. * *, K. Here, with int (t) as the least 412 R. W. KLEIN AND R. H. SPADY integerupperboundon t, bN= int[1/GAN)K]. Then, for WjE sup Im(w) -E[m(w)] I j=. s[up U YlN, m(wj)] [m(W)- + [E[m(wj)-m(w)]] < (213)EN max + + [m(wj)-E[m(W)]] I | |j[m(w,)-E[m(wj)]] Therefore, P[ suplm(w) -E[m(w)]I max [m(wj) SP >ENJ - > EN/3j E[m(wj)]] bN I,P [ m*(wj ) ] ] |> h [ |[ m*(wj )-E EN13],X j=l m*(wi) h+m(w,). N As noted by Bhattacharya, from Hoeffding (1963; Theorem 1, p. 15 and its extension, p. 25) it follows that since m* is an average of i.i.d. bounded random variables, P[ [m* (wj) -E[m* (wj)]] r I> hN N3 [ - CNh2r S2x 2N] where c' is a positive constant. Therefore, P [suplm(w)-E[m(w)]>ENJ With EN replaced by [N" -a)/2hr+l]-l., the sum of the last term over N = 1, 2b N expN[ cNhN ?24]. where E is a finite constant that does not depend on N, Q.E.D. oo is finite; the result now follows. In proving that an estimate converges uniformly to the truth, the strategy is to use Lemma 1 to show that the estimate converges uniformly to its expectation and then that its expectation converges uniformly to the truth. In this manner, Lemma 2 obtains rates for estimated densities and derivatives. LEMMA 2 (Uniform Convergence): Uniformly on v(t; 0) and 0, (2.1) | Dg[er1[v(t;0);0]]-E(D (2.2) | E{D4r[ Yv[v(t;0);0]]} (2.3) = Op(N-D r[gYv[V(t;0);0]]j = Op(h2), r=0,1,2; r= 1,2; = Op(h)4. |E[^yv[v(t;0);0]]-gyv[v(t;0);0] PROOF OF LEMMA 2: The proof of (2.1) immediately follows from Lemma 1. To establish (2.2), with z v[(x; 0) - v(t; 0)]/hN for fixed t, E(D r[ ^Yvo[v(t;a); a]]} =DDrEff^Yvo[v(t;a); a]]} 00 =Pr(y)Dof_ -P00 J r K(z)gvly[v(t;0)+hNz;0]/dz =Pr (Y)f K(z) D rgv,ly[ 00 0N;01d V(t; 0) + hNZ 1 11 10Note that in this representation we have moved the derivative operator in and out of the expectations operator. In each case, the function being integrated is differentiable in a neighborhood of 00 for almost all x and satisfies an appropriate dominance condition. BINARY RESPONSE 413 MODELS Next, expand the integrand above as a function of hN and terminate the expansion at the hN term. From the symmetry of the kernel, the hN term vanishes. The result now follows from the regularity conditions on the kernel and density derivatives. The proof of (2.3) is also based on a Taylor series expansion in hN, where the expansion is taken up to the h4 term. From symmetry of the kernel, the hN and h3 terms vanish. Since fz2K(z) = 0 (the bias-reducing kernel assumption), the hN term also vanishes. The result now follows from the Q.E.D. assumed regularity conditions on the kernel and density derivatives. Apart from the trimming function, Lemma 2 provides the asymptotic behavior of the components of the (estimated) quasi-likelihood. Lemma 3 below characterizes the large-sample behavior of the trimming function. p); p where Op is a LEMMA 3 (Estimated vs. Known Trimming): Let Oyp(Op)p y,[v(xl; preliminaryor pilot estimator for which (Hp- 00) = Op(N- 1/3).1 Define a smooth trimmingfunction as +ez(t)], T(t)=11[1 z(t)--(hN -t)1hCN,, Let where O<b <c < 1/3 and N- 6<hN<N-18 gyl.(vi; 00). Then, riy has the following representation: (a) A+ hNc( + N + hK2c( %,,(60)-gy,)(1- g y-= T[g yi(Op);E] and rT,yr[gyi; T(00)2[y)(1-T]y)T1yhN (8(00) -gyi)2Ciy(l ], gyi= - Uiy+NT)+y), where C1y= 0(1), uniformly in i, y and where the remainder term-is op(N- 1/2) uniformly in i. With N- 12N) + [( yH)gyi +h ~ -Y0-sy0 ~ ?],~~~~0) (b)~ 4gyi(6); e]:iy )Ti op( 6)), and 0]D [1-siy(l gy.(v(xi;DrD(6) gyi(6) ( -rL,(h); where the order of the remainder term is uniform in i, y, and 6. as a functionof ? g.E expanding PROOFOF LEMMA3: For (a), the proof follows by first as a function of T about0 i 0. For (b), the proof follows an about g T1and then expanding c Q.E.D;. analogous argument. Employing Lemmas 1-3, we can now determine convergence rates for estimated probabilities and f tir indivatives. In obtaining these results, it is convenient to do so in terms of adjusted or ; 6), let adjustment trimmed probability functions. With0p<o' <f1, vb (x;s0), andin g (Tg factors, which satisfy the restrictions in (22) of subsection 3.1, be given as (A3) 8yN-hN$[ez/(1 + ez)], z-(h(N/5)-gyt.)/h($/4), and i3N-3ON+ 81N. With ? (8us ) replacing gy,(v;8), define 3(v;6) and 8N(v;0) analogously. Then, with Q.(E;.). Emp(l;i+ g1.(v; 6) as the estimated marginaldensity for v, define the following probability functions: (A4) Estimate: P(u;60)- [ ?1t(U;o)+&6N(v;o)I/ [.?V(v;o)+ N(v;o)I; (A5) Trimmed: PN(c;in )-[g1W(i 0 ; <) + 1N(v; 0)]/[g y(v; l)+ajN(v;tm)]; (A6) True: P(u;60) -g1u(v;6)/gt (u;6). Lemma 4 now obtains results in terms of these three functions. 11 This assumption can be relaxed at the expositional expense of incorporating higher order terms into the series representation of Lemma 4. Manski's (1975) maximum score estimator, Horowitz's (1992) smooth version of this estimator, and Han's rank estimator (see Sherman (1992)) all converge to the truthat least as fast as Op(N- 1/3). 414 R. W. KLEIN R. H. SPADY AND LEMMA 4 (Convergence of Estimated Probabilities and their Derivatives): Let E', 0 < ' < 1, be the adjustment parameter in (A3) employed to control probabilityfunctions in (A4)-(A5). Then, for N- 1/(6+ 2e) <hN <N 1/8, and under the conditions of Lemmas 1-2, uniformlyin v E r and 0 E e (4.1a) hE'|D {P[v(t, 0); 0]}-D 0); 0]l NP[V(t; I = Op(N- 1/2h r = 0, 1,2; -(r+l)) I); ); -PNut 0]-l I= Op(N- 1/2 h-1). Let YN= {v: gov(v; 0) > h , g1v(v; 0) > hNj, where with e' as given in (A3), 0 < e < (E'/5). Note that under this condition, the probability adjustmentfactors, 8ON and 1N vanish exponentially. Then, under the above conditions, uniformly on YIN, 6: E (4.1b) (4.2a) h|Put I hEIDr{P[v(t;0);0]} (4.2b) hEl P[u(t; (r+l)) -Dr{P[V(t;0);0]}1=Op(N-1/2h 0); 0] -P[u(t; r=0,1,2, ]-1 = OP(N- 1/2 h-1) 0); PROOF OF LEMMA 4: Convergence rates for estimated probability functions and their derivatives are similar to those for estimated densities and their derivatives. The only difference is that for the former rates must be decreased by the rate at which the relevant denominators tend to zero. We illustrate the argument for r = 2 in (4.1a); other results are proved analogously.12 A typical term from Dg{P[v(t; 0); 0]) - D{P [v(t; 0); 0]) is given by13 With aN [(gv + aN) - (?t, + + aN) 1/ ][t g +65 T=-[D{?, n] + 8N)][1 [1/(gv AN] From Lemmas 2-3 and the fact that (gv + yields T= Do (1 + 1N) 8N) + 81N)] -Do(g1, + SN]- from the geometric expansion of an inverse, + 8N], 8N)]/[gv + SIN}]I[gv -[D0{l = [1/(gv = + [1 +AN +AN(l -AN)] substitutingthis expansioninto T NO(hN') /[gv + a)] AN] + op(hN hN The result now follows from the convergence rates for the derivatives in Lemma 2 and the bound on Q.E.D. gv + axFrom Lemmas 3-4 consistency now follows by showing that the quasi-likelihood converges uniformly in 0 to a function uniquely maximized at 00. In obtaining this result, we make all assumptions in Lemmas 3-4. In particular, with e', 0 < ' < 1, as the probability trimming parame<h <N-18. ter, the window, hN, is selected such that NPROOF OF THEOREM3 (Consistency): Let the estimated probabilities, Pi, be defined in Lemma 4 and the trimming functions, r,y and riy, as in Lemma 3. Let ri TiOtil and i- =iO il. Then, the In the argument for probability functions, it is convenient to first write P -PN= For reciprocals, with [( 1V + ^1N) aN- (N - P)/PN, 1/P=(1/PN)[11aN]',= (glt + 1N)] + PN [(v? + ^N) (gv + AN)] ] / v? + ^N]- write (1IPN)[ N('-AN) +AN+ ]-11 If P is bounded from below by P > 0, then PN is also bounded from below and the convergence rate of 1/P to 1/PN will be the same as that for P to PN. If P is not bounded away from zero, the convergence rate will be slowed by an additional factor of order hey. There will be other terms with squared densities in the denominator. For example, one such typical term is given by [o 91 + 61X) [g + N]2- [D1(g1v + 1)]21[gv + ax] The convergence rate increases when first rather than second derivatives appear in the numerator, and this increase more than compensates for the additional density term in the denominator. BINARY RESPONSE 415 MODELS quasi-likelihoodis given as (0)- (T/2)[yi ln [P(0)2 ] + (1 -y,) ln [(1 -P(0)) Q1(0) 2 I/N QO(0) Viewing Q1(0)as a functionof Pi(0), from a Taylorseries expansionabout PN(vi; 0) as givenin (A5) and from Lemmas2-4 I SU~~~~~P Q ()QN0|?QNLTj[y In[PN(vi; 0)]]/IN. In obtaining' this result, note that since PN> O, (1/2)In[PN2 = ln[PN]. With qlN(Vi;0)= yi ln[PN(vi; 0)] and bi(0) (_ 1 - r[g1l(vi; 0); e"]) for a" < e' and r as definedin (C.7),write Q1N(0) =E bi(0)jqlN(Vi; 0)/N + E [1 - bj(0)]J-q1N(Vi; sup I EbLrjqlN(VL;0)/NN I O suplqlNlsup Eb1/N 0)/N. if P and 1 - P are uniformly For the b1-terms,note that uniformlyin (i, 0), suplqlNI =0() boundedawayfrom zero.14 Thus, -O. 0 ijO For the (1 - bi)-terms,we maytake gv > glv > h', for a < ?'/5, as anytermsfor whichthis condition does not hold will vanishexponentially.Then, from Lemma4, E (1 - b1),rq1N(v1; where q1 is obtainedfrom qlN in (A6). Then, since E 0)/N - E (1 - 0 uniformlyin 0, by replacingPN(v,; 0) with P(v,; 0), the "true"probabilityfunction p - (1 - b1)-riql1(0)/N - Eq1l/N sup |Q1(0) - Q1(0) -| 0, Q1 b1)-rjqjj(0)/N -* 0 uniformly in 0, Eqjj/N. 6 By an analogousargument,with Qo(0) containingthe yi = 0 terms of the quasi-likelihoodand Q0(0) as the correspondingterms evaluatedat the true probabilityfunctions,Q0(0) convergesin probability,uniformlyin 0 to Qo(0). By standardarguments,Q1+ Qo convergesin probability, uniformlyin 0, to its expectation,with maximum,0*, satisfyingfor almostall x P(y = 1Ix) = P(y = 1Iv(X;Oo)) - P(00) = P(0), 14 For P and 1 - P uniformlyboundedawayfrom zero, we must show that PN and 1 - PN are boundedawayfromzero. Recallingthat gv = glv/P, write PN as [glv + 31N]/[gv PN P[glv + h"(1 + SN] =P[glv - Tl)]I[glv + 81N]/[g1v + hN"(1- T) + M' +P(81N+ (1 - T)] AN)] =PN*. As the aboveexpressionis an increasingfunctionof glv, for g1v > .5h(N 5), PN> P [.5h?N/5] / [.5h?N/5 + h"(1 > P[ 1/(l For glv < .5h(N 15), + 4h4E/5 ) -TO) + h (1-T1)j > P15. as PN is increasingin glv, - 0) + (1 -r1)] > P(1 - 1)/2. PN > P(1 -Tl)/[(l As (1 - rT) convergesexponentiallyto one in this region, the result follows. The argumentfor (1 - PN) is analogous.It should be remarkedthat the consistencywill still hold under a more delicateargumentif P is permittedto approachzero and/or one. In this case, for the b-terms,one mustshow that ln(PN) behaveslike O(h' ), whichis dominatedby the rate at whichthe averageof the bi-termstends to zero. For the (1 - b)-terms,a tail assumptionon P is required. 416 R. W. KLEIN AND R. H. SPADY where the first equality holds because v(xi; 00) aggregates information and the second equality characterizes a maximum. From the identification assumption, 06 =00; consistency then follows. Q.E.D. To prove asymptotic normality, let H(6) and G(6) denote the Hessian and gradient respectively for the quasi-likelihood of Theorem 3. From a standard Taylor series expansion for the gradient: 6 - 00) = - [ H( +)] where 0 + E [6, 00]. Asymptotic normality will follow from a convergence result for the Hessian (Lemma 5) and normality for the normalized gradient (Lemma 6). (A7) N/2( LEMMA 5 (Convergence of the Hessian): With e' as the probability adjustmentparameter in (A3), assume N- 1/(6+ 2E') <hN <N- 1/8 Then, under the conditions of Lemmas 3-4 and with 0+ as any value for 0 such that 0+e [6, 0], PROOF OF LEMMA 5: As in the consistency argument, from Lemmas 2-4 we may replace (in probability, uniformly in 0) the estimated probability function and its derivatives with PN(v,; 0) in (AW)and its derivatives. Following the consistency argument, we may also replace the likelihood trimming function, ri, with the corresponding true function, ri. Denote HN(PN, 0) as the resulting Hessian matrix. We will show that HN(PN, 0) converges uniformly in 0 to its limiting expectation (Lemma 1), which we will show is finite. With H(0) denoting this limiting expectation, it will then follow that HN(PN,6+) H(0o). At 00, ri keeps densities sufficiently far from zero to insure that H(00) has the form required by the lemma. Proceeding, with vi v(x,; 0) and WiN {PN(Vi; 0)[1 - PN(VL;0)]} , HN(PN, 0) =A + B, A B - ETrNWiN[DOPN(vi;6)][DOPN(v,;6)] riNN[Yi-PN(vi;0)]DO{WiND /N, PN(Vj;6)}/N- is O(hK2), one can show from Lemma 1 that For the A-term, since [D1PN(vi;0)][D1PN(vi;0)]' A(M) - EA(0) - 0 uniformly in 0, where EA(O) is finite.'5 Therefore, since 0 is consistent, the 15To demonstrate the argument, it suffices to consider one of the terms comprising this expectation. In so doing, to simplify the notation, we will consider the scalar case. Proceeding with one of these terms, E[([dgl,(v)/aO + @N(V; < 2E[ ([dgJ'(V)1do]2 + PSMVN(; 0)1do]2)1(g'(V; For the second term, d8lN/ao = h( E-/4)r(1 O(h(N/5)), in which case [IdlN(V; 0)/d6]2/[g,(V; first term above is f[E([gl,(v; + AN(V; 0)])2] 0)/80]l[g,(V;0) -_-), 0)/ao]2 IV)/[gv(V;0)]I 0) + SN(V; 0))2] which vanishes exponentially unless g1, 0) + 8N(V; 0)]2 iS o(1). An upper bound for the dv. As shown in (14) of Section 2, [ag1U(D; 0)/aO]2 = O([dg1'(D; 6)/dy]2). Since g1l has four bounded derivatives with respect to v, when g1l tends to zero, then so must these derivatives. One can show that the squared first derivative converges to zero faster than does the density itself (see e.g. the example in (15), Section 2). The claim now follows. BINARY RESPONSE 417 MODELS uniformconvergenceargumentsaboveinsurethat for 0+e [0,08] A(0+) -lim[E[A(00)]] ] =E[dj[ P) ])66 where the latter claim follows at 0 = 00 because ri convergesexponentiallyto zero unless both conditionaldensitiesare appropriatelyboundedawayfromzero. Employinga similarargument,the B-termalso convergesuniformlyin 0 to its expectation,which at 00 is zero. Q.E.D. To analyzethe gradient,write it as a weightedsum of residuals: (A8) G(80) } Iw/N, ri [ Yi-Pi] pi0)/ci, ai i g(Vi; 0o) [i( 1-Pi ],w } (vi; [@i89 0=0o- Under this decomposition, with a iN [gv(Vi; 00) + 3N(Vi; 00)][PP(1 - Pi)], riN [yi - Pil/aiN, and wi dP]d0 16=60denotingthe true functionsto which ai, F^,and wi"tend"(e.g.lai - aiNI -?0), and with vio = v(xi; 0o): (A9) E(riNIxi) = E(riNI viO) = E(wi IviO) = 0. Employing(A7)-(A8), Lemma 6 shows that all estimatedfunctionsdefiningthe gradientin (A7) maybe replacedby their limits: LEMMA6: With e" as the likelihood trimming parameter and r' as the probability trimming Kr' parameter, assume that 0 < < E, 0 < e' < 1. Select the window, hN, such that N- 1/(6 +2?') < hN < N- 1/8. Then, under the assumptions of Lemmas 3-4, N- - 1/2Eir^iw, N-12 PROOF OF LEMMA 6: With /iN N 1/ E2r(i(,i riNwi - AiN) + N and y /2 E A (A.10) +N-1/2 - N"2GN(0O) N/2G(0o) EiriNwi=- Ps',,write N1 = op(1). N2G-N'/2GN as (Ti -Ti)iN B E (Ti-001i- iN) C In analyzingthese terms,it is importantto note that ri essentiallyrestrictsall densitiesto be no smaller than O(h N/5) , < '"< e' < 1. As defined in Lemma 4, we denote )N as the set of v(xi; 00) whose densitiesare at least this large.Proceedingwith A, A = N- 1/2 1/2ET(irNW E: Ti(r^I-N)W-i A1 + N-12 A2 E-iriN(w- wi) A3 For A1, expand A^ in a Taylorseries about aiN: r1ri= rjri +opN v), ri*= [( Yi - aiN][ where the remainderterm above uniformlyhas the stated order.Letting A* be obtainedfrom A, 418 R. W. KLEIN AND R. H. SPADY by replacingrPwith ri*,it sufficesto show that A* convergesto zero in mean-square: E([A* ]2) = E [ Er2w72(Fp* - riN)2/N] + E E r(N)(r - rIN)TJJwiwII /N. From Lemma4 and a uniformintegrabilitycondition,16 the first componentabove convergesto N} and V= {V(xk; 00), k = 1,N.., zero. For the second component,letting X- {Xk, k = 1. E( riN) ( i*- (i*- = = rJN)TijrJWwj]) - riN)(rj - rjN)(PJ E(E[(Z E(E [ (r - -riN)(i = E(E[(Pi* - rN)(J rjN) For A2, since ri essentiallyrestrictsv, to Lemma4 N/su|wIA21~~~~W 7j IVIwiwi) rN)rTi'I - Wi)ISUP rjN)1-7jl X] W1Wj) V] E[wiwjl V]) and N- YN |ri (Pi-riN) 1/(6+2e') = 0. 0< ' <hN <N- 1/8, 1, froM I= op(1). Turningfinallyto A3, the argumentparallelsthat for A,: = E[(A3)2] E E Tir2(w - r Wi)2 ]/N+EEr i -wj)IN. riTj(w,-wj)( i?, Fromthe uniformintegrabilityconditionemployedto analyzeA,, as the first term is op(l), it also convergesto zero in expectation.For the second term, recall that Wkdoes not depend on Yk and N}. Therefore,takingan iteratedexpectationin which that Wk dependsonly on X_ {xi, i =1. we firstconditionon X, the second term abovesimplifiesto E iErNrjN7rTzwiw /N=E E i ?j riJNrNTjr[ wi[l] + wi] [ J['I + W] /N, i?j where Wi[j,is the componentof Wi^containingall terms that depend neitheron y, nor on yJ, and WJ[i]is defined analogously.Then, since neither of these two componentscontributeto the above expectationand since wi'jand wj-are each O[(Nh2) -], the expectationof the cross-productterms convergesto zero as required. Turning to the term B in (A10) above, under Lemmas 2-3 and the same mean-square convergenceargumentused to analyzeA3, B = op(l). For the last term in (A10), C, replace ',. with the representationshownin Lemma3. Since each term in this representationcontainsTi, in every term vi will be restrictedto YIN. Every term can then be written as a weighted sum of trimmed residuals.Accordingly,we mayemploythe same argumentused to showthat A = op(l) to showthat C = op(l). Q.E.D. FromLemma6, the gradientis equivalentto a randomvariableto whicha standardcentrallimit theoremapplies.Therefore,we have the followingtheorem. THEOREM 4 (AsymptoticNormality):Under the Lemmas 6-7 above, N2(0 callydistributedas N(O,.;), [o [PiP 00) is asymptoti- P) ] 16 From Chung (1974, Thm. 4.5.2), if [ZN- Z]2 -P0 and SUPNE[IZN-Z lim[E[ ZN- Z]2] = 0. 17 For example,with independenceover observationsand E(riN) = O, E[riNr1NTiTjwi - ] = E[iE(r1NX)E( riNTiTjwLwIIX)] = 0. 12] = 0(1), then BINARY PROOF OF THEOREM .ap N 1/2 419 MODELS 4: From Lemmas 6-7, ap ( [ do From the definitions of RESPONSE riN ~pp Nl Tr NW 0 w1and Serfling (1980, Corollary, p. 32), and 1/2 E r dNW1 > 0. TirNW -N The theorem now follows because N[O,{ ([do][ 1 P(1 - P) do N-112EriNWi do] is asymptotically distributed as QED. P(1-P) APPENDIX B: LOCAL SMOOTHING As the arguments under local smoothing are quite similar to those of Appendix A, here we will briefly outline the argument. From (C8.b), recall that the kernel estimate of gy,(v,; 0) under local smoothing is N (B1) gYv(v ;6,AY;hN) ___ '{ty,=y}K [I h jo hNAYJ A /(N-1). N hAYJ~ Abstracting from scaling considerations (see (C8.b)), (B2) AYJ=AN[I yv(vj; 0, hNp)], where AN is a function that depends on g a pilot or first-stage kernel density estimate of gyv with pilot window hNp. The function AN also depends on N through a trimming parameter that controls the rate at which AYJcan approach zero (see (C8.b)). Before proceeding, it should be noted that bias calculations will depend on estimated pilot density derivatives. To deal with this bias, it is helpful to set wider pilot windows (hNp) than second stage windows (hN). In so doing, the proof simplifies because it becomes possible to take the estimated local smoothing parameters as given (see e.g. Klein (1991)). In Appendix A, we obtained convergence results in Lemma 2 for estimated densities and derivatives under bias reducing kernels. Once we have obtained analogous results under locallysmoothed kernels, the proofs of consistency and normality will be very similar to those in Appendix A. We begin by providing these results for pilot densities (the local smoothing parameters). LEMMA (a) 7 (Pilot Estimates): In the notation of (B1)-(B2), for the pilot density estimates sup|DJ[yV[v(t;0);0,hNp]] v,0 (b) sup E{D r ['yv[v(t; -E{D0[1yV[v(t;0);0,hNp]] 0); 0 hNP] }-Dr[gyv((t; 0); 0)] } OP(N N1/2hiPr+l)), =op(h2) v,0 PROOF OF LEMMA 7: From Lemma 1 of Appendix A, (a) holds. The proof of (b) follows along the same lines as the proof of Lemma 2 in Appendix A. Q.E.D. Employing Lemma 7, we can now obtain results analogous to Lemma 2 of Appendix A. To this end, from (B2) it is convenient to define an "average" local smoothing parameter: (B3 = JAN [ E( gY,,(v,; 0, hNp))]- From (B1), denote gy(v; 0, AY;hN) as the kernel estimate obtained by replacing Ayj with Ayj. From a standard Taylor series argument and Lemma 7, one can establish convergence rates 420 R. W. KLEIN AND R. H. SPADY (uniform in v, 0) to zero for (B4) IDo(9yV(v; 0o,Ay; hN)) -DJ(YV(vj; 0 hN))I. AXY; From Lemma 1, we can now obtain a uniform convergence rate to zero for the latter term in (B4) to its expectation that is analogous to that in Lemma 2 of Appendix A. Proceeding as in Lemma 2, from a Taylor series expansion in hN about zero, we obtain a convergence rate for the expectation of this term to the truth. When the local smoothing parameters are known and bounded away from zero, one can show that the expected density estimate converges to the truth at a rate (see Silverman (1986)) of hN. When local smoothing parameters are not bounded away from zero, uniform convergence rates are decreased by the proximity of the density to zero (the bias depends on the reciprocal of the density raised to a power). Under the trimming in (C.8b), one can show that the convergence rate exceeds N- /2hj-' with estimated local smoothing parameters. This rate suffices for our purposes (see Lemma 6, the analysis of term A). One can now prove convergence results for estimated probabilities and derivatives that are analogous to Lemma 4 of Appendix A. Employing Lemma 3 of Appendix A, which characterizes the large-sample behavior of the trimming functions, i, we can then establish consistency and asymptotic normality in essentially the same manner as in Appendix A. REFERENCES I. S. (1982): "On Bandwidth Variation in Kernel Estimates-A Square Root Law," The Annals of Statistics, 10, 1217-1223. BEGUN, J., W. HALL, W. HUANG, AND J. WELLNER (1983): "Information and Asymptotic Efficiency in Parametric-Nonparametric Models," The Annals of Statistics, 11, 432-452. BHATTACHARYA, P. K. (1967): "Estimation of a Probability Density Function and its Derivatives," The Indian Journal of Statistics: Series A, 373-383. CHAMBERLAIN, G. (1986): "Asymptotic Efficiency in Semiparametric Models with Censoring," Journal of Econometrics, 32, 189-218. CHUNG, K. L. (1974): A Course in Probability Theory. New York: Academic Press. S. R. (1983): "Distribution-Free Maximum Likelihood Estimation of the Binary Choice COSSLETT, Model," Econometrica, 51, 765-782. (1987): "Efficiency Bounds for Distribution-free Estimators of the Binary Choice and Censored Regression Models," Econometrica, 55, 559-586. Fix, E., AND J. L. HODGES (1951): "Discriminate Analysis, Nonparametric Discrimination: Consistency Properties," Technical Report #4, Project No. 21-49-004, USAF School of Aviation Medicine, Randolph Field, Texas. HAN, A. K. (1984): "The Maximum Rank Correlation Estimator in Classical, Censored, and Discrete Regression Models," Technical Report No. 412, Institute for Mathematical Studies in the Social Sciences, Stanford University, Stanford, CA. (1988): "Large Sample Properties of the Maximum Rank Correlation Estimator in Generalized Regression Models," Working Paper, Harvard Univ. HOEFFDING, H. (1963): "Probability Inequalities for Sums of Bounded Random Variables," Journal of the American Statistical Association, 58, 13-30. HOROWITZ, J. L. (1992): "A Smoothed Maximum Score Estimator for the Binary Response Model," Econometrica, 60, 505-531. ICHIMURA, H. (1986): "Estimation of Single Index Models," Unpublished manuscript. KLEIN, R. W. (1991): "Density Estimation with Estimated Local Smoothing Parameters," Technical Memo #TM-ARH-07103, Bellcore., LEE, LUNG-FEI (1989): "Semiparametric Maximum Profile Likelihood Estimation of Polytomous and Sequential Choice Models," University of Minnesota, Discussion Paper No. 253. MANsKi,C. F. (1975): "The Maximum Score Estimator of the Stochastic Utility Model of Choice," Journal of Econometrics, 3, 205-228. (1985): "Semiparametric Analysis of Discrete Response: Asymptotic Properties of the Maximum Score Estimator," Journal of Econometrics, 27, 313-333. (1988): "Identification of Binary Response Models," Journal of the American Statistical Association, 83, 729-738. NEWEY, W. K. (1989): "The Asymptotic Variance of Semiparametric Estimators," Princeton University, Econometric Research Program Memo. No. 346. ABRAMSON, BINARY RESPONSE MODELS 421 (1990): "Semiparametric Efficiency Bounds," Journal of Applied Econometrics, 5, 99-135. T. M. STOKER (1989): "Semiparametric Estimation of Weighted Average Derivatives," Econometrica, 57, 1403-1430. RUUD, P. A. (1983): "Sufficient Conditions for the Consistency of Maximum Likelihood Estimation Despite Misspecification of Distribution in Multinomial Discrete Choice Models," Econometrica, 51, 225-228. (1986): "Consistent Estimation of Limited Dependent Variable Models Despite Misspecification of Distribution," Journal of Econometrics, 32, 157-187. SERFLING, R. S. (1980): Approximation Theorems of Mathematical Statistics. New York: John Wiley & Sons. SHERMAN, R. P. (1993): "The Limiting Distribution of the Maximum Rank Correlation Estimator," Econometrica, 61, 123-137. SILVERMAN, P. W. (1986): Density Estimation. New York: Chapman and Hall. SILVERMAN, P. W., AND M. C. JONES (1989): "E. Fix and J. L. Hodges (1951): An Important Contribution to Nonparametric Discriminant Analysis and Density Estimation," International Statistical Reuiew, 57, 233-247. STOKER, T. M. (1986): "Consistent Estimation of Scaled Coefficients," Econometrica, 54, 1461-1481. WHITE, H. (1982): "Maximum Likelihood Estimation of Misspecified Models," Econometrica, 50, 1-25. POWELL, J., J. STOCK, AND
© Copyright 2025 Paperzz