√ On the Existence of N -consistent Estimators of Linear Functionals in Non-Parametric IV Models∗ Laurent Davezies† September 9, 2016 Abstract √ This paper considers the issue of the existence of linear functionals of the form θ0 = E(ψ(X)m(j) (X)) N -consistent estimators of in a nonparametric IV model where E(Y − m(X)|Z) = 0. We derive a necessary condition for the existence of √ N -consistent estimators and we derive the semi-parametric efficiency bound under this condition. This condition ensures that θ0 can be expressed as a moment of the form E(q(Z)Y ) and is not nested with the completeness condition. We then explore the implications of this necessary condition on the first stage i.e. the distribution of X|Z. When the distribution of X conditional on Z belongs to an exponential family, or when the first stage has an additive structure, the condition that we exhibit turns out to be quite restrictive for estimating averages under counterfactual distributions. In √ that case, some parameters are never (respectively always) N estimable. But more √ surprisingly in some cases, some parameters are N estimable if and only if some central cross-moments of X and Z are above a critical value. On the other hand, an additive first stage is a condition that enables the econometrician to identify average marginal effects, and ensures the finiteness of the semiparametric efficiency bound for a large class of DGP, even if the completeness condition does not hold. Keywords: semi-parametric model, efficiency bound, instrumental variables, counterfactual, average marginal effect. JEL classification numbers: C01, C02 and C14. ∗ I thank my colleagues Xavier d’Haultfœuille and Anna Simoni for many helpful discussions. I also thank Xiaohong Chen and Arthur Lewbel for their helpful comments and encouragements. This work has also benefited from comments of participants of the 2015 Econometric Society World Congress, of the 2015 Annual Conference of the International Association for Applied Econometrics and of the CREST seminars. Last, I am indebted to Benjamin Walter for his careful reading. The usual disclaimer applies and all errors remain ours. † CREST, [email protected]. 1 Introduction Models with endogeneity are central in econometrics. In a nonparametric setting with additively separable errors, the completeness condition (Newey & Powell, 2003) is necessary and sufficient to identify the structural function. But this condition is quite restrictive and cannot be tested without supplementary restrictions (Canay et al., 2013). However, the parameter of interest is often an expectation depending on the structural function (or on its derivatives) and not the full structural function itself. For instance, the parameter of interest could be an average marginal effect, an average effect under an exogenous change in the distribution of regressors, a linear functional considered by Severini & Tripathi (2012b) and Santos (2011), or a density weighted average derivative (Powell et al., 1989, Cattaneo et al., 2014). Moreover, we can hope there exists some regular root-N consistent estimators for such parameters. The motivation to focus on the existence of root-N regular estimators is related to some optimality results. Indeed, in sufficiently regular models when regular estimators exist, a convolution Theorem and a local asymptotic minimax Theorem hold for any subconvex loss function (cf. van der Vaart, 2000, Theorems 25.20 and 25.21). This means that the corresponding asymptotic risk of any regular estimator would be bounded below. For the asymptotic mean square error, this lower bound is usually called the semiparametric efficiency bound (see for instance Newey, 1990b). The efficient estimator that achieves this bound leads to efficient tests (Choi et al., 1996). This notion generalizes the Cramer-Rao bound originally introduced for parametric models. Not surprisingly, this notion is closely related to the Fisher information. In an iid data generating process, it measures the minimal number of observations needed to discriminate the true value of the parameter of interest from an alternative one, whatever the value of nuisance parameters. Both statisticians (Bickel et al., 1993) and econometricians (Chamberlain, 1987, Newey, 1988, Newey, 1990a, Severini & Tripathi, 2012a) have derived those bounds for various models. However, existence of such regular estimator is not always ensured (van der Vaart, 2000, Theorem 25.32 and van der Vaart, 1991) . To our best knowledge, for the case of linear functionals in a non-parametric instrumental variables model, the first paper to bring this issue to light is Ai & Chen (2007). Non-parametric instrumental variables have only been studied by econometricians (not by statisticians) and a large strand of the literature focuses on the non-parametric estimation of the regression function m (for instance, Ai & Chen, 2003, Florens, 2003, Newey & Powell, 2003, Hall & Horowitz, 2005, Chen, 2007, Florens et al., 2011, Florens & Simoni, 2012, Florens et al., 2012, Darolles et al., 2011, Chen & Pouzo, 2012). But linear functionals of the regression function, which are often the parameter of interest for the economists, have received attention more recently (Ai & 1 Chen, 2007, Santos, 2011, Severini & Tripathi, 2012b, Ai & Chen, 2012, Carrasco et al., 2014, Chen & Pouzo, 2015, Breunig & Johannes, 2016). Ai & Chen (2007), Carrasco et al. (2014), Chen & Pouzo (2015) and Breunig & Johannes (2016) propose plug-in estimators based on a non-parametric first step under the completeness condition, but the root-N consistency of such estimators is far to be ensured. For instance for the plug-in penalized sieve minimum distance estimator, the detailed discussion in Section 3 of Chen & Pouzo (2015) illustrates this negative result (see also the discussion of Assumption 4.1 in Ai & Chen (2007)). Here, we do not establish the properties of a specific estimator, but examine whether a root-N and regular consistent estimator could exist. The next section is brief and only introduces notations and the model. If Y = m(X) + ε with E(ε|Z) = 0 and FY,X,Z identified by the data, we show in the third Section that parameters of the form θ0 = E(ψ(X)m(j) (X)) are identifiable under weaker conditions than the completeness condition in Section 3. We generalize a result of Severini & Tripathi (2012b) to a larger class of parameters including for instance the average marginal effect. In the forth Section of this paper we give a necessary and sufficient condition for the existence of a finite semi-parametric efficiency bound for such parameters. This condition is necessary for the existence of a regular root-N estimator. This condition states that it exists q(Z) such that θ0 = E(q(Z)Y ). Moreover, when such a condition holds, the expression of the semi-parametric efficiency bound suggests an efficient estimator based on an inverse problem because we also need to have E(q(Z)|X)fX (X) = (−1)|j| [ψ(X)fX (X)](j) . A straightforward implication of our main result concerns the exogenous case: if there is no root-N estimator of θ0 based only on the outcome and on the covariates, supplementary instruments won’t help to reach the root-N consistency. Again, results obtained in this section generalize the previous results of Severini & Tripathi (2012b). Moreover, our expression of the semi-parametric efficiency bound is more explicit and shows that the estimator of Santos (2011) is not efficient. Our framework also generalizes those of Powell et al. (1989) or Newey & Stoker (1993). Ai & Chen (2012) propose an alternative expression of the semi-parametric efficiency bound for an arbitrary sequence of identifying conditional moments restrictions, but assume the existence of a solution of to a functional minimization 1 . The existence of finite semi-parametric efficiency bound (and next of regular root-N estimators) is related to the existence of a Riesz representation of the Hadamard derivative of θ with respect to the square root of measure of any parametric submodel (cf. Sections 25.3 to 25.5 of van der Vaart (2000)). We emphasize the necessary condition under which such a linear form exists, and we derive a simple expression for the efficiency bound in which this condition appears explicitly. A first part of our condition leads to a necessary condition of regularity at the boundary 1 cf. Equations (5) and (6) in Ai & Chen (2012) 2 of the support of the regressors X. In the exogenous case (ie when X = Z), regularity at the boundary of the support has already been shown to be a sufficient condition in many papers such as Powell et al. (1989), and we show that we cannot relax it. A second part of our condition is more restrictive, especially in the endogenous case (ie when X 6= Z). The sufficient conditions 2 given in Ai & Chen (2007), Carrasco et al. (2007), Carrasco et al. (2014), Breunig & Johannes (2016) and Ai & Chen (2012) for the existence of a root-N consistent estimator and/or for existence of a finite semi-parametric efficiency bound are compatible with the necessary ones that we establish in this paper. In Section 4.3, we also show that our expression of the semiparametric efficiency bound extends the formula given in Ai & Chen (2012) to a larger class of DGP (reflecting the discrepancy between their sufficient conditions and our necessary ones). However, the second part of our necessary condition is also quite abstract and in the fifth Section, we discuss it on two special cases: the exponential first stage and the additive first stage. We also give some examples to show that naive intuition could be misleading. To sum up, the main implications of our results are the following: 1. only "very regular" (in a sense detailed in this paper) linear functionals are root-N estimable, and in this case θ0 could be reformulated as a moment of (Y, Z) but the completeness condition is not necessary 2. regularity conditions that we get are stronger in the endogenous case than in the exogenous case and supplementary instruments will not help reach the root-N rate when the regressors are exogenous, 3. even for some simple data generating processes usually considered in the textbooks, the mean of Y under a simple counterfactual distribution of X could be or could not be root-N estimable, depending on the counterfactual distribution chosen, 4. under an additive first stage, the average marginal effect admits a finite semiparametric efficiency bound under easily interpretable and reasonable conditions. 2 DGP and parameter Let consider the following instrumental regression: Y = m(X) + ε and E(ε|Z) = 0 We assume there exists a dominant measure of (Y, X, Z) and X := Supp(X) ⊂ RK , Z := Supp(Z) ⊂ Rl , Y := Supp(Y ) ⊂ R. The function m is a function from RK to 2 cf. the source conditions in Carrasco et al. (2007), the Assumption 4.1 in Ai & Chen (2007), the Assumption 3.3 and 3.4 in Carrasco et al. (2014), Assumptions 1 and 2 and Remark 2.4 in Breunig & Johannes (2016), and the existence of a solution of Equation (5) in Ai & Chen (2012) 3 R. For a K-dimensional multi-index (j) = (j1 , ..., jK ), f (j) denotes the partial derivative ∂ j1 , ..., ∂ jK f /∂xj11 , ..., ∂xjKK (which is assumed to exist and to not depend on the order in which the differentiation is performed). Let D(j) (X) the set of functions f from RK to R such that with probability 1, f (j) (X) exits. We assume that m ∈ D(j) (X) and we are interested in estimation of the linear functional θ0 = E(ψ(X)m(j) (X)) with ψ a function from RK to R of the form ψ(x) = h(fX (x), x) with h a known function differentiable with respect to the first variable and fX the identifiable density of X with respect to a dominant measure. Theaverage effect under a counterfactual distribution ff X of regressors ff X (X) corresponds to θ0 = E fX (X) m(X) . The average marginal effect of Xk corresponds to E (∂xk m(X)). To avoid functional dependence between the components of X, we assume that the product of the dominant measure for the distributions of the components of X is a dominant measure. Moreover, because differentiating with respect to the component xk only have sense for continuous variables, for k such that jk > 0, the dominant measure of Xk is assumed to be the Lebesgue measure. We can now state a first minimal assumption, under which the model and the parameter considered have a meaning. In the sequel of the paper, for any random variable U , L1 (U ) denotes the set of real valued and measurable functions f on Supp(U ) such that E(|f (U )|) < ∞. Assumption 1 (Exclusion restriction and well-defined parameter) Y = m(X) + ε and θ0 = E ψ(X)m(j) (X) with (i) Y ∈ L1 (Y ), m(X) ∈ L1 (X) ∩ D(j) (X) and E(ε|Z) = 0 (ii) m ∈ M ⊂ {l ∈ L1 (X) ∩ D(j) (X) and ψl(j) ∈ L1 (X)} (iii) FY,X,Z is identified. The set of functions M in the second condition of the previous assumption reflects the restrictions imposed by the model. For instance, Chen (2007) consider Hölder, Sobolev or Besov space and Ai & Chen (2003), Blundell et al. (2007), Florens et al. (2012) impose functional restrictions like some form of additive separability or partial linearity on m. In many contexts, economic theory also impose shape restrictions such as monotonicity and/or concavity of m. Such restrictions can be embedded in the definition of M. Note that M is not necessarily a linear space especially in the case of shape restrictions. Our framework generalizes those of Severini & Tripathi (2012b) and Santos (2011) who consider the case where (j) = (0). Such a class of parameters includes the average of Y 4 under counterfactual distribution of X. Indeed, consider ψ(x) = feX (x)/fX (x) with feX a density function supported by X . In such a case θ0 represents the average of Y under the assumption that the distribution of unobserved ε is unchanged whereas the distribution of X is changed into feX . Our framework is also a generalization of the weighted average derivative studied among others by Powell et al. (1989) and Newey & Stoker (1993). Such cases correspond to X = Z, (j) = (0, ..., 1, ..., 0). Powell et al. (1989) consider the case where ψ(x) = fX (x), whereas Newey & Stoker (1993) consider any differentiable ψ and include the case of average marginal effect θ0 = E(∂xk m(X)). Whereas Severini & Tripathi (2012b) emphasize the difficulty resulting from the endogeneity of X, they do not allow θ0 to be a linear functional of the derivative of m. On the other hand, Powell et al. (1989) and Newey & Stoker (1993) focus one the average of derivative of m but under an exogeneity assumption of X. We now turn to the identification of θ0 under Assumption 1. 3 Identification Before we present our result on identification, let us introduce some notation that will be used in the rest of the paper. For any nonnegative number and for any random variable U , Lp (U ) denotes the set of measurable functions f defined on Supp(U ) such that E(|f p (U )|) < ∞. If p = 0. L0 (U ) is simply the set of measurable functions defined on the support of U . L∞ (U ) is the set of bounded and measurable functions of U . For p ≥ 1, Lp0 (U ) denotes the space {f ∈ Lp (U ) s.t. E(f (U )) = 0} and Lp0 (U |V ) the space {f ∈ Lp (U, V ) s.t. E(f (U, V )|V ) = 0}. Let A ⊂ L0 (X), A⊥ denotes: h ∈ L0 (X) ∀g ∈ A, hg ∈ L10 (X) Because the model relies on the nullity of the conditional expectation of ε, the condition of identification depends on the following operator: T : L1 (X) −→ L1 (Z) x 7→ g(x) 7−→ z 7→ E(g(X)|Z = z) Ker (T ) denotes the set of functions l ∈ L1 (X) such that T (l) = 0. Theorem 3.1 (Identification) (i) The identification region of θ0 is {θ0 + E(ψl(j) ) : l ∈ (M − {m}) ∩ Ker (T )}. So if M is 5 a linear space, the identification region of θ0 is either {θ0 } (in such case θ0 is identified), either R (in such case, data are uninformative about θ0 ). ⊥ (ii) θ0 is identified if and only if ψ ∈ l(j) s.t. l ∈ (M − {m}) ∩ Ker (T ) . When the completeness condition holds (i.e. when Ker (T ) = {0}) then m is identified and and so is the case for θ0 (Newey & Powell (2003)). Morevoer, Canay et al. (2013) have shown that completeness is not testable without supplementary restrictions. So achieving identification of the parameter of interest under weaker condition is attractive. The previous result generalizes those of Severini & Tripathi (2012b): if j = 0 and ψ ∈ L0 (X), θ0 is identified as soon as for any l ∈ Ker (T ) such that E (|ψ(X)l(X)|) < ∞, one has E (ψ(X)l(X)) = 0. The following examples illustrate the fact that without completeness some parameters of interest could be identified. Example: A simple scale model on the first stage. Let X = Z · η, with Z ⊥⊥ η, η ∼ U[−1;1] , Z an interval of R+ and E(|Z|) < ∞. In that case, l ∈ Ker (T ) if and only R1 Rz if for any z ∈ Z, −1 l(zu)du = −z l(u)du/z = 0. Derivation with respect to z gives Rz (l(z) + l(−z)) /z − −z l(u)du/z 2 = 0 and then implies that l is an odd function. Reciprocally, if l is an odd function, then it belongs to Ker (T ). From Theorem 3.1.(ii), e it follows X (X) m(X) ) that the average effect under the counterfactual distribution feX (i.e. θ0 = E ffX (X) is identifiable if and only if feX is even. On the other hand, the average marginal effect is not identifiable because ψ(x) = 1 is even whereas {l0 s.t. l ∈ Ker (T )}⊥ is the set of odd functions on [−1; 1]. Example: Average marginal effect under an additive first stage. We will see in the last section of this paper, that an additive first stage implies regularity of the DGP that ensures the finiteness of the semi-parametric efficiency bound for the average marginal effect under mild conditions. Here, we discuss the identification of such parameter. Proposition 3.2 (Identification of Average marginal effect) If Y = m(X) + ε with Y, ε and X real random variables such that Assumption 1 holds, and assume that X = p(Z) + η with Z ⊥⊥ η. Then θ0 = E(m0 (X)) is identified if: M ⊂ {l ∈ L1 (X), l ∈ D1 (X), ∃P s.t.P(p(z) ∈ P ) = 1 and ∀p ∈ P, ∂p E(l(p+η)) = E(l0 (p+η))} The argument of identification is relies on the fact that derivative with respect to p ∈ Supp(p(Z)) and integration with respect to Fη commute. It holds if for instance, M = PS s S {l ∈ L1 (X), l ∈ D1 (X) and |l0 (x)| ≤ s=0 as x }, and E(|η| ) < ∞. The derivative of m is assumed to be bounded by a polynomial of fixed degree S and for this given S, the corresponding moment of η are well defined. If the derivative of m is assumed to be 6 bounded by a constant, this condition always holds. It is not difficult to find some DGP where such condition holds but where completeness condition is not fulfilled. Considere for instance that P(p(z) ∈ int Supp(p(Z))) = 1, η = ν + u with ν and u two independent variables with u uniform on an interval I not reduced to a single point. Then E (m0 (X)) is identified whereas neither the completeness nor the bounded completeness condition hold (d’Haultfœuille, 2011). The result holds if ν = 0 almost-surely and whatever the distribution of Z. Proof of the Proposition 3.2 is given in the Appendix, but we will present here an informal sketch of it. Under Condition (i) or (ii) of the Proposition 3.2, for any p ∈ int (Supp(p(Z))) we have the following equality: ∂ E (l(X)|p(Z) = p) = E (l0 (X)|p(Z) = p) . ∂p For l ∈ Ker (T ) the left hand side is null and so is the right hand side. Theorem 3.1.(ii) ensures the identification of θ0 . Moreover, the strategy of identification suggests Pn 1 ∂ ∂ b (Zi ) with gb (z) an estimator of ∂p E (Y |p(Z) = p) because ∂p E (Y |p(Z) = p) = i=1 g n E (m0 (X)|p(Z) = p). When p(.) admits non vanishing derivative with respect to the component zl , the identification hargument could be restated as a kind of Wald-ratio: E(m0 (X)) = i ∂ E(Y |Z=z) E(E(m0 (X)|Z = z)) = E E zl ∂z p(z) |Z = z . Moreover when more than one instrul ment zl satisfies ∂zk p(z) 6= 0, the average marginal effect is overidentified (cf. Carrasco et al. (2007) for a definition of overidentification in non-parametric or semi-parametric linear inverse problem). In both the two previous examples, we see that completeness condition is a sufficient but not a necessary condition for identification of θ0 . Under the conditions of identification given in Theorem 3.1, we will now examine the issue of the existence of a root-N consistent estimator of θ0 . 4 Existence of root-N regular estimator In our context of non-parametric regression and under our identification condition of θ0 , it is natural to ask whether the parameter of interest admits a finite semi-parametric efficiency bound. Heuristically, we know that the optimal rate of convergence of estimators of m(j) is √ always lower then N and in some cases far lower. But because θ is an "average" of m(j) , the optimal rate of convergence of estimators of θ could be higher and sometimes reaches √ the N -rate. In that case, there is a room to build an efficient estimator with respect to any subconvex loss function (van der Vaart (2000), Theorem 25.21), uniformly most powerful tests and asymptotically uniformly most accurate confidence sets (see Choi et al. (1996) 7 for a discussion about asymptotically uniformly most powerful tests in semiparametric models). The case j = 0, M = L2 (X) and ψ ∈ L2 (X) has been studied by Severini & Tripathi (2012b). If there exists q ∈ L1 (Z) such that ψ(X) = E(q(Z)|X), then for any l ∈ L2 (X) ∩ Ker (T ), E(ψ(X)l(X)) = E(q(Z)l(X)) = 0 and the condition for identification of θ0 holds. Moreover in this case, the parameter can be rewritten as θ0 = E(Y q(Z)) and estimated by the average of the Yi qb(Zi ), with qb a consistent estimator of q. Intuitively, if qb is sufficiently "regular" the estimator will behave as a sample mean, i.e. the corresponding estimator of θ0 will be regular and root-N consistent. Severini & Tripathi (2012b) have shown that in such a case, the existence of such a q is a necessary condition for the existence of a regular root-N estimator of θ0 . In this section we derive the same kind of necessary condition in a more general framework √ to ensure existence of regular N consistent estimators of θ0 and we derive an expression of the semi-parametric efficiency bound of θ0 . For the sake of simplicity, we will now assume that Y , m(X), m(j) (X) and ψ(X) all admit finite variances. And consequently we now work in Hilbert spaces L2 (Y, X, Z). Then T is now considered as an operator from L2 (X) into L2 (Z) and T ∗ is his adjoint operator. T ∗ : L2 (Z) −→ L2 (X) z 7→ g(z) 7−→ x 7→ E(g(z)|X = x) 4.1 Regularity conditions We will assume some supplementary regularity conditions. Let X + (respectively X 0 ) denote the subset of the variables Xk such that jk > 0 (respectively jk = 0) and K + (respectively K 0 ) the number of such variables. We assume without loss of generality that the variables X are ordered such that X = (X + , X 0 ). Remember that ψ(x) = h(fX (x), x). Because fX is unknown but identifiable in the data, the existence of root-N consistent estimator is related to regularity conditions of u 7→ h(u, x). Intuitively, the largest M is, the strongest are the conditions to ensure existence of a rootN consistent estimator. If M is a finite dimensional space (for instance, if m is assumed to be linear) and if the identification condition for θ0 holds, GMM estimators are root-N consistent (as soon as Y and m(X) admit finite variances). To avoid such degenerated cases, we have to impose that M is sufficiently large and a priori the largest is M the strongest are the necessary condition for finiteness of the semiparametric efficiency bound. However, we show that the necessary and sufficient condition holds as soon as M contains 8 m + tϕ for any ϕ ∈ Φ and t sufficiently close to zero, where Φ is set of test functions of Schwartz (Schwartz, 1997): ( Φ= ϕ(x) = K Y ) ϕk (xk ) with ϕk indefinitely differentiable with compact support . k=1 The statistical model is parameterized by (m, fε,X,Z ). Consider a subset of parameter (mt , fε,X,Z,t ) indexed by t ∈ T a neighborhood of 0 such that (m, fε,X,Z ) = (m0 , fε,X,Z,0 ). This subset of parameter generates a submodel indexed by t: fY,X,Z,t (y, x, z) = fε,X,Z,t (y − mt (x), x, z). This submodel is differentiable in quadratic mean if there exists a score s ∈ L2 (Y, X, Z) such that: 2 Z 1 1 1/2 1/2 1/2 f − fY,X,Z − sfY,X,Z dν(y, x, z) = 0. lim t→0 t Y,X,Z,t 2 The differentiability in quadratic mean is a regularity condition that ensures the efficiency of the maximum likelyhood in parametric framework (van der Vaart, 2000, Chapters 7 and 8), this regularity condition is allows to extend the notion of optimality in semiparametric framework (see for instance Bickel et al., 1993, van der Vaart, 2000 Chapter 25 or Severini & Tripathi, 2012a Section 2.3). A set of score functions s is a tangent set. The semi-parametric efficiency bound of θ0 is always defined relative to a given tangent set. And this bound is only well defined for a parameter and a tangent set such that R (j) −1 ψ(x)mt (x)fε,X,Z,t (e, x, z)dν − θ0 tends to a continuous linear form on t 7→ θ(t) = t the given tangent set when t → 0. Without any regularity condition, the largest tangent set with respect to which θ(t) is differentiable is not easily characterizable. But under the following regularity conditions, θ(t) is differentiable with respect to the maximal tangent set, i.e. the set of all square integrable function with respect to fY,X,Z . Assumption 2 (Regularity Assumption) (i) Y ∈ L2 (Y ), ψ(X) ∈ L2 (X) and E(ε|Z) = 0, (ii) m ∈ M ⊂ {l ∈ L2 (X) ∩ D(j) (X), l(j) ∈ L2 (X) and ψl(j) ∈ L2 (X)}, (iii) The statistical model is dominated by ν(y, x, z) = λ(y) ⊗ µ(x, z) with λ the Lebesgue measure (iv) h admits a derivative with respect to the first component X-almost-surely and ∂fX ψ(X) := ∂h(u,x) is in L2 (X), ∂u u=fX (X) (v)There exists non negative v and v bounding the skedastik function: E (Y − m(X))2 |Z ∈ [v; v], 9 (vi) When X + 6= ∅, X + |X 0 admits a density fX + |X 0 with respect to the Lebesgue measure + on RK , X 0 -almost-surely, 1/2 (vii) For t sufficiently close to 0, m−t ∈ M and e 7→ fε,Z,X (e, Z, X) is absolutely continuous ∂ fε,X,Z (ε,X,Z) ∈ L2 (ε, X, Z) (i.e. the Fisher information for location (Z, X)-almost-surely, fεε,X,Z (ε,X,Z) ∂ f (ε,X,Z) ε,X,Z of ε is finite) and ε fεε,X,Z ∈ L2 (ε, X, Z), (ε,X,Z) (viii) ∀ϕ ∈ Φ, m + tϕ ∈ M for t sufficiently close to 0, ∂ fε,X,Z (ε,X,Z) (ix) For t sufficiently close to 0, (1 + t)m ∈ M and m(X) fεε,X,Z ∈ L2 (ε, X, Z). (ε,X,Z) The first and the second conditions (i) and (ii) in Assumption 2 reinforce the conditions of Assumption 1, assuming that all considered variables admit finite variances. In particular, ψ(X)m(j) (X) ∈ L2 (X) ensures that θ(t) admits a derivative at t = 0 which is a continuous linear form on the maximal tangent set of scores. Assumption 2.(iii) is very common and natural in such context. Assumption 2.(iv) is made to get a tractable formula for the semiparametric efficiency bound of θ0 when ψ(X) depends on fX (X). We do not have in mind some interesting parameters of the form E(h(fX (X), X)m(j) (X)) with h not differentiable with respect to the first component. For instance, h(u, x) = u for the density-weighted average or h(u, x) = fe(x)/u for the average under the counterfactual distribution fe are both differentiable with respect to u. Assumption 2.(v) ensures the equivalence between q(Z) ∈ L2 (Z) and q(Z)ε ∈ L2 (Z, ε), which considerably simplifies the proof. Such an assumption also ensures that every score (corresponding to a parametric sub-model) can be approximated by a bounded score (of an other parametric sub-model). Indeed, in a step of our proof we establish some results for parametric sub-models with bounded score, and in another step we extend this result to parametric sub-models with unbounded score. Even if such condition may not be minimal to establish our result, we follow Severini & Tripathi (2012b) who also assume such quite intuitive and simple condition. Concerning Assumption 2.(vi), remember that we have previously assumed that each component of X + is dominated by the Lebesgue measure on R. Assumption 2.(vi) ensures moreover that no component of X + is functionally determined by other components of X. This is a very usual assumption in econometrics to ensure that data allows to identify variation of one component of X keeping the others components constant. In Assumption 2.(vii), absolute 1/2 continuity of e 7→ fε,Z,X (e, Z, X) ensures that ∂ε fε,X,Z is well defined (De La Vallée Poussin h i2 ∂ε fε,X,Z (ε,X,Z) is also well defined. Assumptions (1915), Section 7.40.2) and next E fε,X,Z (ε,X,Z) 2.(vii) and 2.(iii) ensure that the parametric sub-model of location {(m(.)−t, fε,X,Z ), t ∈ T } is differentiable in quadratic mean. These two assumptions also ensure that any score (corresponding to any sub-model differentiable in quadratic mean) can be decomposed in two components: one related to the nuisance parameter m and the other one related to 10 fε,X,Z . Assumption 2.(viii) enables us to consider a tangent set sufficiently rich to provide non-trivial restrictions on the DGP to get a finite semi-parametric efficiency bound. In particular, this Assumption rules out the case where the support of X is not finite and where m is known up to a finite parameter corresponding to the usual two-step GMM. Last, if Assumption 2.(ix) also holds then the parameter θ0 needs to be a moment of (Y, Z) to ensure that the semi-parametric efficiency bound is finite. Assumptions 2.(vii) to 2.(ix) impose that M is not degenerated. But they are compatible with a lot of restrictions of regularity to define M, like those considered by Chen (2007) (Hölder, Sobolev or Besov spaces). Concerning shape restrictions such as monotonicity and/or concavity, Assumption 2 imposes that m is strictly monotonic and/or strictly concave, meaning that m belongs to the interior of the constraints that define M. When m is at the boundary of M, the model becomes irregular and this could help to get a higher rate of convergence3 , such as in Chetverikov & Wilhelm (2015). 4.2 Main result The key necessary condition for existence of a root-N regular estimator of θ0 is given by the following Assumption. Assumption 3 (i) With probability 1, the weak-derivative of x+ 7→ ψ(x+ , X 0 )fX (x+ , X 0 ) is a function on + RK . This function is denoted (ψfX )(j) . (ii)Q := {q ∈ L2 (Z) s.t. T ∗ (q)fX = (−1)|j| (ψfX )(j) } = 6 ∅ The Assumption 3.(i) is related to the distribution theory developed by Schwartz (1997). For k = 1, ..., K + , let ϕk an indefinitely differentiable function with compact support from R to R. As x+ 7→ ψ(x+ , X 0 )fX (x+ , X 0 ) is locally integrable then the quantity R Q + (jk ) + + + (−1)|j| ψ(x+ , X 0 )fX (x+ , X 0 ) K k=1 ϕk (xk )dx1 ...dxK is well defined for any (ϕ1 , ..., ϕK + ). This mapping defines the weak derivative of order (j) of ψ(x+ , X 0 )fX (x+ , X 0 ). This weak derivative is a function (up to an isomorphism) denoted (ψfX )(j) if for any (ϕ1 , ..., ϕK + ): + Z (j) + 0 (ψfX ) (x , X ) K Y + + ϕk (x+ k )dx1 ...dxK = (−1) |j| Z + + 0 + 0 ψ(x , X )fX (x , X ) k=1 K Y (j ) + + ϕk k (x+ k )dx1 ...dxK . k=1 When x+ 7→ ψ(x+ , X 0 )fX (x+ , X 0 ) is continuously differentiable up to order (j), successive integrations by parts ensure that (ψfX )(j) coincides with the usual derivative of order (j) of 3 but the rate of convergence becomes specific to loss function and risk considered 11 ψfX . Such a restriction will be discussed briefly in the next section. The second condition in the previous assumption is the strongest one and the most abstract one. To get the intuition about the role of this condition, we can consider the particular case where (j) = (0). In that case Assumption 3.(i) automatically holds and the Assumption 3.(ii) is equivalent to the existence of q ∈ L2 (Z) such that E(q(Z)|X) = ψ(X). Under the completeness condition (i.e. Ker (T ) = {0}), m is identified and so is E(ψm) for every ψ. The completeness condition is equivalent to cl (R(T ∗ )) = L2 (X). In the favorable case ψ is in R(T ∗ ) and so there exists q such that θ0 = E(ψ(X)m(X)) = E(q(Z)m(X)) = E(q(Z)Y ). But in other cases ψ is only in cl (R(T ∗ )) \R(T ∗ ). In this latter case, θ0 is only poorly identified and cannot be expressed as a known moment of the form E(q(Z)Y ) but only approximated by such moments. Such a situation is not specific to the non parametric instrumental variable and has been formalized in a wider framework by van der Vaart (van der Vaart, 1991 for a detailed discussion or van der Vaart, 2000, Theorem 25.32 for a more synthetic one). In the general case (i.e. (j) 6= (0)), if Assumption 2 holds then Assumption 3 is necessary to ensure existence of root-N regular estimator, and in that case there exists q ∈ L2 (Z) such that θ0 = E(q(Z)Y ). This is the main theoretical result of this paper established in the following Theorem. Theorem 4.1 (Semi-parametric efficiency bounds) If Assumptions 1 and Assumptions 2.(i) to 2.(viii) hold and if θ0 is identified, a regular root-N estimator of θ0 exists only if Assumption 3 holds, and in that case there exists q ? ∈ Q such that the semiparametric efficiency bound V ? satisfies: V ? = V [εq ? (Z) + R(X)] = minq∈Q V [εq(Z) + R(X)] . with R(X) = (ψ(X) + ∂fX ψ(X)fX (X))m(j) (X) − E (ψ(X) + ∂fX ψ(X)fX (X))m(j) (X) . If moreover Assumption 2.(ix) holds, a regular root-N estimator of θ0 exists only if θ0 = E(q ? (Z)Y ). Theorem 4.1 states that Assumption 3 is necessary to ensure the existence of a finite semiparametric efficiency bound. It also states that the bound is the minimum on Q of a function V and is achieved for a given q ? ∈ Q. The existence of regular root-N consistent estimator depends on two conditions. Assumption 3.(i) concerns the regularity of ψfX and will be briefly discussed in the next section. Assumption 3.(ii) is more abstract and restrictive and is crucial to ensure finiteness of the semiparametric efficiency bound. Moreover in that case, θ0 is a moment of (Y, Z) only (unless (1 + t)m ∈ / M). So, Theorem 4.1 states that a root-N regular consistent estimator exists only if we are able to "rule out" the endogenous variable X in the definition of θ0 . 12 And the intuition suggests that such situation is very specific. So, we will discuss more extensively Assumption 3.(ii). In particular, we will focus on two special cases widely considered by theoretical and applied econometricians: the exponential first stage (i.e. X|Z belongs to an exponential family of distributions) and the additive first stage (i.e. X is the sum of a function of Z and a term η independent of Z but potentially correlated with ε). Another interesting remark about Theorem 4.1 concerns the existence of supplementary instruments in the exogenous case4 . In the exogenous case, we have Z next the = X and 2 (j) semi-parametric efficiency bound of θ0 is not finite if and only if E [ψffXX] = +∞ (cf. Assumption 3). In that case supplementary instruments Z ∗ (ie the situation where Z = (X, Z ∗ )) will not a parametric rate of convergence (because if Assumption help achieve 2 (j) < +∞). 3.(ii) holds then E [ψffXX] On the other hand, in the endogenous case (ie when X 6⊂ Z)), supplementary instruments could ensure the validity of Assumption 3.(ii). Last, in all cases (endogenous or exogenous) if Assumption 3 holds for a given set of instruments it will still hold if supplementary instruments are used. 4.3 Alternative formulation of the semi-parametric efficiency bound Assumption 3.(ii) and the expression of the semiparametric efficiency bound have alternative formulations, that makes more transparent the link with the two stage least square estimators used in a parametric framework. Let define the two adjoint operators: K : L2 (X) → L2 (Z) f 7→ and √ 1 2 T (f ) E(ε |Z) K ∗ : L2 (Z) → L2 (X) ∗ √ g(z) 7→ T g E(ε2 |Z) If Assumption 2.2 holds, the skedastic function z 7→ E(ε2 |Z = z) is assumed to be bounded and bounded away from 0 and so R(T ∗ ) = R(K ∗ ), R(T ) = R(K), R(T T ∗ ) = R(KK ∗ ) and R(T ∗ T ) = R(K ∗ K). For any linear and bounded operator A from an Hilbert H1 to an Hilbert H2 , we can define the generalized inverse A† of A on R(A) by: A† (y) = u if and only if u = Arg min v∈H1 :A(v)=y ||v||2H1 . Proposition 4.2 If Assumptions 1 to 3 hold and if θ0 is identified, then the semi-parametric efficiency bound of θ0 is equal to: 4 We are grateful to Arthur Lewbel for suggesting to look at this question. 13 (i) " 2 # (j) √ 2 V ∗ = E K ∗† (−1)|j| [ψffXX] + K ∗ E(εR(X)|Z) E(ε |Z) , 2 E(εR(X)|Z) +E R(X) − ε E(ε2 |Z) (ii) For any q ∈ Q, V∗ = E " √ 2 K ∗† K ∗ q(Z) + E(εR(X)|Z) E(ε |Z) 2 E(εR(X)|Z) +E R(X) − ε E(ε2 |Z) If and only if (−1)|j| [ψffXX] (j) +K ∗ √ E(εR(X)|Z) E(ε2 |Z) 2 # , ∈ R(K ∗ K), these expressions reduce to: (iii) ∗ (−1)|j| [ψffXX] (j) ∗ √ 2 +K V = E E(ε |Z) 2 , +E R(X) − ε E(εR(X)|Z) E(ε2 |Z) E(εR(X)|Z) (j) E(εR(X)|Z) ∗ |j| [ψfX ] √ (K K) (−1) +K fX 2 ∗ † E(ε |Z) (iv) For any q ∈ Q, ∗ ∗ † K(K K) K q(Z) + √ 2 E(ε |Z) 2 E(εR(X)|Z) +E R(X) − ε E(ε2 |Z) , V = E E(εR(X)|Z) ∗ q(Z) + √ E(εR(X)|Z) E(ε2 |Z) . The formula (iii) has already been establishes by Ai & Chen (2012), but it is valid only if (j) √ 2 ∈ R(K ∗ K), a stronger condition than Assumption 3.(ii). (−1)|j| [ψffXX] + K ∗ E(εR(X)|Z) E(ε |Z) (j) E(εR(X)|Z) |j| [ψfX ] ∗ √ Under Assumption 3.(ii) the condition (−1) +K ∈ R(K ∗ K) is equivfX E(ε2 |Z) (j) √ 2 + E(εR(X)|Z) ∈ R(K) + Ker (K ∗ ) or equivalently alent to the fact that K ∗† (−1)|j| [ψffXX] E(ε |Z) (j) E(εR(X)|Z) ∗† |j| [ψfX ] PR(K) K (−1) + √ 2 ∈ R(K), where PR(K) is the orthogonal projecfX E(ε |Z) tion on the closure of R(K). In a non-parametric context, the assumption R(K) = R(K) (or equivalently R(T ) = R(T )) is a restrictive condition that will be discussed below in Proposition 5.2. Then in general the orthogonal projection of an element on R(K) does not belong to R(K). 14 , 5 When Assumption 3 holds ? We will discuss separately Assumption 3.(i) on the one hand and Assumption 3.(ii) on the other hand. Indeed, the first assumption is only a regularity condition on the functions ψ and fX . For a given theoretical DGP and function ψ, it is not difficult to determine if it holds or not. We will only emphasize that the condition holds only if some regularity conditions at the boundary of the support of X hold and discuss the role of this condition on the fact that θ0 is a moment of (Z, Y ) under Assumption 2.(ix). On the contrary, Assumption 3.(ii) involves the adjoint operator T ∗ and is quite abstract at this stage, so it is not easy to know if it holds for a given DGP and a given function ψ. So, the largest part of the following will concern Assumption 3.(iii). 5.1 Discussion of Assumption 3.(i) Proposition 5.1 (Regularity condition on ψfX ) If Assumption 3.(i) holds then, the function x 7→ ψ(x)fX (x) admits is a derivative of order (j 0 ) ≤ (j) on set D(j 0 ) such that P(X ∈ D(j 0 ) ) = 1. For any k = 1, ..., K such 0 that 0 ≤ jk0 < jk , with probability one, xk 7→ (ψfX )(j ) (X1 , ..., Xk−1 , xk , Xk+1 , ..., XK ) (is almost-everywhere equal to a function that) vanishes at the boundary of the support of Xk |X1 , ..., Xk−1 , Xk+1 , ..., XK . If x 7→ ψ(x)fX (x) is (j) times-continuously differentiable on RK then Assumption 3.(i) holds. The previous proposition ensure that weak derivative defined in Assumption 2.(i) is equal to a usual derivative (up to a negligible set). Moreover, this derivative needs to vanish at the boundary of the support of ψfX . When (j) > (0), this rules out the function ψ = 1C where C is a set on which fX is bounded below 0. To understand why such regularity at the boundary of the support is necessary for the existence of root-N and regular estimator, we can consider the simple case where there is only one variable X with a convex compact support [a; b] dominated by the Lebesgue measure. If j > 0 then (ψfX )(k) for k < j vanishes at a and b. If fX (or one of its j − 1 first derivatives) does not vanish at the boundary, then ψ needs to vanish sufficiently quickly at the boundary of the support of X. Then successive integrations by part of ϕ(j) ψfX hold for any ϕ ∈ Φ. Under Assumption 2.(viii), the existence of a root-N consistent and regular estimator implies that "the parts between brackets" in the successive integrations by parts 15 of ϕ(j) ψfX vanish i.e.: lim (x,x)→(a,b) j−1 X x=x (−1)l (ψ(x)fX (x))(l) ϕ(j−1−l) (x) x=x = 0. l=0 Because this result needs to hold for any ϕ ∈ Φ, then the successive derivatives of ψfx needs to vanish at the boundaries a and b. Such a regularity condition at the boundary of the support is commonly used as a sufficient condition to get an efficient estimator for average partial derivative of weighted average derivatives of index models with exogenous regressors (Powell et al., 1989, Newey & Stoker, 1993, Cattaneo et al., 2014). In a wider class of models, relaxing the index structure and the exogeneity of regressors, we show that such an assumption is in fact necessary for the existence of regular and root-N consistent estimators. When X is multidimensional, the previous reasoning can be generalized, however, it involves a generalization of integration by part to a multidimensional setting that complicates the presentation of the result without adding any essential benefit. When ψfX is (j) continuously differentiable then successive integration by parts could be performed and in this the weak derivatives are the usual derivatives. Interestingly, such regularity condition is often assumed to show root-N consistency of specific estimators (Cattaneo et al., 2014 for density weighted average derivative with exogenous regressors or Ai & Chen, 2007 for the endogenous regressors), or to derive the semiparametric efficiency bound of a parameter (Ai & Chen, 2012). In these papers, the assumption of continuously differentiability is sufficient, in this paper we show that a slightly weaker regularity condition (namely the weak differentiability) is in fact necessary. Example: The average under counterfactual distribution corresponds to the case e , in that case Assumption 3.(i) does not restrict the choice where (j) = (0, ..., 0) and ψ = ffX X of feX . However, we will show in the sequel that averages under counterfactual distributions are not always root-N estimable because Assumptions 3.(ii) restricts feX . Example: The weighted average partial derivative corresponds to X + = X1 and (j) = (1, 0, ..., 0) and the average marginal effect is a special case corresponding to ψ = 1. θ0 = E(ψ(X)∂x1 m(X)) is root-N estimable as soon as x1 7→ ψ(x1 , X 0 )fX1 |X 0 (x1 ) is (equal X almost-surely to) an absolutely continuous function on R. For instance, if X ∼ N (0, 1), then θ0 is root-N estimable only if ψ admits a derivative whose integral is equal (up to an additive constant) to ψ. This rules out discontinuous ψ like 1{x ∈ [a; b]} or 1{x > 0} but allows for kinked function such as x1{x > 0}. Similarly, if X is drawn in a B(α, β) distribution, then θ0 = E(ψm0 ) is root-N estimable only if ψ is twice differentiable on ]0; 1[ and ψ(x) = o(x1−α ) and ψ(1 − x) = o(x1−β ) for x ∼ 0. Notice that if α ≤ 1 (respectively β ≤ 1) then ψ needs to vanish at 0 (respectively at 1) sufficiently quickly to ensure existence 16 of a root-N consistent estimator. In particular, root-N consistent estimators for the average marginal effect (ψ = 1) or for the density weighted average derivative (ψ = fX ) does not exist if α ≤ 1 or β ≤ 1. For the Beta distribution, note that Assumption 3.(ii) reinforces these conditions at the boundary of the support: the square integrability of q ensures that root-N consistent estimators for the average marginal effect (ψ = 1) do not exist if α ≤ 2 or β ≤ 2. We now turn to Assumption 3.(ii), which is particularly abstract in the endogenous case (ie X 6= Z). There is at least two frameworks in which the condition Q 6= ∅ have more comprehensive interpretations. The first one corresponds to the case where the distribution of X|Z belongs to an exponential family (we will refer to that case as "the exponential first stage"). The second one corresponds to the case where the first stage is "additive" i.e. X = p(Z) + η with η ⊥⊥ Z. 5.2 5.2.1 When is Q = 6 ∅? On the closedness of of range of linear operators (j) (j) The Assumption 3.(ii) implies that (−1)|j| [ψffXX] ∈ L2 (X). Reciprocally, if (−1)|j| [ψffXX] ∈ L2 (X), then Q is never empty if R(T ∗ ) = L2 (X). The condition R(T ∗ ) = L2 (X) holds if and only if Ker (T ) = {0} (completeness condition) and R(T ∗ ) is closed. The completeness condition is known to be hard to test is the data (Canay et al., 2013 and Freyberger, 2015) and impose abstract restrictions on the joint distribution of (X, Z) (see d’Haultfœuille (2011), Hu & Schennach (2008) or ? for discussion of criteria ensuring the completeness of non-completeness). The closedness of R(T ∗ ) is even more restrictive as illustrated by the following proposition. Proposition 5.2 (Closedness of R(T ) and R(T ∗ )) (i) The range of T is closed if and only if the range of T ∗ is closed. (ii) If T (or T ∗ ) is a compact operator, then R(T ) is closed if and only if R(T ) and R(T ∗ ) are finite dimensional. (iii) If the completeness condition holds (ie Ker (T ) = {0}) and if X|Z admits a density with respect to the Lebesgue measure Z-almost surely, then neither R(T ) neither R(T ∗ ) are closed. In our instrumental variable framework, if the product measure of (X, Z) is dominated by the product of marginal measures of X and Z, then T is a compact operator as soon 17 as E(fX,Z (X, Z)) < ∞ (cf. discussion of Assumption A1 in Darolles et al., 2011). The proposition 5.2.(ii) shows that in that case, R(T ) = L2 (Z) and R(T ∗ ) = L2 (X) only if X and Z have finite supports (and then T and T ∗ are characterized by full rank matrixes). When the support of X or Z is not finite, then even in the favorable case of completeness, Proposition 5.2.(iii) shows that R(T ) (respectively R(T ∗ )) is not closed and then not equal to L2 (Z) (respectively L2 (X)). So because R(T ) is in general not closed when X 6= Z, criteria on the DGP and the parameter θ0 ensuring that Q = 6 ∅ are hard to formulate in the most general case. In the following, we investigate the condition Q = 6 ∅ in two specific cases: when the first stage belongs to an exponential distribution and when the first stage is additive. 5.2.2 Specialization to the exponential first stage Exponential families of distributions have many nice features: they embed many usual parametric models and provide sufficient statistics. So, they have been widely studied by econometricians such as in Gourieroux et al. (1984). Concerning the non-parametric instrumental variables, Newey & Powell (2003) have established the completeness condition for some (but not all) models in which the true conditional distribution X|Z belongs to an exponential family. When the first stage is exponential, the condition Q = 6 ∅ is equivalent to the condition that an identifiable function of X (depending on ψ and on the distribution of (Z, X)) is the Laplace transform of a measure. The range of Laplace transforms of measures is not easily characterizable, but some properties can be derived. On the other hand the range of Laplace transform of linear forms has been precisely characterized by L. Schwartz and this implies interesting necessary conditions for the non-emptiness of Q. Note that the following discussion differs from the one of Ai & Chen (2007) (pp 25-26), which show that root-N consistent estimator exists for θ0 = E(m(1)(X) ) when T is HilbertSchmidt and the marginal density distribution of X is proportional to exp (t(x)) with t a finite order polynomial. Let W denote the set of exogenous regressors i.e. the variables that are simultaneously e denote the set of endogenous variable and Ze the set of excluded in X and Z. Let X instruments. Assumption 4 (Exponential First Stage) e Z|W e (i) X, is dominated by a product measure πX|W ⊗ πZ|W e e (ii) fX| x) takes the form hw (e x)gw (τw (z)) exp (−µw (e x)0 τw (e z )), with where µw (x) e Z=e e z ,W =w (e pw e and τw (z) are elements of R and 0 ∈ Supp(µw (X)|W = w) and span (Supp(µw (X)|W = w)) = pw R . 18 Assumption 4.(i) is mainly needed for the sake of simplicity, and means that the joint distribution of X, Z|W is not degenerated as in the exogenous case that has been previously stud 0 e e ied. Assumption 4.(ii) is equivalent to fX| (e x) = hW (e x)e gW (Z) exp −µW (e x) τW (Z) , e Z,W e because an integration with respect to x e ensures that gew (z) is a function of (τ (z), w). The condition 0 ∈ Supp(µ(X)|W ) can be assumed without loss of generality and is just a normalization5 . Similarly, the condition span (Supp(µw (X)|W = w)) = Rpw can be assumed without loss of generality6 , as soon as we adopt the convention R0 = {0}. Newey & Powell (2003) have proved that the completeness condition holds for a subclass of models with an exponential first stage (for instance, for X|Z ∼ N (b(Z), σ 2 ) with R(b) containing an open set of RK ). But some models with an exponential first stage do not verify the condition of completeness for at least two reasons. The first reason is the lack of variation in Z, for instance if Z is binary and X = Z + ε with ε|Z ∼ N (0, 1). A second reason of failure of the completeness condition is due to the fact that for some models the dimension of the linear span of the range of µ and τ is "too large". For instance, if X|Z ∼ N (b(Z), σ 2 (Z)), then the linear span of the range of µ(X) = (X 2 /2, X) and 2 whereas X is only real and it follows that the completeness τ (Z) = (− σ21(Z) , σb(Z) 2 (Z) ) is R condition does not hold in general7 . In both cases, Assumption 4 holds and despite the failure of the completeness condition, θ0 = E ψ(X)m(j) (X) will be root-N estimable for some functions ψ. And on the other hand, we will also show that even if the completeness condition holds, some simple parameter will not be root-N estimable. R pW For any γ ∈ Rpw , let νw (γ) = gw (t) exp (−γ 0 t) dFτw (Z)|W : e =w (t). The random set {γ ∈ R e (because hW (X)νW (µW (X)) e = νW (γ) < ∞} is a convex set8 that contains almost-surely µw (X) fX|W is finite almost-surely and P(hW (X) = 0|W ) = 0 almost-surely). Then it contains e e with probability one ΓW = int co Supp(µW (X)|W ), the interior of the convex hull of e Supp(µW (X)|W ). The function νW can be naturally extended to ΓW + iRpW (where R i2 = −1), by νW (u + iv) = gW (t) exp (−(u + iv)0 t) dFτW (Z)|W (t). This is the Laplace e transform of the signed measure MW defined by dMW (t) = gW (t)dFτW (Z)|W (t). Such a e Laplace transform is a function that verifies strong conditions of smoothness. Interestingly another function (denoted ρW in the following theorem) needs to verify similar conditions Indeed gw (τ (z)) exp (−µ(x)0 τ (z)) = gf µ(x)0 τ (z)) with gf = w (τ (z)) exp (−e w (τ (z)) gw (τ (z)) exp(−µ(x0 )τ (z)) and µ e(x) = µ(x) − µ(x0 ) for a given x0 ∈ Supp(X|W ). 6 If span (Supp(µw (X)|W = w)) is a proper subspace of dimension rw < dw , we can always reparameterize µwP and τw . Indeed, in such case, up to a reordering of the τw , we can asPd Prcomponent of µw and P d r sume that µj = i=1 λji µi for j > r, and next we have i=1 µi τi = i=1 µi τei with τei = τi + j=r+1 λji τj . 7 For instance, if b is bounded on Supp(Z), then and β such that b2 (Z) + αb(Z) + β < 0 it exists α 2 2 on Supp(Z) and in that case, E X + αX + β|Z = 0 for σ (Z) = −b2 (Z) − αb(Z) − β. To our best knowledge, in such case of heteroscedastic first stage, necessary conditions on b(z) and σ 2 (z) to ensure the completeness are not well understood. See d’Haultfœuille (2011) for a discussion of the completeness assumption. 8 exp(−(γ1 + γ2 )0 t/2) ≤ exp(−γ10 t) + exp(−γ20 t) because ab ≤ a2 + b2 , and then the convexity follows. 5 19 of smoothness. Let us introduce some notions of regularity of functions from Cp to C. A function f from an open set D ⊂ Cp to C is holomorphic on D if for any z0 ∈ D, (z0 ) limz→z0 ,z6=z0 f (z)−f exists. Roughly speaking, an holomorphic function is "very" regular z−z0 P Q n and admits an analytic development f (z) = (n)∈Np a(n) z (n) where z (n) = pj=1 zj j . L. Schwartz has defined the Laplace transform for a wider class of object than functions or measures by duality arguments (Schwartz, 1997, and see Appendix 7.6 for details). The Laplace transforms are always defined on a set C + iRp , where C is a convex set. When C has a non-empty interior, the Laplace transform is an holomorphic function on int(C) + iRp and verifies some conditions of domination by polynomials. All the Laplace transforms considered in this paper are Laplace transforms of signed measures. In this case, the condition of domination of the Laplace transform can be refined, and last when the measure is dominated by the Lebesgue measure additional conditions apply. Under Assumption 4, Q is not empty if and only if a function %W (x) depending on ψ, fX , µ and νW is the Laplace transform of a signed measure. The next Theorem uses this property to derive necessary conditions on ψ and the joint distribution of (X, Z) for the existence of a regular root-N estimator. These conditions are sufficient to ensure that %W is the Laplace transform of a linear form, so informally speaking, necessary conditions given below are almost-sufficient to ensure the non-emptiness of Q and can be used to reject existence of root-N estimator in many settings. Theorem 5.3 If Assumptions 1, 2 and 4 hold then Assumption 3.(ii) holds only if: (j) X (X)] e W ), i.e. there exists rW such that: ∈ L1 (µW (X), (i) (−1)|j| [ψ(X)f fX (X) e = (−1)|j| [ψ(X)fX (X)](j) and rW is integrable with respect to F rW (µW (X)) , e µW (X)|W fX (X) W -almost surely, (ii) W -almost-surely, there exists a convex CW such that P(µW (X) ∈ CW |W ) = 1 such that %W = rW νW is the Laplace transform of a signed measure with finite totale variation dominated by dFτ (Z)|W on CW + iRpW . (iii) W -almost-surely, %W is an holomorphic function on ΓW + iRpW , (iv) W -almost-surely, %W is bounded on C + iRpW for any compact subset C of ΓW , e (v) when τW (Z)|W is dominated by the Lebesgue measure on RpW , %W (u + iv) tends to 0 when v tends to infinity for any u ∈ ΓW . 20 Moreover, when a root-N regular consistent estimator of θ0 exists, the semiparametric efficiency bound of θ0 is: V ∗ = V (εq ∗ (Z) + R(X)) , with 1 dL−1 [%W ] e h i q (Z) = (τW (Z)) dF 2 2 e e e τ ( Z)|W gw (τ (Z))E(ε |Z)E 1/E(ε |Z)|τW (Z), W W h i e W E E(εR(X)|Z)/E(ε2 |Z)|τW (Z), E(εR(X)|Z) h i − + E(ε2 |Z) e W E(ε2 |Z)E 1/E(ε2 |Z)|τ (Z), ∗ W and R(X) = (ψ(X) + ∂fX ψ(X)fX (X)) m(j) (X) − E (ψ(X) + ∂fX ψ(X)fX (X)) m(j) (X) . (j) X (X)] Condition (i) of the Theorem means that the quantity [ψ(X)f is independent of X fX (X) conditional on µ(X) and W . Intuitively, this means that (µ(X), W ) is a sufficient statistic (j) X (X)] , i.e. there is enough variations in (µ(X), W ) to recover variations of for [ψ(X)f fX (X) [ψfX ](j) /fX . Conditions (i) and (ii) are necessary and sufficient to ensure the existence of q ∈ L1 (Z) such that T ∗ (q)fX = [ψfX ](j) . And if such a function q is also in L2 (Z), it will be in Q. Moreover, other conditions of the Theorem imply strong regularity conditions on ψ and on fX to ensure existence of regular root-N consistent estimator of θ0 . In condition (iii), holomorphy is a necessary condition to ensure that %W is a Laplace transform (in the sense of L. Schwartz, see Proposition 6 in Chapter VIII of Schwartz (1997)). We prove in the appendix that Condition (iv) is necessary for the Laplace transform to be the Laplace transform of a signed measure. Both (iii) and (iv) are sufficient to ensure that % is a Laplace transform in the sense of L. Schwartz (see Appendix 7.7.2 and Proposition 6 in Chapter VIII of Schwartz (1997) for more details). Last, Condition (v) is a refinement of e (iv) when τW (Z)|W is dominated by the Lebesgue measure. dL−1 [%W ] In the expression of the semiparametric efficiency bound, dF denotes the density e τW (Z)|W of the inverse Laplace transform of %W with respect to the Lebesgue-Stieljes measure e of τW (Z)|W . The condition (ii) of the Theorem, the injectivity of the Laplace transform and the Radon-Nikodym Theorem, ensure that the density is well defined. Under exponential first stage, Theorem 5.3 gives the solution of the minimization problem given in Theorem 4.1. Not surprisingly, the solution q ∗ depends on the skedastic function E(ε2 |Z) and depends also on the dependence between ε and X through the quantity 21 E(εR(X)|Z). When τW is injective, the sigma-algebra generated by Z is equal to the e W ), and in that case q ∗ (Z) has a simpler expression: sigma-algebra generated by (τW (Z), dL−1 [%W ] e More generally, this simpler expression holds as soon as q ∗ (Z) = g (τ1(Z)) (τW (Z)). e dF e w 2 τW (Z)|W 2 e W ) and E(εR(X)|Z) = E(εR(X)|τW (Z), e W ). E(ε |Z) = E(ε |τW (Z), Even when ψ and fX are strongly regular, θ0 is not always root-N estimable as illustrated by simple examples detailed below. Indeed, the behavior of the holomorphic function νW (u) when the imaginary part of u tends to infinity is related to the size of the set of parameters θ0 that are root-N estimable: the faster νW (x + iy) tends to 0 when |y| → ∞, the stronger is the link between X and Z and the larger is the set of functions rW such that rW νW is a Laplace transform and then the larger is the range of T ∗ and the set of ψ such that θ0 is root-N estimable. In particular, some functions ψ correspond to parameters which are root-N estimable if the strength of instruments is sufficient and which is not otherwise. Example: Bivariate normal distribution Consider the bivariate normal case: ! ! !! X 0 1 ρ ∼N , . Z 0 ρ 1 p The first stage is X = ρZ + 1 − ρ2 η with η|Z ∼ N (0, 1). Consider a shift ∆m in the mean of X keeping constant the unobserved variable ε. In that case, the average of the outcome under such counterfactual distribution corresponds to θ0 = E exp − 21 [(X − ∆m)2 − X 2 ] m(X) and this parameter is root-N estimable. On the other hand, the average of the outcome under a shift σ 2 − 1 in the variance of X √ corresponds corresponds to θ0 = 1/ σ 2 E exp − 21 [X 2 (1/σ 2 − 1)] m(X) . This parameter is root-N estimable if and only if σ 2 > 1 − ρ2 . Intuition behind this result is that when σ tends to 0, θ0 tends to m(0) which is not a parameter that can be consistently estimated at the root-N rate even in the exogenous case. On the other hand if the counterfactual distribution has a higher variance than the original one (σ > 1) the existence of a root-N consistent estimator is always ensured as soon as ρ 6= 0. We can also consider a choice of σ and remark, that the semi-parametric efficiency bound is finite if and only if ρ2 > 1 − σ 2 . When the counterfactual distribution has a lower variance than the original ones, we are in an intermediary case in which the existence of root-N consistent estimator depends on the strength of the instruments. Similarly, the average marginal effect E(m0 (X)) is always root-N estimable, whereas the density weighted average derivative E(fX (X)m0 (X)) is root-N estimable if and only if ρ2 ≥ 1/2. The corresponding semi-parametric efficiency bounds are derived in the Appendix and are reported in Table 1. 22 Example: Average Marginal Effect with normal homoscedastic first stage Consider the case where θ0 = E(m0 (X)), with X|Z ∼ N (t(Z), σ 2 ) and where Assumption 1 and 2 holds. In that case Assumption 4 is fulfilled with µ(X) = X and τ (Z) = −t(Z)/σ 2 , h(X) = exp(−X 2 /(2σ 2 )), g(τ (Z)) = (2πσ 2 )−1/2 exp(−τ 2 (Z)σ 2 /2). In that case, there is no R variable W and Γ = ΓW = R, νW (x) = ν(x) = (2πσ 2 )−1/2 exp (−t2 σ 2 /2) exp(−xt)dFτ (Z) (t) R 0 (x) (2πσ 2 )−1/2 t exp(−t2 σ 2 /2) exp(−xt)dFτ (Z) (t) fX x and rW (x) = r(x) = − fX (x) = σ2 + . So, we conclude ν(x) that Z x 1 2 2 exp(−t σ /2) exp(−xt) 2 + t dFτ (Z) (t). ρW (x) = ρ(x) = (2πσ 2 )1/2 σ Now, assume for the sake of simplicity that E (|τ (Z)| exp(−τ (Z)2 σ 2 /2 − xτ (Z))) < ∞ for any x ∈ C. The Condition (ii) of Theorem 5.3 holds only if τ (Z) admits a locally absolutely continuous density with respect to the Lebesgue measure (see Lemma 7.5 in appendix). So the density of τ (Z) is continuous and almost everywhere differentiable on R and its derivative fτ0 (Z) is locally integrable. If τ (Z) admits a mass point or if τ (Z) has a density bounded below on its compact support (such as the uniform distribution), regular root-N estimators do not exist. A bit of algebra gives: 0 0 ft(Z) (t(Z)) dL−1 (%) 1 1 fτ (Z) (τ (Z)) =− , (τ (Z)) = 2 g(τ (Z)) dFτ (Z) σ fτ (Z) (τ (Z)) ft(Z) (t(Z)) Then Q is not empty only if E 02 (t(Z)) ft(Z) 2 ft(Z) (t(Z)) < ∞ which means that the Fisher information matrix of location of t(Z) is finite. Last, if moreover t is one to one, the semiparametric efficiency bound of the average marginal effect is: V ? = V m0 (X) − (Y − m(X)) 0 (t(Z)) ft(Z) ft(Z) (t(Z)) ! . Example: Average Marginal Effect with normal heteroscedastic first stage Consider the case where θ0 = E(m0 (X)), with X|Z ∼ N (t(Z), σ 2 (Z)). For the sake of simplicity, we assume moreover that Z is compact and that the range of t and σ 2 are respectively included in compacts of R and ]0; ∞[. To our best knowledge, conditions on (t(Z), σ 2 (Z)) that ensure the completeness condition are not known. Here, we discuss existence of regular and root-N consistent estimators. In that case, h(X) = 1, µ(X) = (µ1 , µ2 ) = (X, X 2 /2), τ 1/2 τ2 1 1 2 τ (Z) = (τ1 , τ2 ) = (− σt(Z) 2 (Z) , σ 2 (Z) ) and g(τ ) = (2π)1/2 exp(− 2τ ). Note that the restrictions 2 2 assumed on t(Z) and σ (Z) ensure that the range of τ1 and τ2 are respectively included in compacts of R and ]0; ∞[. It follows that any continuous function on R×]0; ∞[ of (τ1 (Z), τ2 (Z)) admits a finite moment. The support of µ(X)|Z does not depend on Z and is a parabola in the plan and the relative interior of its convex hull is Γ = {(a, b) ∈ 23 R2 s.t.b > a2 }. 2 For (µ1 , µ2 ) ∈ C2 , ν(µ1 , µ2 ) is well defined and equal to: Z ν(µ1 , µ2 ) = 1/2 t2 t2 − 2t1 −µ1 t1 −µ2 t2 √ e 2 dFτ (Z) (t1 , t2 ), 2π and fX (x) = ν(x, x2 /2). It follows that: Z %(µ1 , µ2 ) = 2 R 1/2 t2 t2 − 2t1 −µ1 t1 −µ2 t2 dFτ (Z) (t1 , t2 ). (t1 + µ1 t2 ) √ e 2 2π t2 1/2 − 1 −µ t −µ t t1 |t√2 |2π e 2t2 1 1 2 2 dFτ (Z) (t1 , t2 ) Because (µ1 , µ2 ) ∈ C 7→ is the Laplace transform of a signed measure dominated by the push-forward measure of τ (Z), it follows that % is a t2 R t3/2 − 2t1 −µ1 t1 −µ2 t2 2 √ e 2 Laplace transform if and only if (µ1 , µ2 ) 7→ µ1 dFτ (Z) (t1 , t2 ) is itself a 2π 2 2 Laplace transform on C + iR where C is a convex of R such that P((X, X 2 /2) ∈ C) = 1. This means that there exists a function g ∈ L1 (τ1 (Z), τ2 (Z)) such that for any (µ1 , µ2 ) ∈ C + iR2 : Z 3/2 t2 t − 1 −µ t −µ t µ1 √2 e 2t2 1 1 2 2 dFτ (Z) (t1 , t2 ) = 2π Z g(t1 , t2 )e−µ1 t1 −µ2 t2 dFτ1 (Z),τ2 (Z) (t1 , t2 ). Fubini Theorem ensures this previous equality is equivalent to: Z Z µ1 L[Mt2 ](µ1 ) exp (−µ2 t2 ) dFτ2 (Z) (t2 ) = L[Gt2 ](µ1 ) exp(−µ2 t2 )dFτ2 (Z) (t2 ), 3/2 t 2 u )dFτ1 (Z)|τ2 (Z)=t2 (u) and with Mt2 and Gt2 measures defined by dMt2 (u) = √22π exp(− 2t 2 dGt2 (u) = g(u, t2 )dFτ1 (Z)|τ2 =t2 (u). Injectivity of the Laplace transform ensures that µ1 L[Mτ2 (Z) ](µ1 ) = L[Gτ2 (Z) ](µ1 ). It follows from Lemma 7.5 in Appendix that τ1 (Z)|τ2 (Z) needs to admit a locally absolutely continuous density with respect to the Lebesgue measure, which is equivalent to the fact that t(Z)|σ 2 (Z) admits a locally absolutely continuous density (σ 2 (Z)-almost surely). Informally this requirement is very strong when the econometrician only have one instrument Z, but quite reasonable if the econometrician observe two continuous and different instruments Z1 and Z2 . We will just discuss informally9 this point. Imagine that Z admits a density with respect to the Lebesgue measure in R. In that case, 9 We do not discuss more precisely the precise meaning of having only one or two distincts instruments, because some difficulties come from the fact that for Z1 , Z2 drawn in a uniform distribution on [0; 1]2 , it exists an invertible function g such that Z = g(Z1 , Z2 ) with Z is drawn in a uniform distribution on [0; 1]. 24 a finite semi-parametric efficiency bound exists only if Z|σ 2 (Z) is not countable: |{z ∈ Z : σ 2 (Z) = σ 2 (z)}| > |N| σ 2 (Z) almost-surely. Informally, σ 2 (Z) needs to be not too informative on t(Z). It follows that when z 7→ σ 2 (z) is invertible or more generally admits a finite or countable fibers, regular root-N consistent estimators do not exist. And even when σ(Z) takes only a few values, the existence of a locally absolutely continuous density of t(Z)|σ 2 (Z) imposes non trivial and weird restrictions on the distribution of t(Z). To understand why, consider the case X|Z ∼ N (Z, σ02 1{Z≤a} + σ12 1{Z>a} ): a root-N consistent estimator of the average marginal effect exists only if Z|Z ≤ a and Z|Z > a admit locally absolutely continuous densities. Then fZ (z)/FZ (a)1{z≤a} and fZ (z)/(1 − FZ (a))1{z>a} needs to vanish at the boundary of Z but also at z = a. Then Z needs to admit a locally absolutely continuous density that vanishes at a and at the boundary of its support. On the other hand, Z admits a density with respect to the Lebesgue measure on R2 (or Rp for p > 2). Then t(Z)|σ 2 (Z) = u admits a locally absolutely continuous densities under reasonable conditions because the existence of two distincts instruments enables the econometrician to observe continuous variation of t(Z) keeping constant σ 2 (Z). If such reasonable conditions are fulfilled, then : V ∗ = V (εq ∗ (Z) + m0 (X)) , with: q ∗ (Z) = 1 2 |Z)E(1/E(ε2 |Z)|t(Z),σ 2 (Z)) E(ε 0 2 2 × Cov (E(εm (X)|Z), 1/E(ε |Z)|t(Z), σ (Z)) − f0 t(Z)|σ 2 (Z) (t(Z)) ft(Z)|σ2 (Z) (t(Z)) . Example: Average marginal effect for a simple parametric first stage Consider the case where X|Z is dominated by the Lesbegue measure on R, the marginal distribution of Z is such that P(Z > 0) = 1 and where fX|Z (x) = g(Z) exp(−|x|k Z) with k > 0. For k = 1 we get the two sided-exponential distribution for X|Z and the case k = 2 corresponds R 1−1/k to X|Z ∼ N (0, (2Z)−1 ).It follows that g(z) = [ exp(−|x|k z)dx]−1 = kz . In such a 2Γ(1/k) 2−1/k case µ(X) is a scalar. If Z is such that E(Z ) < ∞ (this is the case if Z has a compact support included in ]0; +∞[), then the dominated convergence Theorem ensures that fX R is differentiable and fX0 (x) = k|xk−1 |sgn(x) zg(z) exp(−|x|k z)dFZ (z), where sgn(x) = 1{x>0} − 1{x<0} . To satisfy Condition (i) of Theorem 5.3, fX0 (x)/fRX (x) needs to be a zg(z) exp(−µ(x)z)dF (z) function of µ(x) = |x|k . Because fX0 (x)/fX (x) = kµ(x)1−1/k sgn(x) R g(z) exp(−µ(x)z)dFZZ(z) , this condition holds if and only if sgn(x) is a function of |x| = µ(x)1/k , which never holds. 25 5.2.3 Specialization to the additive first stage We now consider the case where the first stage is additive and where both instruments and residuals of the first stage admit densities with respect to the Lebesgue measure. In the following, for any p ≥ 0, Lp (RK ) denotes the set of functions f which are measurable with respect to the Lebesgue measure on RK and such that f p is integrable. Assumption 5 (Additive 1st stage) X = p(Z) + η where R(p) ⊂ RK , p(Z) and η are random variables in RK that admit densities fp(Z) and fη with respect to the Lesbegue measure on RK and Z ⊥⊥ η. The previous Assumption rules out the case where there exists non continuous variables in X 0 . This restriction is made for the sake of simplicity but can be relaxed by a reasoning conditional to X 0 . For every f integrable with respect to the Lesbegue meaR sure on Rd , Fq denotes the Fourier transform of q: q(u) exp(−2πix0 u)du, and F q (x) = Fq (−x). Note that under the previous Assumption, X ⊥⊥ Z|p(Z) and then E(q(Z)|X) = E(E(q(Z)|p(Z))|X) = E(e q (p(Z))|X). And the additivity of the first stage enables us to R ∗ rewrite T (q)fX as an convolution product because T ∗ (q)(x)fX (x) = qe(v)fp(Z) (v)fη (x − v)dv = (e q fp(Z) ) ? fη (x). The convolution Theorem ensures that Q 6= ∅ if and only if F[ψfX ](j) /Ffη is the Fourier transform of an integrable function (more precisely the equality holds for any v such that Ffη (v) 6= 0). Let S, the Schwartz space (i.e. the space of infinitely differentiable functions f such that xα f (β) (x) decreases to zero when |x| → +∞ for any multi-index (α) and (β)). Because S ⊂ L1 (RK ), F is well defined on S. By duality arguments, F can be extended to S 0 the topological dual space of S (see Schwartz (1997) for the topology considered) that contains L1 (RK ), S and L2 (RK ) (up to a canonical isomorphism). On S 0 , we have FF = FF = id. F is a bijection from S 0 into itself. This is also a bijection from S (respectively L2 (RK )) to itself. Then despite that F(S) (respectively F(L2 (RK )) and F(S 0 )) is easily characterizable, there isn’t also simple result for F(L1 (RK )). But we know that F(L1 (RK )) is included in C00 (RK ), the space of continuous functions from RK to C vanishing at infinity. And because FF = id on S 0 and then on L1 (RK ), we also have {f ∈ L1 (RK ) s.t. Ff ∈ L1 (RK )} = {f ∈ L1 (RK ) s.t. F f ∈ L1 (RK )} ⊂ F(L1 (RK )) ∩ C00 (RK ). R Moreover, the Laplace transform defined by Lf (z) = f (u) exp(−z 0 u)du extends the Fourier transform10 to Γf + iRd , where Γf = {x ∈ RK : u 7→ f (u) exp(−u0 x) integrable}. When the support of p(Z) is compact, then we know that the Laplace transform of qefp(Z) is defined on Cp and the Paley-Wiener Theorem gives us a sharp characterization of the 10 More exactly, with the definition of F and L used here, Lf (x + iy) extends iy 7→ Ff (−iy/2π) from iRK to Γf + iRK . 26 range of Lqefp(Z) . So in the next Theorem, we consider a general case and a special case where Supp(p(Z)) is compact. Theorem 5.4 If Assumptions 1, 2 and 5 hold and if Ffη has isolated zeros then: (i) Q = 6 ∅ if and only if it exists a function q ∈ L2 (p(Z)) such that F[ψfX ](j) /Ffη = (−1)|j| Fqfp(Z) . (ii) Q = 6 ∅ only if F[ψfX ](j) /Ffη ∈ C00 (RK ). (iii) If F[ψfX ](j) /Ffη ∈ C00 (RK ) ∩ L1 (RK ) then Q 6= ∅ if and only if F F[ψfX ](j) /Ffη ∈ L1 (RK ). And in that case, o n . Q = q ∈ L1 (Z) : E(q(Z)|p(Z) = v)fp(Z) (v) = (−1)|j| F F[ψfX ](j) /Ffη Moreover, if the support of p(Z) is compact and then: R 0 fη (x)e−t x dx < ∞ for every t ∈ RK (iv) Q = 6 ∅ only if: (a) L[ψfX ](j) /Lfη is an holomorphic function on CK bounded on Γ + iRK for every compact Γ included in RK , h i (b) L[ψfX ](j) /Lfη (x + iy) tends to zero when when |y| tends to infinity for any x ∈ RK , (c) there exists C, N such that: L [ψfX ](j) (x + iy) ≤ C(1 + ||x + iy||)N exp Lfη (x + iy) ! sup ||p(z)|| · ||y|| . z∈Supp(Z) When a root-N regular consistent estimator of θ0 exists, the semiparametric efficiency bound of θ0 is: V ∗ = V (εq ∗ (Z) + R(X)) , with dF −1 [F(ψfX )(j) /Ffη ] (−1)|j| q (Z) = (p(Z)) E(ε2 |Z)E [1/E(ε2 |Z)|p(Z)] dFp(Z) E [E(εR(X)|Z)/E(ε2 |Z)|p(Z)] E(εR(X)|Z) + − E(ε2 |Z)E [1/E(ε2 |Z)|p(Z)] E(ε2 |Z) ∗ and R(X) = (ψ(X) + ∂fX ψ(X)fX (X)) m(j) (X) − E (ψ(X) + ∂fX ψ(X)fX (X)) m(j) (X) . 27 The first part of the previous Theorem enables us to distinguish three cases. First, if F[ψfX ](j) /Ffη ∈ / C00 (RK ), it ensures that Q = ∅. Second if F[ψfX ](j) /Ffη ∈ C00 (RK ) ∩ L1 (RK ) and F F[ψfX ](j) /Ffη ∈ L1 (RK ), then Q 6= ∅. Otherwise, if F[ψfX ](j) /Ffη ∈ C00 (RK ) \ 1 K L (R ) or F F[ψfX ](j) /Ffη ∈ / L1 (RK ), we remain agnostic about the emptiness of Q, we only know that there is a linear form S ∈ S 0 with a Fourier transform11 equals to F[ψfX ](j) /Ffη (up to a canonical isomorphism) but this linear form does not necessarily correspond to a q ∈ L1 (Z) (in the sense that S ∈ {U ∈ S 0 s.t. ∃q ∈ L1 (Z) s.t. U (ϕ) = R (−1)|j| E(q(Z)|p(Z) = v)fp(Z) (v)ϕ(v)dv for ϕ ∈ S}). In this last case, it is always possi 0 12 ble to consider F[ψfX ](j) /Ffη as an element of S , and to verify that S = F F[ψfX ](j) /Ffη is in L1 (RK ) (up to the usual isomorphism). The second part of the previous Theorem considers the case where the support of p(Z) is bounded and where η have thin tails (such as the normal distribution or distribution with bounded support). In particular, when X has a compact support then p(Z) and η also have compact supports and then Proposition (iii) of the previous theorem gives a criterion on the distribution of η to ensure that Q = 6 ∅. Example: Average under counterfactual distribution. If Z, η are two independent normal centered distributions of variance σZ2 and ση2 . In that case if, X = Z + η then X ∼ N (0, σZ2 + ση2 ). If we are interested to average m(X) under the counterfactual distribution √ 2 2 σZ +ση 2 2 2 2 2 2 N (µ, σ ), then ψ(x) = ) /2 . In such a case, + σ exp − (x − µ) /σ − x /(σ η Z σ 2 2 2 2 LψfX (z) = exp(µz + σ z /2) and Lfη (z) = exp(ση z /2) with ΓψfX = Γfη = R, then the ratio LψfX (z)/Lfη (z) is equal to exp(µz + σ 2 z 2 /2 − ση2 z 2 /2) and verifies condition (i) of the Theorem 5.4 if and only if σ 2 > ση2 . In such a case, Q contains the single 2 (z−m)2 σ z Z element q(z) = √ 2 2 exp 2σ2 − √ 2 2 . The condition σ 2 > ση2 means that the σ −ση Z 2 σ −ση counterfactual distribution cannot be concentrated more than the noise in the fist stage. We recover a result already established with Theorem 5.3. Interesting results can also be obtained for additive first stage which does not verify Assumption 4. For instance, if Z, η and X are Gamma distributions of parameter (kZ , τ ), (kη , τ ) and (kZ + kη , τ ). If the counterfactual distribution is a Gamma distribution of parameter (k 0 , τ 0 ), then the ratio 0 LψfX (z)/Lfη (z) is equal to (1 + τη z)kη (1 + τ 0 z)−k and then Condition (ii) of Theorem 5.4 11 See Schwartz (1997) for a definition of Fourier transform of linear form. To do so, verify that for every compact K, if ϕn is a sequence of functions infinitely differentiable with a support included in K that tends uniformly to 0 then S(ϕn ) tends to 0. This first step ensures that S is identifiable to a measure (cf. Theorem III, Chapter I of Schwartz (1997)). To be identifiable to an integrable function, such measure have also to be dominated by the Lesbegue measure and of finite total variation by the Radon-Nikodim Theorem. This leads to check also that supϕ∈S,||ϕ||∞ <1 S(ϕ) < ∞ and supA negligible Borel set inf ϕ∈S,K compact s.t.ϕ≥1A∩K S(ϕ) = 0. 12 28 holds if and only if k 0 > kη . Because the value of θ does not matter, it appears that in this last case the mean and the variance of the counterfactual distribution can be chosen arbitrarily small (because the both depends on θ) but its skewness and its kurtosis (the both depend only on k) have to be larger than those of η. The case of average derivatives leads to elegant results, as illustrated in the following proposition. Proposition 5.5 (Average derivatives under additive first stage) If Assumption 1, 2 and 5 hold, if Ffη has isolated zero and if θ0 = E m(j) (X) then a regular and root-N consistent estimator of θ0 exists only if the weak derivative of fp(Z) is a (j) 2 ! fp(Z) (p(Z)) (j) locally integrable function (denoted fp(Z) in the sequel) and if E < ∞. In fp(Z) (p(Z)) that case, the semi-parametric efficiency bound is: V(εq ∗ (Z) + m(j) (X)), with (j) fp(Z) (−1)|j| q (Z) = (p(Z)) E(ε2|Z)E [1/E(ε2 |Z)|p(Z)] fp(Z) E E(εm(j) (X)|Z)/E(ε2 |Z)|p(Z) E(εm(j) (X)|Z) + − . E(ε2 |Z)E [1/E(ε2 |Z)|p(Z)] E(ε2 |Z) ∗ and when p is one to one this expression reduced to: ∗ (j) f (p(Z)) |j| p(Z) q (Z) = (−1) fp(Z) (p(Z)) . The proof rely on the Condition (i) of Theorem 5.4 and is given in Appendix. We briefly illustrate the implication of the previous proposition for the average marginal effect. Example: Average marginal effect. Let X = Z + η with η drawn in an uniform distribution on [0; 1], Z drawn in an uniform distribution [0; α] and η ⊥⊥ Z. The correlation 2 1/2 between X and Z is αα2 +1 , then α is a measure of the strength of the instrument. In such cases, we have shown that E(m0 (X)) is always identified (cf. Proposition 3.2). But the previous proposition ensures that E(m0 (X)) is not root-N estimable whatever the value of α. Indeed fZ is not absolutely continuous on R and its weak derivative is the signed measure δ0 − δ1 (where δa is the Dirac at a) and not a function. Such an example shows that if the densities of Z and η are not "sufficiently" regular, root-N estimation of the average marginal effect is lost even in case of identification and in presence of strong instruments. A subsequent question concerns the respective minimal regularity of 29 fZ and fη . As soon as FfX0 (t) = (2iπt)FfX (t), then E(m0 (X)) is root-N estimable only R f 02 (z) if Z admits an absolutely-continuous density fZ on R such that fZZ (z) dz < ∞. Note that fZ must at least vanish at the boundary of its support as it is often assumed and this vanishing needs to be sufficiently smooth to ensure the integrability condition. If Z ∼ B(α, β), α > 2 and β > 2 is a necessary condition for the existence of a consistent root-N estimator. And similarly, if Z ∼ Γ(k, τ ) or if Z ∼ W eibull(k, λ) then k > 2 is a necessary condition for the existence of a consistent root-N estimator. With Pareto distribution root-N consistent estimators never exist. But normal or Student distributions for Z ensure that Q is not empty as soon as fX is absolutely continuous.In that case, the i h 0 (Z) fZ 0 semiparametric efficiency bound is V m (x) − (Y − m(x)) fZ (Z) and the parameter θ0 is 0 f (Z) also equal to −E Y fZZ (Z) . More generally for multivariate X, if Assumptions 2 and 5 hold and if the Fourier transform of fη has isolated zeros, then a regular and root-N consistent estimator of E (∂xk m(X)) exists only if pk 7→ f absolutely continuous (for p(Z) (p1 , ..., pK ) is almost-every p1 , ..., pk−1 , pk+1 , ..., pK ) and if E (∂pk fp(Z) )2 (p(Z)) 2 fp(Z) < ∞. In that case, the semiparametric efficiency bound is V(εq ∗ (Z) + ∂xk m(X)) with q ∗ (Z) = 6 ∂pk fp(Z) −1 (p(Z)) E(ε2 |Z)E [1/E(ε2 |Z)|p(Z)] fp(Z) E [E(ε∂xk m(X)|Z)/E(ε2 |Z)|p(Z)] E(ε∂xk m(X)|Z) + − . E(ε2 |Z)E [1/E(ε2 |Z)|p(Z)] E(ε2 |Z) Conclusion In this paper we show that the parameter θ0 = E(ψ(X)m(j) (X)) with m unknown and E(Y − m(X)|Z) = 0 is often not root-N estimable. We show that root-N estimability of θ0 is closely related to the fact that [ψfX ](j) /fX belongs the range of T ∗ , with T ∗ (q)(X) = E(q(Z)|X). This condition is not nested with the completeness condition that ensures the identification of m. If the first stage is exponential or additive, our condition is quite restrictive when the parameter of interest is an average under a counterfactual distribution of X. It appears that for some DGPs and some counterfactual distributions, θ0 is never root-N estimable, and sometimes for others, θ0 is root-N estimable if and only if the strength of the instruments is above a critical value. When the parameter of interest is an average marginal effect, root-N estimation is achievable if the first stage is additive, even if the completeness condition does not hold, as soon as the density of the instrument admits a derivative that vanishes at the boundary of its support. 30 √ θ0 Shift in Mean ρ E exp − 12 [(x − m)2 − x2 ] m(X) ρ 6= 0 yes Shift in Variance √ 1/ σ 2 E exp − 12 [x2 (1/σ 2 − 1)] m(X) ρ2 ≤ 1 − σ 2 no ρ2 > 1 − σ 2 yes Average marginal effect ρ2 = 0 no 0 2 31 E(m (x)) ρ >0 Density-weighted average derivative, ... E (2π)−k/2 exp (−kX 2 /2) m0 (X) , k ∈ N ρ2 < 2 ρ = N -estimable efficiency bound V (Y − m(X)) exp mρ Z − ρ2 σ 2 +ρ2 −1 h 22 h ρ z V (Y − m(X)) exp 2(1−ρ 2) 1 − ii σ 2 ρ2 σ 2 +ρ2 −1 V ε Zρ + m0 (X) yes k or ρ2 = 0 k+1 k with k ≥ 1 k+1 yes (k + 1)2 V ε Zρ + k k+1 yes (k + 1)2 V ρ2 > m2 2ρ2 no Table 1: Some examples when (X, Z) ∼ N (0, Σ) with Σ11 = Σ22 = 1 and Σ12 = ρ 1 2 0 k exp (−kX /2) m (X) 2π ! 1−ρ2 εZ exp Z 2 (k+1)ρ 2 −k + √ 1 k exp (−kX 2 /2) m0 (X) 2π √ 7 7.1 Appendix A: Proofs of theorems Proof of Theorem 3.1 Without completeness assumption, T is not invertible, m is not point identified and then the set identification is {m + l, l ∈ Ker (T ) ∩ (M − {m})} = {m} + Ker (T ) ∩ (M − {m}). It follows that the set identification of m(j) is m(j) + l(j) , l ∈ Ker (T ) ∩ (M − {m}) . Because ψ is known (up to fX which is identified), the set identification of θ0 is {θ0 + E(ψl(j) ), l ∈ Ker (T ) ∩ (M − {m})}. When M is a vectorial space, M − {m} = M and because l 7→ E(ψl(j) ) is a linear form on vectorial space Ker (T ) ∩ M and its range is {0} or R and this proves (i). To prove (ii), remark that θ0 is identified if and only if E(ψl(j) ) = 0 for every l ∈ Ker (T ) ∩ (M − {m}), i.e. ⊥ ψ ∈ l(j) s.t. l ∈ Ker (T ) ∩ (M − {m}) . 7.2 Proof of Theorem 3.2 We first prove the result under the first condition. The region of identification of θ0 is {E(m0 (X) + l0 (X)), l ∈ M ∩ Ker (T )}. For all l ∈ M ∩ Ker (T ), we have both equalities: E(l(X)|p(Z) = u) = 0 ∂u E(l(X)|p(Z) = u) = E(l0 (X)|p(Z) = u) The second equality is a simple consequence of the dominated convergence Theorem using PS s 0 s=1 as x for the domination. All together, these equalities imply that E(l (X)) = 0, and next the region of identification is reduced to a single point. Under the second condition, we have: Z sup(I) Z E(l(X)|p(Z) = p) = Z sup(I)+p Z l(p+u+v)dFν (v)du = inf(I) l(u+v)dFν (v)du, inf(I)+p Supp(ν) 32 Supp(ν) R R derivation with respect to p gives: Supp(ν) l(sup(I) + p + v)dFν (v) − Supp(ν) l(inf(I) + R R sup(I) p + v)dFν (v) = Supp(ν) inf(I) l0 (p + u + v)dudFν = E(l0 (X)|p(Z) = p), And next, θ0 is identified. 7.3 Proof of Theorem 4.1 7.3.1 Some Lemmas Lemma 7.1 (Derivation in quadratic mean "with respect to m(X)") If Assumption 1 and 2 holds then r(e, x, z) = ∂e fε,X,Z (e, x, z)/fε,X,Z (e, x, z)1{fε,X,Z (e, x, z) > 0} is well defined and square integrable, and for any l ∈ L2 (X) such that rl ∈ L2 (ε, X, Z), we have: 2 Z l(x) 1 1/2 1/2 1/2 lim f (e − tl(x), x, z) − fε,X,Z (e, x, z) − r(e, x, z)fε,X,Z (e, x, z) dν(e, x, z) = 0 t→0 t ε,X,Z 2 Moreover, E(r|X, Z) = 0. 1/2 Proof: Let su (e, x, z) = fε,X,Z (e−u, x, z), because x 7→ x2 is locally Lipschitz continuous on R+ , u 7→ s2u (y, x, z) is absolutely continuous and the chain rule applies: i.e. ∂u s2u (e, x, z) = 2su (e, x, z)∂u su (e, x, z), and next r(e−u, x, z) = ∂u s2u (e, x, z)/s2u (e, x, z)1su (e,x,z)>0 = 2∂u su (e, x, z)/su (e, x is well defined. The square integrability of r is ensured by Assumption 2.(vii). Next, we have: stl(x) (e, x, z) − s0 (e, x, z) = = = R tl(x) ∂u su (e, x, z)dλ(u) 0 R tl(x) 1/2 1 r(e − u, x, z)fε,X,Z (e − u, x, z)dλ(u) 2 0 1/2 tl(x) R 1 r(e − utl(x), x, z)fε,X,Z (e − utl(x), x, z)dλ(u) 2 0 Next, the Cauchy-Schwartz inequality and the Fubini Theorem ensure that: R 2 1 1/2 2 stl(X) − s0 = l (X) 0 r(e − utl(x), x, z)fε,X,Z (e − utl(x), x, z)dλ(u) h R i 1 2 1 2 ≤ 4 E l (X) 0 r (ε − utl(X), X, Z)fε,X,Z (ε − utl(X), X, Z)dλ(u) R1R R = 14 0 Supp(X,Z) R r2 (e − utl(x), x, z)fε,X,Z (e − utl(x), x, z)dλ(e) l2 (X)dµ(x, z)dλ(u) R1R R = 41 0 Supp(X,Z) R r2 (e, x, z)fε,X,Z (e, x, z)dλ(e) l2 (X)dµ(x, z)dλ(u) 1 E t2 h 2 i 1 E 4 = 41 E (r2 (ε, X, Z)l2 (X)) < ∞ 1/2 X, Z-almost-surely, t 7→ fε,X,Z (e − t, X, Z) is absolutely continuous and then admits a 33 derivative at t = 0 for almost-every e ∈ R, then: 2 l(X) 1 stl(X) (ε, X, Z) − s0 (ε, X, Z) − r(ε, X, Z)s0 (ε, X, Z) → 0 a.s. t 2 The Proposition 2.29 of van der Vaart (2000) ensures the nullity of the limit given in the Lemma. Next, consider the case where l(X) = 1 and use the fact that ν(e, x, z) = λ(e) ⊗ ν(x, z) to deduce that X, Z-almost surely: Z lim t→0 2 1 1/2 1 1/2 1/2 (f (e − t) − fε|X,Z (e)) − r(e, X, Z)fε|X,Z (e) dλ(e) = 0. t ε|X,Z 2 Next, bound E(r|X, Z): R − r(e, X, Z)fε|X,Z (e)dλ(e) |E(r|X, Z)| = h1t fε|X,Z (e − t) − fε|X,Z (e) ih i R 1/2 1/2 1/2 1/2 1/2 f (e) f (e − t) + f (e) dλ(e) ≤ 1t fε|X,Z (e − t) − fε|X,Z (e) − r(e,X,Z) ε|X,Z ε|X,Z 2 R i ε|X,Z h 1/2 1/2 1/2 + 12 r(e, X, Z)fε|X,Z (e) fε|X,Z (e − t) − fε|X,Z (e) dλ(e) h 1/2 i2 R 1 1/2 1/2 1/2 1 ≤ 2 fε|X,Z (e − t) − fε|X,Z (e) − 2 r(e, X, Z)fε|X,Z (e) dλ(e) t 1/2 2 R 1/2 1/2 1 2 fε|X,Z (e − t) − fε|X,Z (e) dλ(e) + 2 E(r (ε, X, Z)|X, Z) Theorem 9.5 of Rudin (1987) ensures that limt→0 2 R 1/2 1/2 fε|X,Z (e − t) − fε|X,Z (e) dλ(e) = 0. Lemma 7.2 (DQM "with respect to the nuisance parameter fε,X,Z ") If Assumption 2 holds, and if ft (e, x, z) is a density with respect to ν and s ∈ L2 (ε, X, Z) such that: 1 lim 2 t→0 t Z ft (e, x, z)1{fε,X,Z (e,x,z)=0} dν(e, x, z) = 0, 2 Z 1 1 1/2 1/2 1/2 lim f (e, x, z) − fε,X,Z (e, x, z) − s(e, x, z)fε,X,Z (e, x, z) dν(e, x, z) = 0 t→0 t t 2 then for any ϕ ∈ Φ: 2 Z s − rϕ 1 1/2 1/2 1/2 lim f (e − tϕ(x), x, z) − fε,X,Z (e, x, z) − fε,X,Z (e, x, z) dν(e, x, z) = 0 t→0 t t 2 Let W = (X, Z) and w = (x, z). R Assumptions imply that E(s2 (ε, W )) = s2 (e, w)fε,W (e, w)dν(e, w) < ∞. 34 Moreover, we have: 2 R 1 1/2 1/2 1/2 1 f (e − tϕ(x), w) − f (e, w) − (s(e, w) + r(e, w)ϕ(x))f (e, w) dν(e, w) t ε,W ε,W t 2 2 R 1 1/2 1/2 1/2 1 f (e − tϕ(x), w) − f (e − tϕ(x), w) − s(e − tϕ(x), w)f (e − tϕ(x), w) ≤2 dν(e, w) t ε,W ε,W t 2 2 R 1 1/2 1/2 1/2 fε,W (e − tϕ(x), w) − fε,W (e, w) − 21 r(e, w)ϕ(x)fε,W (e, w) dν(e, w) +2 t 2 R 1/2 1/2 +2 s(e − tϕ(x), w)fε,W (e − tϕ(x), w) − s(e, w)fε,W (e, w) dν(e, x, z) The first and the third terms tend to zero because ν(e, x, z) = λ(e)⊗(x, z) and the Lebesgue measure λ is translation invariant. The second term tends to zero by assumption. Lemma 7.3 (Density of bounded scores) Let S = {s ∈ L2 (ε, X, Z) : E(s) = 0, E(εs|Z) = 0}, Under Assumption 2, Sb = S ∩ L∞ (ε, X, X) is dense in S for the norm of L2 (ε, X, Z). Consider s ∈ S and sn = s1{|s|<n} − E(s1{|s|<n} ). We have sn ∈ L20 (ε, X, Z) ∩ L∞ (ε, X, Z) and by dominated convergence Theorem, lim ||sn − s||22 = lim E (sn − s)2 = 0. n n Let introduce pn the orthogonal projection of sn on closure of {q(Z)ε+c : q ∈ L2 (Z), c ∈ R}, defined by: E [(sn − pn )(q(Z)ε + c)] = 0 ∀c, q ∈ R × L2 (Z). We can check that pn = E [sn ε|Z] ε. E[ε2 |Z] Note that by definition sn − pn is the projection of sn on S = S, so sn − pn − s ∈ S and pn ∈ S⊥ , we have by Pythagoras Theorem: ksn − sk22 = ksn − pn − sk22 + kpn k22 ≥ ksn − pn − sk22 . Let gk (Z) = E ε2 1{|ε|≥k} |Z , by dominated convergence Theorem: lim gk (Z) = 0, Z − a.s., k then gk (Z) converges in probability to 0. It follows that it exists kl a strictly increasing sequence in N, such that P gkl (Z) ≥ 1l ≤ 1/l. 35 Let rl (Y, X, Z) = ε1{|ε|<kl } 1{gkl (Z)<1/l} . rl converges in probability to ε because P(|rl − ε| > a) ≤ ≤ ≤ ≤ P(|rl − ε| = 6 0) P ({|ε| ≥ kl } ∩ {gkl (Z) < 1/l}) + P (gkl (Z) ≥ 1/l) P(|ε| ≥ kl ) + P gkl (Z) ≥ 1l P(|ε| ≥ kl ) + 1/l Moreover E(rl2 |Z) converges almost-surely (and then in probability) to E(ε2 |Z) because: i h |E (ε2 − rl2 |Z) | = E ε2 1{|ε|≥kl } 1{gkl (Z)<1/l} + 1{gkl (Z)≥1/l} 2 ≤ E gkl (Z)1{gkl (Z)<1/l} + E E [ε |Z] 1{gkl (Z)≥1/l} ≤ 1l + vP gkl (Z) ≥ 1l ≤ 1+v . l > 0 for l ≥ 1+v It follows that E(rl2 |Z) ≥ v − 1+v . l v Then for a given n, by continuous mapping Theorem, E(sn ε|Z)rl /E(rl2 |Z) converge in probability to pn when l tends to infinity. Moreover, 2 E(sn ε|Z)2 rl2 ≤ E(s2n |Z)E(ε2 |Z) ε1+v 2 E(rl2 |Z)2 v+ ( n ) ε2 2 2 ≤ 4n v 2 (v+ 1+v n ) By Theorem 17.4 in Jacod & Protter (2000), for every n, E(sn ε|Z)rl = 0. lim − p n l E(rl2 |Z) 2 Then for every n, let hn,l E(sn ε|Z)rl −E = E(rl2 |Z) E(sn ε|Z)rl E(rl2 |Z) We have: limkhn,l − pn k2 = 0, l and because rl2 = rl ε: E [(sn − hn,l )ε|Z] = 0 E(hn,l ) = E(sn ) = 0. It follows that sn − hn,l ∈ S ∩ L∞ (ε, X, Z) 36 . and ksn − hn,l − sk2 ≤ ksn − pn − sk2 + kpn − hn,l k2 ≤ ksn − sk2 + kpn − hn,l k2 . And then we deduce that: lim lim ksn − hn,l − sk2 = 0. n l Then s ∈ S ∩ L∞ (ε, X, Z). 7.3.2 The Proof The tangent set. For a parametric sub-model indexed by t in a neighborhood of 0 in R, let θ(t) the value of the parameter for the DGP corresponding to t. Z θ(t) = (j) ψt (x)mt (x)fY,X,Z,t (y, x, z)dλ(y, x, z) Z = (j) h (fX,t (x), x) mt (x)fY,X,Z,t (y, x, z)dλ(y, x, z) Consider a family of parametric submodel with score l ∈ S ⊂ L2 (Y, X, Z) such that θ(t) is a sufficiently smooth function of t to ensure that it exists f ∈ L2 (Y, X, Z) such that ∂t θ(0) = E(f l). The set of score l is called a tangent set, f is called an influence function, and such case we say that θ is differentiable with respect to the tangent set S (see van der Vaart (2000), Chapter 25). A necessary condition for existence of a regular root-N convergent estimator is that it exists such an influence function for the parameter θ = E(ψ(X)m(j) (X)). The semi-parametric efficiency bound (with respect to the tangent set S) of θ0 is simply the smallest variance among variance of the influence functions. In our case, we consider the tangent set of scores S = ∪ϕ∈Φ Sϕ = ∪ϕ∈Φ ∪c≥0 Sϕ,c , where Sϕ,c denotes the set of scores of parametric models indexed by t and defined by: Y = m(X) + tϕ(X) + ε, E(ε|Z) = 0, (ε, X, Z) has density f (t) with respect to ν such that the "true" density fε,X,Z corresponds to f0 and 1 lim 2 t→0 t Z ft (e, x, z)1{f0 (e,x,z)=0} dν(e, x, z) = 0, 2 Z 1 1 1/2 1/2 1/2 lim f (e, x, z) − f0 (e, x, z) − s(e, x, z)f0 (e, x, z) dν(e, x, z) = 0, t→0 t t 2 and Z max 2 2 2 e ft (e, x, z)dν(e, x, z), E ε s 37 < c. Lemma 7.2 implies that the sub-models considered are differentiable in quadratic means with a score of the form s − rϕ. It follows that the tangent set considered here is S0 − rΦ (with S0 corresponds to Sϕ with ϕ = 0 and r defined in Lemma 7.1). The restriction E(ε|Z) = 0 imposes that: Z qeft dν = 0 for any q ∈ L∞ (Z) and t close to 0, On the other side, −E(q(Z)εs(ε, X, Z)) = R = R qe 1t (f t − f0 ) − sf0 dν qe R 1 +2 1 t 1/2 1/2 1/2 −f − 12 sf0 0 1/2 1/2 qesf0 ft − f 1/2 dν ft 1/2 1/2 ft + f0 dν(e, x, z) The Cauchy-Schwartz inequality ensures that: |E(qεs)| ≤ R + 2 2 q e R 1/2 ft 2 2 2 + q e s f0 dν 1/2 f0 2 dν 1/2 R 1/2 R 1 t 1/2 ft − 1/2 f0 1/2 ft 2 1/2 ft − 1/2 f0 − 1/2 1 sf0 2 2 1/2 dν 1/2 dν 1/2 f0 2 1/2 p dν ≤ 2 max(c, ν)||q||∞ + q 2 e2 The triangle inequality ensures that: 1/2 R 2 2 2 √ q e s f0 dν(e, x, z) ≤ ||q||∞ c. So, the differentiability in quadratic and moreover mean ensures the right hand side of the previous inequality tends to 0 and next R E(qεs) = 0. Then for any s ∈ S0 we have E(εs|Z) = 0. Lemma 7.3 ensures that we can restrict without loss of generality the tangent set to Sb0 −rΦ (because the efficient influence function is the projection of any influence function on S0 −rΦ and then on Sb0 − rΦ by density of Sb0 in S0 ). R We also have q(z)(e − tϕ(x))f0 (e − tϕ(x), x, z)dν(e, x, z) = 0 for any q ∈ L∞ (Z), ϕ ∈ Φ, and t close to 0. Then, Z q(z)e Z 1 (f0 (e − tϕ(x), x, z) − f0 (e, x, z)) dν(e, x, z) = q(z)ϕ(x)f0 (e−tϕ(x), x, z)dν(e, x, z). t And translation invariance of the Lebesgue measure ensures that Z Z q(z)ϕ(x)f0 (e − tϕ(x), x, z)dν(e, x, z) = 38 q(z)ϕ(x)f0 (e, x, z)dν(e, x, z). It follows that: R |E(qϕ) − E(qεrϕ)| = qe 1t (f (e − tϕ(x), x, z) − f (e, x, z)) − ϕ(x)r(e, x, z)f (e, x, z) dν(e, x, z) R 1/2 2 ≤ q(z)2 e2 f 1/2 (e − tϕ(x), x, z) + f 1/2 (e, x, z) dν(e, x, z) 1/2 2 ϕ(x) R 1 1/2 1/2 1/2 f (e − tϕ(x), x, z) − f (e, x, z) − 2 r(e, x, z)f (e, x, z) dν(e, x, z) t 1/2 1/2 R 1/2 2 R + q(z)2 e2 s2 (e, x, z)f (e, x, z)dν(e, x, z) f (e − tϕ(x), x, z) − f 1/2 (e, x, z) dν(e, x, z) 1/2 Lemma 7.1 ensures the differentiability in quadratic mean of t 7→ f0 (e − tϕ(x), x, z), and we conclude that for any ϕ ∈ Φ, E(εrϕ|Z) = E(ϕ|Z). Differentiability of θ with respect to the tangent set For any s ∈ S0b and ϕ ∈ Φ, the score s − rϕ is generated for instance by the parametric sub-model ft (e, x, z) = f0 (e − tϕ(x), x, z) [1 + ts(e, x, z)] for |t| ≤ ||s||∞ . This sub-model R generates a parametric model for the marginal density of X: fX,t = ft (e, x, z)dλ(e)dρ(z), where ρ is the pushforward measure of µ under πZ (x, z) = z. The projected model fX,t is also differentiable in quadratic mean (cf. Proposition 5 of Appendix A.5 of Bickel et al. (1993)). A direct calculation ensures that fX,t (x) = fX (x)(1 + tb(x)) and b is the score of t 7→ fX,t defined by b(X) = E(s − rϕ|X) = E(s|X) (remember that E(r|X) = 0).Let ψt (x) = h(fX,t (x), x) with fX,t (x) = fX (x)(1+tb(x)), its derivative at t = 0 is ∂t ψt (x)|t=0 = ∂fX ψ(x)b(x)fX (x). Then for any b ∈ L∞ (X), we have: θ(t)−θ t = 1 t R (ψt (x) − ψ(x))m(j) (x)fX,t (x)dρ(x) R + 1t ψ(x)m(j) (x)(fX,t (x) − fX (x))dρ(x) R + ψt (x)ϕ(j) (x)fX,t (x)dρ(x) R R R The term ψt (x)ϕ(j) (x)fX,t (x)dρ(x) is ψt (x)ϕ(j) (x)fX (x)dρ(x)+t ψt (x)b(x)ϕ(j) (x)fX (x)dρ(x), and then dominated convergence Theorem ensures that Z lim t→0 (j) ψt (x)ϕ (x)fX (x)dρ(x) = Z ψ(x)ϕ(j) (x)fX (x)dρ(x) = E(ψϕ(j) ). The fact that ψm(j) ∈ L2 (X) ensures that 1 lim t→0 t Z ψ(x)m(j) (x)(fX,t (x) − fX (x))dρ(x) = E(ψm(j) b) = E(ψm(j) (s − rϕ)). And last, Assumption 2.(iv) ensures that 1 lim t→0 t Z (ψt (x) − ψ(x))m(j) (x)fX,t (x)dρ(x) = E(∂fX ψfX bm(j) ) = E(∂fX ψfX (s − rϕ)m(j) ). 39 So: θ(t) − θ = E [∂fX ψfX + ψ] m(j) (s − rϕ) + ψϕ(j) t→0 t √ If it exists a regular n-consistent estimator θb of θ, then Theorem 2.1 of van der Vaart (1991) ensures that: lim 1. θ(t) is differentiable with respect to the tangent set, i.e. it exists f ∈ L2 (ε, X, Z) such that limt θ(t)−θ = E(f (s − rϕ)) for any s ∈ S0 and ϕ ∈ Φ, t 2. and the asymptotic distribution of θb is the convolution of a distribution with a normal N (0, V ∗ ), with V ∗ = V(ΠS0 −rΦ (f )), where ΠS0 −rΦ is the orthogonal projection on S0 − rΦ. Note that in the previous claim, f is not necessarily unique, but ΠS0 −rΦ (f ) does not depend on the choice of f . Deriving the conditions and the form of semi-parametric efficiency bound At √ this stage, we have show that a necessary condition for existence of regular N -consistent estimator of θ is the existence of an influence function f . If we consider the parametric sub-models with scores Sb0 , the previous derivation of θ(t) and the projection on then f ∈ [∂fX ψfX + ψ] m(j) + S⊥ 0. Usual algebra about projections on convex in Hilbert space ensures that S⊥ 0 = {q(Z)ε + c : 2 q ∈ L (Z), c ∈ R}. √ Now if we consider parametric sub model with scores in Sbϕ = Sb0 − rϕ, existence of n consistent and regular estimator of θ is given by implies that it exists q ∈ L2 (Z) and c ∈ R such that: ∀ϕ ∈ Φ : E(ψϕ(j) ) = E((qε + c)rϕ) But we know that E(εrϕ|Z) = E(ϕ|Z) and E(r|X, Z) = 0, so we get: ∀ϕ ∈ Φ : E(ψϕ(j) ) = E(T ∗ (q)ϕ). Then, up to (−1)|j| , T ∗ (q)fX is the weak derivative of order (j) of ψfX (in the sense of the linear forms on Φ, cf. Schwartz (1997), Chapter II and III). It follows that Assumption 3 is necessary for existence of a regular and root-N consistent estimator. So, the expression of the semi-parametric efficiency bound is: V ? = min V εq(Z) + (ψ(X) + ∂fX ψ(X)fX (X))m(j) (X) , q∈Q Where Q is a closed convex subset in L2 (Z) and the objective function is a quadratic form 40 of q that tends to +∞ when ||q||L2 (Z) → +∞, this minimisation admits a unique solution q?. Representation of θ under Assumption 2.(ix) If we have rm ∈ L2 (ε, X, Z), then the parametric sub-model defined by Y = (1 + t)m(X) + ε, with E(ε|Z) = 0 and (ε, X, Z) ∼ 1/2 ft (e, x, z) and ft differentiable in quadratic means is regular. Following the same reasoning as previously, we know that existence of regular estimator of θ implies that θ = E(ψ(X)m(X)(j) ) = E(q(Z)m(X)) for any q ∈ Q. So, the parameter can be expressed as the expectation of q ? (Z)Y . 7.4 Proof of Proposition 4.2 In Theorem 4.1, the bound is the is the variance of the sum of a function R(X) and an element of εL2 (X), then a projection of R(X) on εL2 (Z) and Pythagorean Theorem ensures that: V ? = min E (εq(Z) + R(X))2 q∈Q " " 2 # 2 # E(εR(X)|Z) E (εR(X)|Z) = min E ε2 q(Z) + + E R(X) − ε q∈Q E(ε2 |Z) E(ε2 |Z) " # " 2 2 # E(εR(X)|Z) E (εR(X)|Z) + E R(X) − ε = min E E(ε2 |Z) q(Z) + q∈Q E(ε2 |Z) E(ε2 |Z) The second term does not depend on q, so the minimization on Q only concerns the first term. This first term could be expressed as a function of a generalized inverse of an operator K ∗. 2 varies in T defined by: When q varies in Q then s = E(ε2 |Z) q(Z) + E(εR(X)|Z) E(ε2 |Z) ( T = [ψfX ](j) s ∈ L2 (Z) : K ∗ (s) = (−1)|j| + K∗ fX E(εR(X)|Z) p E(ε2 |Z) !) And next: " 2 # 2 E(εR(X)|Z) min E E(ε2 |Z) q(Z) + = min E s (Z) . s∈T q∈Q E(ε2 |Z) So s∗ of minimum norm in T which is the preimage of we are lead back tofind the element (j) √ 2 (−1)|j| [ψffXX] + K ∗ E(εR(X)|Z) under K ∗ . By definition, s? (Z) = Arg mins∈T E(s2 (Z)), E(ε |Z) (j) E(εR(X)|Z) ∗† |j| [ψfX ] ∗ √ 2 is the generalized inverse K (−1) +K which is well defined if fX E(ε |Z) 41 (j) and only if (−1)|j| [ψffXX] ∈ R(K ∗ ) = R(T ∗ ). And then we get the formula (i) and (ii) of the Proposition 4.2. Moreover, Corollary 3.2.2 in Groetsch (1977) ensures that: K ∗† = limt↓0 (tI + KK ∗ )−1 K. Let h = (−1)|j| [ψffXX] (j) +K ∗ √ E(εR(X)|Z) E(ε2 |Z) −1 ∗ , under Assumption 3 it exists u such that K ∗ u = h and K † u = limt↓0 (tI + K ∗ K) K u is well-defined if and only if u ∈ R(K) + ker K ∗ (see Groetsch (1977), Chapter 3), then K ∗† h = limt↓0 (tI + KK ∗ )−1 K(tI + KK ∗ )−1 K(tI + K ∗ K)(tI + K ∗ K)−1 K ∗ u = limt↓0 K(tI + K ∗ K)−1 K ∗ u = KK † u if and only if u ∈ R(K) + ker K ∗ or equivalently h ∈ R(K ∗ K) Last because K † u is the mean square solution of the equation Kx = u, this is also the mean square solution of K ∗ Kx = K ∗ u and then K † u = (K ∗ K)† K ∗ u. The formula (iii) and (iv) of Proposition 4.2 follows. 7.5 Proof of Proposition 5.2 The first part of the proposition is given by the Theorem 1.2.4 of Groetsch (1977). The second part of the proposition is given by the Theorem 3.1.3 of Groetsch (1977) and the fact that T is normal if and only if T ∗ is normal. For the third part of the proposition, assume R that Ker ((T )) = {0} and that E(g(x)|Z) = fX|Z (x)g(x)dλ(x) where λ is the Lebesgue R measure on RK . For gn (x) = cos(n(x1 + x2 + ... + xK )), we have fX|Z (x)gn (x)dλ(x) → 0 P when n tends to +∞ because the Fourier transform of K i=1 Xi |Z tends to zero at the R 2 fX|Z (x)gn (x)dλ(x) ≤ infinity. Moreover the Cauchy-Schwartz inequality ensures that R R fX|Z (x)dλ(x) fX|Z (x)gn2 (x)dλ(x) ≤ 1. Then the dominated convergence Theorem ensures that ||T (gn )||L2 (Z) tends to zero when n tends to ∞. On the other hand ||gn ||2L2 (Z) = R R cos2 (n(x1 +...+xK ))fX|Z (x)dλ(x) = 21 + 12 sin(2n(x1 +...+xK ))fX|Z (x)dλ(x) and tends to 1/2 when n tends to infinity. Then, Theorem 1.2.3 in Groetsch (1977) ensures that R(T ) is not closed and next by the first part of the proposition R(T ∗ ) is not closed. 42 7.6 Definition and properties of Fourier and Laplace transforms of linear forms. As explained in Section 5.2.2, the condition given in the Theorem 5.3 are sufficient to ensure that % is a Laplace transform of a linear form. We will briefly present the definition of the Laplace transform of a linear form. Let define S, the Schwartz space of Rp : this is the set of complex-valued functions that are indefinitely differentiable such that themselves and all their derivatives tend to 0 faster that any inverse power of the norm (see Schwartz (1997), Chapter VII) : Q lim|x|→∞ pi=1 xki i ϕ(j) (x) = 0. The usual Fourier transform that maps ϕ to Fϕ (t) = R exp(−2πiv 0 t)ϕ(v)dv is a bijection on S. The topology of S is the topology of the following convergence p 2 ϕj → 0 if and only if ∀((k), (l)) ∈ (N ) , sup | x∈Rp p Y (l) xki i ϕj (x)| → 0. i=1 For any continuous linear form T ∈ S 0 , ϕ ∈ S 7→ T (Fϕ ) is well-defined because Fϕ ∈ S. This is a continuous linear form on S, denoted F[T ] and called Fourier transform of T . The Fourier transform of T ∈ S 0 is related to the usual Fourier transform on S through the formula: F[T ](ϕ) = T (Fϕ ) . For the definition of Laplace transform of T , let F̌ϕ defined by F̌ϕ (t) = Fϕ 2πt , this also defined a bijection on S. For any continuous linear form T ∈ S 0 , ϕ ∈ S 7→ T F̌ϕ is well-defined linear form on S, denoted F̌[T ]. Moreover, this linear form is continuous F̌[T ] ∈ S 0 . For any T ∈ S 0 , let ΓT the convex set containing 0, defined by: u ∈ ΓT ⇔ [ϕ 7→ T (exp(−u0 .)ϕ(.))] is a well-defined element of S 0 . R For instance, consider the case where T (ϕ) = f (x)ϕ(x)dx. If f is integrable with a Q compact support on Rp then ΓT = Rp , if f (x) = pi=1 1{xi >0} then ΓT = Rp+ and if f (x) = 1 then ΓT = {0}. For u ∈ ΓT , exp(−u0 .)T denotes the element of S 0 defined by [exp(−u0 .)T ](ϕ) = T (ϕ) e with 0 0 0 ϕ(x) e = exp(−u x)ϕ(x). And u ∈ ΓT if and only if exp(−u .)T ∈ S . The Laplace transform of T is defined from ΓT to S 0 by: L[T ] : u ∈ ΓT 7→ ϕ 7→ F̌[exp(−u0 .)T ](ϕ) . 43 R For instance, consider the case where T (ϕ) = f (x)ϕ(x)dx for f a bounded function, then R R −iv0 x 0 L[T ](u)(ϕ) = f (x)e−u x e ϕ(v)dv dx. Then L[T ] is related to the usual Laplace R transform Lf : (u + iv) 7→ f (x)e−(u+iv)x dx via the following canonical mapping: Z L[T ](u)(ϕ) = Lf (u + iv)ϕ(v)dv. R R Similarly, if T (ϕ) = ϕ(x)dµ(x) for µ a signed measure, then L[T ](u)(ϕ) = Lµ (u + R 0 iv)ϕ(v)dv with Lµ (u + iv) = e−(u+iv) x dµ(x). More generally, for any T ∈ S 0 it exists a R function LT defined on ΓT + iRp such that L(T )(u)(ϕ) = LT (u + iv)ϕ(v)dv (Schwartz, 1997, Chapter VIII), and such function is strongly regular : for instance if ΓT is an open set, then LT is holomorphic on ΓT +iRp . Because it exists a canonical isomorphism between 0 L[T ] ∈ (ΓT )S and LT ∈ (ΓT + iRp )C , it is not worthwhile to make difference between these two objects. 7.7 Proof of Theorem 5.3 Before to prove the Theorem, we will prove two useful Lemma to establish (iii). 7.7.1 Lemma Lemma 7.4 (Approximation of compact sets by polytopes) Let S a convex set in Rp such that int(S) 6= ∅ and K a compact included in int(S), then it exists a polytope P such that K ⊂ P ⊂ int(S). The case p = 1 is obvious because in that case int(S) is an open interval of R, and we can choose P = [inf(K); sup(K)]. Because K is bounded it exists a closed ball of radius r and center x0 containing K, we can assume without loss of generality that S is bounded (if S is unbounded replace S by S ∩ B(x0 , r) in the following). Let K 0 = co(K) the convex hull of K, K 0 is compact because this is the range of the continuous mapping (λ, x) ∈ Λ×K 7→ λ·x, P where Λ is the compact {λ ∈ [0; 1]p : pi λi = 1} (the Thykhonov Theorem ensures that Λ× K is compact in R2p ). For (x, y) ∈ Rp , d2 (x, y) = ||x − y||2 denotes the Euclidean distance between x and y and for non-empty A ⊂ Rp , d2 (A, y) = inf x∈A d2 (A, y). For any A ⊂ Rp , Ac denotes the complementary of A in Rp . Let d = inf y∈S c d2 (y, K 0 ) > 0, compactness of K 0 ensures that d > 0 (if d = 0 it exists a sequence xn ∈ K 0 such that d2 (xn , S c ) < 1/n and a subsequence that converges to x∞ ∈ K 0 ∩ cl(S c ) = K 0 ∩ int(S)). For any non-empty subsets A and B in Rp , let dH (A, B) = sup{supx∈A d2 (B, x), supy∈B d2 (A, y)} the Hausdorff metric between A and B. It exists a polytope P such that dH (P, K 0 ) < d/2 and K 0 ⊂ P (Bronstein (2008)). For any non-empty convex C ⊂ Rp and any x ∈ Rp , it exists a unique 44 πC (x) ∈ cl(C) such that d2 (x, πC (x)) = d2 (x, C) (Hiriarat-Urruty & Lemaréchal (2001), Section A.3). For any y ∈ S c : d ≤ d2 (y, πK 0 (y)) ≤ ≤ ≤ ≤ ≤ ≤ d2 (y, πK 0 (πP (y))) d2 (y, πP (y)) + d2 (πP (y), πK 0 (πP (y))) d2 (y, πP (y)) + inf u∈K 0 d2 (πP (y), u) d2 (y, πP (y)) + supv∈P inf u∈K 0 d2 (v, u) d2 (y, πP (y)) + dH (P, K 0 ) d2 (y, πP (y)) + d/2 It follows that d2 (y, πP (y)) ≥ d/2 for any y ∈ S c and then inf y∈S c inf x∈P d2 (x, y) > 0. This ensures that P ∩ cl(S c ) = ∅ and then P ⊂ int(S). 7.7.2 Theorem 5.3 e W ) = E(q(Z, e W )|τW (Z), e W ). Under Assumption 4, X e ⊥⊥ For any q ∈ Q, let qe(τW (Z), e W (Z), e W then for any q ∈ Q, T ∗ (q) = T ∗ (e Z|τ q ). Moreover, under Assumption 4 the Bayes formula implies: R e W )))|X e =x E(e q (τ (Z, e, W = w) = qe(t, w)gw (t) exp(−µw (e x)0 t)dFτW (Z)|W =w (t) νw (t) Assumption 3 ensures that ∀ϕ ∈ Φ: R ψ(x)ϕ(j) (x)dFX (x) = R = R = R E(q(Z)|X = x)ϕ(x)dFX (x) R qe(t,w)gw (t) exp(−µw (x)0 t)dFτ (Z)|W =w (t) ϕ(e x, w)dFX (e x, w) νw (t) rw (µw (e x))ϕ(e x, w)dFX (e x, w). Then if Q = 6 ∅: (j) |j| [ψ(X)fX (X)] e = (−1) = rW (µW (X)) fX (X) R E(q(Z)|τ (Z) = t, W )gW (t) exp(−µ(X)0 t)dFτ (Z)|W (t) . e νW (µW (X)) e = E(q(Z)|X) which is integrable with respect to F e . And (i) holds because rW (µW (X)) X|W e Remark that for any v ∈ RpW , %W (µW (X)+iv) is (almost-surely) well defined and finite beR 0 e e e cause |e q (t, W )gW (t) exp(−(µW (X)+iv) t)|dFτ (Z)|W (t) ≤ νW (µW (X))E(|q(Z)||µ W (X), W ). By definition, ρW is the Laplace transform of the measure MW defined by dMW (t) = qe(t, W )gW (t)dFτW (Z)|W (t). Domain of a Laplace transforms are always of the form Γ + iRp e where Γ is convex (see for instance Schwartz (1997), Chapter VIII). It follows that ρW e ∈ CW |W ) = 1, is defined on CW + iRpW where CW is a convex set such that P(µW (X) e e then Supp(µW (X)|W ) ⊂ cl(CW ) and int co Supp(µW (X)|W ) ⊂ int cl(CW ). It follows 45 ΓW ⊂ int(CW ) (a convex set in Rd has the same interior than its closure, see for instance Hiriarat-Urruty & Lemaréchal, 2001, pages 36 and 37). Moreover, Proposition 6 in Chapter VIII of Schwartz (1997) ensures that ρW is holomorphic on intCW + iRpW and then on ΓW + iRpW and Condition (ii) holds. For any KW compact included in ΓW , Lemma 7.4 ensures that it exists (u1 , ..., upW ) (depending on W ) such that KW ⊂ co(u1 , ..., upW ) ⊂ ΓW , then for every u + iv ∈ KW + iRp : |ρW (u + iv)| = ≤ ≤ ≤ ≤ R qe(t, W )gW (t) exp(−(u + iv)0 t)dFτ (Z)|W (t) R |e q (t, W )| gW (t) exp(−u0 t)dFτ (Z)|W (t) R Pl Pl |e q (t, W )| gW (t) exp(−u0i t)dFτ (Z)|W (t), with i=1 λi i=1 λi ui = u Pl R 0 |e q (t, W )| gW (t) exp(−ui t)dFτ (Z)|W (t) Pli=1 i=1 E (|q(Z)| |µ(X) = ui , W ) νW (ui ) < ∞ Then (iii) holds. Last if τ (Z)|W is dominated by the Lebesgue measure, then it exists f (t, w) ≥ 0 such R e = h(t)f (t, w)dt. Let ξu,w (t) = qe(t, w)gw (t)f (t, w) exp(−u0 t). For any that E(h(τW (Z))) v ∈ RpW , ve denotes the vector of Rp with components vek = 0 if |vk | < maxk0 =1,...,pW |vk0 | R and vek = π/vk if |vk | = maxk0 =1,...,pW |vk0 |. Then ρW (u + iv) = Rp ξu,W (t) exp(−iv 0 t)dt = R R − Rp ξu,W (t) exp(−iv 0 (t − ve))dt = − Rp ξu,W (t + ve) exp(−iv 0 t)dt. It follows that: R 2 |ρW (u + iv)| ≤ |ξu,W (t) − ξu,W (t + ve)| dt Continuity of translations in L1 (Rp )) (cf. Rudin (1987), Theorem 9.5) ensures that ρW (u + iv) tends to 0 when maxk0 =1,...,pW |vk0 | tends to ∞ or equivalently when v tends to ∞. And the Condition (iv) holds. To derive the form of the semi-parametric efficiency bound, note that the Laplace transform dL−1 [%W ] is injective. It follows that qe(t, W ) = gW1(t) dF (t) does not depend on the choice of e τW (Z)|W q ∈ Q. Then if Q is not empty, we have: e W ) + {η ∈ L2 (Z) such that E(η(Z)|τW (Z), e W ) = 0} Q = qe(τW (Z), e W ) + η ∗ (Z) the element of Q that achieves the bound V ∗ . For any Let q ∗ (Z) = qe(τW (Z), e W ) = 0 and any δ ∈ R: η ∈ L2 (Z) such that E(η(Z)|τW (Z), V(εq ∗ (Z) + R(X)) ≤ V(ε(q ∗ (Z) + δη(Z)) + R(X)), or equivalently, after an expansion: δ 2 V(εη(Z)) + 2δCov(εη(Z), εq ∗ (Z) + R(X)) ≥ 0. 46 e W ) = 0: It follows that for any η ∈ L2 (Z) such that E(η(Z)|τW (Z), Cov(εη(Z), εq ∗ (Z) + R(X)) = 0, e W) = Because E(εη(Z)) = E(E(ε|Z)η(Z)) = 0, then for any η ∈ L2 (Z) such that E(η(Z)|τW (Z), 0 we have: E E(ε2 |Z)q ∗ (Z) + E(εR(X)|Z) η(Z) = 0. e W )+E(εR(X)|τW (Z), e W ), Consider η(Z) = E(ε2 |Z)q ∗ (Z)+E(εR(X)|Z)−E(E(ε2 |Z)q ∗ (Z)|τW (Z), to obtain that: h i e W = 0. E V E(ε2 |Z)q ∗ (Z) + E(εR(X)|Z)|τW (Z), It follows that: e e W ). E(ε |Z)q (Z) + E(εR(X)|Z) = E ε q (Z)|τW (Z), W + E(εR(X)|τW (Z), 2 ∗ 2 ∗ e + η ∗ (Z), it exists a function h such that: Because q ∗ (Z) = qe(τW (Z)) e e = −e e W ) − E(εR(X)|Z) + h(τW (Z), W ) , η ∗ (Z) q (τW (Z), E(ε2 |Z) E(ε2 |Z) e W ) = 0 ensures: and then, the condition E(η ∗ (Z)|τW (Z), h i e W ) + E E(εR(X)|Z)/E(ε2 |Z)|τW (Z), e W qe(τW (Z), e W) = h i h(τW (Z), , e W E 1/E(ε2 |Z)|τW (Z), and next: h i 2 e E E(εR(X)|Z)/E(ε |Z)|τ ( Z), W e W) W qe(τ (Z), E(εR(X)|Z) h W i+ h i − q ∗ (Z) = E(ε2 |Z) e W e W E(ε2 |Z)E 1/E(ε2 |Z)|τW (Z), E(ε2 |Z)E 1/E(ε2 |Z)|τW (Z), and the form of the semiparametric efficiency bound follows. 7.8 Proof of Theorem 5.4 If it exists a regular and root-N consistent estimator, then ψfX admits a weak derivative of order (j) and there exists q ∈ L2 (Z) such that (−1)(j) E(q(Z)|X = x)fX (x) = [ψfX ](j) (x). 47 Assumption 5 ensures that Z ⊥⊥ X|p(Z), then T ∗ (q)(x) = E(q(Z)|X = x) = E(E(q(Z)|p(Z))|X = x). Let qe(v) = E(q(Z)|p(Z) = v), we have qe ∈ L2 (p(Z)) and: R R ψ(x)ϕ(j) (x)dFX (x) = ϕ(x)T ∗ (q)(x)fX (x)dx R = ϕ(x)E(e q (p(Z))|X = x)fX (x)dx R = ϕ(x) qefp(Z) ? fη (x)dx Then if there exists a regular and root-N consistent estimator, we have: [ψfX ](j) = (−1)|j| qefp(Z) ? fη . The Fourier transform are well defined and next: F[ψfX ](j) = (−1)|j| Fqefp(Z) Ffη . Because Ffη has isolated zeros, we get the "only if part" of the Proposition (i) of Theorem 5.4. The "if part" is achieved by the fact that if such qe exists then qe ∈ Q. Because q ∈ L1 (Z) then qefp(Z) is integrable with respect to the Lebesgue measure and next Fqefp(Z) ∈ C00 (RK ) by the Riemann-Lebesgue lemma and (ii) holds. Moreover, when Fqefp(Z) ∈ L1 (RK )∩C00 (RK ), then usual inversion formula applies i.e. F −1 = F. Then Q is not empty if and only if it exists (−1)|j| F F[ψfX ](j) /Ffη ∈ L1 (RK ), which is equivalent to F F[ψfX ](j) /Ffη ∈ L1 (RK ). And next, the full characterization of Q follows. Then (iii) holds. R If p(Z) has a compact support then ϕ 7→ qe(v)fp(Z) (v)ϕ(v)dv is a linear form compactly supported and then its Laplace transform verifies conditions (iv).(a), (iv).(b) and (iv).(c) (cf. Schwartz (1997)). Moreover, the identity L[ψfX ](j) = (−1)|j| Lqefp(Z) Lfη holds on Γfη + iRK = CK and next L[ψfX ](j) L fη = (−1)|j| Lqefp(Z) on CK . Last, the expression of q ∗ is achieved by the same reasoning than those used in the proof of Theorem 5.3. 48 7.9 Proof of Propostion 5.5 It follows from Condition (i) of Theorem 5.4 that a regular and root-N estimator exists only (j) if the weak derivative of fX is a function (locally integrable) fX and we have (cf. Schwartz Q jk (1997), Equation VII.7.3, page 253): Ff (j) (t) = (2πi)|j| t(j) FfX (t) with t(j) = K k=1 tk . The X additive first stage ensures that Ff (j) (t) = (2πi)|j| t(j) Ffp(Z) (t)Ffη (t). The first condition of X Theorem 5.4 ensures that there exists q ∈ L2 (p(Z)) such that: (2πi)|j| t(j) Ffp(Z) (t) = (−1)|j| Fqfp(Z) (t). Because (2πi)|j| t(j) is the weak derivative of order (j) of the Dirac at 0, it follows (cf. Schwartz, 1997, Chapter VII, Theorem XV) that (−1)|j| qfp(Z) is the weak derivative of order (j) of fp(Z) . Then the weak derivative of!fp(Z) is a function of the form (−1)|j| qfp(Z) 2 (j) fp(Z) (p(Z)) 2 < ∞. The expression of q ∗ follows from with q ∈ L (p(Z)) and next E fp(Z) (p(Z)) Theorem 5.4. 7.10 A Lemma on Fourier transform Lemma 7.5 (Fourier transforms and derivatives) Let P a probability measure on R, and H and G two signed measures dominated by P such that: P is dominated by H, dH is locally absolutely continuous, and dP for any x ∈ R : 2πixF[H](x) = F[G](x), then P admits a locally absolutely continuous density with respect to the Lebesgue measure. Proof: As a signed measured H ∈ S 0 and because the weak derivative of the Dirac at 0 admits a compact support, 2πixF[H] = F[Dδ ? H] = F[DH] where DH is the weak derivative of H and Dδ is the weak derivative of the Dirac at 0. The injectivity of the Fourier transform on S 0 ensures that the weak derivative of the measure H needs to be the measure G. Theorem II, Chapter II in Schwartz (1997) ensures that it exists a function e h (with bounded variation on every finit interval) such that dH(t) = e h(t)dλ(t) where λ is the dP e Lebesgue measure on R. Then dP (t) = h(t) dH (t)dλ(t) and next, dG(t) = ge(t)dλ(t) with 49 dP ge = e h(t) dH (t) dG (t) and ge ∈ L1 (λ). So ge is the weak derivative of e h. The Theorem dP e III in Chapter II of Schwartz ensures that h is absolutely continuous. The product of locally absolutely continuous functions is a locally absolutely continuous function. And dP (t) is locally absolutely continuous, this is the density of P with respect to the next e h(t) dH Lebesgue measure. 8 Appendix B: Details on Table 1 8.1 Proof of the results about the normal bivariate case We consider the case where: (X, Z) ∼ N ! 0 0 , !! 1 ρ ρ 1 , Following the notations of Assumption 4 and Theorem 4 (and dropping W that does not appear here), we have: ρz , µ(x) = x, τ (z) = − 1 − ρ2 2 1 1 x2 t (1 − ρ2 ) g(t) = p , h(x) = √ exp − . exp − 2 2(1 − ρ2 ) 2π 1 − ρ2 R xρz z2 dz, Remember that ν(x) = E (g(τ (Z)) exp (−xτ (Z))) = √ 1 2 exp 1−ρ − 2 2(1−ρ2 ) 2π(1−ρ ) and next a bit of algebra gives: ν(x) = exp x 2 ρ2 2(1 − ρ2 ) . Then for a parameter θ0 = E(ψ(X)m(j) (X)), we have to check that the following function %(x) is a Laplace transform: %(x) = (−1) j (j) x2 x2 ψ(x) exp(− ) exp . 2 2(1 − ρ2 ) 2 /2) √ 1. Shift in mean: If θ0 = E(ψ(X)m(X)) with ψ(x) = h( exp(−(x 2π √1 2π 2 e−(x−m) /2 , x) with h(y, x) = . Note that choosing h like that means that we are interesting by the average effect under a counterfactual distribution N (m, 1) in a case where fX is unknown and needs to be estimated. In that case, R(X) = (ψ(X) + ∂fX ψ(X) · fX (X))m(X) − E ((ψ(X) + ∂fX ψ(X) · fX (X))m(X)) = 0. A less realistic problem y 50 could also be considered, assuming thatfX is knownby econometrician, in that case 2 2 h would take the form h(y, x) = exp − (x−m)2 −x and even if the parameter is the same, the expression of the semi parametric efficiency bound would be different because in that case R(X) = ψ(X)m0 (X) − E (ψ(X)m0 (X)). In that case, (x − m)2 x2 %(x) = exp − + 2 2(1 − ρ2 ) m2 = exp − 2 exp x2 ρ2 − xm . 2(1 − ρ2 ) For ρ 6= 0, usual formula of Laplace transform ensures that: p 1 − ρ2 m2 (z − m)2 (1 − ρ2 ) %(x) = √ exp − L exp − , 2 2ρ2 2π|ρ| and next, if L−1 [%] denotes the measure that is the inverse Laplace of %(x): p 1 − ρ2 m2 (t − m)2 (1 − ρ2 ) −1 exp − dL [%] = √ exp − dt. 2 2ρ2 2π|ρ| On the other side: p 2 1 − ρ2 t (1 − ρ2 ) exp − g(t)dFτ (Z) (t) = √ dt, 2ρ2 2π|ρ| so, we obtain: 1 dL−1 [%(x)] (t) = g(t) dFτ (Z) √ 2π(1 − ρ2 )3/2 m2 tm(1 − ρ2 ) exp − 2 exp . |ρ| 2ρ ρ2 Because τ (Z) is one to one, E(ε2 |Z) = E(ε2 |τ (Z)) and E(εR(X)|Z) = E(εR(X)|τ (Z)). It follows that the semi-parametric efficiency bound of θ0 is: V (εq ? (Z)) . ? With q (Z) = exp mZ ρ − m2 2ρ2 . 2. Shift in variance: If θ0 = E(ψ(X)m(X)) with ψ(x) = h( √12π e−x −x2 /(2σ 2 ) e √ 1 2πσ 2 2 /2 , x) with h(y, x) = . Note that choosing h like that means that we are interesting by the average effect under a counterfactual distribution N (0, σ 2 ) in a case where fX is unknown and needs to be estimated and next R(X) = 0. We have ψ(x) = y 51 2 (σ 2 )−1/2 exp − x2 1 σ2 − 1 , then 2 −1/2 %(x) = (σ ) 2 1 1 x . − exp − 2 σ 2 1 − ρ2 Because Γ = R, it follows that conditions (iii) and (iv) hold only if only if σ 2 > 1 − ρ2 . Reciprocally, if σ 2 > 1 − ρ2 , we have: p 1 σ2 1 − 1−ρ 2 < 0, ie z 2 σ 2 (1 − ρ2 ) L[exp(− )], %(x) = p 2(σ 2 + ρ2 − 1) 2π(σ 2 + ρ2 − 1)) (1 − ρ2 ) and in this case if L−1 [%] denotes the measure that is the inverse Laplace of %(x): p (1 − ρ2 ) t2 σ 2 (1 − ρ2 ) dL−1 [%](t) = p exp(− )dt. 2(σ 2 + ρ2 − 1) 2π(σ 2 + ρ2 − 1)) On the other side: p 2 1 − ρ2 t (1 − ρ2 ) g(t)dFτ (Z) (t) = √ exp − dt, 2ρ2 2π|ρ| then: |ρ| 1 dL−1 [%(x)] (t) = p exp 2 g(t) dFτ (Z) σ + ρ2 − 1 t2 (1 − ρ2 )2 (σ 2 − 1) 2ρ2 (σ 2 + ρ2 − 1) . It follows that the semi-parametric efficiency bound of θ0 is: ρ2 Z 2 (σ 2 − 1) V ε exp . σ 2 + ρ2 − 1 2(σ 2 + ρ2 − 1) 3. Average marginal effect: If θ0 = E (m0 (X)), then if q ∈ Q, it follows that E[q(Z)|X] = X and then q(Z) = Zρ and the semi parametric efficiency bound is: Z 0 V (Y − m(X)) + m (X) ρ 4. average derivative weighted by power of density: Let θ0 = E (ψ(X)m0 (X)) with ψ(X) = h((2π)−1/2 exp(−x2 /2), x), with h(y, x) = y k , for k = 0 we get the average marginal effect and for k = 1 we get the density-weighted average derivative. R(X) = 2 2 k+1 k+1 exp − kx2 m0 (X) − E (2π) − kx2 m0 (X) . The function % is: k/2 exp (2π)k/2 %(x) = (k + 1)x exp x2 2 52 1 − (k + 1) 1 − ρ2 , which is not the Laplace transform of a function if 1 1−ρ2 − (k + 1) < 0. Then the k . For ρ2 > parameter θ0 is root-N estimable only if ρ2 ≥ k+1 ensure that the semi-parametric efficiency bound is: 2 (k + 1) V εZ exp Z 2 And for ρ2 = k k+1 1 − ρ2 (k + 1)ρ2 − k 1 +√ k 2π k , k+1 ! exp −kX 2 /2 m0 (X) . and k ≥ 1, we get: ! Z 1 (k + 1)2 V ε + √ k exp −kX 2 /2 m0 (X) . ρ 2π Last, if k = 0 and ρ2 = 0 Q is empty. 53 a bit of algebra References Ai, C. & Chen, X. (2003), ‘Efficient estimation of models with conditional moment restrictions containing unknown functions’, Econometrica 71, 1795–1843. Ai, C. & Chen, X. (2007), ‘Estimation of possibly misspecified semiparametric conditional moment restriction models with different conditioning variables’, Journal of Econometrics 141, 5–43. Ai, C. & Chen, X. (2012), ‘The semiparametric efficiency bound for models of sequential moment restrictions containing unknown functions’, Journal of Econometrics 170, 442– 457. Bickel, P. J., Klaassen, C. A., Ritov, Y. & Wellner, J. A. (1993), Efficient and Adaptative Estimation for Semiparametric Models, Springer-Verlag New York. ISBN 0-387-98473-9. Blundell, R., Chen, X. & Kristensen, D. (2007), ‘Semi-nonparametric iv estimation of shape-invariant engel curves’, Econometrica 75, 1613–1669. Breunig, C. & Johannes, J. (2016), ‘Adaptive estimation of functionals in nonparametric instrumental regression’, Econometric Theory 32. Bronstein, E. M. (2008), ‘Approximation of convex sets by polytopes’, Journal of Mathematical Sciences 153(6), 727–762. Canay, I. A., Santos, A. & Shaikh, A. M. (2013), ‘On the testability of identification in some nonparametric models with endogeneity’, Econometrica 81(6), 2535–2559. Carrasco, M., Florens, J. & Renault, E. (2007), Linear inverse problems and structural econometrics: Estimation based on spectral decomposition and regularization, in J. Heckman & E. Leamer, eds, ‘Handbook of Econometrics’, Vol. 6B, North-Holland, pp. 5633–5751. Carrasco, M., Florens, J. & Renault, E. (2014), Asymptotic normal inference in linear inverse problems, in J. Racine, L. Su & A. Ullah, eds, ‘The Oxford Handbook of Applied Nonparametric and Semiparametric Econometrics and Statistics’, Oxford University Press, pp. 65–96. Cattaneo, M. D., Crump, R. K. & Jansson, M. (2014), ‘Small bandwidth asymptotics for density-weighted average derivatives’, Econometric Theory 30, 176–200. Chamberlain, G. (1987), ‘Asymptotic efficiency in estimation with conditional moment restrictions’, Journal of Econometrics 34, 305–334. 54 Chen, X. (2007), Large sample sieve estimation of semi-nonparametric models, in J. Heckman & E. Leamer, eds, ‘Handbook of Econometrics’, Vol. 6B, North-Holland, pp. 5549– 5632. Chen, X. & Pouzo, D. (2012), ‘Estimation of nonparametric conditional moment models with possibly nonsmooth generalized residuals’, Econometrica 80(1), 277–321. Chen, X. & Pouzo, D. (2015), ‘Sieve wald and qlr inferences on semi/nonparametric conditional moment models’, Econometrica 83(3), 1013–1079. Chetverikov, D. & Wilhelm, D. (2015), ‘Nonparametric instrumental variable estimation under monotonicity’, CEMMAP working paper (39). Choi, S., Hall, W. J. & Schick, A. (1996), ‘Asymptotically uniformly most powerful tests in parametric and semiparametric models’, The Annals of Statistics 24(2), 841–861. Darolles, S., Fan, Y., Florens, J. & Renault, E. (2011), ‘Nonparametric instrumental regression’, Econometrica 79(5), 1541–1565. De La Vallée Poussin, C.-J. (1915), ‘Sur l’intégrale de lebesgue’, Transactions of the American Mathematical Society 16(4), 435–501. d’Haultfœuille, X. (2011), ‘On the completeness condition in nonparametric instrumental problems’, Econometric Theory 27(03), 460–471. Florens, J. (2003), Inverse problems and structural econometrics: The example of instrumental variables, in R. Dewatripont, L. Hansen & S. Turnovsky, eds, ‘Advances in Economics and Econometrics: Theory and Applications’, Cambridge University Press, pp. 284–311. Florens, J., Johannes, J. & Bellegem, S. V. (2011), ‘Identification and estimation by penalization in nonparametric instrumental regression’, Econometric Theory 27, 472–496. Florens, J., Johannes, J. & Bellegem, S. V. (2012), ‘Instrumental regression in partially linear models’, Econometric Journal 15, 304–324. Florens, J. & Simoni, A. (2012), ‘Nonparametric estimation of an instrumental variables regression: A quasi bayesian approach based on a regularized posterior’, Journal of Econometrics 170, 458–475. Freyberger, J. (2015), On completeness and consistency in nonparametric instrumental variable models. webpage: http://www.ssc.wisc.edu/˜jfreyberger/Completeness_Freyberger. 55 Gourieroux, C., Montfort, A. & Trognon, A. (1984), ‘Pseudo maximum likelihood methods: Theory’, Econometrica 52(03), 681–700. Groetsch, C. W. (1977), Generalized Inverse of Linear Operators, Representation and Approximation, first edn, MARCEL DEKKER. ISBN 0-8247-6615-6. Hall, P. & Horowitz, J. (2005), ‘Nonparametric methods for inference in the presence of instrumental variables’, The Annals of Statistics 33, 2904–2929. Hiriarat-Urruty, J.-B. & Lemaréchal, C. (2001), Fundamentals of Convex Analysis, first edn, Springer. ISBN 3-540-42205-6. Hu, Y. & Schennach, S. M. (2008), ‘Instrumental variable treatment of nonclassical measurement error models’, Econometrica 76(1), pp. 195–216. Jacod, J. & Protter, P. (2000), Probability Essentials, Springer Berlin Heidelberg. ISBN 3-540-66419-X. Newey, W. K. (1988), ‘Adaptative estimation of regression models via moment restrictions’, Journal of Econometrics 38, 301–339. Newey, W. K. (1990a), ‘Efficient instrumental variables estimation of nonlinear models’, Econometrica 58(4), 809–837. Newey, W. K. (1990b), ‘Semiparametric efficiency bound’, Journal of Applied Econometrics 5(2), 99–135. Newey, W. K. & Powell, J. L. (2003), ‘Instrumental variable estimation of nonparametricmodels’, Econometrica 71(5), 1565–1578. Newey, W. K. & Stoker, T. M. (1993), ‘Efficiency of weighted average derivative estimators and index models’, Econometrica 61(5), 1199–1223. Powell, J. L., Stock, J. H. & Stoker, T. M. (1989), ‘Semiparametric estimation of index coefficients’, Econometrica 57(6), 1403–1430. Rudin, W. (1987), Real and Complex Analysis, third edn, McGraw-Hill. ISBN 0-07-0542341. Santos, A. (2011), ‘Instrumental variables methods for recovering continuous linear functionals’, Journal of Econometrics 161, 129–146. Schwartz, L. (1997), Théorie des distributions, third edn, Hermann. ISBN 2-7056-5551-4. 56 Severini, T. A. & Tripathi, G. (2012a), ‘Semiparametric efficiency bounds for microeconometric models: A survey’, Foundations and Trendsr in Econometrics 6(3-4), 163–397. Severini, T. & Tripathi, G. (2012b), ‘Efficiency bounds for estimating linear functionals of nonparametric regression models with endogenous regressors’, Journal of Econometrics 170, 491–498. van der Vaart, A. W. (1991), ‘On differentiable functionals’, Annals of Statistics 19(1), 178– 204. van der Vaart, A. W. (2000), Asymptotic Statistics, Cambridge University Press. ISBN 0-521-78450-6. 57
© Copyright 2026 Paperzz