Asymptotic Theory of Least Absolute Error Regression Authors(s): Gilbert Bassett, Jr. and Roger Koenker Source: Journal of the American Statistical Association, Vol. 73, No. 363 (Sep., 1978), pp. 618-622 Published by: Taylor & Francis, Ltd. on behalf of the American Statistical Association Stable URL: http://www.jstor.org/stable/2286611 Accessed: 24-03-2016 15:31 UTC Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at http://about.jstor.org/terms JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. American Statistical Association, Taylor & Francis, Ltd. are collaborating with JSTOR to digitize, preserve and extend access to Journal of the American Statistical Association http://www.jstor.org This content downloaded from 131.193.211.30 on Thu, 24 Mar 2016 15:31:20 UTC All use subject to http://about.jstor.org/terms Asymptotic Theory of Least Absolute Error Regression GILBERT BASSETT, JR. and ROGER KOENKER* In the general linear model with independent and identically disY1, Y2, .. ., YT distributed according to tributed errors and distribution function F, the estimator which K minimizes the sum of absolute residuals is demonstrated to be con- Pr[Yt < y] = F(y - E Xktk), t =1,*,T, (1.1) sistent and asymptotically Gxaussian with covariance matrix co2Q-', k=1 where Q = lim T-lX'X and Co2 iS the asymptotic variance of the ordinary sample median from samples with distribution F. Thus the where Xkt denotes an element of a known T X K design least absolute error estimator has strictly smaller asymptotic con- fidence ellipsoids than the least squares estimator for linear models matrix XT. We assume that 5 is located so that the from any F for which the sample median is a more efficient estimator distribution function F has median zero; if the column of location than the sample mean. space of XT contains IT = (1, 1, . . . , 1), then this KEY WORDS: Least absolute error estimators; Linear models; involves no loss in generality. The LAE estimator T* Asymptotic distribution theory. is a solution to the problem IK 1. INTRODUCTION min [p(b) = |y - E xktbk|] . (1.2) The methods of minimizing the sum of absolute and bCGRK t-1 k=1 squared deviations from hypothesized linear models have We prove the following theorem in Section 3. vied for statistical favor for more than 250 years. While least squares enjoys certain well-known optimality prop- Theorem: Let { T* } denote a sequence of unique erties within strictly Gaussian parametric models, the solutions to the problem (1.2) for model (1.1), and assume least absolute error (LAE) estimator' is a widely recog- (i) F is continuous and has continuous and positive nized superior robust method especially well-suited to density f at the median, and longer-tailed error distributions. Increasingly, the LAE (ii) lim T-'XT'XT = Q, a positive definite matrix. estimator is recommended as a preliminary (consistent) Then V\T(3T* - 5) converges in distribution to a K-di- estimator for one-step and iteratively reweighted least mensional Gaussian random vector with mean 0 and squares procedures; cf. Andrews (1974), Bickel (1975), covariance matrix c2Q-1, where co2 is the asymptotic Harvey (1977), and Hill and Holland (1977). variance of the sample median from random samples This article resolves a long-standing, open question from distribution F; i.e., co = [2f (0)]-1. concerning the LAE estimator by establishing its asymp- The result implies that for any error distribution for totic normality under general conditions, thereby ex- which the median is superior to the mean as an estimator tending a result of Laplace (1818) (see Stigler 1973) to of location, the LAE estimator is preferable to the least the general linear model. The result confirms that the squares estimator in the general linear model, in the LAE estimator is a natural analog of the sample median sense of having strictly smaller asymptotic confidence for the general linear model. Recently we have extended ellipsoids.2 This condition holds for an enormous class of this analogy to the estimation of arbitrary quantile distributions which have peaked density at the median hyperplanes for linear models; see Koenker and Bassett and/or long tails. (1978). Laplace (1818) proved the preceding result for the We consider the familiar problem of estimating a special case of bivariate regression through the origin, K-dimensional vector of unknown parameters 5 from a 2 A number of recent studies have sought to investigate the sample of independent observations on random variables sampling distribution of the LAE estimator by Monte Carlo methods. Most notable among these is the unpublished work of Rosenberg and * Gilbert Bassett, Jr. is Assistant Professcr of Economics, UniCarlson (1971), who conclude that the Gaussian distribution given versity of Illinois at Chicago Circle, Chicago, IL 60680. Roger above provides an acceptable approximation to their sampling dis- Koenker is a member of the technical staff, Bell Laboratories, Murray tributions for modest sample sizes and well-conditioned designs. Hill, NJ 07974. The authors wish to express their appreciation to Their finding that the approximation was significantly worse for Lester D. Taylor for stimulating their interest in this topic. these sample sizes and ill-conditioned designs should come as no 1 This venerable method, which Laplace called the "method of surprise. situations," has had a bewildering variety of names. Recently it has ? Journal of the American Statistical Association accumulated a large array of acronyms: LAE, LAR, LAD, MAD, MSAE, and others. It is also frequently referred to as Li-regression, September 1978, Volume 73, Number 363 and less frequently as median regression. Theory and Methods Section 618 This content downloaded from 131.193.211.30 on Thu, 24 Mar 2016 15:31:20 UTC All use subject to http://about.jstor.org/terms Bassett and Koenker: Least Absolute Error Regression 619 which in turn includes the median as a special case. Since abundant evidence that L* outperforms least squares for * may be viewed as a limiting form of the well-known a large class of distributions satisfying Gauss-Markov Huber (1972, 1973) M-estimator defined by conditions. For examples we need go no further than the comparison of the sample median and sample mean as TK estimates of location. min E P(Yt - X Xtkbk), (1.3) bCRK t=l k=1 We now introduce some additional notation and state where two more lemmas. For v E (RK let llv|| = maxk=l,...,K I Vk I p(u) = 2 uul < c and -consider the K-dimensional closed hypercubes = ll22 lu , (1.4) centered at 5 of "radius" e, our result may be interpreted as an extension of well- C[6, E] = {C C_ C5K: ||c - all < e} . (2.1) known results on the large-sample theory of M-estimators The corresponding open cubes will be denoted by C(&, e). to an important nonregular special case. General methods Our characterization of the estimator * relies extensively of proof of asymptotic results on M-estimators for linear on the vector-valued function models break down when p (t) = t I because differentia- bility conditions are violated. However, note that the ,(h, v) E ^,( v) = E sgn*(yt - x,b(h); t Gh- t G h familiar asymptotic variance formula for M-estimators (see, e.g., Huber 1973, Relles 1968, Yohai 1974), -xtX(h)-1v)xtX(h)-1, (2.2) where sgn*(u; w) = sgnu if u 5 0 V(I' F) f_ | 2(t) f(t)dtj[f +'(t) f(t)dt] = sgnw otherwise equals [2f(0) ]-2 for A'(t) = sgn(t) with &'(t) interpreted Lemma 2: For any h C H, b(h) X(h)-ly(h) E B* if as twice the Dirac delta function. and only if ( (h, v) E C[0, 1] for all v 0 O, and b (h) = B* (is unique) if and only if (h, v) C (0, 1) for all v # 0. 2. NOTATION AND PRELIMINARY RESULTS Proof: Since p (.) is convex, it suffices to show that our Let T = {1, 2, ..., T} and let SC denote the set of existence and uniqueness conditions are equivalent to K-element subsets of r. Elements h E SC have relative nonnegativity and strict positivity of the directional complement h = - h, and serve to partition y and derivative function, X. Thus y (h) denotes the K-vector with elements T {yt: t E h}, while X(h) denotes a (T - K) X K matrix j&(b(h); w) =-E sgn*(yt - xtb(h); -x,w)xtw with rows {Ix: t C h}. Finally, let t=1 H = {h C Jelrank X(h) = KI in all directions w $ 0. Now, We may now state some fundamental properties of ,6(b(h); w) = E sgn(xtw)xtw elements @* of the solution set B* of the minimization tEh problem (1.2). - E sgn*(yt - xtb(h);xtw)xtw > 0 Lemma 1: If X has rank K, then the solution set B* for all w # 0 is equivalent to, setting v = X(h)w, to (1.2) has at least one element of the form K *= X(h)-ly(h) E Ivk I - ((h, v)v > ? (2.3) k=1 for some h C H. Moreover, B* is the convex hull of all solutions having this form. for all v # 0. But (2.3) is equivalent to (h, v) C CEO, 1]. Proof: This result follows immediately from the linear Uniqueness is argued in the same way, with strict programming formulation of problem (1.1). See, for inequalities replacing weak. example, the recent work by Abdelmalek (1974), Bassett Remarks: Note that if yt - xtb(h) # 0 for all t E h, (1973), and Taylor (1974) for details. then ( (h, v) is independent of v. This is a nondegeneracy Remark: One introductory econometrics text asserts or "no ties" condition for the general linear model. that the LAE estimator "ignores sample information" It is instructive to consider the sample median in the because it "passes through" K sample points. However, light of Lemmas 1 and 2. Suppose that xt = 1 for the entire sample serves to determine which K points the t = 1, ... T, so H = .Then*= y(h) is a sample * hyperplane passes through. Also, the fact that 5* is median if and only if a linear function of some of the sample observations has -1 < E sgn*(yt - y(h);w) < 1 apparently led some authors to state that g* must be teh inferior to least squares under Gauss-Markov conditions. However, the selection of h such that 0* = X (h)ly (h) makes ~* a manifestly nonlinear estimator; thus no Gauss-Markov claims apply to it. Indeed, there is for all w # 0. If F is continuous, so that y - y(h) = 0 for t E hi with probability zero, then a unique median obtains if and only if T is odd. With continuous F, and This content downloaded from 131.193.211.30 on Thu, 24 Mar 2016 15:31:20 UTC All use subject to http://about.jstor.org/terms 620 Journal of the American Statistical Association, September 1978 T even, there is, with probability one, an "interval of down in the scatter. The result states that as long as medians" between adjacent order statistics. Note that in these movements leave observations on the same side the absence of degeneracy, the condition for uniqueness of the original line, the solution is unaffected. This is purely a design condition, reducing in the location property is not shared by least squares, and although model to the requirement that T be odd. This suggests obvious in the case of the median, it seems to capture that for any sequence XT } of designs, one should be part of the intuitive flavor of 0*'s median-type robust- able to extract a subsequence, or at worst a "perturbed" ness and insensitivity to outlying observations. subsequence, whose elements have unique solutions. An 3. PROOF OF THE THEOREM alternative approach which is frequently employed in the location model is to adopt some arbitrary rule to We begin by establishing that the finite sample density choose a single element from sets of solutions to problem kT(6) of the random K-vector VT(L1* -) is given by (1.2) when they occur. Either approach suffices to obtain rT(&) = T-K12 E X(h)| H f(T-ixt6) the sequence of unique solutions considered in the next hEH teh section. *Pr[ZT(5, h) C C(O, 1)], (3.1) We now state a number of equivariance properties of the LAE estimator. where ZT(61 h) = z Zt(6, h) = E sgn(ut T-1xt6)xtX(h)-l . Lemma 3: If g*(y, X) E B*(y, X), then the following tih- t'ch are elements of the solution of the specified transformed problem: We argue as follows. Let * = - and d (h) = b(h) - = X(h)-u(h), and note that Yt - xtb(h) =ut (i) 0*(XY, X) = X (y, X), X E i; -xtd (h). Consider the events (ii) @*(y + Xy, X) = P*(y, X) + Y, y CR'; (iii) * (y, XA) = A-* (y, X), AK XK nonsingular; El(h, 5, e) = {u C iR Td(h) C C(, ()}, (iv) 0* (X5* + Du*, X) = 5* (y, X), E2(h) = {u 6d T (h, v) & C(O, 1), Vv 0} DTXT diagonal with nonnegative elements; u* - y-X0*. By Lemmas 1 and 2, Proof: Let Pr[6* C C(&, E)]- U [El(h, 6, e) n E2(h)] . (3.2) 1' heH 4V(b;y,X) = I lyt-xtbf t=1 Let and note that IIXTII = max |XktJ k=l,... SK (i) X 11,(b;y, X) = (X(bb;Xy, X); t=1,.., T (ii) s6(b;y, X) = VI(b + Y; y+ X'y, X); set MT = KIIXTII, and define the event (iii) 4t(b; y, X) = O'(A-lb; y, XA). E3(h,, E) = {uC RiT: I Ut - X, I> EM for all t h} For (iv), * E B*(y, X) implies (by Lemma 2); T Clearly, - E sgn* (yt - Xtg*; -XtW)XtW > 0 , w E K Pr(El r E2) = Pr(E1l r E2r) E3) t=1 + Pr(E1 r) E2 r>-E3) * (3.3) Note that Since El(h, 5, e) implies ut - XtB I < EM for all t e h, sgn*(xt5* + dt(yt - Xt5*) - xtg*; -x,w)xtw Pr[ U (E1 C E2 / E3)] < sgn*(y, - xtg*; -xtw)xtw h EH for dt > 0, and (iv) follows. E Pr(El E2 r E3) heH Remark: Estimators with properties (i) and (ii) are termed affine equivariant, and scale and shift equivariant, = E Pr(El) Pr(E2 | E1 I E3) Pr(E3 I E1) . (3.4) h'H respectively (see Bickel 1975). Estimators with property (iii) may be termed equivariant to reparameterization of Set E2'(h, 6) = {U R G IT Z(VTh, h) E C (O, 1)}. It is design. Properties (i)-(iii) are shared by the least squares readily shown that Pr (E2') = Pr (E2 EEl n E3). If we estimator, but typically robust alternatives to least denote the Lebesgue measure of C by X { C }, then since squares are not equivariant in one or more of the above lim6 .o Pr(E3(h, &, e)) = 1, we have: senses; see, e.g., Bickel (1973, 1975). The fourth property generalizes an invariance prop- lim PrE G C(8, E)] Pr[EE(h, 6, e)] Pr(E2') lim - - = lim -N) erty of the median to the linear model. It has the follow- ing geometric interpretation. Imagine a scatter of sample - I X(h)l IIH f(xt8) Pr(E2') . (3.5) observations in 612 with the LAE solution line slicing tEh through the scatter. Now consider the effect (on the position of the LAE line) of moving observations up or Normalizing bY T-? yields (3.1). This content downloaded from 131.193.211.30 on Thu, 24 Mar 2016 15:31:20 UTC All use subject to http://about.jstor.org/terms Bassett and Koenker: Least Absolute Error Regression 621 If GT is lattice in one or more directions, the expression We now demonstrate that gT(?) converges to a speci- (3.9) holds only up to some bounded constant of pro- fied Gaussian density, and Scheffe's Theorem (1947) on portionality involving counting measure on the lattice convergence of densities completes the proof. points in CT. Summing over H yields a uniformly bounded Note that density proportional to exp I- w2w'QB }; thus, by the Zt(B, h) Lebesgue dominated convergence theorem, the sequence = x,X(h)-l with probability 1 - F(T-Ixt6) of integrals of gT (a) converges, and our result follows as in the nonlattice case (cf. Sheffe 1947, p. 437). --x,X(h)-1 with probability F(T-lx,6) (3.6) Expanding F around zero, noting that T-1 E teh x,xt -> Q, 4. CONCLUSION and noting that IIXTII = o(V\T) (see 1\lalinvaud 1970, pp. 226-227), it is readily shown that the stabilized sum The least absolute error estimator has been demon- strated to be more efficient than the least squares esti- T-'ZT(&, h) = T-i E Zt(8&, h) (3.7) mator in the linear model for any error distribution for teh which the median is more efficient than the mean. This converges in law to a K-variate Gaussian random vector result considerably strengthens the existing rationale for with mean -2f(0)6'Q X (h)-1 and covariance matrix the use of the LAE estimator, making it particularly X'(h)-'QX(h)-1. Note that the condition IIXTII = o(V\T) attractive relative to least squares when the regression implies the standard multivariate Lindeberg condition process is thought to be potentially long-tailed or to (e.g., Cramer 1970, p. 114) immediately. possess peaked density at the median. Suppressing 8 and h, let GT denote the probability Our main theorem also provides, for the first time, the measure induced on 61K by ST = 2T-ZT, and let GT -> G. foundation for a large-sample hypothesis testing ap- Define the K-dimensional hypercubes centered at the paratus for the LAE estimator. Any of the conventional origin, schemes for calculating confidence intervals for the CT = C[O, 1/(2 /T)] = {cE (RK |lc|l median, applied to the LAE residuals, will provide a consistent estimator of the density at the median, and < 1/ (2 \T)} (3.8) therefore, normal theory tests with asymptotic justifi- with Lebesgue measure X CT} = T-K /2. If the design cation may be readily constructed. sequence {XT} makes GT nonlattice after some To, then arguing as in Shepp (1964) and Stone (1965), the Radon- [Received November 1976. Revised January 1978.] Nikodyn derivative, GT{CT} _ K/ REFERENCES lim - = lim TK/2 Pr[ZT C CT] = g(0), (3.9) T->oo X{CT} T-*oo Abdelmalek, N.N. (1974), "On the Discrete Linear IJ Approxima- tion and Li Solutions of Overdetermined Linear Equations," where g (0) is the density of G evaluated at the origin. Journal of Approximation Theory, 11, 38-53. Andrews, David F. (1974), "A Robust Method for Multiple Linear That is, Regression," Technometrics, 16, 523-531. Bassett, Gilbert W., Jr. (1973), Some Properties of the Least Absolute TK/2 Pr[ZT(&, h) C CT] Error Estimator, unpublished Ph.D. thesis, Department of Eco- = (2ir)K/2 14 X'(h)'Q X (h)' nomics, University of Michigan. Bickel, Peter J. (1973), "On Some Analogues to Linear Combinations * exp { -_f2(0)6'Q X'(h)-' of Order Statistics in the Linear Model," Annals of Statistics, 1, * [X'(h)-XQX(h)'7j'X(h)-1Q8} + o(1) 597-616. - (1975), "One-Step Huber Estimates in the Linear Model," = (27r) K/22KX(h)IIQI-'{exp - 2X8Q8} + o(1) Journal of the American Statistical Association, 70, 428-434. Cramdr, Harald (1970), Random Variables and Probability DistribuThe continuity of the density at the median implies tions, Cambridge: Cambridge University Press. Harvey, Andrew C. (1977), "A Comparison of Preliminary Estima- II f (T-IxtB) = fK(0) + o(1) . (3.10) tors for Robust Regression," Journal of the American Statistical tEh Association, 72, 910-913. Hill, Richard W., and Holland, Paul W. (1977), "Two Robust So substituting back into (3.1), we have Alternatives to Least-Squares Regression," Journal of the American Statistical Association, 72, 828-833. 4 T(b) = E ThK I X(h) 121 Q I-i(27r)-K2W-K Huber, Peter J. (1972), "Robust Statistics: A Review," Annals of hEH Mathematical Statistics, 43, 1041-1067. *exp {-Ico25'Q6J + o(l) E T-K I X(h) 12 (1973), "Robust Regression: Asymptotics, Conjectures and hEH Monte Carlo," Annals of Statistics, 1, 799-821. Koenker, Roger W., and Bassett, Gilbert W., Jr. (1978), "Regression But FheH I X(h) 12 l X'X I (see Rao 1973, p. 32). Thus, Quantiles," Econometrica, 46, 33-50. de Laplace, Pierre Simon (1818), Deuxxieme Supplement a la Theiorie simplifying, we have Analytique des Probabilites, Paris: Courcier. Malinvaud, E. (1970), Statistical Methods of Econometrics, 2nd rev. 'kT(8) -*(2r) K/2 | @ 2Q | exp { c2B'Q6 .8 (3.11) ed., Amsterdam: North-Holland Publishing Co., New York: American Elsevier Publishing Co. Finally, Scheffe's Theorem (1947) on the convergence of Rao, C. Radhakrishna (1973), Linear Statistical Inference and Its densities yields the desired conclusion. Applications, 2nd ed., New York: John Wiley & Sons. This content downloaded from 131.193.211.30 on Thu, 24 Mar 2016 15:31:20 UTC All use subject to http://about.jstor.org/terms 622 Journal of the American Statistical Association, September 1978 Relles, Daniel A. (1968), "Robust Regression by Modified Least Squares," unpublished Ph.D. dissertation, Yale University. Stigler, Stephen M. (1973), "Laplace, Fisher, and the Discovery of the Concept of Sufficiency," Biometrika, 60, 439-445. Rosenberg, Barr, and Carlson, D. (1971), "The Sampling DistribuStone, Charles (1965), "A Local Limit Theorem for Nonlattice tion of Least Absolute Residuals Regression Estimates," Working Multi-dimensional Distribution Functions," Annals of Mathe- Paper No. IP-164, Institute of Business and Economic Research, matical Statistics, 36, 546-551. University of California, Berkeley. Taylor, Lester D. (1974), "Estimation by Minimizing the Sum of Scheff6, Henry (1974), "A Useful Convergence Theorem for Prob- Absolute Errors," in Frontiers in Econometrics, ed. Paul Zarembka, ability Distributions," Annals of Mathematical Statistics, 18, New York: Academic Press. 434-438. Shepp, L.A., (1964), "A Local Limit Theorem," Annals of Mathe- matical Statistics, 35, 419-423. Yohai, Victor J. (1974), "Robust Estimation in the Linear Model," Annals of Statistics, 2, 562-567. This content downloaded from 131.193.211.30 on Thu, 24 Mar 2016 15:31:20 UTC All use subject to http://about.jstor.org/terms
© Copyright 2026 Paperzz