Asymptotic Theory of Least Absolute Error Regression

Asymptotic Theory of Least Absolute Error Regression
Authors(s): Gilbert Bassett, Jr. and Roger Koenker
Source: Journal of the American Statistical Association, Vol. 73, No. 363 (Sep., 1978), pp.
618-622
Published by: Taylor & Francis, Ltd. on behalf of the American Statistical Association
Stable URL: http://www.jstor.org/stable/2286611
Accessed: 24-03-2016 15:31 UTC
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
http://about.jstor.org/terms
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted
digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about
JSTOR, please contact [email protected].
American Statistical Association, Taylor & Francis, Ltd. are collaborating with JSTOR to digitize,
preserve and extend access to Journal of the American Statistical Association
http://www.jstor.org
This content downloaded from 131.193.211.30 on Thu, 24 Mar 2016 15:31:20 UTC
All use subject to http://about.jstor.org/terms
Asymptotic Theory of Least Absolute Error
Regression
GILBERT BASSETT, JR. and ROGER KOENKER*
In the general linear model with independent and identically disY1, Y2, .. ., YT distributed according to
tributed errors and distribution function F, the estimator which
K
minimizes the sum of absolute residuals is demonstrated to be con-
Pr[Yt < y] = F(y - E Xktk), t =1,*,T, (1.1)
sistent and asymptotically Gxaussian with covariance matrix co2Q-',
k=1
where Q = lim T-lX'X and Co2 iS the asymptotic variance of the
ordinary sample median from samples with distribution F. Thus the
where Xkt denotes an element of a known T X K design
least absolute error estimator has strictly smaller asymptotic con-
fidence ellipsoids than the least squares estimator for linear models
matrix XT. We assume that 5 is located so that the
from any F for which the sample median is a more efficient estimator
distribution function F has median zero; if the column
of location than the sample mean.
space of XT contains IT = (1, 1, . . . , 1), then this
KEY WORDS: Least absolute error estimators; Linear models;
involves no loss in generality. The LAE estimator T*
Asymptotic distribution theory.
is a solution to the problem
IK
1. INTRODUCTION
min [p(b) = |y - E xktbk|] . (1.2)
The methods of minimizing the sum of absolute and
bCGRK t-1 k=1
squared deviations from hypothesized linear models have
We prove the following theorem in Section 3.
vied for statistical favor for more than 250 years. While
least squares enjoys certain well-known optimality prop-
Theorem: Let { T* } denote a sequence of unique
erties within strictly Gaussian parametric models, the
solutions to the problem (1.2) for model (1.1), and assume
least absolute error (LAE) estimator' is a widely recog-
(i) F is continuous and has continuous and positive
nized superior robust method especially well-suited to
density f at the median, and
longer-tailed error distributions. Increasingly, the LAE
(ii) lim T-'XT'XT = Q, a positive definite matrix.
estimator is recommended as a preliminary (consistent)
Then V\T(3T* - 5) converges in distribution to a K-di-
estimator for one-step and iteratively reweighted least
mensional Gaussian random vector with mean 0 and
squares procedures; cf. Andrews (1974), Bickel (1975),
covariance matrix c2Q-1, where co2 is the asymptotic
Harvey (1977), and Hill and Holland (1977).
variance of the sample median from random samples
This article resolves a long-standing, open question
from distribution F; i.e., co = [2f (0)]-1.
concerning the LAE estimator by establishing its asymp-
The result implies that for any error distribution for
totic normality under general conditions, thereby ex-
which the median is superior to the mean as an estimator
tending a result of Laplace (1818) (see Stigler 1973) to
of location, the LAE estimator is preferable to the least
the general linear model. The result confirms that the
squares estimator in the general linear model, in the
LAE estimator is a natural analog of the sample median
sense of having strictly smaller asymptotic confidence
for the general linear model. Recently we have extended
ellipsoids.2 This condition holds for an enormous class of
this analogy to the estimation of arbitrary quantile
distributions which have peaked density at the median
hyperplanes for linear models; see Koenker and Bassett
and/or long tails.
(1978).
Laplace (1818) proved the preceding result for the
We consider the familiar problem of estimating a
special case of bivariate regression through the origin,
K-dimensional vector of unknown parameters 5 from a
2 A number of recent studies have sought to investigate the
sample of independent observations on random variables
sampling distribution of the LAE estimator by Monte Carlo methods.
Most notable among these is the unpublished work of Rosenberg and
* Gilbert Bassett, Jr. is Assistant Professcr of Economics, UniCarlson (1971), who conclude that the Gaussian distribution given
versity of Illinois at Chicago Circle, Chicago, IL 60680. Roger
above provides an acceptable approximation to their sampling dis-
Koenker is a member of the technical staff, Bell Laboratories, Murray
tributions for modest sample sizes and well-conditioned designs.
Hill, NJ 07974. The authors wish to express their appreciation to
Their finding that the approximation was significantly worse for
Lester D. Taylor for stimulating their interest in this topic.
these sample sizes and ill-conditioned designs should come as no
1 This venerable method, which Laplace called the "method of
surprise.
situations," has had a bewildering variety of names. Recently it has
? Journal of the American Statistical Association
accumulated a large array of acronyms: LAE, LAR, LAD, MAD,
MSAE, and others. It is also frequently referred to as Li-regression,
September 1978, Volume 73, Number 363
and less frequently as median regression.
Theory and Methods Section
618
This content downloaded from 131.193.211.30 on Thu, 24 Mar 2016 15:31:20 UTC
All use subject to http://about.jstor.org/terms
Bassett and Koenker: Least Absolute Error Regression 619
which in turn includes the median as a special case. Since
abundant evidence that L* outperforms least squares for
* may be viewed as a limiting form of the well-known
a large class of distributions satisfying Gauss-Markov
Huber (1972, 1973) M-estimator defined by
conditions. For examples we need go no further than the
comparison of the sample median and sample mean as
TK
estimates of location.
min E P(Yt - X Xtkbk), (1.3)
bCRK t=l k=1
We now introduce some additional notation and state
where
two more lemmas. For v E (RK let llv|| = maxk=l,...,K I Vk I
p(u) = 2 uul < c
and -consider the K-dimensional closed hypercubes
= ll22 lu , (1.4)
centered at 5 of "radius" e,
our result may be interpreted as an extension of well-
C[6, E] = {C C_ C5K: ||c - all < e} . (2.1)
known results on the large-sample theory of M-estimators
The corresponding open cubes will be denoted by C(&, e).
to an important nonregular special case. General methods
Our characterization of the estimator * relies extensively
of proof of asymptotic results on M-estimators for linear
on the vector-valued function
models break down when p (t) = t I because differentia-
bility conditions are violated. However, note that the
,(h, v) E ^,( v) = E sgn*(yt - x,b(h);
t Gh- t G h
familiar asymptotic variance formula for M-estimators
(see, e.g., Huber 1973, Relles 1968, Yohai 1974),
-xtX(h)-1v)xtX(h)-1, (2.2)
where
sgn*(u; w) = sgnu if u 5 0
V(I' F) f_ | 2(t) f(t)dtj[f +'(t) f(t)dt]
= sgnw otherwise
equals [2f(0) ]-2 for A'(t) = sgn(t) with &'(t) interpreted
Lemma 2: For any h C H, b(h) X(h)-ly(h) E B* if
as twice the Dirac delta function.
and only if ( (h, v) E C[0, 1] for all v 0 O, and b (h) = B*
(is unique) if and only if (h, v) C (0, 1) for all v # 0.
2. NOTATION AND PRELIMINARY RESULTS
Proof: Since p (.) is convex, it suffices to show that our
Let T = {1, 2, ..., T} and let SC denote the set of
existence and uniqueness conditions are equivalent to
K-element subsets of r. Elements h E SC have relative
nonnegativity and strict positivity of the directional
complement h = - h, and serve to partition y and
derivative function,
X. Thus y (h) denotes the K-vector with elements
T
{yt: t E h}, while X(h) denotes a (T - K) X K matrix
j&(b(h); w) =-E sgn*(yt - xtb(h); -x,w)xtw
with rows {Ix: t C h}. Finally, let
t=1
H = {h C Jelrank X(h) = KI
in all directions w $ 0. Now,
We may now state some fundamental properties of
,6(b(h); w) = E sgn(xtw)xtw
elements @* of the solution set B* of the minimization
tEh
problem (1.2).
- E sgn*(yt - xtb(h);xtw)xtw > 0
Lemma 1: If X has rank K, then the solution set B*
for all w # 0 is equivalent to, setting v = X(h)w,
to (1.2) has at least one element of the form
K
*= X(h)-ly(h)
E Ivk I - ((h, v)v > ? (2.3)
k=1
for some h C H. Moreover, B* is the convex hull of all
solutions having this form.
for all v # 0. But (2.3) is equivalent to (h, v) C CEO, 1].
Proof: This result follows immediately from the linear
Uniqueness is argued in the same way, with strict
programming formulation of problem (1.1). See, for
inequalities replacing weak.
example, the recent work by Abdelmalek (1974), Bassett
Remarks: Note that if yt - xtb(h) # 0 for all t E h,
(1973), and Taylor (1974) for details.
then ( (h, v) is independent of v. This is a nondegeneracy
Remark: One introductory econometrics text asserts
or "no ties" condition for the general linear model.
that the LAE estimator "ignores sample information"
It is instructive to consider the sample median in the
because it "passes through" K sample points. However,
light of Lemmas 1 and 2. Suppose that xt = 1 for
the entire sample serves to determine which K points the
t = 1, ... T, so H = .Then*= y(h) is a sample
* hyperplane passes through. Also, the fact that 5* is
median if and only if
a linear function of some of the sample observations has
-1 < E sgn*(yt - y(h);w) < 1
apparently led some authors to state that g* must be
teh
inferior to least squares under Gauss-Markov conditions.
However, the selection of h such that 0* = X (h)ly (h)
makes ~* a manifestly nonlinear estimator; thus no
Gauss-Markov claims apply to it. Indeed, there is
for all w # 0. If F is continuous, so that y - y(h) = 0
for t E hi with probability zero, then a unique median
obtains if and only if T is odd. With continuous F, and
This content downloaded from 131.193.211.30 on Thu, 24 Mar 2016 15:31:20 UTC
All use subject to http://about.jstor.org/terms
620 Journal of the American Statistical Association, September 1978
T even, there is, with probability one, an "interval of
down in the scatter. The result states that as long as
medians" between adjacent order statistics. Note that in
these movements leave observations on the same side
the absence of degeneracy, the condition for uniqueness
of the original line, the solution is unaffected. This
is purely a design condition, reducing in the location
property is not shared by least squares, and although
model to the requirement that T be odd. This suggests
obvious in the case of the median, it seems to capture
that for any sequence XT } of designs, one should be
part of the intuitive flavor of 0*'s median-type robust-
able to extract a subsequence, or at worst a "perturbed"
ness and insensitivity to outlying observations.
subsequence, whose elements have unique solutions. An
3. PROOF OF THE THEOREM
alternative approach which is frequently employed in
the location model is to adopt some arbitrary rule to
We begin by establishing that the finite sample density
choose a single element from sets of solutions to problem
kT(6) of the random K-vector VT(L1* -) is given by
(1.2) when they occur. Either approach suffices to obtain
rT(&) = T-K12 E X(h)| H f(T-ixt6)
the sequence of unique solutions considered in the next
hEH teh
section.
*Pr[ZT(5, h) C C(O, 1)], (3.1)
We now state a number of equivariance properties of
the LAE estimator.
where
ZT(61 h) = z Zt(6, h) = E sgn(ut T-1xt6)xtX(h)-l .
Lemma 3: If g*(y, X) E B*(y, X), then the following
tih- t'ch
are elements of the solution of the specified transformed
problem:
We argue as follows. Let * = - and d (h) = b(h)
- = X(h)-u(h), and note that Yt - xtb(h) =ut
(i) 0*(XY, X) = X (y, X), X E i;
-xtd (h). Consider the events
(ii) @*(y + Xy, X) = P*(y, X) + Y, y CR';
(iii) * (y, XA) = A-* (y, X), AK XK nonsingular;
El(h, 5, e) = {u C iR Td(h) C C(, ()},
(iv) 0* (X5* + Du*, X) = 5* (y, X),
E2(h) = {u 6d T (h, v) & C(O, 1), Vv 0}
DTXT diagonal with nonnegative elements;
u* - y-X0*.
By Lemmas 1 and 2,
Proof: Let
Pr[6* C C(&, E)]- U [El(h, 6, e) n E2(h)] . (3.2)
1'
heH
4V(b;y,X) = I lyt-xtbf
t=1
Let
and note that
IIXTII = max |XktJ
k=l,... SK
(i) X 11,(b;y, X) = (X(bb;Xy, X);
t=1,.., T
(ii) s6(b;y, X) = VI(b + Y; y+ X'y, X);
set MT = KIIXTII, and define the event
(iii) 4t(b; y, X) = O'(A-lb; y, XA).
E3(h,, E) = {uC RiT: I Ut - X, I> EM for all t h}
For (iv), * E B*(y, X) implies (by Lemma 2);
T
Clearly,
- E sgn* (yt - Xtg*; -XtW)XtW > 0 , w E K
Pr(El r E2) = Pr(E1l r E2r) E3)
t=1
+ Pr(E1 r) E2 r>-E3) * (3.3)
Note that
Since El(h, 5, e) implies ut - XtB I < EM for all t e h,
sgn*(xt5* + dt(yt - Xt5*) - xtg*; -x,w)xtw
Pr[ U (E1 C E2 / E3)]
< sgn*(y, - xtg*; -xtw)xtw
h EH
for dt > 0, and (iv) follows.
E Pr(El E2 r E3)
heH
Remark: Estimators with properties (i) and (ii) are
termed affine equivariant, and scale and shift equivariant,
= E Pr(El) Pr(E2 | E1 I E3) Pr(E3 I E1) . (3.4)
h'H
respectively (see Bickel 1975). Estimators with property
(iii) may be termed equivariant to reparameterization of
Set E2'(h, 6) = {U R G IT Z(VTh, h) E C (O, 1)}. It is
design. Properties (i)-(iii) are shared by the least squares
readily shown that Pr (E2') = Pr (E2 EEl n E3). If we
estimator, but typically robust alternatives to least
denote the Lebesgue measure of C by X { C }, then since
squares are not equivariant in one or more of the above
lim6 .o Pr(E3(h, &, e)) = 1, we have:
senses; see, e.g., Bickel (1973, 1975).
The fourth property generalizes an invariance prop-
lim PrE G C(8, E)] Pr[EE(h, 6, e)] Pr(E2')
lim - - = lim -N)
erty of the median to the linear model. It has the follow-
ing geometric interpretation. Imagine a scatter of sample
- I X(h)l IIH f(xt8) Pr(E2') . (3.5)
observations in 612 with the LAE solution line slicing
tEh
through the scatter. Now consider the effect (on the
position of the LAE line) of moving observations up or
Normalizing bY T-? yields (3.1).
This content downloaded from 131.193.211.30 on Thu, 24 Mar 2016 15:31:20 UTC
All use subject to http://about.jstor.org/terms
Bassett and Koenker: Least Absolute Error Regression 621
If GT is lattice in one or more directions, the expression
We now demonstrate that gT(?) converges to a speci-
(3.9) holds only up to some bounded constant of pro-
fied Gaussian density, and Scheffe's Theorem (1947) on
portionality involving counting measure on the lattice
convergence of densities completes the proof.
points in CT. Summing over H yields a uniformly bounded
Note that
density proportional to exp I- w2w'QB }; thus, by the
Zt(B, h)
Lebesgue dominated convergence theorem, the sequence
= x,X(h)-l with probability 1 - F(T-Ixt6)
of integrals of gT (a) converges, and our result follows
as in the nonlattice case (cf. Sheffe 1947, p. 437).
--x,X(h)-1 with probability F(T-lx,6) (3.6)
Expanding F around zero, noting that T-1 E teh x,xt -> Q,
4. CONCLUSION
and noting that IIXTII = o(V\T) (see 1\lalinvaud 1970,
pp. 226-227), it is readily shown that the stabilized sum
The least absolute error estimator has been demon-
strated to be more efficient than the least squares esti-
T-'ZT(&, h) = T-i E Zt(8&, h) (3.7)
mator in the linear model for any error distribution for
teh
which the median is more efficient than the mean. This
converges in law to a K-variate Gaussian random vector
result considerably strengthens the existing rationale for
with mean -2f(0)6'Q X (h)-1 and covariance matrix
the use of the LAE estimator, making it particularly
X'(h)-'QX(h)-1. Note that the condition IIXTII = o(V\T)
attractive relative to least squares when the regression
implies the standard multivariate Lindeberg condition
process is thought to be potentially long-tailed or to
(e.g., Cramer 1970, p. 114) immediately.
possess peaked density at the median.
Suppressing 8 and h, let GT denote the probability
Our main theorem also provides, for the first time, the
measure induced on 61K by ST = 2T-ZT, and let GT -> G.
foundation for a large-sample hypothesis testing ap-
Define the K-dimensional hypercubes centered at the
paratus for the LAE estimator. Any of the conventional
origin,
schemes for calculating confidence intervals for the
CT = C[O, 1/(2 /T)] = {cE (RK |lc|l
median, applied to the LAE residuals, will provide a
consistent estimator of the density at the median, and
< 1/ (2 \T)} (3.8)
therefore, normal theory tests with asymptotic justifi-
with Lebesgue measure X CT} = T-K /2. If the design
cation may be readily constructed.
sequence {XT} makes GT nonlattice after some To, then
arguing as in Shepp (1964) and Stone (1965), the Radon-
[Received November 1976. Revised January 1978.]
Nikodyn derivative,
GT{CT} _ K/
REFERENCES
lim - = lim TK/2 Pr[ZT C CT] = g(0), (3.9)
T->oo X{CT} T-*oo
Abdelmalek, N.N. (1974), "On the Discrete Linear IJ Approxima-
tion and Li Solutions of Overdetermined Linear Equations,"
where g (0) is the density of G evaluated at the origin.
Journal of Approximation Theory, 11, 38-53.
Andrews, David F. (1974), "A Robust Method for Multiple Linear
That is,
Regression," Technometrics, 16, 523-531.
Bassett, Gilbert W., Jr. (1973), Some Properties of the Least Absolute
TK/2 Pr[ZT(&, h) C CT]
Error Estimator, unpublished Ph.D. thesis, Department of Eco-
= (2ir)K/2 14 X'(h)'Q X (h)'
nomics, University of Michigan.
Bickel, Peter J. (1973), "On Some Analogues to Linear Combinations
* exp { -_f2(0)6'Q X'(h)-'
of Order Statistics in the Linear Model," Annals of Statistics, 1,
* [X'(h)-XQX(h)'7j'X(h)-1Q8} + o(1)
597-616.
- (1975), "One-Step Huber Estimates in the Linear Model,"
= (27r) K/22KX(h)IIQI-'{exp - 2X8Q8} + o(1)
Journal of the American Statistical Association, 70, 428-434.
Cramdr, Harald (1970), Random Variables and Probability DistribuThe continuity of the density at the median implies
tions, Cambridge: Cambridge University Press.
Harvey, Andrew C. (1977), "A Comparison of Preliminary Estima-
II f (T-IxtB) = fK(0) + o(1) . (3.10)
tors for Robust Regression," Journal of the American Statistical
tEh
Association, 72, 910-913.
Hill, Richard W., and Holland, Paul W. (1977), "Two Robust
So substituting back into (3.1), we have
Alternatives to Least-Squares Regression," Journal of the American
Statistical Association, 72, 828-833.
4 T(b) = E ThK I X(h) 121 Q I-i(27r)-K2W-K
Huber, Peter J. (1972), "Robust Statistics: A Review," Annals of
hEH
Mathematical Statistics, 43, 1041-1067.
*exp {-Ico25'Q6J + o(l) E T-K I X(h) 12
(1973), "Robust Regression: Asymptotics, Conjectures and
hEH
Monte Carlo," Annals of Statistics, 1, 799-821.
Koenker, Roger W., and Bassett, Gilbert W., Jr. (1978), "Regression
But FheH I X(h) 12 l X'X I (see Rao 1973, p. 32). Thus,
Quantiles," Econometrica, 46, 33-50.
de Laplace, Pierre Simon (1818), Deuxxieme Supplement a la Theiorie
simplifying, we have
Analytique des Probabilites, Paris: Courcier.
Malinvaud, E. (1970), Statistical Methods of Econometrics, 2nd rev.
'kT(8) -*(2r) K/2 | @ 2Q | exp { c2B'Q6 .8 (3.11)
ed., Amsterdam: North-Holland Publishing Co., New York:
American Elsevier Publishing Co.
Finally, Scheffe's Theorem (1947) on the convergence of
Rao, C. Radhakrishna (1973), Linear Statistical Inference and Its
densities yields the desired conclusion.
Applications, 2nd ed., New York: John Wiley & Sons.
This content downloaded from 131.193.211.30 on Thu, 24 Mar 2016 15:31:20 UTC
All use subject to http://about.jstor.org/terms
622 Journal of the American Statistical Association, September 1978
Relles, Daniel A. (1968), "Robust Regression by Modified Least
Squares," unpublished Ph.D. dissertation, Yale University.
Stigler, Stephen M. (1973), "Laplace, Fisher, and the Discovery of
the Concept of Sufficiency," Biometrika, 60, 439-445.
Rosenberg, Barr, and Carlson, D. (1971), "The Sampling DistribuStone, Charles (1965), "A Local Limit Theorem for Nonlattice
tion of Least Absolute Residuals Regression Estimates," Working
Multi-dimensional Distribution Functions," Annals of Mathe-
Paper No. IP-164, Institute of Business and Economic Research,
matical Statistics, 36, 546-551.
University of California, Berkeley.
Taylor, Lester D. (1974), "Estimation by Minimizing the Sum of
Scheff6, Henry (1974), "A Useful Convergence Theorem for Prob-
Absolute Errors," in Frontiers in Econometrics, ed. Paul Zarembka,
ability Distributions," Annals of Mathematical Statistics, 18,
New York: Academic Press.
434-438.
Shepp, L.A., (1964), "A Local Limit Theorem," Annals of Mathe-
matical Statistics, 35, 419-423.
Yohai, Victor J. (1974), "Robust Estimation in the Linear Model,"
Annals of Statistics, 2, 562-567.
This content downloaded from 131.193.211.30 on Thu, 24 Mar 2016 15:31:20 UTC
All use subject to http://about.jstor.org/terms