Asymptotic Results for the Linear Regression Model

Asymptotic Results for the Linear Regression
Model
C. Flinn
November 29, 2000
1. Asymptotic Results under Classical Assumptions
The following results apply to the linear regression model
y = Xβ + ε,
where X is of dimension (n × k), ε is a (unknown) (n × 1) vector of disturbances,
and β is a (unknown) (k × 1) parameter vector. We assume that n À k, and that
ρ(X) = k. This implies that ρ(X 0 X) = k as well.
Throughout we assume that the “classical” conditional moment assumptions
apply, namely
• E(εi |X) = 0 ∀i.
• V (εi |X) = σ 2 ∀i.
We Þrst show that the probability limit of the OLS estimator is β, i.e., that it
is consistent. In particular, we know that
β̂ = β + (X 0 X)−1 X 0 ε
⇒ E(β̂|X) = β + (X 0 X)−1 X 0 E(ε|X)
= β
In terms of the (conditional) variance of the estimator β̂,
V (β̂|X) = σ 2 (X 0 X)−1 .
Now we will rely heavily on the following assumption
Xn0 Xn
= Q,
n→∞
n
lim
where Q is a Þnite, nonsingular k × k matrix. Then we can write the covariance
of β̂ n in a sample of size n explicitly as
σ2
V (β̂ n |Xn ) =
n
µ
Xn0 Xn
n
¶−1
,
so that
µ 0
¶−1
Xn Xn
σ2
lim V (β̂ n |Xn ) = lim
lim
n→∞
n
n
−1
= 0×Q =0
Since the asymptotic variance of the estimator is 0 and the distribution is centered
on β for all n, we have shown that β̂ is consistent.
Alternatively, we can prove consistency as follows. We need the following
result.
Lemma 1.1.
µ
X 0ε
plim
n
Proof. First, note that E
X0ε
is given by
n
V
¡ X 0ε ¢
µ
n
X 0ε
n
¶
= 0.
= 0 for any n. Then the variance of the expression
¶
µ
¶ µ 0 ¶0
X 0ε
Xε
= E
n
n
−2
0 0
= n E(X εε X)
σ2 X 0X
=
,
n n
¡ 0 ¢
so that limn→∞ V Xn ε = 0 × Q = 0. Since the asymptotic mean of the random variable is 0 and the asymptotic variance is 0, the probability limit of the
expression is 0.¥
2
Now we can state a slightly more direct proof of consistency of the OLS estimator, which is
plim(β̂) = plim(β + (X 0 X)−1 X 0 ε)
¶−1
µ 0 ¶
µ 0
Xε
Xn Xn
plim
= β + lim
n
n
−1
= β + Q × 0 = β.
Next, consider whether or not s2 is a consistent estimator of σ 2 . Now
s2 =
SSE
,
n−k
where SSE = (y − X β̂)0 (y − X β̂). We showed that E(s2 ) = σ 2 for all n - that is,
that s2 is an unbiased estimator of σ 2 for all sample sizes. Since SSE = ε0 M ε,
with M = (I − X(X 0 X)−1 X 0 ), then
ε0 M ε
n−k
ε0 M ε
= p lim
n
µ 0 ¶ µ 0 ¶−1 µ 0 ¶
0
εε
εX
XX
Xε
= p lim
− p lim
n
n
n
n
0
εε
− 0 × Q−1 × 0.
= p lim
n
p lim s2 = p lim
Now
n
X
ε0 ε
= n−1
ε2i
n
i=1
so that
E
µ
ε0 ε
n
¶
= n−1 E
n
X
ε2i
i=1
= n−1
n
X
Eε2i
i=1
−1
= n (nσ 2 ) = σ 2 .
3
Similarly, , under the assumption that εi is i.i.d., the variance of the random
variable being considered is given by
µ 0 ¶
n
X
εε
−2
= n V(
V
ε2i )
n
i=1
= n
−2
n
X
V (ε2i )
i=1
= n−2 (n[E(ε4i ) − V (εi )2 ])
= n−1 [E(ε4i ) − V (εi )2 ],
0
so that the limit of the variance of εnε is 0 as long as E(ε4i ) is Þnite [we have
already assumed that the Þrst two moments of the distribution of εi exist]. Thus
0
the asymptotic distribution of εnε is centered at σ 2 and is degenerate, thus proving
consistency of s2 .
2. Testing without Normally Distributed Disturbances
In this section we look at the distribution of test statistics associated with linear
restrictions on the β vector when εi is not assumed to be normally distributed as
N (0, σ 2 ) for all i. Instead, we will proceed with the weaker condition that εi is independently and identically distributed with the common cumulative distribution
function (c.d.f.) F. Furthermore, E(εi ) = 0 and V (εi ) = σ 2 for all i.
Since we retain the mean independence and homogeneity assumptions, and
since unbiasedness, consistency, and the Gauss-Markov theorem for that matter,
all only rely on these Þrst two conditional moment assumptions, all these results
continue to hold when we drop normality. However, the small sample distributions
of our test statistics no longer will be accurate, since these were all derived under
the assumption of normality. If we made other explicit assumptions regarding F,
it is possible in principle to derive the small sample distributions of test statistics,
though these distributions are not simple to characterize analytically or even to
compute. Instead of making explicit assumptions regarding the form of F, we can
derive distributions of test statistics which are valid for large n no matter what
the exact form of F [except that it must be a member of the class of distibutions
for which the asymptotic results are valid, of course].
We begin with the following useful lemma, which is associated with LindbergLevy.
4
Lemma 2.1. If ε is i.i.d. with E(εi ) = 0 and E(ε2i ) = σ 2 for all i; if the elements
of the matrix X are uniformly bounded so that |Xij | < U for all i and j and for
0
U Þnite; and if lim XnX = Q is Þnite and nonsingular, then
1
√ X 0 ε → N (0, σ 2 Q).
n
Proof. Consider the case of only one regressor for simplicity. Then
n
1 X
Zn ≡ √
Xi εi
n i=1
is a scalar. Let Gi be the c.d.f. of Xi εi . Let
Sn2
n
X
≡
V (Xi εi ) = σ
2
i=1
n
X
Xi2 .
i=1
P
In this scalar case, Q = lim n−1 i Xi2 . By the Lindberg-Feller Theorem, the
necessary and sufficient condition for Zn → N(0, σ 2 Q) is
n
1 X
lim 2
Sn i=1
Z
ω 2 dGi (ω) = 0
(2.1)
|ω|>νSn
for all ν > 0. Now Gi (ω) = F ( |Xωi | ). Then rewrite [2.1] as
n
n X Xi2
lim 2
Sn i=1 n
Z
|ω/Xi |>νSn /|Xi |
µ
ω
Xi
¶2
dF (
ω
) = 0.
|Xi |
P X2
Since lim Sn2 = lim nσ 2 ni=1 ni = nσ 2 Q, then lim Snn2 = (σ 2 Q)−1 , which is a Þnite
and nonzero scalar. Then we need to show
−1
lim n
n
X
Xi2 δ i,n = 0,
i=1
R
³
ω
Xi
´2
where δ i,n ≡ |ω/Xi |>νSn /|Xi |
dF ( |Xωi | ). Now lim δ i,n = 0 for all i and any
Þxed ν since |Xi | is bounded while lim Sn = ∞ [thus the measure
P 2 of the set
−1
Xi is Þnite and
|ω/Xi | > νSn /|Xi | goes to P
0 asymptotically]. Since lim n
lim δ i,n = 0 for all i, lim n−1 Xi2 δ i,n = 0.¥
5
For vector-valued Xi , the result is identical of course, with Q being k × k
instead of a scalar. The proof is only slightly more involved.
Now we can prove the following important result.
Theorem 2.2. Under the conditions of the lemma,
√
n(β̂ − β) → N(0, σ 2 Q−1 ).
¡ 0 ¢−1 1 0
¡ 0 ¢−1
√
√ X ε. Since lim X X
Proof.
n(β̂ − β) = XnX
= Q−1 and
n
n
√
N (0, σ 2 Q),then n(β̂ − β) → N(0, σ 2 Q−1 QQ−1 ) = N(0, σ 2 Q−1 ).¥
√1 X 0 ε
n
→
The results of this √
proof have the following practical implications. For small
n, the distribution of n(β̂ − β) is not normal, though asymptotically the distribution of this random variable converges to a normal. The variance of this
random variable converges to σ 2 Q−1 which is arbitrarily well-approximated by
³ 0 ´−1
0
s2 XnnXn
= s2 n(Xn Xn )−1 . But the variance of (β̂ − β) is equal to the variance
√
of n(β̂ − β) divided by n, so that in large samples the variance of the OLS
0
0
estimator is approximately equal to s2 n(Xn Xn )−1 /n = s2 (Xn Xn )−1 , even when F
is non-normal.
Usual t tests of one linear restriction on β are no longer consistent. However,
an analagous large sample test is readily available.
Proposition 2.3. Let εi be i.i.d. (0,σ 2 ), σ 2 < ∞, and let Q be Þnite and nonsingular. Consider the test
H0 : Rβ = r,
where R is (1 × k) and r is a scalar, both known. Then
p
Rβ̂ − r
s2 R(X 0 X)−1 R0
→ N(0, 1).
Proof. Under the null, Rβ̂ − r = Rβ̂ − Rβ = R(β̂ − β), so that the test statistic
is
√
nR(β̂ − β)
p
.
s2 R(X 0 X/n)−1 R0
Since
√
n(β̂ − β) → N (0, σ 2 Q−1 )
√
⇒ nR(β̂ − β) → N (0, σ 2 RQ−1 R0 ).
6
p
The denominator of the test statistic has a probability limit equal to σ 2 RQ−1 R0 ,
which is the standard deviation of the random variable in the numerator. A mean
zero normal random variable divided by its standard deviation has the distribution
N (0, 1).¥
A similar result holds for the situation in which multiple (nonredundent) linear
restrictions on β are tested simultaneously.
Proposition 2.4. Let εi be i.i.d. (0,σ 2 ), σ 2 < ∞, and let Q be Þnite and nonsingular. Consider the test
H0 : Rβ = r,
where R is (m × k) and r is a (m × 1) vector, both known. Then
χ2
(r − Rβ̂)0 [R(X 0 X)−1 R0 ]−1 (r − Rβ̂)/m
→ m.
SSE/(n − k)
m
Proof. The denominator is a consistent estimator of σ 2 [as would be SSS/n],
and has a degenerate limiting distribution. Under the null hypothesis, r − Rβ̂ =
−R(X 0 X)−1 X 0 ε, so that the numerator of the test statistic can be written ε0 Dε,
where
D ≡ X(X 0 X)−1 R0 [R(X 0 X)−1 R0 ]−1 R(X 0 X)−1 X 0 .
Now D is symmetric and idempotent with ρ(D) = m. Then write
ε0 P P 0 DP P 0 ε
ε0 Dε
=
2
mσ 2
mσ
·
¸
1 0 Im 0
=
V
V
0 0
m
m
1 X 2
V ,
=
m i=1 i
¸
Im 0
where P is the orthogonal matrix such that P DP =
and where V =
0 0
P 0ε
. Thus the Vi are i.i.d. with mean 0 and standard deviation 1. Because V =
σ
0
P ε/σ,
n
X
Pji εj
Vi =
, i = 1, ..., m.
σ
j=1
0
7
·
The terms in the summand are independent random variables with mean 0 and
variance σ 2j = Pji2 . Since the εj are i.i.d., the central limit theorem applies, so that
n
X
Pji εj /σ
Wn
j=1
where Wn =
qP
n
2
j=1 σ j =
qP
each Vi is standard normal,
n
j=1
1
m
Pm
→ N (0, 1),
Pji2 = 1 because P is orthogonal. Then since
i=1
Vi2 →
χ2m
.¥
m
The practical use of this theorem is as follows. For large samples, the sample
distribution of the statistic
(r − Rβ̂)0 [R(X 0 X)−1 R0 ]−1 (r − Rβ̂)/m
χ2
→ m,
SSE/(n − k)
m
(2.2)
which means that for large enough n
(r − Rβ̂)0 [R(X 0 X)−1 R0 ]−1 (r − Rβ̂)
→ χ2m .
SSE/(n − k)
(2.3)
Now when disturbances were normally distributed, in a sample of size n we have
the same test statistic given by the left-hand side of [2.2] was distributed as an
2
F (m, n − k). Note that limn→∞ F (x; m, n − k) is χmm(x) . For example, say that the
test statistic associated with a null with (m) 3 restrictions assumed the value 4.
In a sample size of n = 8000, we have (approximately) 1 − F (4; 3, 8000) = .00741.
The asymptotic approximation given in [2.3] in this example yields 1−χ23 (3 ×4) =
.00738. In small samples, differences are much greater of course. For example, for
the same value of the test statistic, when n = 20 we have 1 − F (4; 3, 20 − 3) =
.02523, which is certainly different than 1 − χ23 (3 × 4) = .00738.
In summary, when the sample size is very large, the normality assumption is
pretty much inconsequential in the testing of linear restrictions on the parameter vector β. In small samples, some given assumption as to the form of F (ε) is
generally required to compute the distribution of the estimator β̂. Under normality, the small sample distributions of test statistics follow the t or F, depending
on the number of restrictions being tested. Testing in this environment depends
critically on the normality assumption, and if the disturbances are not normally
distributed, tests will be biased in general.
8