MA 575 Linear Models: Cedric E. Ginestet, Boston University Gauss

MA 575: Linear Models
MA 575 Linear Models:
Cedric E. Ginestet, Boston University
Gauss-Markov Theorem, Weighted Least Squares
Week 6, Lecture 2
1
Gauss-Markov Theorem
1.1
Assumptions
We make three crucial assumptions on the joint moments of the error terms. These assumptions are required
for the Gauss-Markov theorem to hold. Note that this theorem also assumes that the fitted model is linear
in the parameters.
i. Firstly, we assume that the expectations of all the error terms are centered at zero, such that
E[Ei |xi ] = 0,
i = 1, . . . , n.
ii. Secondly, we also assume that the variances of the error terms are constant for every i = 1, . . . , n. This
assumption is referred to as homoscedasticity.
Var[Ei |xi ] = σ 2 ,
i = 1, . . . , n.
iii. Thirdly, we assume that the error terms are uncorrelated,
Cov[Ei , Ej |xi , xj ] = 0,
1.2
∀ i 6= j.
BLUEs
ind
b 1 , . . . , Yn ) of the parameter
Definition 1. Given a random sample, Y1 , . . . , Yn ∼ f (X, β); an estimator β(Y
β is said to be unbiased if
b
E[β|X]
= β,
∗
for every β ∈ Rp .
Definition 2. An estimator βb of a parameter β is said to be Best Linear Unbiased Estimator (BLUE),
if it is a linear function of the observed values y, an unbiased estimator of β; and if for any other linear
e we have
unbiased estimator β,
b
e
Var[β|X]
≤ Var[β|X].
1.3
Proof of Theorem
Theorem 1. Under the G-M assumptions, a multiple regression model with mean and variance functions
respectively defined as
E[y|X] = Xβ
and
Var[y|X] = σ 2 I,
the OLS estimator βb := (XT X)−1 XT y is BLUE for β.
Department of Mathematics and Statistics, Boston University
1
MA 575: Linear Models
e the following
Proof. We need to show that for any arbitrary linear unbiased estimator of β, denoted β,
matrix is negative semidefinite,
b
e
Var[β|X]
− Var[β|X]
≤ 0.
(i) Firstly, since both βb and βe are linear functions of y, it follows that there exists two matrices C and D
of order p∗ × n, with C := (XT X)−1 XT , and such that
βb = Cy,
and
βe = (C + D)y.
(ii) Secondly, as both βb and βe are also unbiased, we hence have
e
E[β|X]
= E[(C + D)y|X]
= E[Cy|X] + DE[y|X]
= β + DXβ,
and therefore DXβ must be zero for βe to be unbiased. In fact, since unbiasedness holds for every values of
∗
β ∈ Rp , it follows that DXβ = 0 for every β, which implies that
DX = 0,
and
XT DT = 0.
(1)
e
(iii) Finally, it suffices to compute the variance of β,
e
Var[β|X]
= Var[(C + D)y|X]
= (C + D) Var[y|X](C + D)T
= σ 2 (CCT + CDT + DCT + DDT ).
Observe that by equation (1), we have
DCT = DX(XT X)−1 = 0,
and therefore
and
CDT = (XT X)−1 XT DT = 0,
e
Var[β|X]
= σ 2 (CCT + DDT )
= σ 2 (XT X)−1 + σ 2 DDT
b
= Var[β|X]
+ σ 2 DDT .
However, since DDT is a Gram matrix of order p∗ × p∗ , it follows that it is at least positive semidefinite,
e
b
such that DDT ≥ 0. Therefore, we indeed obtain Var[β|X]
≥ Var[β|X],
as required.
This theorem can be generalized to weighted least squares (WLS) estimators. A more geometric proof
of the Gauss-Markov theorem can be found in Christensen (2011), using the properties of the hat matrix.
However, this latter proof technique is less natural as it relies on comparing the variances of the fitted values
corresponding to two different estimators, as a proxy for the actual variances of these estimators. Finally,
yet another proof can be found in Casella and Berger (2002), on p. 544.
2
Weighted Least Squares (WLS)
The classical OLS setup can be extended by including a set of weights associated with each data point.
E[Y |X = xi ] = xTi β,
and
Var[Y |X = xi ] =
Department of Mathematics and Statistics, Boston University
σ2
,
wi
2
MA 575: Linear Models
where the wi ’s are known positive numbers, such that
wi > 0, ∀ i = 1, . . . , n.
These weights may naturally come from the number of ‘samples’, associated with each data points. This
is especially the case, when every data point is a sample average of some quantity, such as the number of
cancer cases in a particular geographical location. This extension can be formulated using matrix notation
as follows,
y = Xβ + e
and
Var[e|X] = σ 2 W−1 ,
where W is assumed to be a diagonal matrix. It then suffices to specify a statistical criterion, such that
RSS(β; W) := (y − Xβ)T W(y − Xβ)
n
X
=
wi (yi − xTi β)2
(2)
i=1
Alternatively, this may be re-expressed in terms of the error vector, e = y − Xβ, such that
n
X
e2i
.
RSS(β; W) = e We =
wi
i=1
T
2.1
Fitting WLS using the OLS Framework
It is useful to try to re-formulate this WLS optimization into the standard OLS framework that we have
already encountered. Hence, consider the following matrix decompositions,
W = W1/2 W1/2 ,
and
W1/2 W−1/2 = W−1/2 W1/2 = I;
where the diagonal entries in W1/2 and W−1/2 are respectively defined for every i = 1, . . . , n as
(W1/2 )ii :=
√
wi ,
and
1
(W−1/2 )ii := √ .
wi
Once we have performed this decomposition, we can transform our original WLS model, such that we premultiply both sides in this fashion,
W1/2 y = W1/2 Xβ + W1/2 e,
and define the following terms,
z := W1/2 y,
M := W1/2 X,
and
d := W1/2 e.
Observe that the vector of parameters of interest, β, has not been affected by this change of notation. Using
these definitions, we can now re-define our target OLS model as follows,
z = Mβ + d.
This yields a new RSS, which can be shown to be equivalent to the one described in equation (2),
RSS(β; W) = (z − Mβ)T (z − Mβ)
= [W1/2 (y − Xβ)]T [W1/2 (y − Xβ)]
= (y − Xβ)T W(y − Xβ).
Department of Mathematics and Statistics, Boston University
3
MA 575: Linear Models
This can be directly minimized using the standard machinery that we have developed for minimizing unweighted residual sum of squares. In addition, note that the variance function is given by
Var[d|X] = Var[W1/2 e|X]
= W1/2 Var[e|X](W1/2 )T
= W1/2 σ 2 W−1 (W1/2 )T
= σ 2 W1/2 W−1/2 W−1/2 (W1/2 )
= σ 2 I.
In summary, we therefore have recovered, after some transformations, a standard OLS model taking the
form,
z = Mβ + d,
and
Var[d|X] = σ 2 I.
It simply remains to compute the actual form of β with respect to W, such that
βbWLS = (MT M)−1 MT z
= ((W1/2 X)T W1/2 X)−1 (W1/2 X)T z
= (XT W1/2 W1/2 X)−1 XT W1/2 W1/2 y
= (XT WX)−1 XT Wy.
The equivalence between the WLS and the OLS framework is best observed by considering the entries of M
and z. The new weighted design matrix and vector of observations are now,
√
√
w 1 y1 w1 √w1 x11 . . . √w1 x1p √
√
√
√
w 2 y2 w2
w2 x21 . . .
w2 x2p and
z= . M= .
,
..
..
..
.
..
.
.
.
√ . √
wn yn wn √wn xn1 . . . √wn xnp b
The regression problem simply involves finding the WLS fitted values b
z := Mβ.
2.2
Generalized Least Squares (GLS)
The WLS extension of OLS can be further generalized by considering any symmetric and positive definite
matrix, such that
Var[e|X] := Σ−1 ,
(3)
where the generalized residual sum of squares becomes
RSS(β; Σ) := (y − Xβ)T Σ(y − Xβ),
which can also be re-written as
RSS(β; Σ) =
n X
n
X
2
σij
(yi − xTi β)(yj − xTj β),
i=1 j=1
2
where σij
:= Σij . Noting that the inverse of a positive definite matrix is also positive definite, we require
that
Σ = ΣT
and
Σ > 0,
Department of Mathematics and Statistics, Boston University
4
MA 575: Linear Models
which implies that for every non-zero v ∈ Rn , we have vT Σv > 0. Every symmetric positive definite
matrices can be Cholesky decomposed, such that
Σ = LLT ,
where L is a lower triangular matrix of dimension n × n. As a result, we can perform the same manipulations that we have conducted for WLS, such that if we pre-multiply both sides of our GLS equation we
obtain
LT y = LT Xβ + LT e,
and define the following terms,
z := LT y,
M := LT X,
and
d := LT e.
Then, we can again apply the standard OLS minimization machinery, after having verified that the variance
of d is simply I. Moreover, it is straightforward to see that the Gauss-Markov theorem also holds under
these more general assumptions, such that the GLS estimator
βbGLS := (XT ΣX)−1 XT Σy,
is also BLUE, amongst the class of unbiased linear estimators in a model, whose variance function is
Var[e|X] := Σ−1 .
References
Casella, G. and Berger, R. (2002). Statistical Inference (2nd edition). Duxbury, New York.
Christensen, R. (2011). Plane Answers to Complex Questions: The Theory of Linear Models (4th edition). Springer,
New York.
Department of Mathematics and Statistics, Boston University
5