MA 575: Linear Models MA 575 Linear Models: Cedric E. Ginestet, Boston University Gauss-Markov Theorem, Weighted Least Squares Week 6, Lecture 2 1 Gauss-Markov Theorem 1.1 Assumptions We make three crucial assumptions on the joint moments of the error terms. These assumptions are required for the Gauss-Markov theorem to hold. Note that this theorem also assumes that the fitted model is linear in the parameters. i. Firstly, we assume that the expectations of all the error terms are centered at zero, such that E[Ei |xi ] = 0, i = 1, . . . , n. ii. Secondly, we also assume that the variances of the error terms are constant for every i = 1, . . . , n. This assumption is referred to as homoscedasticity. Var[Ei |xi ] = σ 2 , i = 1, . . . , n. iii. Thirdly, we assume that the error terms are uncorrelated, Cov[Ei , Ej |xi , xj ] = 0, 1.2 ∀ i 6= j. BLUEs ind b 1 , . . . , Yn ) of the parameter Definition 1. Given a random sample, Y1 , . . . , Yn ∼ f (X, β); an estimator β(Y β is said to be unbiased if b E[β|X] = β, ∗ for every β ∈ Rp . Definition 2. An estimator βb of a parameter β is said to be Best Linear Unbiased Estimator (BLUE), if it is a linear function of the observed values y, an unbiased estimator of β; and if for any other linear e we have unbiased estimator β, b e Var[β|X] ≤ Var[β|X]. 1.3 Proof of Theorem Theorem 1. Under the G-M assumptions, a multiple regression model with mean and variance functions respectively defined as E[y|X] = Xβ and Var[y|X] = σ 2 I, the OLS estimator βb := (XT X)−1 XT y is BLUE for β. Department of Mathematics and Statistics, Boston University 1 MA 575: Linear Models e the following Proof. We need to show that for any arbitrary linear unbiased estimator of β, denoted β, matrix is negative semidefinite, b e Var[β|X] − Var[β|X] ≤ 0. (i) Firstly, since both βb and βe are linear functions of y, it follows that there exists two matrices C and D of order p∗ × n, with C := (XT X)−1 XT , and such that βb = Cy, and βe = (C + D)y. (ii) Secondly, as both βb and βe are also unbiased, we hence have e E[β|X] = E[(C + D)y|X] = E[Cy|X] + DE[y|X] = β + DXβ, and therefore DXβ must be zero for βe to be unbiased. In fact, since unbiasedness holds for every values of ∗ β ∈ Rp , it follows that DXβ = 0 for every β, which implies that DX = 0, and XT DT = 0. (1) e (iii) Finally, it suffices to compute the variance of β, e Var[β|X] = Var[(C + D)y|X] = (C + D) Var[y|X](C + D)T = σ 2 (CCT + CDT + DCT + DDT ). Observe that by equation (1), we have DCT = DX(XT X)−1 = 0, and therefore and CDT = (XT X)−1 XT DT = 0, e Var[β|X] = σ 2 (CCT + DDT ) = σ 2 (XT X)−1 + σ 2 DDT b = Var[β|X] + σ 2 DDT . However, since DDT is a Gram matrix of order p∗ × p∗ , it follows that it is at least positive semidefinite, e b such that DDT ≥ 0. Therefore, we indeed obtain Var[β|X] ≥ Var[β|X], as required. This theorem can be generalized to weighted least squares (WLS) estimators. A more geometric proof of the Gauss-Markov theorem can be found in Christensen (2011), using the properties of the hat matrix. However, this latter proof technique is less natural as it relies on comparing the variances of the fitted values corresponding to two different estimators, as a proxy for the actual variances of these estimators. Finally, yet another proof can be found in Casella and Berger (2002), on p. 544. 2 Weighted Least Squares (WLS) The classical OLS setup can be extended by including a set of weights associated with each data point. E[Y |X = xi ] = xTi β, and Var[Y |X = xi ] = Department of Mathematics and Statistics, Boston University σ2 , wi 2 MA 575: Linear Models where the wi ’s are known positive numbers, such that wi > 0, ∀ i = 1, . . . , n. These weights may naturally come from the number of ‘samples’, associated with each data points. This is especially the case, when every data point is a sample average of some quantity, such as the number of cancer cases in a particular geographical location. This extension can be formulated using matrix notation as follows, y = Xβ + e and Var[e|X] = σ 2 W−1 , where W is assumed to be a diagonal matrix. It then suffices to specify a statistical criterion, such that RSS(β; W) := (y − Xβ)T W(y − Xβ) n X = wi (yi − xTi β)2 (2) i=1 Alternatively, this may be re-expressed in terms of the error vector, e = y − Xβ, such that n X e2i . RSS(β; W) = e We = wi i=1 T 2.1 Fitting WLS using the OLS Framework It is useful to try to re-formulate this WLS optimization into the standard OLS framework that we have already encountered. Hence, consider the following matrix decompositions, W = W1/2 W1/2 , and W1/2 W−1/2 = W−1/2 W1/2 = I; where the diagonal entries in W1/2 and W−1/2 are respectively defined for every i = 1, . . . , n as (W1/2 )ii := √ wi , and 1 (W−1/2 )ii := √ . wi Once we have performed this decomposition, we can transform our original WLS model, such that we premultiply both sides in this fashion, W1/2 y = W1/2 Xβ + W1/2 e, and define the following terms, z := W1/2 y, M := W1/2 X, and d := W1/2 e. Observe that the vector of parameters of interest, β, has not been affected by this change of notation. Using these definitions, we can now re-define our target OLS model as follows, z = Mβ + d. This yields a new RSS, which can be shown to be equivalent to the one described in equation (2), RSS(β; W) = (z − Mβ)T (z − Mβ) = [W1/2 (y − Xβ)]T [W1/2 (y − Xβ)] = (y − Xβ)T W(y − Xβ). Department of Mathematics and Statistics, Boston University 3 MA 575: Linear Models This can be directly minimized using the standard machinery that we have developed for minimizing unweighted residual sum of squares. In addition, note that the variance function is given by Var[d|X] = Var[W1/2 e|X] = W1/2 Var[e|X](W1/2 )T = W1/2 σ 2 W−1 (W1/2 )T = σ 2 W1/2 W−1/2 W−1/2 (W1/2 ) = σ 2 I. In summary, we therefore have recovered, after some transformations, a standard OLS model taking the form, z = Mβ + d, and Var[d|X] = σ 2 I. It simply remains to compute the actual form of β with respect to W, such that βbWLS = (MT M)−1 MT z = ((W1/2 X)T W1/2 X)−1 (W1/2 X)T z = (XT W1/2 W1/2 X)−1 XT W1/2 W1/2 y = (XT WX)−1 XT Wy. The equivalence between the WLS and the OLS framework is best observed by considering the entries of M and z. The new weighted design matrix and vector of observations are now, √ √ w 1 y1 w1 √w1 x11 . . . √w1 x1p √ √ √ √ w 2 y2 w2 w2 x21 . . . w2 x2p and z= . M= . , .. .. .. . .. . . . √ . √ wn yn wn √wn xn1 . . . √wn xnp b The regression problem simply involves finding the WLS fitted values b z := Mβ. 2.2 Generalized Least Squares (GLS) The WLS extension of OLS can be further generalized by considering any symmetric and positive definite matrix, such that Var[e|X] := Σ−1 , (3) where the generalized residual sum of squares becomes RSS(β; Σ) := (y − Xβ)T Σ(y − Xβ), which can also be re-written as RSS(β; Σ) = n X n X 2 σij (yi − xTi β)(yj − xTj β), i=1 j=1 2 where σij := Σij . Noting that the inverse of a positive definite matrix is also positive definite, we require that Σ = ΣT and Σ > 0, Department of Mathematics and Statistics, Boston University 4 MA 575: Linear Models which implies that for every non-zero v ∈ Rn , we have vT Σv > 0. Every symmetric positive definite matrices can be Cholesky decomposed, such that Σ = LLT , where L is a lower triangular matrix of dimension n × n. As a result, we can perform the same manipulations that we have conducted for WLS, such that if we pre-multiply both sides of our GLS equation we obtain LT y = LT Xβ + LT e, and define the following terms, z := LT y, M := LT X, and d := LT e. Then, we can again apply the standard OLS minimization machinery, after having verified that the variance of d is simply I. Moreover, it is straightforward to see that the Gauss-Markov theorem also holds under these more general assumptions, such that the GLS estimator βbGLS := (XT ΣX)−1 XT Σy, is also BLUE, amongst the class of unbiased linear estimators in a model, whose variance function is Var[e|X] := Σ−1 . References Casella, G. and Berger, R. (2002). Statistical Inference (2nd edition). Duxbury, New York. Christensen, R. (2011). Plane Answers to Complex Questions: The Theory of Linear Models (4th edition). Springer, New York. Department of Mathematics and Statistics, Boston University 5
© Copyright 2026 Paperzz