Handout #8: Matrix Framework for Simple Linear Regression Example 8.1: Consider again the Wendyโs subset of the Nutrition dataset that was initially presented in Handout #7. Assume the following structure for the mean and variance functions. o o ๐ธ(๐๐๐ก๐ข๐๐๐ก๐๐น๐๐ก|๐ถ๐๐๐๐๐๐๐ , ๐ ๐๐ ๐ก๐๐ข๐๐๐๐ก = ๐๐๐๐๐ฆโฒ๐ ) = ๐ฝ0 + ๐ฝ1 โ ๐ถ๐๐๐๐๐๐๐ ๐๐๐(๐๐๐ก๐ข๐๐๐ก๐๐น๐๐ก|๐ถ๐๐๐๐๐๐๐ , ๐ ๐๐ ๐ก๐๐ข๐๐๐๐ก = ๐๐๐๐๐ฆโฒ๐ ) = ๐ 2 Simple Linear Regression Output Scatterplot showing the conditional distribution of SaturatedFat | Calories Basic Regression Output Standard Parameter Estimate Output (with 95% confidence intervals) Output for the 95% confidence interval and prediction interval. 1 Matrix Representation of the Data The data structure can easily be represented with vectors and matricies. For example, the response column of the data will be represented by a vector, say y, and the predictor variable will be represented by a second vector, say x1. A theoretical representation and a representation for the observed data are presented here for comparison purposes. Theoretical Representation Representation for Observed Data ๐1 ๐2 ๐3 : : : ๐26 ๐27 ๐28 = = = : : : = = = ๐ฝ0 โ 1 ๐ฝ0 โ 1 ๐ฝ0 โ 1 : : : ๐ฝ0 โ 1 ๐ฝ0 โ 1 ๐ฝ0 โ 1 ๐1 ๐2 ๐3 : : = : ๐26 ๐27 [๐28 ] 1 1 1 : : : 1 1 [1 + ๐ฝ1 โ ๐ฅ1 + ๐ฝ1 โ ๐ฅ2 + ๐ฝ1 โ ๐ฅ3 : : : : : : + ๐ฝ1 โ ๐ฅ26 + ๐ฝ1 โ ๐ฅ27 + ๐ฝ1 โ ๐ฅ28 + + + : : : + + + ๐1 ๐ฅ1 ๐2 ๐ฅ2 ๐3 ๐ฅ3 : : ๐ฝ : [ 0] + : ๐ฝ1 : : ๐26 ๐ฅ26 ๐27 ๐ฅ27 ๐28 ] [ ๐ฅ28 ] ๐ = ๐ฟ๐ท + ๐บ ๐1 ๐2 ๐3 : : : ๐26 ๐27 ๐28 14 21 30 : : : 12 2 2.5 = = = : : : = = = โ5.8333 โ 1 โ5.8333 โ 1 โ5.8333 โ 1 : : : โ5.8333 โ 1 โ5.8333 โ 1 โ5.8333 โ 1 14 21 30 : : = : 12 2 [2.5] + + + : : : + + + 0.03096 โ 580 0.03096 โ 800 0.03096 โ 1060 : : : 0.03096 โ 580 0.03096 โ 320 0.03096 โ 210 + + + : : : + + + 1.88 2.06 3.01 : : : โ0.12 โ2.07 1.83 1.88 1 580 2.06 1 800 3.01 1 1060 : : : โ5.8333 ]+ : : : [ 0.03096 : : : โ0.12 1 580 โ2.07 1 320 [ 1.83 ] [1 210 ] ฬ + ๐บฬ ๐ = ๐ฟ๐ท 2 Theoretical Framework is easier with Matrix Representation Theoretical Representation using Standard Notation Theoretical Representation using Matrix Notation Representation Representation ๐๐ = โ ๐ฝ0 + ๐ฝ1 โ ๐ฅ๐ ๐๐๐๐ + ๐โ๐ ๐๐๐๐๐๐๐๐ ๐ = ๐ฟ๐ท + ๐บ Distributional Properties Distributional Properties ๐|๐ฟ~ Normal(๐ฟ๐ท, ๐ 2 โ ๐ฐ) ๐๐ ~ ๐๐๐๐๐๐(๐ฝ0 + ๐ฝ1 โ ๐ฅ๐ , ๐ 2 ) ๐ธ(๐๐ ) = ๐ฝ0 + ๐ฝ1 โ ๐ฅ๐ ๐ธ(๐|๐ฟ) = ๐ฟ๐ท ๐๐๐(๐๐ ) = ๐๐๐(๐๐ ) = ๐ 2 ๐๐๐(๐|๐ฟ) = ๐ 2 โ ๐ฐ where ๐ฐ is an n x n identify matrix, with n equal to the number of observations. Some people emphasize the fact that all the variability in the response is represented in the error term and state the following result. Some people emphasize the fact that all the variability in the response is represented in the error term and state the following result. ๐๐ ~ ๐๐๐๐๐๐(0, ๐ 2 ), for all i ๐บ ~ ๐ต(๐, ๐ 2 โ ๐ฐ) The quantity ๐บ ~ ๐ต(๐, ๐ 2 โ ๐ฐ) has the following form when it is written out in its entirety. ๐1 0 ๐2 0 ๐3 0 : : : ~๐ต : , : : ๐26 0 ๐27 0 [๐28 ] ( [0] 1 0 0 โฎ ๐2 โฎ โฎ 0 0 [0 0 1 0 โฎ โฎ โฎ 0 0 0 0 0 1 โฎ โฎ โฎ 0 0 0 โฏ โฏ โฏ โฑ โฑ โฑ โฏ โฏ โฏ โฏ โฏ โฏ โฑ โฑ โฑ โฏ โฏ โฏ โฏ โฏ โฏ โฑ โฑ โฑ โฏ โฏ โฏ 0 0 0 โฎ โฎ โฎ 1 0 0 0 0 0 โฎ โฎ โฎ 0 1 0 0 0 0 โฎ โฎ โฎ 0 0 1] ) 3 Example 8.2: There are certainly situations in which this simple form maybe inadequate. Consider the following data structure in which glucose levels were measured on subjects at baseline and at three repeated time points, i.e. baseline, 30 minutes, 60 minutes, and 90 minutes. In this particular situation, the standard error assumptions are not appropriate. Data structure and snip-it of estimated mean functions of interest. A better modeling approach for the error structure would be to allow the errors within a subject to be correlated with each other. ๐1,1 0 ๐1,2 0 ๐1,3 0 ๐1,4 0 : ~๐ต : , ๐59,1 0 ๐59,2 0 ๐59,3 0 [๐59,4 ] ( [0] 1 ๐12 ๐13 ๐14 ๐2 โฎ 0 0 0 [ 0 ๐12 1 ๐23 ๐24 โฎ 0 0 0 0 ๐13 ๐23 1 ๐34 โฎ 0 0 0 0 ๐14 ๐24 ๐34 1 โฑ 0 0 0 0 โฏ 0 โฏ 0 โฏ 0 โฑ 0 โฑ โฑ โฑ 1 โฏ ๐12 โฏ ๐13 โฏ ๐14 0 0 0 0 โฎ ๐12 1 ๐23 ๐24 0 0 0 0 โฎ ๐13 ๐23 1 ๐34 0 0 0 0 โฎ ๐14 ๐24 ๐34 1 ] ) 4 Working with Matrix Representation in R Studio To read a dataset into R Studio, select Import Dataset in the Workspace box (upper right corner). Select From Text Fileโฆ The most common format for text files that I use is comma delimited, which simply implies that observations in the dataset are separated by commas. R Studio has the ability to automatically identify a comma delimited file type. R Studio produces the following window when reading in this type of file. 5 Data Structure in R Studio In R (and R Studio), data is stored in a data.frame structure. This is not necessary equivalent to a matrix, but for our purposes a data.frame can be thought of as a matrix. Gettng the dimensions of our Nutrition data.frame , i.e. number of observations and number of variables. > dim(Nutrition) [1] 196 15 Getting the variable names of a the Nutrition data.frame. > names(Nutrition) [1] "RowID" "Restaurant" "Item" "ServingSize" "Calories" [8] "TotalFat" "SaturatedFat" "Cholesterol" "Fiber" "Sugar" [15] "Protein" "Type" "Breakfast" "Sodium" "TotalCarbs" Getting the elements in the 1st row of the Nutrition data.frame. > Nutrition[1,] Getting the elements of the 2nd column, i.e. the restaurant for each observation. > Nutrition[,2] 6 Simple plotting and model fitting R Studio. Creating a simple plot in R > plot(Nutrition$Calories,Nutrition$SaturatedFat) A simple linear regression model fit can be done using the lm() function. > slr.fit = lm(SaturatedFat ~ Calories, data=Nutrition) To see the output the initial output, simple type slr.fit. A more detailed summary can be obtained by using the following summary() function. For example, summary(slr.fit) will produce additional summaries for this model. > slr.fit Call: lm(formula = SaturatedFat ~ Calories, data=Nutrition) Coefficients: (Intercept) -2.86933 Calories 0.02276 Adding the estimated model to the plot. > abline(slr.fit) In R, there are often several characteristics of a function that are retained, but not necessarily easily identified or know. The names() function can be used to identify the names of the often hidden quantities. For example, slr.fit$residuals will produce a vector of all the residuals from the fit. > names(slr.fit) 7 Using the residuals from the fit to easily obtain a plot of the estimated variance function. > plot(Nutrition$Calories,abs(slr.fit$residuals)) > lines(lowess(Nutrition$Calories,abs(slr.fit$residuals))) You can very easily get help on most functions in R through the use of the help() function. For example, if youโd like information regarding the use of the lowess() function, type > help(lowess) 8 > help(lowess) : : 9 Example 8.3 Working again the with Wendysโ subset of the Nutrition dataset. The first step is to obtain only the observations from Wendys. This can be done as follows. > Wendys=Nutrition[Nutrition$Restaurant=="Wendys",] To obtain only the variables needed, we will ask for only certain columns. These columns will also be reordered as well. > Wendys=Nutrition[Nutrition$Restaurant=="Wendys",c(2,3,9,7)] > Wendys Next, we will construct the X matrix, i.e. design matrix. Recall the matrix notation structure for our simple linear regression model. ๐1 ๐2 ๐3 : : = : ๐26 ๐27 [๐28 ] 1 1 1 : : : 1 1 [1 ๐1 ๐ฅ1 ๐2 ๐ฅ2 ๐3 ๐ฅ3 : : ๐ฝ : [ 0] + : ๐ฝ1 : : ๐26 ๐ฅ26 ๐27 ๐ฅ27 [๐28 ] ๐ฅ28 ] 10 Creating the X matrix ๏ท Step #1: Creating the column of 1s. > dim(Wendys) [1] 28 4 > x0=rep(1,28) > x0 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ๏ท Step #2: Creating the 2nd column > x1=Wendys[,4] > x1 [1] 580 800 1060 470 740 970 660 700 260 430 340 380 560 530 340 570 440 [23] 770 210 470 580 320 210 ๏ท 400 350 250 220 450 Step #3: Putting the columns together in a matrix > x=cbind(x0,x1) > View(x) 11 Creating the Y vector > y=Wendys[,3] > View(y) Obtaining the estimated parameters, i.e. the vector We know from JMP output that the estimated y-intercept is about -5.8 and the slope estimate is about 0.03. Putting these quantities into a vector format yields. ฬ = [โ5.8333] ๐ท 0.03096 This vector can be obtained using the following matrix formula ฬ = (๐ฟโฒ ๐ฟ)โ๐ ๐ฟโฒ ๐ ๐ท 12 Getting the first quantity, i.e. (๐ฟโฒ ๐ฟ)โ๐ in R First, getting the transpose of the matrix X > xprime=t(x) > View(xprime) Next, multiply X by the transpose of X > xprimex=xprime %*% x > View(xprimex) Now, getting the inverse of ๐ฟโฒ ๐ฟ > xprimex.inv=solve(xprimex, diag(2) ) > View(xprimex.inv) ฬ = (๐ฟโฒ ๐ฟ)โ๐ ๐ฟโฒ ๐ Now, we can multiply the pieces together to get the estimated parameters, i.e. ๐ท > beta.hat = xprimex.inv %*% xprime %*% y > View(beta.hat) ฬ = [โ5.8333] ๐ท 0.03096 13 Predicted Values and Residuals Predicted Values > y.pred = x %*% beta.hat > View(y.pred) Residuals > resid = y - y.pred > View(resid) 14 Some of the other commonly used to quantities Summary of Fit and ANOVA table from JMP. Getting the Sum of Squares for C. Total, i.e. the total unexplained variation in the marginal distribution. > C.Total = 27 * var(Wendys$SaturatedFat) > C.Total [1] 1467.7143 Getting the total unexplained variation in the conditional distribution can be done quite easily using the residual vector. > Sum.Squared.Error = t(resid) %*% resid > Sum.Squared.Error [,1] [1,] 170.94 Dividing the quantity above by 26 yields our variance estimate, under a constant variance assumption. That is, the ๐ฬ 2 is given by > Mean.Squared.Error = Sum.Squared.Error / 26 > Mean.Squared.Error [,1] [1,] 6.57 Taking the square root yields the estimated standard deviation, i.e, ๐ฬ = ฬ ๐๐ก๐ ๐ท๐๐ฃ(๐๐๐ก๐ข๐๐๐ก๐๐๐น๐๐ก | ๐ถ๐๐๐๐๐๐๐ ) > sqrt(Mean.Squared.Error) [,1] [1,] 2.564 15 R2 -- to include a visualization of R2 Getting the R2 value via the reduction in unexplained variation. > RSquared = (C.Total - Sum.Squared.Error)/C.Total > RSquared [,1] [1,] 0.8835354 Visualization of R2 There is a visual interpretation of R2, which we have not yet discussed in this class. This visualization is given by plotting the y values against the predicted values. This plot is shown here. If the model provides a good fit, then the points on this plot should follow the y=x line. This has been included on the plot below. > plot(y,y.pred,xlim=c(0,30),ylim=c(0,30)) > abline(0,1) Questions 1. What would this plot look like if R2 values was very close to 1? 2. Consider the 1st observations in our dataset -- Daves Hot N Juicy ¼ lb Single. This items has a SaturatedFat value of 14 and the predicted SaturatedFat from the regression line was determined to be 12.12. a. Find this point on the graph above. b. Identify the residual for this point on the graph. The R2 quantity calculated above can be computed by squaring the correlation measurement from the plot above. Traditionally, ๐, the greek r, is used to identify a correlation; thus, Iโd guess that this is where the R2 notation was derived from. > cor(y,y.pred)^2 [,1] [1,] 0.8835354 16 Obtaining the standard errors for estimated parameters The standard error quantities for the y-intercept and slope were discussed in a previous handout; however, the formulation of such quantities was not given. Standard error values for the y-intercept and slope are provided in standard regression output. The standard error of the slope is used to quantify the degree to which the estimated slope of the regression line will vary over repeated samples. From the above plot, we can see that the variation in the estimated slope certainly affects the variation in the estimated y-intercepts. That is, these two standard error quantities are said to co-vary, i.e. a covariation exists between these two quantities. The variation in the estimated parameter vector is given by the following variance/covariance matrix. ฬ ๐๐๐(๐ฝฬ0 ) ๐ถ๐๐ฃ(๐ฝฬ0 , ๐ฝฬ1 ) ฬ ) = ๐๐๐ ([๐ฝ0 ]) = [ ๐๐๐(๐ท ] ๐ฝฬ1 ๐ถ๐๐ฃ(๐ฝฬ0 , ๐ฝฬ1 ) ๐๐๐(๐ฝฬ1 ) The estimated variance/covariance matrix is given by the following quantity. ฬ ) = ๐ฬ 2 (๐ฟโฒ ๐ฟ)โ1 ๐๐๐(๐ท 17 Getting variance/covariance matrix of the estimated parameter vector ฬ) = ๐๐๐(๐ท = ๐ฬ 2 (๐ฟโฒ ๐ฟ)โ1 28 14060 โ1 6.57 โ [ ] 14060 8412800 0.2221 โ0.00037 = 6.57 โ [ ] โ0.00037 0.0000007 = 1.459 โ0.0024 [ ] โ0.0024 0.0000049 Thus, the standard error, i.e. standard deviation, of the estimated y-intercept is given by ๐๐ก๐๐๐๐๐๐ ๐ธ๐๐๐๐ ๐๐ ๐ฝฬ0 = โ๐๐๐(๐ฝฬ0 ) = โ1.459 = 1.208 and the standard error of the estimated slope is given by ๐๐ก๐๐๐๐๐๐ ๐ธ๐๐๐๐ ๐๐ ๐ฝฬ1 = โ๐๐๐(๐ฝฬ1 ) = โ0.0000049 = 0.0022 Comment: The co-variation that exists between the model parameters is ignored when the 95% confidence intervals are individually considered. A 95% joint confidence region does not ignore such covariation. This confidence region is constructed using a multivariate normal distribution. (Take STAT 415: Multivariate Statistics for all the details!) Individual 95% confidence intervals for model parameters 95% Joint Confidence Region for model parameters 18 Predictions and Standard Errors for Predictions Goal: Obtain a prediction (and its associated standard errors for CIs and PIs) for the expected SaturatedFat level of a Wendyโs menu item with 900 calories. Output from JMP regarding the prediction and estimated standard errors. Creating a row vector that contains the information for our new observation > xnew=cbind(1,900) Note: Column binding is needed to create a row vector as [1] and [900] are being put together to form a row vector. Thus, cbind(1,900) will create a row vector that is needed to make a prediction for a food item with 900 Calories. ฬ To obtain the predicted SaturatedFat, simply multiple this row vector by ๐ท > y.pred.900 = xnew %*% beta.hat [,1] [1,] 22.03295 19 Multiplication Properties for Variances ๏ท ฬ Variance of a constant, say c, times ๐ท ฬ) = ๐๐๐(๐ โ ๐ท = ๏ท ฬ) ๐ 2 โ ๐๐๐(๐ท ฬ) โ ๐ ๐ โ ๐๐๐(๐ท ฬ . This is commonly referred to as a Variance of a row vector, say r, times ๐ท linear combination of the estimated parameter vector. ฬ ) = ๐ โ ๐๐๐(๐ท ฬ ) โ ๐โฒ ๐๐๐(๐ โ ๐ท Getting the variance for the linear combination of interest when making a prediction for a food item with 900 Calories. ฬ) = ๐๐๐(๐ โ ๐ท ๐ โ ฬ) ๐๐๐( ๐ท โ ๐โฒ = ๐ โ ๐ฬ 2 (๐ฟโฒ ๐ฟ)โ1 โ ๐โฒ = [1 1 1.459 โ0.0024 ]โ[ ] 900] โ [ 900 โ0.0024 0.0000049 = 1.004 Taking the square root of this quantity yields the predicted standard error quantity provided by JMP. ๐๐ก๐๐๐๐๐๐ ๐ธ๐๐๐๐ ๐๐๐ ๐๐๐๐ = = โ1.004 1.002 20 The standard error for an individual prediction (versus the average predicted value) requires the addition of the variability present in the conditional distribution. That is, the variation for an individual predication involves variation in estimating the regression line plus the variation in conditional distribution. ๐๐ก๐๐๐๐๐๐ ๐ธ๐๐๐๐ ๐๐๐ ๐๐ผ = ๐๐๐ ฬ โโ ๐ธ(๐๐๐ก๐ข๐๐๐ก๐๐๐น๐๐ก|๐ถ๐๐๐๐๐๐๐ =900) ๐๐๐๐๐๐๐๐๐ก๐ฆ ๐๐ ๐๐๐๐ ๐น๐ข๐๐๐ก๐๐๐ + ๐๐๐๐๐๐ก๐ข๐๐๐ก๐๐๐น๐๐ก|๐ถ๐๐๐๐๐๐๐ โ ๐๐๐๐๐๐๐๐๐ก๐ฆ ๐๐ ๐ถ๐๐๐๐๐ก๐๐๐๐๐ ๐ท๐๐ ๐ก๐๐๐๐ข๐ก๐๐๐ = โ1.004 + 6.57 = 2.75 A visualization of the 95% prediction intervals and itโs corresponding standard error are given below. 95% Prediction Interval for Calories = 900 Prediction intervals certainly vary over repeated samples. The standard error for an individual prediction measures such variation. 21
© Copyright 2026 Paperzz