Creating the X matrix - Winona State University

Handout #8: Matrix Framework for Simple Linear Regression
Example 8.1: Consider again the Wendy’s subset of the Nutrition dataset that was initially presented in
Handout #7.
Assume the following structure for the mean and variance functions.
o
o
𝐸(𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝐹𝑎𝑡|𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠, 𝑅𝑒𝑠𝑡𝑎𝑢𝑟𝑎𝑛𝑡 = 𝑊𝑒𝑛𝑑𝑦′𝑠) = 𝛽0 + 𝛽1 ∗ 𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠
𝑉𝑎𝑟(𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝐹𝑎𝑡|𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠, 𝑅𝑒𝑠𝑡𝑎𝑢𝑟𝑎𝑛𝑡 = 𝑊𝑒𝑛𝑑𝑦′𝑠) = 𝜎 2
Simple Linear Regression Output
Scatterplot showing the conditional distribution of
SaturatedFat | Calories
Basic Regression Output
Standard Parameter Estimate Output (with 95% confidence intervals)
Output for the 95% confidence interval and prediction interval.
1
Matrix Representation of the Data
The data structure can easily be represented with vectors and matricies. For example, the response
column of the data will be represented by a vector, say y, and the predictor variable will be represented
by a second vector, say x1.
A theoretical representation and a representation for the observed data are presented here for
comparison purposes.
Theoretical Representation
Representation for Observed Data
𝑌1
𝑌2
𝑌3
:
:
:
𝑌26
𝑌27
𝑌28
=
=
=
:
:
:
=
=
=
𝛽0 ∗ 1
𝛽0 ∗ 1
𝛽0 ∗ 1
:
:
:
𝛽0 ∗ 1
𝛽0 ∗ 1
𝛽0 ∗ 1
𝑌1
𝑌2
𝑌3
:
: =
:
𝑌26
𝑌27
[𝑌28 ]
1
1
1
:
:
:
1
1
[1
+ 𝛽1 ∗ 𝑥1
+ 𝛽1 ∗ 𝑥2
+ 𝛽1 ∗ 𝑥3
:
:
:
:
:
:
+ 𝛽1 ∗ 𝑥26
+ 𝛽1 ∗ 𝑥27
+ 𝛽1 ∗ 𝑥28
+
+
+
:
:
:
+
+
+
𝜀1
𝑥1
𝜀2
𝑥2
𝜀3
𝑥3
:
:
𝛽
: [ 0] + :
𝛽1
:
:
𝜀26
𝑥26
𝜀27
𝑥27
𝜀28 ]
[
𝑥28 ]
𝒀 = 𝑿𝜷 + 𝜺
𝜀1
𝜀2
𝜀3
:
:
:
𝜀26
𝜀27
𝜀28
14
21
30
:
:
:
12
2
2.5
=
=
=
:
:
:
=
=
=
−5.8333 ∗ 1
−5.8333 ∗ 1
−5.8333 ∗ 1
:
:
:
−5.8333 ∗ 1
−5.8333 ∗ 1
−5.8333 ∗ 1
14
21
30
:
: =
:
12
2
[2.5]
+
+
+
:
:
:
+
+
+
0.03096 ∗ 580
0.03096 ∗ 800
0.03096 ∗ 1060
:
:
:
0.03096 ∗ 580
0.03096 ∗ 320
0.03096 ∗ 210
+
+
+
:
:
:
+
+
+
1.88
2.06
3.01
:
:
:
−0.12
−2.07
1.83
1.88
1 580
2.06
1 800
3.01
1 1060
:
:
:
−5.8333
]+
:
:
: [
0.03096
:
:
:
−0.12
1 580
−2.07
1 320
[ 1.83 ]
[1 210 ]
̂ + 𝜺̂
𝒀 = 𝑿𝜷
2
Theoretical Framework is easier with Matrix Representation
Theoretical Representation using
Standard Notation
Theoretical Representation using
Matrix Notation
Representation
Representation
𝑌𝑖 = ⏟
𝛽0 + 𝛽1 ∗ 𝑥𝑖
𝑀𝑒𝑎𝑛
+
𝜀⏟𝑖
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
𝒀 = 𝑿𝜷 + 𝜺
Distributional Properties
Distributional Properties
𝒀|𝑿~ Normal(𝑿𝜷, 𝜎 2 ∗ 𝑰)
𝑌𝒊 ~ 𝑁𝑜𝑟𝑚𝑎𝑙(𝛽0 + 𝛽1 ∗ 𝑥𝑖 , 𝜎 2 )
𝐸(𝑌𝑖 ) = 𝛽0 + 𝛽1 ∗ 𝑥𝑖
𝐸(𝒀|𝑿) = 𝑿𝜷
𝑉𝑎𝑟(𝑌𝑖 ) = 𝑉𝑎𝑟(𝜀𝑖 ) = 𝜎 2
𝑉𝑎𝑟(𝒀|𝑿) = 𝜎 2 ∗ 𝑰
where 𝑰 is an n x n identify matrix, with n equal to the
number of observations.
Some people emphasize the fact that all the variability
in the response is represented in the error term and
state the following result.
Some people emphasize the fact that all the variability in
the response is represented in the error term and state the
following result.
𝜀𝑖 ~ 𝑁𝑜𝑟𝑚𝑎𝑙(0, 𝜎 2 ), for all i
𝜺 ~ 𝑵(𝟎, 𝜎 2 ∗ 𝑰)
The quantity
𝜺 ~ 𝑵(𝟎, 𝜎 2 ∗ 𝑰)
has the following form when it is written out in its entirety.
𝜀1
0
𝜀2
0
𝜀3
0
:
:
: ~𝑵
: ,
:
:
𝜀26
0
𝜀27
0
[𝜀28 ]
( [0]
1
0
0
⋮
𝜎2 ⋮
⋮
0
0
[0
0
1
0
⋮
⋮
⋮
0
0
0
0
0
1
⋮
⋮
⋮
0
0
0
⋯
⋯
⋯
⋱
⋱
⋱
⋯
⋯
⋯
⋯
⋯
⋯
⋱
⋱
⋱
⋯
⋯
⋯
⋯
⋯
⋯
⋱
⋱
⋱
⋯
⋯
⋯
0
0
0
⋮
⋮
⋮
1
0
0
0
0
0
⋮
⋮
⋮
0
1
0
0
0
0
⋮
⋮
⋮
0
0
1]
)
3
Example 8.2: There are certainly situations in which this simple form maybe inadequate. Consider the
following data structure in which glucose levels were measured on subjects at baseline and at three
repeated time points, i.e. baseline, 30 minutes, 60 minutes, and 90 minutes. In this particular situation,
the standard error assumptions are not appropriate.
Data structure and snip-it of estimated mean functions of interest.
A better modeling approach for the error structure would be to allow the errors within a subject to be
correlated with each other.
𝜀1,1
0
𝜀1,2
0
𝜀1,3
0
𝜀1,4
0
: ~𝑵
: ,
𝜀59,1
0
𝜀59,2
0
𝜀59,3
0
[𝜀59,4 ]
( [0]
1
𝜌12
𝜌13
𝜌14
𝜎2 ⋮
0
0
0
[ 0
𝜌12
1
𝜌23
𝜌24
⋮
0
0
0
0
𝜌13
𝜌23
1
𝜌34
⋮
0
0
0
0
𝜌14
𝜌24
𝜌34
1
⋱
0
0
0
0
⋯ 0
⋯ 0
⋯ 0
⋱
0
⋱ ⋱
⋱
1
⋯ 𝜌12
⋯ 𝜌13
⋯ 𝜌14
0
0
0
0
⋮
𝜌12
1
𝜌23
𝜌24
0
0
0
0
⋮
𝜌13
𝜌23
1
𝜌34
0
0
0
0
⋮
𝜌14
𝜌24
𝜌34
1 ]
)
4
Working with Matrix Representation in R Studio
To read a dataset into R Studio, select Import Dataset in the Workspace box (upper right corner). Select
From Text File… The most common format for text files that I use is comma delimited, which simply
implies that observations in the dataset are separated by commas. R Studio has the ability to
automatically identify a comma delimited file type. R Studio produces the following window when
reading in this type of file.
5
Data Structure in R Studio
In R (and R Studio), data is stored in a data.frame structure. This is not necessary equivalent to a matrix,
but for our purposes a data.frame can be thought of as a matrix.
Gettng the dimensions of our Nutrition data.frame , i.e. number of observations and number of
variables.
> dim(Nutrition)
[1] 196 15
Getting the variable names of a the Nutrition data.frame.
> names(Nutrition)
[1] "RowID"
"Restaurant"
"Item"
"ServingSize" "Calories"
[8] "TotalFat"
"SaturatedFat" "Cholesterol"
"Fiber"
"Sugar"
[15] "Protein"
"Type"
"Breakfast"
"Sodium"
"TotalCarbs"
Getting the elements in the 1st row of the Nutrition data.frame.
> Nutrition[1,]
Getting the elements of the 2nd column, i.e. the restaurant for each observation.
> Nutrition[,2]
6
Simple plotting and model fitting R Studio.
Creating a simple plot in R
> plot(Nutrition$Calories,Nutrition$SaturatedFat)
A simple linear regression model fit can be done using the lm() function.
> slr.fit = lm(SaturatedFat ~ Calories, data=Nutrition)
To see the output the initial output, simple type slr.fit. A more detailed summary can be obtained by
using the following summary() function. For example, summary(slr.fit) will produce additional
summaries for this model.
> slr.fit
Call:
lm(formula = SaturatedFat ~ Calories, data=Nutrition)
Coefficients:
(Intercept)
-2.86933
Calories
0.02276
Adding the estimated model to the plot.
> abline(slr.fit)
In R, there are often several characteristics of a function that are retained, but not necessarily easily
identified or know. The names() function can be used to identify the names of the often hidden
quantities. For example, slr.fit$residuals will produce a vector of all the residuals from the fit.
> names(slr.fit)
7
Using the residuals from the fit to easily obtain a plot of the estimated variance function.
> plot(Nutrition$Calories,abs(slr.fit$residuals))
> lines(lowess(Nutrition$Calories,abs(slr.fit$residuals)))
You can very easily get help on most functions in R through the use of the help() function. For example,
if you’d like information regarding the use of the lowess() function, type
> help(lowess)
8
> help(lowess)
:
:
9
Example 8.3 Working again the with Wendys’ subset of the Nutrition dataset.
The first step is to obtain only the observations from Wendys. This can be done as follows.
> Wendys=Nutrition[Nutrition$Restaurant=="Wendys",]
To obtain only the variables needed, we will ask for only certain columns. These columns will also be
reordered as well.
> Wendys=Nutrition[Nutrition$Restaurant=="Wendys",c(2,3,9,7)]
> Wendys
Next, we will construct the X matrix, i.e. design matrix. Recall the matrix notation structure for our
simple linear regression model.
𝑌1
𝑌2
𝑌3
:
: =
:
𝑌26
𝑌27
[𝑌28 ]
1
1
1
:
:
:
1
1
[1
𝜀1
𝑥1
𝜀2
𝑥2
𝜀3
𝑥3
:
:
𝛽
: [ 0] + :
𝛽1
:
:
𝜀26
𝑥26
𝜀27
𝑥27
[𝜀28 ]
𝑥28 ]
10
Creating the X matrix

Step #1: Creating the column of 1s.
> dim(Wendys)
[1] 28 4
> x0=rep(1,28)
> x0
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Step #2: Creating the 2nd column
> x1=Wendys[,4]
> x1
[1] 580 800 1060 470 740 970 660 700
260 430 340 380 560 530 340 570 440
[23] 770 210 470 580 320 210

400
350
250
220
450
Step #3: Putting the columns together in a matrix
> x=cbind(x0,x1)
> View(x)
11
Creating the Y vector
> y=Wendys[,3]
> View(y)
Obtaining the estimated parameters, i.e. the vector
We know from JMP output that the estimated y-intercept is about -5.8 and the slope estimate is about
0.03.
Putting these quantities into a vector format yields.
̂ = [−5.8333]
𝜷
0.03096
This vector can be obtained using the following matrix formula
̂ = (𝑿′ 𝑿)−𝟏 𝑿′ 𝒀
𝜷
12
Getting the first quantity, i.e. (𝑿′ 𝑿)−𝟏 in R
First, getting the transpose of the matrix X
> xprime=t(x)
> View(xprime)
Next, multiply X by the transpose of X
> xprimex=xprime %*% x
> View(xprimex)
Now, getting the inverse of 𝑿′ 𝑿
> xprimex.inv=solve(xprimex, diag(2) )
> View(xprimex.inv)
̂ = (𝑿′ 𝑿)−𝟏 𝑿′ 𝒀
Now, we can multiply the pieces together to get the estimated parameters, i.e. 𝜷
> beta.hat = xprimex.inv %*% xprime %*% y
> View(beta.hat)
̂ = [−5.8333]
𝜷
0.03096
13
Predicted Values and Residuals
Predicted Values
> y.pred = x %*% beta.hat
> View(y.pred)
Residuals
> resid = y - y.pred
> View(resid)
14
Some of the other commonly used to quantities
Summary of Fit and ANOVA table from JMP.
Getting the Sum of Squares for C. Total, i.e. the total unexplained variation in the marginal
distribution.
> C.Total = 27 * var(Wendys$SaturatedFat)
> C.Total
[1] 1467.7143
Getting the total unexplained variation in the conditional distribution can be done quite easily
using the residual vector.
> Sum.Squared.Error = t(resid) %*% resid
> Sum.Squared.Error
[,1]
[1,] 170.94
Dividing the quantity above by 26 yields our variance estimate, under a constant variance
assumption. That is, the 𝜎̂ 2 is given by
> Mean.Squared.Error = Sum.Squared.Error / 26
> Mean.Squared.Error
[,1]
[1,] 6.57
Taking the square root yields the estimated standard deviation, i.e, 𝜎̂ =
̂
𝑆𝑡𝑑
𝐷𝑒𝑣(𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝑑𝐹𝑎𝑡 | 𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠)
> sqrt(Mean.Squared.Error)
[,1]
[1,] 2.564
15
R2 -- to include a visualization of R2
Getting the R2 value via the reduction in unexplained variation.
> RSquared = (C.Total - Sum.Squared.Error)/C.Total
> RSquared
[,1]
[1,] 0.8835354
Visualization of R2
There is a visual interpretation of R2, which we have not yet discussed in this class. This
visualization is given by plotting the y values against the predicted values. This plot is
shown here. If the model provides a good fit, then the points on this plot should follow
the y=x line. This has been included on the plot below.
> plot(y,y.pred,xlim=c(0,30),ylim=c(0,30))
> abline(0,1)
Questions
1. What would this plot look like if R2 values was very close to 1?
2. Consider the 1st observations in our dataset -- Daves Hot N Juicy ¼ lb Single.
This items has a SaturatedFat value of 14 and the predicted SaturatedFat from
the regression line was determined to be 12.12.
a. Find this point on the graph above.
b. Identify the residual for this point on the graph.
The R2 quantity calculated above can be computed by squaring the correlation
measurement from the plot above. Traditionally, 𝜌, the greek r, is used to identify a
correlation; thus, I’d guess that this is where the R2 notation was derived from.
> cor(y,y.pred)^2
[,1]
[1,] 0.8835354
16
Obtaining the standard errors for estimated parameters
The standard error quantities for the y-intercept and slope were discussed in a previous handout;
however, the formulation of such quantities was not given.
Standard error values for the y-intercept and slope are provided in standard regression output.
The standard error of the slope is used to quantify the degree to which the estimated slope of the
regression line will vary over repeated samples.
From the above plot, we can see that the variation in the estimated slope certainly affects the variation
in the estimated y-intercepts. That is, these two standard error quantities are said to co-vary, i.e. a
covariation exists between these two quantities.
The variation in the estimated parameter vector is given by the following variance/covariance matrix.
̂
𝑉𝑎𝑟(𝛽̂0 )
𝐶𝑜𝑣(𝛽̂0 , 𝛽̂1 )
̂ ) = 𝑉𝑎𝑟 ([𝛽0 ]) = [
𝑉𝑎𝑟(𝜷
]
𝛽̂1
𝐶𝑜𝑣(𝛽̂0 , 𝛽̂1 )
𝑉𝑎𝑟(𝛽̂1 )
The estimated variance/covariance matrix is given by the following quantity.
̂ ) = 𝜎̂ 2 (𝑿′ 𝑿)−1
𝑉𝑎𝑟(𝜷
17
Getting variance/covariance matrix of the estimated parameter vector
̂) =
𝑉𝑎𝑟(𝜷
=
𝜎̂ 2 (𝑿′ 𝑿)−1
28
14060 −1
6.57 ∗ [
]
14060 8412800
0.2221
−0.00037
= 6.57 ∗ [
]
−0.00037 0.0000007
=
1.459
−0.0024
[
]
−0.0024 0.0000049
Thus, the standard error, i.e. standard deviation, of the estimated y-intercept is given by
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑜𝑓 𝛽̂0 = √𝑉𝑎𝑟(𝛽̂0 ) = √1.459 = 1.208
and the standard error of the estimated slope is given by
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑜𝑓 𝛽̂1 = √𝑉𝑎𝑟(𝛽̂1 ) = √0.0000049 = 0.0022
Comment: The co-variation that exists between the model parameters is ignored when the 95%
confidence intervals are individually considered. A 95% joint confidence region does not ignore such covariation. This confidence region is constructed using a multivariate normal distribution. (Take STAT
415: Multivariate Statistics for all the details!)
Individual 95% confidence intervals
for model parameters
95% Joint Confidence Region
for model parameters
18
Predictions and Standard Errors for Predictions
Goal: Obtain a prediction (and its associated standard errors for CIs and PIs) for the expected
SaturatedFat level of a Wendy’s menu item with 900 calories.
Output from JMP regarding the prediction and estimated standard errors.
Creating a row vector that contains the information for our new observation
> xnew=cbind(1,900)
Note: Column binding is needed to create a row vector as [1] and [900] are being put together
to form a row vector. Thus, cbind(1,900) will create a row vector that is needed to make a
prediction for a food item with 900 Calories.
̂
To obtain the predicted SaturatedFat, simply multiple this row vector by 𝜷
> y.pred.900 = xnew %*% beta.hat
[,1]
[1,] 22.03295
19
Multiplication Properties for Variances

̂
Variance of a constant, say c, times 𝜷
̂) =
𝑉𝑎𝑟(𝑐 ∗ 𝜷
=

̂)
𝑐 2 ∗ 𝑉𝑎𝑟(𝜷
̂) ∗ 𝑐
𝑐 ∗ 𝑉𝑎𝑟(𝜷
̂ . This is commonly referred to as a
Variance of a row vector, say r, times 𝜷
linear combination of the estimated parameter vector.
̂ ) = 𝒓 ∗ 𝑉𝑎𝑟(𝜷
̂ ) ∗ 𝒓′
𝑉𝑎𝑟(𝒓 ∗ 𝜷
Getting the variance for the linear combination of interest when making a prediction for a food
item with 900 Calories.
̂) =
𝑉𝑎𝑟(𝒓 ∗ 𝜷
𝒓
∗
̂)
𝑉𝑎𝑟( 𝜷
∗ 𝒓′
=
𝒓
∗
𝜎̂ 2 (𝑿′ 𝑿)−1
∗ 𝒓′
= [1
1
1.459
−0.0024
]∗[
]
900] ∗ [
900
−0.0024 0.0000049
= 1.004
Taking the square root of this quantity yields the predicted standard error quantity provided by
JMP.
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑓𝑜𝑟 𝑀𝑒𝑎𝑛
=
=
√1.004
1.002
20
The standard error for an individual prediction (versus the average predicted value) requires the
addition of the variability present in the conditional distribution. That is, the variation for an
individual predication involves variation in estimating the regression line plus the variation in
conditional distribution.
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑓𝑜𝑟 𝑃𝐼
=
𝑉𝑎𝑟 ̂
√⏟ 𝐸(𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝑑𝐹𝑎𝑡|𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠=900)
𝑉𝑎𝑟𝑖𝑏𝑖𝑙𝑖𝑡𝑦 𝑖𝑛 𝑀𝑒𝑎𝑛 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛
+
𝑉𝑎𝑟𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝑑𝐹𝑎𝑡|𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠
⏟
𝑉𝑎𝑟𝑖𝑏𝑖𝑙𝑖𝑡𝑦 𝑖𝑛 𝐶𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝐷𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛
= √1.004 + 6.57
= 2.75
A visualization of the 95% prediction intervals and it’s corresponding standard error are given
below.
95% Prediction Interval for Calories = 900
Prediction intervals certainly vary over
repeated samples. The standard error for an
individual prediction measures such variation.
21

Download Report

Creating the X matrix - Winona State University

Paperzz.com

Your Paperzz