Section 17

STAT 305: Chapter 17 – Correlation and Simple Linear Regression
Spring 2014
Regression is used to study relationships between variables. Linear regression is used
for a special class of relationships, namely, those that can be described by “straight”
lines, or by generalizations of “straight” lines to many dimensions.
In regression we seek to understand how the value of a response of variable (Y) is related
to a set of explanatory variables or predictors (X’s). This is done by specifying a model
that relates the mean or average value of the response to the predictors in mathematical
form. The mean of the response Y as function of X is denoted E(Y|X). In simple linear
regression, we have single predictor variable (X) and we use a “line” (of the form y = mx
+ b) to relate the response to the predictor.
Regression Model Notation:
E(Y|X) =
Var(Y|X) =
In regression we first specify a functional form or model for E(Y|X). We then fit that
model with using available data and then assess the adequacy of the model. We may
then decide to remove some of the explanatory variables that do not appear important,
add others to improve the model, change the function form of the model, etc.
Regression is an iterative process where we fit a preliminary model and then modify our
model based on the results.
Multiple Regression Examples:
a) E(Mercury Level in Walleyes |Weight, Length, Alkalinity, pH) =
 0  1Weight   2 Length   3 Alkalinity   4 pH
b) E(Systolic BP|
)=
249
STAT 305: Chapter 17 – Correlation and Simple Linear Regression
Spring 2014
Example 16.1 - Length of Sunfish in Camp Lake
Data File: Camplake.JMP
Background:
These data were collected as part of study of bluegill sunfish in Camp Lake.
The variables measured in this study were:
 Age - age of the sunfish (determined by counting growth rings on scales, and
recorded to the nearest year)
 Scalerad - scale radius (mm/100)
 Length - length of sunfish (mm)  Response of interest (Y)
Goal: Investigate the relationship between age (X) and the length of a sunfish from
Camp Lake (Y). Also the relationship between scale radius and length is of potential
interest.
Assumptions for a Simple Linear Regression Model
1. The mean of the response variable (Y) can be modeled using X in the following
form: E (Y | X )   o   x * X
2. The variability in the response variable (Y) must be the same for each X, i.e.
Var (Y | X )   2 or equivalently SD(Y | X )   . In other words, the variation in the
response Y must be constant across the entire range of X.
3. The response measurements (Y’s) should be independent of each other. If a
simple random sample is taken from the population this typically the case. One
situation where this assumption is violated is when the data are collected over time.
4. The response measurements (Y), for a given value of X, should follow a normal
distribution.
We will discuss how to check the assumptions outlined above after fitting our initial
model.
250
STAT 305: Chapter 17 – Correlation and Simple Linear Regression
Spring 2014
Pearson’s Product Moment Correlation (r):
This measure gives us some idea about whether or not the predictor(s) X and response Y
are linearly related. The correlations are obtained under Analyze > Multivariate.
Next, select the variables of interest and place them all in the Y box.
Correlation measures to what degree two numeric variables X and Y are related to
another in a linear fashion. If the correlation is 1.00 or -1.00, then the variables are
perfectly linearly related. A scale for correlations is given below.
Sample Correlation (r)  (Pearson’s)
n
r
 (x
i 1
n
 (x
i 1
i
i
 x )( y i  y )
 x)2
n
(y
i 1
i
 y) 2
What is the correlation between the variables here? In particular, what is the correlation
between age and scale radius with length?
NEVER CALCULATE CORRELATIONS WITHOUT PLOTTING YOUR DATA!
251
STAT 305: Chapter 17 – Correlation and Simple Linear Regression
Spring 2014
To test whether a correlation is significantly different from 0, we select Multivariate >
Pairwise Correlations. Here we see that all three pair-wise correlations are significantly
different from zero.
252
STAT 305: Chapter 17 – Correlation and Simple Linear Regression
Spring 2014
More on Correlation:
253
STAT 305: Chapter 17 – Correlation and Simple Linear Regression
Spring 2014
Fitting the regression model relating Length (Y) to Age (X)
What is the population model? We want to model the mean value of Y using X, so the
model is given by:
E (Y | X )   o   x * X
or being specific for this situation, we have
E ( Length | Age)   o  1  Age
Again: E(Y|X) is the notation we use to denote the mean value of Y given X.
Taking a closer look at its pieces:
𝐸(𝐿𝑒𝑛𝑔𝑡ℎ|𝐴𝑔𝑒) = 𝛽0 + 𝛽1 𝐴𝑔𝑒
𝛽1 = 𝑠𝑙𝑜𝑝𝑒
For this data set, Y = Length (mm) and X = Age (years).
How is the line that best fits our data determined? Using the Method of Least Squares.
254
STAT 305: Chapter 17 – Correlation and Simple Linear Regression
Spring 2014
To fit the model:
E ( Length | Age)   o  1 * Age
we first select Analyze > Fit Y by X and place Length in the Y box and Age in the X box as
shown below. This will give us scatter plot of Y vs. X, from which we can fit the simple
linear regression model.
The resulting scatter plot is shown below.
With regression line added.
To perform the regression of Length on Age in JMP, select Fit Line from the Bivariate
Fit... pull-down menu.
255
STAT 305: Chapter 17 – Correlation and Simple Linear Regression
Spring 2014
The resulting output is shown below.
We begin by looking at whether or not this regression stuff is even helpful:
H o : Regression is NOT useful
H a : Regression is useful
This says that Age (X) explains a significant
amount of the variation in the response Length (Y).
Next we can assess the importance of the X variable, Age in this case:
H o : 1  0
H 1 : 1  0
Conclusions from both tests:
This is equivalent to saying the X is not
useful for explaining Y. Thus the results
of these two tests are identical in every
way!
95% CI for 𝛽1
256
STAT 305: Chapter 17 – Correlation and Simple Linear Regression
Spring 2014
Determining how well the model is doing in terms of explaining the response is done
using the R-Square and Root Mean Square Error:
The proportion or percentage of variation explained
by the regression of Length (Y) on Age (X) is given
by the R-Square = .838 or 83.8%
More on the R-Square:
The amount of unexplained variability in Length (Y)
taking Age (X) into consideration is given by the
RMSE (root mean square error). This is an estimate
of the SD(Length|Age).
Describing the Relationship:
Eˆ ( Length | Age)  70.99  16.71 * Age
Interpret each of the parameter estimates and CI’s:
Estimating the Mean Y for a Given X and Making Individual Predictions
What do we estimate for the mean length of all sunfish in Camp Lake that are 4 years
old?
If picked a 4 year old fish from Camp Lake at random, what do we predict its length will
be?
257
STAT 305: Chapter 17 – Correlation and Simple Linear Regression
Spring 2014
Checking the Assumptions:
To check the adequacy of the model in terms of the basic assumptions involves looking a
plots of the residuals from the fit. One plot that is particular helpful for checking a
number of the model assumptions is a plot of the residuals vs. the fitted values. When
we are performing simple linear regression we can alternatively plot the residuals vs. X,
which is what JMP gives by default when selecting the Linear Fit > Plot Residuals
option.
Ideal Residual Plot:
Violations to Assumption #1:
The trend need not be linear (BAD)
The trend need not be linear (BAD)
Violations to Assumption #2:
Variance of Y given X is NOT constant.
Megaphone violation of 1 & 2 (BAD)
258
STAT 305: Chapter 17 – Correlation and Simple Linear Regression
Spring 2014
Violations to Assumption #3:
One point closely following another -positive autocorrelation, (BAD)
Extreme bouncing back and forth -- negative
autocorrelation (BAD)
Violations to Assumption #4:
To check this assumption, simple save the residuals out and make a histogram of the
residuals and/or look at a normal quantile plot. Recall, you can easily make a histogram of a
variable under Analyze > Distribution. We should generally assess normality using a
normal quantile plot as well.
Checking for outliers:
Determine the value of 2*RMSE. Any observations outside these bands are potential outliers
and should be investigated further to determine whether or not they adversely affect the
model.
259
STAT 305: Chapter 17 – Correlation and Simple Linear Regression
Spring 2014
Checking for outliers in this example we find:
To obtain the residual plot shown below in JMP select Linear Fit > Plot Residuals
THE ASSUMPTION CHECKLIST:
Model Appropriate:
Constant Variance:
Independence:
Normality Assumption (see histogram above):
Identify Outliers:
260
STAT 305: Chapter 17 – Correlation and Simple Linear Regression
Spring 2014
Confidence Interval for the Average Length:
(i.e. the mean for the entire population of sunfish of a specified age.)
Select Confid Curves Fit from the Linear Fit pull-down menu located below the scatter
plot. The narrow bands in plot below represent the CI for the Average Length. For
example, from the plot below we estimate that the mean length for 2 year old sunfish is
likely to be somewhere between 95 – 107 mm. We will examine a more precise way for
obtaining such intervals later in the tutorial.
Prediction Interval for the Length of Single Sunfish :
(i.e., for a single sunfish sampled from the population of all sunfish with a specified age.)
Select Confid Curves Indiv from the Linear Fit pull-down menu which is located
directly beneath the scatter plot. These are the wider bands in the plot above. For
example if we were to sampled a single 4 year old sunfish we estimate with 95%
confidence that its length will be somewhere between 115 – 155 mm. We will examine a
more precise way for obtaining exact intervals of this form later in the tutorial.
261
STAT 305: Chapter 17 – Correlation and Simple Linear Regression
Spring 2014
Using the Analyze > Fit Model Option to Perform the Regression
An alternative to using Fit Y by X to perform simple linear regression, is to use the Fit
Model option from the Analyze menu. The advantages of this approach are two-fold:
1) You have access to more detailed results from your regression and have
enhanced features for estimation/prediction of Y.
2) Allows for the addition of more predictors (X’s) to your model. This is
called multiple regression and will be discussed in Section 18.
For the Camp Lake sunfish example we fit the model as follows:
Select Analyze > Fit Model and place Length in the Y box and Age in Model Effects box.
Y variable
X variables
go here.
If we had more predictors (X’s) that we wanted to add to our model we would
simply put them in the Model Effects box, e.g. we could scale radius as a
predictor of length as well. Remember when we have more than one predictor in
a linear regression we call it a multiple regression.
262
STAT 305: Chapter 17 – Correlation and Simple Linear Regression
Spring 2014
The output from the Analyze > Fit Model option is shown below:
The same numeric summaries, parameter estimates,
and test results are contained as part of the standard
output. The plot at the top is NOT a scatter plot of
Y vs. X. It is a plot of the actual Y values vs. the
predicted values from the regression model. The
stronger the trend exhibited the better the fit. It is
essentially a visualization of the R-square for the
fitted model.
In cases where we have multiple values of the response
for some of the X values we will be given the results of
the Lack Of Fit test. If the p-value here is small it can
indicate that our model does not adequately represent
the mean of the response variable Y. For example, if we
fit line to clearly nonlinear relationship we would expect
this p-value to be significant. In cases where there is
significant lack of fit the plot of the residuals vs. the
fitted values will generally show some non-linear trend.
Notice the bulk of the output is same as that obtained using the Analyze > Fit Y by X
approach.
263
STAT 305: Chapter 17 – Correlation and Simple Linear Regression
Spring 2014
Interval Estimation for E(Y|X), the Mean Value of Y for a given X &
Prediction of Y for an Individual with a given X (when using Fit Model)
You can save 95% Confidence
Intervals for E(Y|X) to the data
spread sheet by selecting Mean
Confidence Interval.
You can save 95% Prediction
Intervals for Individual Y values
to the data spread sheet by
selecting Indiv Confidence
Interval.
The table on the following page is a portion data spread sheet showing both types of
intervals for these data.
Interpretation of the 95% Confidence Interval for E(Length|Age=5)
Consider estimating the average/mean length for all five year old sunfish in Camp Lake.
A 95% confidence interval for this mean is given by the interval 151.76 mm to 157.28
mm. We are 95% confident this interval covers the true mean length of 5 year old
sunfish in Camp Lake.
Interpretation of the 95% Prediction Interval for Length|Age=5
Suppose we are going to pick one five year old sunfish at random from the population of
all 5 year old sunfish in Camp Lake. What do estimate the length of this sunfish will be?
We estimate, with 95% confidence, that the actual length for this particular sunfish will
be somewhere between 135.2 mm and 173.8 mm. This range of scores has a 95% chance
of covering the actual length for a randomly selected 5 year old sunfish. Notice how
much wider this interval is when compared to interval for the mean length for all 5 year
old sunfish in Camp Lake. This should seem natural as it is much harder to predict the
length of a single randomly selected fish than the average score for all fish.
264
STAT 305: Chapter 17 – Correlation and Simple Linear Regression
Spring 2014
Example 16.2 – Length (Y) vs. Scale Radius (X)
Linear Fit
Length = 42.346763 + 40.17735 Scalerad
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.731336
0.726858
12.28663
142.9355
62
Summary of R-square:
Analysis of Variance
Source
Model
Error
C. Total
DF
1
60
61
Sum of Squares
24656.064
9057.678
33713.742
Mean Square
24656.1
151.0
F Ratio
163.3271
Prob > F
<.0001
Parameter Estimates
Term
Intercept
Scalerad
Estimate
42.346763
40.17735
Std Error
8.02401
3.143781
t Ratio
5.28
12.78
Prob>|t|
<.0001
<.0001
Conclusions from tests results highlighted above:
265
STAT 305: Chapter 17 – Correlation and Simple Linear Regression
Spring 2014
Residual Plot (residuals vs. scale radius (X))
40
30
Residual
20
10
0
-10
-20
-30
1.5
2
2.5
3
3.5
Scalerad
Discussion of Residual Plot:
Distribution of the Residuals
266
STAT 305: Chapter 17 – Correlation and Simple Linear Regression
Spring 2014
Example 16.3 –Age, Length, and Weight of Paddlefish
Datafile: Paddlefish.JMP
Age – age of paddlefish in years
Length – fork length (cm)
Weight – weight of paddlefish (kg)
Below is a scatterplot with correlations for age, length, and weight of a random sample
of n = 185 paddlefish from Pools 11, 12, and 13 of the Mississippi River near Iowa.
The correlations of weight (Y) with length & age are all strong, but are they truly
measuring the strength of the actual relationship between weight and the other
variables?
267
STAT 305: Chapter 17 – Correlation and Simple Linear Regression
Spring 2014
Let’s fit the simple linear regression of weight (Y) on length (X), i.e.
𝐸(𝑊𝑒𝑖𝑔ℎ𝑡|𝐿𝑒𝑛𝑔𝑡ℎ) = 𝛽0 + 𝛽1 𝐿𝑒𝑛𝑔𝑡ℎ
Discussion of these results:
Residual plot - residuals (𝑒̂ ) vs. fitted values (𝑦̂)
268
STAT 305: Chapter 17 – Correlation and Simple Linear Regression
Spring 2014
Consider fitted a cubic polynomial regression model,
𝐸(𝑊𝑒𝑖𝑔ℎ𝑡|𝐿𝑒𝑛𝑔𝑡ℎ) = 𝛽𝑜 + 𝛽1 𝐿𝑒𝑛𝑔𝑡ℎ + 𝛽2 𝐿𝑒𝑛𝑔𝑡ℎ2 + 𝛽3 𝐿𝑒𝑛𝑔𝑡ℎ3
Residual plot - residuals (𝑒̂ ) vs. fitted values (𝑦̂)
Now what?!?
269