Simple Linear Regression

Lean Six Sigma
Simple Linear Regression
1
Outline
•
•
•
•
•
•
Introduction
Simple Linear Regression
Regression Vs. ANOVA
Correlation and R2
Residuals
Key Points
2
Introduction
In this section, we introduce Simple Linear Regression
as a method for studying the relationship between a
continuous response and one continuous factor.
One-way ANOVA and simple linear regression are
conceptually very different.
ANOVA allows us to determine whether different
factor levels have a significant effect on the
response.
Regression allows us to model the relationship between
the continuous experimental factor and the response.
3
Introduction
ANOVA
X - Nominal
Y - Continuous
Regression
X - Continuous
Y - Continuous
4
Simple Linear Regression
Linear regression is used when experimental factors (also
called regressors or predictors) are continuous.
Linear regression is used to develop a model (a
relationship) between the response and the factors.
This model is used to predict the average response at
various factor settings.
In the case when there is only one factor of interest, a
simple linear regression model fits a linear relationship
between the factor and the response.
When modeling a relationship, we need to verify that the
proposed model fits the data.
This is done by first fitting the model, then checking to
see if any systematic relationships have been missed.
5
Simple Linear Regression
Example: Suppose we want to understand the
relationship between weights and heights for adults in
a fitness study. We could plot weight versus height:
220
200
Weight
180
160
140
120
100
80
60
62
64
66
68
Height
70
72
74
76
Does there seem to be a relationship between the two
variables?
6
Simple Linear Regression
A scatterplot provides some understanding of the relationship.
The graph suggests that the two variables may be linearly related.
This is indicated by the fact that, for given values of weight, the
average heights fall more or less on a straight line.
220
200
Weight
180
160
140
120
100
80
60
62
64
66
68
Height
70
72
74
76
7
Simple Linear Regression
A simple linear regression model describes the
mathematical relationship between two variables.
The theoretical model for a simple regression line is:
Y   0  1 X  
In this equation, Y is the response and X is the
predictor.
There are two coefficients, the intercept (o) and
the slope (1). Coefficients define the mathematical
relationship between the response and the
predictor(s).
Random variability is represented by .
8
Simple Linear Regression
For a regression analysis to be appropriate, the underlying
population must satisfy certain assumptions:
 For any given X, the values of Y are normally distributed,
 The means of the response Y depend linearly on X,
 The standard deviation of Y does not depend on X,
namely, it is constant.
In terms of the data obtained for fitting a regression, the
values of Y must be independent of one another (a key
assumption).
For the most part, regression is fairly ‘robust’ to the
assumptions about the population. Only substantial
departures from any of the assumptions invalidate conclusions
from a regression analysis.
9
Simple Linear Regression
For each X, the
response Y has a
normal distribution
with mean 0 + 1X,
and variance s2.
y
Y | X   0  1 X
x
10
Simple Linear Regression
Caution: Association, not causality!
Regression, with observational data, cannot be used to
establish a causal relationship between response
variables and predictors.
Regression analysis, along with ANOVA, can only be used to
quantify an association between two or more variables.
To establish causality requires experimental proof that a
change in X produces a change in Y.
Designed experiments are the most efficient method for
establishing experimental proof.
Correlation does not imply causality!!
11
Regression Vs. ANOVA
How do the approaches differ?
5950
ANOVA: Do different factor
levels have different effects
on the response?
5850
How does the response change
as we increase or decrease the
predictor settings?
Predict the response one would
obtain at given settings of the
predictors.
Lumens
5800
5750
5700
5650
5600
5550
5500
1
2
3
Phosphor
25
20
PSI
Regression: What is the nature
of the relationship between
the response and the
predictors?
5900
15
10
5
5
10
15
20
%HrdWd
Linear Fit
12
Regression Vs. ANOVA
Example: Tensile strength of a paper product is of
interest
It is believed that percent hardwood, %HrdWd,
improves strength, but no real knowledge exists.
An experiment is designed with %HrdWd at four
levels: 5%, 10%, 15%, and 20%.
Six replications of each of the experimental settings
are obtained.
The treatment conditions are reset for every run so
that experimental variation is captured.
The 24 trials are run in random order.
The resulting data is summarized on the next slide.
13
Regression Vs. ANOVA
Does there appear to be a relationship
between percent hardwood and PSI?
What is the apparent nature of the
relationship?
We will analyze the data using both
ANOVA and regression to illustrate
the differences between the two
methods.
Output for each of the methods is
given on the following slides.
%HrdWd
5
5
5
5
5
5
10
10
10
10
10
10
15
15
15
15
15
15
20
20
20
20
20
20
PSI
7
8
15
11
9
10
12
17
13
18
19
15
14
18
19
17
16
18
19
25
22
18
20
23
14
Regression Vs. ANOVA
Based on the ANOVA analysis and the comparison of the
means, we would conclude that 10% and 15% hardwood
do not have different effects on PSI values.
15
Regression Vs. ANOVA
Now, let’s look at the regression analysis of the data,
again using SLR. To obtain this analysis, change
%HrdWd to a continuous factor.
The response is PSI and the predictor is %HrdWd.
16
Regression Vs. ANOVA
Biv ariate Fit of PSI By %HrdWd
a regression line.
Of course, we should only
make predictions within
the range of the data.
Additional options are
available under the red
arrow next to “Linear Fit”.
20
PSI
The equation of this line
allows us to predict PSI
at different %HrdWd
values.
25
15
10
5
5
10
15
20
%HrdWd
Linear Fit
Linear Fit
PSI = 7.25 + 0.6966667 %HrdWd
Analysis of Varian ce
Source
Model
Error
C. Total
DF
1
22
23
Sum of Squares Mean Square
364.00833
364.008
148.95000
6.770
512.95833
F Ratio
53.7642
Prob > F
<.0001
Parameter Estimates
Term
Intercept
%HrdWd
Estimate Std Error t Ratio
7.25 1.301005
5.57
0.6966667 0.095012
7.33
Prob>|t|
<.0001
<.0001
17
Regression Vs. ANOVA
Confid Curves Indiv for confidence intervals for individual values, and
select Confid Curves Fit for confidence intervals for the predicted
response (population mean) at different settings of %HrdWd.
25
PSI
20
95% CI for mean response.
15
10
95% CI for individual
observations.
5
5
10
15
20
%HrdWd
Linear Fit
18
Regression Vs. ANOVA
Regression gives us a statistical model that allows us to predict
PSI at different percent hardwood values. ANOVA gives us
much less information.
ANOVA
Regression
Do regression and ANOVA lead to different conclusions about
the relationship between percent hardwood and PSI?
19
Regression Vs. ANOVA
In a designed experiment, a factor that is continuous is
only run at a small number of distinct settings.
In such a case, both ANOVA and Regression can be
used to analyze the resulting data.
When a factor is set at only two levels, the results of
linear regression analysis and ANOVA are identical.
From the hardwood example, we can see that we may
draw different conclusions based on the approach we
use to analyze our experimental results.
When more than two levels of truly continuous
factors are used, ANOVA and regression can lead to
very different conclusions.
20
Simple Linear Regression
The prediction equation estimates the (assumed) linear relationship
between PSI and percent hardwood.
Linear Fit
PSI = 7.25 + 0.6966667 %HrdWd
This equation allows us to estimate PSI for different values of percent
hardwood. For example, if we want to predict PSI for 14% hardwood:
PSI = 7.25 + 0.697(14) = 17.08
Note: The regression equation is only valid for values of X over the
range included in the analysis. Although extrapolation to other
values of X is possible, this should only be done with extreme caution.
21
Simple Linear Regression
If the p-value for the predictor, %HrdWd, is
<0.05, we conclude that the slope is likely not
equal to zero: there is a statistically significant
relationship between percent hardwood and PSI.
Paramete r Estimates
Term
Intercept
%HrdWd
Estimate Std Error t Ratio
7.25 1.301005
5.57
0.6966667 0.095012
7.33
Prob>|t|
<.0001
<.0001
22
Correlation and R2
R Square (R2), the coefficient of determination, gives a
measure of the predictive ability of the regression model.
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.709626
0.696427
2.60201
15.95833
24
R2 is the proportion of the total variation in Y (or SST)
accounted for by the regression model(or SS(Model)).
R2 is calculated from the sum of squares (SS) for the model:
SS ( Model ) SST  SSE
R 

.
SST
SST
2
23
Correlation and R2
R2 can take on values between 000 and 1.00.
Higher R2 is better.
An R2 of 1.0 indicates that no random error exists in the
response (the model accounts for all the variation in Y).
Unfortunately, a high R2 value does not necessarily
indicate a good regression model.
24
Correlation and R2
R2 has several shortcomings as a measure of fit for a
regression model:
 R2 does not have an associated probability
distribution that can be used to assess the
significance of the observed value. Thus, there is no
objective basis by which to determine what is a
“good” or “significant” value of R2.
 R2 is partially redundant to the overall F test. Since
the F test has an associated probability distribution,
the significance of the F Ratio can be objectively
evaluated while R2 cannot.
 R2 is systematically lowered by taking repeated
measurements (or nearly repeated measurements) for
the different values of X.
25
Correlation and R2
Shortcomings of R2 (continued):
 The value of R2 is sensitive to the range of the X
values.
By increasing the range of X, one can artificially
improve R2.
 R2 can be increased simply by adding regressors to
the model.
We’ll talk about R2 Adjusted, which is adjusted for
additional X’s, in the Multiple Regression section.
26
Correlation and R2
The correlation coefficient, denoted “r”, measures the
linear association between two sets of measurements.
The correlation coefficient can take on values between -1
and 1.
 The closer r is to 1 or -1, the more closely the points
fall on a line.
 If r is positive, the fitted line has positive slope; if r
is negative, the fitted line has negative slop.
 If r is close to 0, there is little, if any, linear
association.
In the case of simple linear regression, the coefficient of
determination, R2, is the square of the correlation
coefficient, as the notation implies.
The following slides show some common patterns.
27
Correlation and R2
35
30
r close to 1, strong
positive association
20
15
10
5
40
0
0
5
10
15
20
25
30
35
35
X
30
Y4
Y2
25
r close to -1, strong
negative association
25
20
15
10
0
5
10
15
20
25
30
35
X
28
Correlation and R2
35
30
r close to 0, no
linear association
20
15
10
5
25
0
0
5
10
15
20
25
30
35
X
20
Y3
Y1
25
r close to 0, no linear
association (but there
is a relationship
between X and Y)
15
10
0
5
10
15
20
25
30
35
X
29
Residuals
To determine if the regression assumptions are
satisfied, we fit the regression model and then
analyze how the data values behave relative to the
fitted model to see if the fit is adequate.
This analysis is based on residuals.
A residual is the difference between the observed
value at X and the predicted value at X.
If the linear model is appropriate, residuals should be
normally distributed with mean 0 and constant
variance given X.
30
Residuals
Under Linear Fit, we can
save predicted values and
residuals to the data sheet.
The three points for which
residuals are calculated on
the previous slide are
highlighted here.
31
Residuals
To plot residuals,
select Plot Residuals
under the red arrow
next to Linear Fit.
Note that one can Save
Residuals and Save
Predicteds to the data
sheet as well.
32
Residuals
Residuals should be approximately normally spread
around zero, the centerline. Their variation should
be approximately constant across the values of X.
Are there any obvious trends or patterns?
33
Residuals
When evaluating the residual plots, look for the following
patterns:
Curvature in the
residuals. The model
we fit is not adequate.
Variation increases
as X increases.
Homogeneity of
variance assumption
is violated.
An outlier, and points
not evenly dispersed
on either side of the
line. The unusual
observation is
influencing the model.
34
Key Points
 Regression can not be used to establish causal
relationships.
 Always plot the data.
 Neither R2 nor R2 Adjusted measure model adequacy.
At best, they measure how well the fitted model
predicts the observed values.
 Be very cautious when extrapolating outside the
range of X values used.
 A high p-value and a low R2 do not necessarily mean
that there is not a significant relationship between X
and Y.
 Always plot residuals to check the adequacy of the
model.
35