++= bx ay

Statistics – 20080
1.
MINITAB - Lab 2
Simple Linear Regression
In simple linear regression we attempt to model a linear relationship between two variables
with a straight line and make statistical inferences concerning that linear model. Using the fire
damage dataset from last week, we are assuming here that the variable on the x-axis (the distance
from the fire station) will predict the amount of fire damage caused to the house. In this case
therefore, distance from the fire station is the predictor variable and the damage to the property
is the response variable.
2.
Fitting the Line
Construct a scatter plot of the data to determine the nature of the relationship between the
two variables. Then calculate the correlation coefficient, which describes the relationship
numerically. The next step is to calculate the equation of the least squares regression line,
which is the data’s line of best fit. This step is what is known as fitting the line. The reason this is
constructed is to help the researcher to see any trends and make predictions. The line of best fit is
needed because we want to predict the values of y from the values of x. In other words we want
to predict the damage in €’s using the distance from the fire station.
When fitting a straight-line model we fit what is called the least squares line. This is a straight
line such that the vertical distance between the points and the line is kept at a minimum.
An equation for a straight-line model has two components, the intercept and the slope.
Therefore the equation of the least squares regression line takes the form,
Response = intercept + slope* (predictor variable) + ε (the error or residual term)
Or more generally:
ŷ = a + bx + ε
Where
•
•
•
•
a is the intercept
b is the slope of the line
ε is the distance between the fitted line and the data point (I.e. The residuals)
X is the chosen value of the predictor variable
1
Summary from lecture notes
The formulae for the estimates of the slope and the intercept are;
SS xy
Slope:
b=
Where
SS xy =
∑ (x
SS xx =
∑ (x
a = y − bx
Intercept:
SS x
i
− x )( yi − y ) =
∑x y
i
i
−
(∑ x )(∑ y )
i
i
n
(∑ x )
−
2
n
3.
− x) =
2
i
∑
xi2
i
n
= sample size
Fitting A Regression Model in MINITAB
Using the drop down menus in Minitab, go to Stat - Regression - Fitted Line Plot
1. Select the response
variable here
2. Select the predictor
variable here
3. Ensure that the
linear model is
selected
This command will give you a scatter plot of the response variable versus the predictor variable
with the least squares line shown in blue on the plot. The least squares regression equation will be
displayed over the plot. If you look at the session window you will also see the ANOVA table for
this model and the associated p-value, similar to the table below. We will cover what this ANOVA
table means in the next class.
2
Regression Analysis: Damage - $ versus Distance
Regression Line
The regression equation is
Damage - $ = 10.28 + 4.919 Distance
S = 2.31635
R-Sq = 92.3%
R-Sq(adj) = 91.8%
Analysis of Variance
Source
Regression
Error
Total
DF
1
13
14
SS
841.766
69.751
911.517
MS
841.766
5.365
F
156.89
P
0.000
ANOVA Table
What is the least squares regression equation? _________________________________
What is the slope? ________________________________________________________
What is the intercept? ______________________________________________________
What type of relationship is there between distance and damage?
Now that we know what the least squares regression equation is we can use it to make predictions
for the dependent variable. If a building which was on fire was 10 miles from the nearest fire
station how much damage would be caused to it in the event of a fire? ________________
Hint: Substitute 10 in for X in the regression line equation.
(a)
For each of the datasets used in last week’s lab, find the least squares regression
equation, the slope and the intercept.
1.
2.
3.
4.
3
Answer the following using your answer from (a)
If the three most recent volunteers in a blood donation clinic had blood pressures of 86, 91 and
101 respectively, what would your estimate of their platelet-calcium concentrations be?
___________________
Five people were randomly selected and their heights were measured to be 145, 150, 161, 165
and 177cm. What are the estimated weights of these people?_______________________
____________________________________________________________________________
If you look more closely at the figures you have just calculated and compare them to the actual
values from the original dataset you will notice that they are not exactly the same. This is because
the calculated figures are fitted using the regression line (indicated in blue on the scatter plot). The
discrepancies between the two numbers are the residuals. The coefficient of correlation is one way
of quantifying how large this discrepancy is.
6.
The Coefficient of Determination - R2
How much of the variation in y is explained by the linear relationship between x and y? The
answer to this is given by the Coefficient of Determination or R2. The Coefficient of Determination
is the ratio between the total variation in the data and variation 'explained' by the linear relationship
between the predictor and response variables.
Coefficient of Determination - R2
R2 =
SS regression / SS Total
What is R2 for the regression model fitted to the fire damage dataset? _____________________
What does this figure mean? ______________________________________________________
Note, that in the case of a simple linear regression model the coefficient of determination is the
correlation coefficient squared. Calculate the square root of R2 and compare it to the correlation
coefficient.
4
Calculate and interpret the coefficient of determination for each of the datasets from last week’s
lab.
1.
2.
3.
4.
5
Assignment: Due 2 week’s time.
From the Minitab class page download the file named TV. This contains the data for 15 students
on their final year mark and the number of hours they spend watching TV.
1. Construct a simple linear regression line for this data and show the graph.
2. What is the correlation coefficient for this data?
3. Is there a negative or a positive correlation between the number of hours spent watching
TV and the end of year grade?
4. What is the value of the intercept with the regression line and the y-axis?
5. What is the slope of this regression line?
Assignments should be handed in at the beginning of class two weeks from today. Late
assignments will not be accepted.
REVISION SUMMARY
After this lab you should be able to:
-
Calculate the correlation coefficient by hand and in Minitab
-
Fit a simple linear regression line to data using Minitab
-
Make predictions from the least squares regression line
6