Linear Regression

Simple Linear Regression Modeling Using Excel
BSAD 30, Spring 2017 (Thursday, 4/28)
You need to have the Data Analysis Toolpak plug-in installed before you can do this exercise!
There are detailed instructions for how to install the Data Analysis Toolpak provided on the
course schedule.
Simple Linear Regression Modeling
Here our model is a mathematical relationship between the factors involved in a problem. The
purpose of regression is prediction. This is different from a linear program, where the purpose
is to obtain an EXACT solution (the quantities of the different decision variables to produce) to a
particular problem, given certain constraints.
For example, in simple linear regression we may want to know something quantitative about
the relationships between: 1) the number of building permits and the interest rate, 2) the
number of winter jackets sold and the temperature, 3) expected fuel efficiency in miles per
gallon (MPG) and vehicle weight, etc.
First Order Regression Analysis (or simple linear regression) involves studying the relationship
between two variables.
We call one of the variables X = the independent or predictor variable
We call the other variable
Y = the dependent or response variable
We use values of X to predict Y  so Y depends on X
Following example 1) above, say that we want to predict the number of building permits issued
(Y) based on the interest rate (X). Following example 2) above, we want to predict the number
of winter jackets sold (Y) based on the temperature (X).
To begin a study of the relationship between two variables (X and Y), we need a random sample
of size n and will need to record the values of X and Y for each observation. This is called
bivariate data because each sample has two observations. What are data that have more than
two observations called?
The general equation of a line: Y = y-intercept + slope(X)
y-intercept: where does the line cross the y-axis? What is Y when X is zero?
1
Deterministic Model We take a bivariate sample from the population (this is a pair of
observations associated with each sampling point) and estimate the equation for the regression
line using the sample data.
𝑦̂ = 𝛽̂0 + ̂𝛽1 𝑥1
X = observed values of the independent variable
𝑦̂ = predicted value of the dependent variable using observed x
𝛽̂0 = estimated y-intercept
𝛽̂1 = estimated slope
Using the Regression function in MS Excel
The examples are based on:
http://www.stat.wmich.edu/wang/216/notes/LinearRegression_handout.pdf
Assume we collect the following bivariate data from a restaurant describing the relationship
between X = the total dollar value of the bill and Y = the observed customer tip:
Customer
(observation)
1
2
3
4
5
6
7
8
9
10
Bill (total $)
98.84
33.46
63.60
50.68
107.34
200.54
57.32
25.10
76.99
89.04
Tip ($)
15.00
3.77
9.81
9.00
17.33
35.00
12.56
4.20
10.55
20.00
1) Identify the independent variable (x) and the dependent variable (y). Using our
observed data above, we want to predict the dollar value of a tip using the total dollar
value of the bill.
2
2) Enter the data above into Excel and report basic summary statistics using Descriptive
Statistics from the Data Analysis. Bill Total is the independent variable (X) and the Tip
Total is the dependent variable (Y).
Bill Total
Tip Total
Mean
80.291
Standard Error 15.827274
Median
70.295
Mode
#N/A
Standard Deviation
50.0502349
Sample Variance
2505.02601
Kurtosis
3.48494231
Skewness
1.60851319
Range
175.44
Minimum
25.1
Maximum
200.54
Sum
802.91
Count
10
Mean
13.722
Standard Error2.87815751
Median
11.555
Mode
#N/A
Standard Deviation
9.1015332
Sample Variance
82.8379067
Kurtosis
2.83528868
Skewness
1.4520748
Range
31.23
Minimum
3.77
Maximum
35
Sum
137.22
Count
10
3) Create a scatter plot in Excel. Highlight the Bill Total ($) and Tip ($) data from the table
on the previous page  go to the Insert tab on the main menu bar  choose “scatter”.
This chart allows us to visualize the two dimensional relationship between the variables.
What do the data look like? Are there any outliers?
Bill Total ($) versus Tip ($)
40
35
30
T 25
i 20
p 15
10
5
0
0
50
100
150
200
250
Bill Total
4) Describe the direction and the strength of the correlation between the bill total and
the tip. Highlight the Bill Total ($) and Tip ($). Highlight the Bill Total ($) and Tip ($) data
 go to the Data tab on the main menu bar  Data Analysis  choose “correlation”.
Bill Total ($)
Tip ($)
Bill Total ($) Tip ($)
1
0.968035
1
3
The correlation (the interdependence between two variables) between Bill Total and Tip
is 0.968. The correlation coefficient = r, where the range -1 ≤ r ≤ 1. A value of r = 0
means absolutely no correlation / interdependence and a value of r = 1 being perfect
positive correlation. A value of 0.968 means that these two variables are highly
correlated / interdependent! There is definitely a strong relationship between the Bill
Total and the Tip amount.
A perfect correlation r = (+/-) 1
A strong correlation (a general rule of thumb is r ≥ (+/-) 0.7
A moderate correlation (a general rule of thumb is 0.69 ≥ r ≥ (+/-) 0.5
A weak correlation (a general rule of thumb is 0.49 ≥ r ≥ (+/-) 0.3
No correlation r = 0
A positive correlation means that higher valued X’s (in this case the Bill Total) tend to go
along with higher valued Y’s (in this case the Tip Total). A negative correlation would
mean that higher X’s would go along with lower valued Y’s.
NOTE OF CAUTION!!!! Correlation does not imply causality. One of the most common
mistakes I observed while working as an analyst involved confusing correlation with
causality. There is no statistical or factual basis for us to say that X causes Y. It is;
however, a statistically valid statement to say that X and Y are highly correlated or
highly related.
5) Use the Regression tool in the Data Analysis tool pack to generate output from a
regression model: Choose Data Analysis from the Data menu bar  Regression  input
data ranges for the X and Y variables, choose your output range (an open space in your
spreadsheet), and check the Residuals checkbox
4
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.968035115
R Square
0.937091984
Adjusted R Square0.929228482
Standard Error
2.421273298
Observations
10
ANOVA
df
Regression
Residual
Total
Intercept
Bill Total ($)
SS
698.6406449
46.90051506
745.54116
MS
698.6406449
5.862564382
Coefficients Standard Error
-0.41204
1.50420
0.17604
0.01613
t Stat
-0.27393
10.91649
1
8
9
F
Significance F
119.1698034
4.39454E-06
P-value
0.79108
0.00000
Lower 95%
-3.88073
0.13885
Upper 95%
Lower 95.0% Upper 95.0%
3.05665
-3.88073
3.05665
0.21322
0.13885
0.21322
RESIDUAL OUTPUT
ObservationPredicted Tip ($)
1
16.98727716
2
5.478094951
3
10.78379626
4
8.509421315
5
18.48357647
6
34.8900583
7
9.678295127
8
4.006440572
9
13.14090776
10
15.26213208
Residuals
-1.987277162
-1.708094951
-0.973796264
0.490578685
-1.15357647
0.1099417
2.881704873
0.193559428
-2.590907763
4.737867923
6) Write out the regression equation following the basic equation form from the first
page. You should be able to clearly identify the 𝛽̂0 and 𝛽̂1 values in the regression
output:
7) Using the estimated regression equation from 6) above to predict the average tip for a
customer for a bill totaling $70.50
8) The coefficient of determination R^2 – used to explain how much variation in the
dependent variable (the y – variable) is explained by the regression model. We can
express this as a percent. Interpret the coefficient of determination (R^2) for this
problem.
5
9) What is the difference between linear extrapolation and prediction?
10) Is it appropriate to use our regression equation to estimate the tip associated with a bill
total, X < 25.10 or X > 200.54? (outside of our observed data range)
11) The question of whether or not it is appropriate to use a linear regression model to
predict a particular outcome is different from the question of how “good” the model is
(the explanatory power of a particular linear regression model).
Is a linear regression model a good “fit” for the data addresses the question of whether or
not it is appropriate to use linear regression. We can use a scatter plot of the residuals to
test this. A residual plot with no pattern in the residuals (the error terms) indicates a good
fit. A residual plot with some type of identifiable pattern indicates a poor fit – inappropriate
model. NOTE: This is a separate issue from whether the regression model has good
explanatory power!
Residuals
6
5
4
3
2
1
0
-1 0
5
10
15
20
25
30
35
40
-2
-3
6
Does this linear regression model do a good job of explaining the variance in the tip based
on the total bill amount (the explanatory power of THIS particular regression model)? There
are more advanced diagnostic techniques, but for our purposes in this class, we can look at
the values of Adj R2 and R2 (coefficient of determination).
7
Use the data below to predict a student’s final GPA upon graduation from college given the
student’s SAT score upon high school graduation. Enter the data in Excel and then use the
Descriptive Statistics and Linear Regression functions, as well as appropriate scatter plots to
answer the following questions:
Observation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
SAT score
980
1450
1120
1340
1500
1070
1150
960
1030
1090
1150
1300
1270
1210
950
Final GPA
3.0
4.0
3.2
3.8
3.4
3.5
2.7
2.9
2.8
2.5
3.0
3.9
3.2
2.5
2.6
1) What is the mean SAT score?
2) What is the mean Final GPA?
3) Write out the regression equation model
4) What is your prediction for final GPA given an SAT score of 1100? Show your work!
5) What is the correlation coefficient (r)? Interpret the strength and direction.
6) Write out the coefficient of determination R^2 and explain what it means in this case
(interpret the value in the context of this problem).
7) Does this regression model “fit” these data? Justify your answer
8) Does this regression model have good explanatory power?
8