Simple Linear Regression Modeling Using Excel BSAD 30, Spring 2017 (Thursday, 4/28) You need to have the Data Analysis Toolpak plug-in installed before you can do this exercise! There are detailed instructions for how to install the Data Analysis Toolpak provided on the course schedule. Simple Linear Regression Modeling Here our model is a mathematical relationship between the factors involved in a problem. The purpose of regression is prediction. This is different from a linear program, where the purpose is to obtain an EXACT solution (the quantities of the different decision variables to produce) to a particular problem, given certain constraints. For example, in simple linear regression we may want to know something quantitative about the relationships between: 1) the number of building permits and the interest rate, 2) the number of winter jackets sold and the temperature, 3) expected fuel efficiency in miles per gallon (MPG) and vehicle weight, etc. First Order Regression Analysis (or simple linear regression) involves studying the relationship between two variables. We call one of the variables X = the independent or predictor variable We call the other variable Y = the dependent or response variable We use values of X to predict Y so Y depends on X Following example 1) above, say that we want to predict the number of building permits issued (Y) based on the interest rate (X). Following example 2) above, we want to predict the number of winter jackets sold (Y) based on the temperature (X). To begin a study of the relationship between two variables (X and Y), we need a random sample of size n and will need to record the values of X and Y for each observation. This is called bivariate data because each sample has two observations. What are data that have more than two observations called? The general equation of a line: Y = y-intercept + slope(X) y-intercept: where does the line cross the y-axis? What is Y when X is zero? 1 Deterministic Model We take a bivariate sample from the population (this is a pair of observations associated with each sampling point) and estimate the equation for the regression line using the sample data. 𝑦̂ = 𝛽̂0 + ̂𝛽1 𝑥1 X = observed values of the independent variable 𝑦̂ = predicted value of the dependent variable using observed x 𝛽̂0 = estimated y-intercept 𝛽̂1 = estimated slope Using the Regression function in MS Excel The examples are based on: http://www.stat.wmich.edu/wang/216/notes/LinearRegression_handout.pdf Assume we collect the following bivariate data from a restaurant describing the relationship between X = the total dollar value of the bill and Y = the observed customer tip: Customer (observation) 1 2 3 4 5 6 7 8 9 10 Bill (total $) 98.84 33.46 63.60 50.68 107.34 200.54 57.32 25.10 76.99 89.04 Tip ($) 15.00 3.77 9.81 9.00 17.33 35.00 12.56 4.20 10.55 20.00 1) Identify the independent variable (x) and the dependent variable (y). Using our observed data above, we want to predict the dollar value of a tip using the total dollar value of the bill. 2 2) Enter the data above into Excel and report basic summary statistics using Descriptive Statistics from the Data Analysis. Bill Total is the independent variable (X) and the Tip Total is the dependent variable (Y). Bill Total Tip Total Mean 80.291 Standard Error 15.827274 Median 70.295 Mode #N/A Standard Deviation 50.0502349 Sample Variance 2505.02601 Kurtosis 3.48494231 Skewness 1.60851319 Range 175.44 Minimum 25.1 Maximum 200.54 Sum 802.91 Count 10 Mean 13.722 Standard Error2.87815751 Median 11.555 Mode #N/A Standard Deviation 9.1015332 Sample Variance 82.8379067 Kurtosis 2.83528868 Skewness 1.4520748 Range 31.23 Minimum 3.77 Maximum 35 Sum 137.22 Count 10 3) Create a scatter plot in Excel. Highlight the Bill Total ($) and Tip ($) data from the table on the previous page go to the Insert tab on the main menu bar choose “scatter”. This chart allows us to visualize the two dimensional relationship between the variables. What do the data look like? Are there any outliers? Bill Total ($) versus Tip ($) 40 35 30 T 25 i 20 p 15 10 5 0 0 50 100 150 200 250 Bill Total 4) Describe the direction and the strength of the correlation between the bill total and the tip. Highlight the Bill Total ($) and Tip ($). Highlight the Bill Total ($) and Tip ($) data go to the Data tab on the main menu bar Data Analysis choose “correlation”. Bill Total ($) Tip ($) Bill Total ($) Tip ($) 1 0.968035 1 3 The correlation (the interdependence between two variables) between Bill Total and Tip is 0.968. The correlation coefficient = r, where the range -1 ≤ r ≤ 1. A value of r = 0 means absolutely no correlation / interdependence and a value of r = 1 being perfect positive correlation. A value of 0.968 means that these two variables are highly correlated / interdependent! There is definitely a strong relationship between the Bill Total and the Tip amount. A perfect correlation r = (+/-) 1 A strong correlation (a general rule of thumb is r ≥ (+/-) 0.7 A moderate correlation (a general rule of thumb is 0.69 ≥ r ≥ (+/-) 0.5 A weak correlation (a general rule of thumb is 0.49 ≥ r ≥ (+/-) 0.3 No correlation r = 0 A positive correlation means that higher valued X’s (in this case the Bill Total) tend to go along with higher valued Y’s (in this case the Tip Total). A negative correlation would mean that higher X’s would go along with lower valued Y’s. NOTE OF CAUTION!!!! Correlation does not imply causality. One of the most common mistakes I observed while working as an analyst involved confusing correlation with causality. There is no statistical or factual basis for us to say that X causes Y. It is; however, a statistically valid statement to say that X and Y are highly correlated or highly related. 5) Use the Regression tool in the Data Analysis tool pack to generate output from a regression model: Choose Data Analysis from the Data menu bar Regression input data ranges for the X and Y variables, choose your output range (an open space in your spreadsheet), and check the Residuals checkbox 4 SUMMARY OUTPUT Regression Statistics Multiple R 0.968035115 R Square 0.937091984 Adjusted R Square0.929228482 Standard Error 2.421273298 Observations 10 ANOVA df Regression Residual Total Intercept Bill Total ($) SS 698.6406449 46.90051506 745.54116 MS 698.6406449 5.862564382 Coefficients Standard Error -0.41204 1.50420 0.17604 0.01613 t Stat -0.27393 10.91649 1 8 9 F Significance F 119.1698034 4.39454E-06 P-value 0.79108 0.00000 Lower 95% -3.88073 0.13885 Upper 95% Lower 95.0% Upper 95.0% 3.05665 -3.88073 3.05665 0.21322 0.13885 0.21322 RESIDUAL OUTPUT ObservationPredicted Tip ($) 1 16.98727716 2 5.478094951 3 10.78379626 4 8.509421315 5 18.48357647 6 34.8900583 7 9.678295127 8 4.006440572 9 13.14090776 10 15.26213208 Residuals -1.987277162 -1.708094951 -0.973796264 0.490578685 -1.15357647 0.1099417 2.881704873 0.193559428 -2.590907763 4.737867923 6) Write out the regression equation following the basic equation form from the first page. You should be able to clearly identify the 𝛽̂0 and 𝛽̂1 values in the regression output: 7) Using the estimated regression equation from 6) above to predict the average tip for a customer for a bill totaling $70.50 8) The coefficient of determination R^2 – used to explain how much variation in the dependent variable (the y – variable) is explained by the regression model. We can express this as a percent. Interpret the coefficient of determination (R^2) for this problem. 5 9) What is the difference between linear extrapolation and prediction? 10) Is it appropriate to use our regression equation to estimate the tip associated with a bill total, X < 25.10 or X > 200.54? (outside of our observed data range) 11) The question of whether or not it is appropriate to use a linear regression model to predict a particular outcome is different from the question of how “good” the model is (the explanatory power of a particular linear regression model). Is a linear regression model a good “fit” for the data addresses the question of whether or not it is appropriate to use linear regression. We can use a scatter plot of the residuals to test this. A residual plot with no pattern in the residuals (the error terms) indicates a good fit. A residual plot with some type of identifiable pattern indicates a poor fit – inappropriate model. NOTE: This is a separate issue from whether the regression model has good explanatory power! Residuals 6 5 4 3 2 1 0 -1 0 5 10 15 20 25 30 35 40 -2 -3 6 Does this linear regression model do a good job of explaining the variance in the tip based on the total bill amount (the explanatory power of THIS particular regression model)? There are more advanced diagnostic techniques, but for our purposes in this class, we can look at the values of Adj R2 and R2 (coefficient of determination). 7 Use the data below to predict a student’s final GPA upon graduation from college given the student’s SAT score upon high school graduation. Enter the data in Excel and then use the Descriptive Statistics and Linear Regression functions, as well as appropriate scatter plots to answer the following questions: Observation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 SAT score 980 1450 1120 1340 1500 1070 1150 960 1030 1090 1150 1300 1270 1210 950 Final GPA 3.0 4.0 3.2 3.8 3.4 3.5 2.7 2.9 2.8 2.5 3.0 3.9 3.2 2.5 2.6 1) What is the mean SAT score? 2) What is the mean Final GPA? 3) Write out the regression equation model 4) What is your prediction for final GPA given an SAT score of 1100? Show your work! 5) What is the correlation coefficient (r)? Interpret the strength and direction. 6) Write out the coefficient of determination R^2 and explain what it means in this case (interpret the value in the context of this problem). 7) Does this regression model “fit” these data? Justify your answer 8) Does this regression model have good explanatory power? 8
© Copyright 2026 Paperzz