Regression Analysis Student Project Panera`s Nutritional

Regression Analysis Student Project
Panera’s Nutritional Information
Fall November 2012- January 2013
xxxxx xxxxxxx
Objective:
Panera is one of my favorite places to eat, and sandwiches are one of my staple foods. Thus, I decided that I
would perform a regression analysis on the nutritional data for the most popular Panera sandwiches, provided
on the Panera website. The goal of this analysis is to examine the relationships between the total calories and
several nutritional values, yielding a model with the best fit. The dependent variable is calories, and there are
nine independent variables. I will perform Ordinary Least Squares Regression, using a 95% confidence interval.
Data:
The data can be seen below, and was obtained from http://www.panerabread.com/pdf/nutr-guide.pdf.
Panera has quite a large menu, so I chose only full sized sandwiches from the Café and Signature lists. There
are fourteen items total.
Equation:
For the analysis, I will use the following equation,
Y = A + B1X1 + B2X2 + B3X3 + B4X4 + B5X5 + B6X6 + B7X7 + B8X8 + B9X9
The dependent variable (Y) is the calories per each sandwich. A is the intercept, and the independent variables
(the X’s) are listed below:
X1= Fat (g)
X2=Saturated Fat (g)
X3=Trans Fat (g)
X4=Cholesterol (mg)
X5=Sodium (mg)
X6=Carbs (g)
X7=Fiber (g)
X8=Sugars
X9=Protein (g)
Models:
In order to estimate the parameters of the following models, I used the regression function in Excel’s Data
Analysis Add-In. First, the full model is evaluated, including all nine explanatory variables. Based on the results,
variables will be eliminated one by one- starting with the variable that has the highest p-value.
Model 1- The Full Model
Utilizing all nine explanatory variables in the regression yields the following results;
The initial regression on all nine independent variables produces this equation:
Y = 10.7756 + 8.8872X1 + .8393X2 + 6.8358X3 - .0950X4 + .0041X5 + 3.8823X6 + .3519X7 - .1586X8 + 3.7147X9
The R2 of this regression is .99984, the Adjusted R2, which adjusts for the degrees of freedom, is .99947. Since
these values are close to 1, we see that this model is a good fit to the data. The F ratio is 2740.3738, and the
Signifiance F is a very small number. These results also indicate that the regression equation produces a close
prediction of the actual data.
However, it is still possible to enhance the model in order to provide a closer fit, so we look at the P-values.
Small P-values tell us that these explanatory variables really influence the response variable. Therefore, we
can remove the explanatory variable with the highest P-value and rerun the regression to make our model
more precise. In this case, we rerun the regression without X8=Sugars, which has the highest P-value (.7683).
Model 2- Eight Explanatory Variables
Removing Sugars from the regression and utilizing the other eight explanatory variables yields the following
results;
The second regression on eight independent variables produces this equation:
Y = 10.2641 + 8.8783X1 + .8044X2 + 7.3577X3 - .0968X4 + .0043X5 + 3.8668X6 + .3000X7 + 3.7408X9
With this regression, the R2 is .99983, almost the exact same as the full model. The Adjusted R2 is .99957,
which is a tiny bit larger than that of the full model. The standard error of Model 2 is also less than Model 1’s
standard error (3.48537 vs. 3.84923). Lastly, when comparing the F ratio, we can also see that Model 2 has an
F ratio of 3760.1911, which is larger than Model 1. Together, these statistics illustrate that Model 2 may be an
even better fit to the data. Still, to enhance the model further, we can remove the explanatory variable with
the highest P-value again, Fiber (X7).
Model 3- Seven Explanatory Variables
Removing Fiber from Model 2 and rerunning the regression yields the following results;
The third regression on seven independent variables produces this equation:
Y = 10.8761 + 8.8876X1 + .7632X2 + 7.533X3 - .1016X4 + .0038X5 + 3.8823X6 + 3.7648X9
The R2 did not change from Model 2 to Model 3, it is .99983. However, the Adjusted R2 increased to .99963,
the F ratio increased to 5051.3079, and the Standard Error decreased to 3.21475. Hence, we can deduce that
Model 3 is not only a good fit to the data, but an improvement from Models 1 and 2. We continue enhancing
the model by pulling out X5=Sodium because it’s P-value is still not close to 0.
Model 4- Six Explanatory Variables
Removing Sodium from Model 3 and rerunning the regression yields the following results;
The fourth regression on six independent variables produces this equation:
Y = 6.9503 + 8.9243X1 + .8463X2 + 5.9666X3 - .1393X4 + 3.9183X6 + 4.0114X9
The R2 barely decreased in this model, now .99981. Again, the Adjusted R2 increased to .99964, the F ratio
increased to 6058.5481, and the Standard Error decreased to 3.17054. Model 4 is still a precise fit, but we
rerun the regression again, removing X3=Trans Fat, with the highest P-value of .0818.
Model 5- Five Explanatory Variables
Removing Trans Fat from Model 4 and rerunning the regression yields the following results;
The fifth regression on five independent variables produces this equation:
Y = 2.6641 + 8.9607X1 + .9732X2 - .1486X4 + 3.9137X6 + 4.1566X9
For Model 5, the R2 decreased a little to .99969, the Adjusted R2 decreased a little to .99950, and the F ratio
decreased to 5228.3807. The Standard Error increased to 3.73852. This model is less precise than Model 5,
according to these statistics. Still, we run another model without X4=Cholesterol since it has the highest Pvalue of the explanatory variables left.
Model 6- Four Explanatory Variables
Removing Cholesterol from Model 5 and rerunning the regression yields the following results;
The sixth regression on four independent variables produces this equation:
Y = 2.3764 + 8.7752X1 + .8842X2 + 3.9926X6 + 3.8485X9
As with Model 4, Model 5 saw a decrease in R2 (.99944), a decrease in Adjusted R2 (.99920), a decrease in the
F ratio (4038.4995), and an increase in the Standard Error (4.75526). These results mean that Model 5’s
goodness of fit is worse than all the other models. We will run one last regression since there is one
explanatory left with a P-value that is larger than 0, X2=Saturated Fat.
Model 7- Three Explanatory Variables
Removing Saturated Fat from Model 6 and rerunning the regression yields the following results;
The seventh regression on three independent variables produces this equation:
Y = .2363 + 9.0779X1 + 3.9935X6 + 3.9214X9
Once again, R2 has decreased (.99882), Adjusted R2 has decreased (.99847), the F ratio has decreased
(2823.2085), and Standard Error has increased 6.56518. These changes are significant enough to say that
Model 7 is most likely not the best fit to the data. The P-values for the remaining independent variables are all
very close to 0, so the regression will not be re-evaluated.
Correlation:
According to Excel’s Correlation Analysis, Fat has the highest correlation with Calories, at .88181. Sugars have
the lowest correlation with Calories, .02829. The Fat correlation with Calories is the highest correlation in the
data set.
Conclusion:
Below, is a summary of each model’s results:
Based on this table, I conclude that Model 4, with six explanatory variables, is the best fit to the data. This
model has the largest Adjusted R2 and F Ratio. It also has the smallest Standard Error. Although Model 4’s R2 is
not the highest of group, it is very close to 1 and only .00001 off from the full model and Model 2. Also, the
Adjusted R2 is a better comparison since it adjusts for degrees of freedom. Additonally, the P-values of the
explanatory variables in Model 4 are all fairly close to 0. For all these reasons, Model 4 is my choice for most
precise model. This regression equation is:
Y = 6.9503 + 8.9243X1 + .8463X2 + 5.9666X3 - .1393X4 + 3.9183X6 + 4.0114X9
Choosing Model 4 is saying that Fat, Saturated Fat, Trans Fat, Cholesterol, Carbs, and Proteins are the main
drivers of a Panera meal’s calorie total.