Multiple regression: scatterplot matrices, using Fit Model for

Stat 401 A – Lab 10
Goals: In this lab, we will learn tools for multiple regression. Specifically, we will see how to:
draw scatterplot matrices and estimate correlation coefficients
fit a multiple regression (more than one X variable)
evaluate diagnostics
calculate sequential (type I SS)
fit models with quadratic and interaction terms
We will use the grandfather clock data for all demonstrations. If you did the lab 9 self assessment, you
should already have the glclocks.txt data file. If not, download it from the labs or datasets page on the
class web site. We will focus on the AGE, NUMBIDS, and PRICE variables. The AGE-BID variable contains
the product of the age value and the number of bids value for each clock. We won’t use AGE-BID for
most this lab.
Correlation coefficients and Scatterplot matrices:
1. From the menu in the home window, select Analyze / Multivariate Methods / Multivariate.
(Multivariate Methods brings up a menu of methods, so you need to make a second choice).
2. Select Age and Price and put them in the Y, columns box, then click OK. You should see:
Correlations
AGE
PRICE
AGE
1.0000
0.7296
PRICE
0.7296
1.0000
Scatterplot Matrix
3. The table at the top gives the correlation coefficient for all pairs of variables. The diagonals are
always 1.00 (can you explain why?). The 0.7296 occurs twice because the correlation of age and
price is the same as the correlation of price and age.
4. The plot below is called a scatterplot matrix. It is a very convenient way of plotting relationships
among multiple variables. Variable names are along the diagonal. For all plots in the first ROW,
the Y variable is age; for all plots in the first COLUMN, the X variable is AGE. So the two plots
shown here are flipped versions of the same plot.
The red ellipses are called density ellipses. They contain 95% of the observations and are
computed under the assumptions that the two variables have a bivariate normal distribution (a
two-variable generalization of the normal distributions we’ve been using all semester). You
expect some points outside the ellipse, especially with a large data set (e.g., with 200 points, the
ellipse is expected to contain 95% of 200 = 190 points, which puts 10 points outside the ellipses).
I’m not emphasizing these ellipses, but some folks find them helpful to identify unusual
observations.
5. The power of the scatterplot matrix is better shown when more than 2 variables are plotted.
Here’s the matrix for all three variables: AGE, NUMBIDS, and PRICE. Now, three separate plots
are plotted compactly on the same page.
6. If you want a confidence interval for the population correlation or a test of population
correlation = 0, click the red triangle by Multivariate and select Pairwise Correlations. Ask if you
don’t understand any of the results.
Fitting Multiple Linear Regressions
7. In the home window, select Analyze / Fit Model. That will bring up the following dialog box:
8. It should be clear how to specify the response variable: put the variable into the Y box.
9. The X variables are specified by putting them in the ‘Construct Model Effects’ box. Select AGE
and NUMBIDS (either one at a time, or together), then click ‘Add’ to put the variables in the
Model Effects box. I prefer to reduce the amount of output JMP gives me, so I click on Emphasis
and change the default Effect Leverage to Minimal Report (this is optional). The dialog should
now look like:
If you want to fit a regression, the X variables should be continuous (blue ramp) variables. This
is what separates fitting an ANOVA model (X is nominal, red bars) from fitting a regression (X is
continuous, blue ramp).
10. Fit the model by clicking the Run button.
11. The result window looks very similar to that from the Fit Y by X dialog. It should look like:
12. You should know how to interpret the Root Mean Square Error number and the information in
the Parameter Estimates box. Ask if you don’t remember.
13. The big difference between Fit Model and Fit Y by X is the menus for additional information that
are obtained by clicking on the red triangle by Response PRICE. Here’s how to get results that
you might want:
Confidence intervals on regression parameters:
Click the red triangle, select Regression Reports / Show all confidence intervals
As with Fit Y by X, you can only get 95% intervals. If you want an interval with some other
coverage, you have the hard pieces (estimate and standard error) to use in a hand calculation.
Predicted values for new observations:
Click the red triangle, select Save Columns / Prediction Formula
The data table will have a new column, labelled Pred Formula. You get predictions for new
observations by entering the X values in the data table. (Remember, if PRICE is predicted from
AGE and NUMBIDS, you need to enter values for both variables).
Note: If you choose Predicted Values instead of Predicted Formula, you get the values for all
observations in the data table, but the predicted values are not computed when you enter new
observations.
CI’s or PI’s for new observations:
Click the red triangle, select Save Columns / Mean Confidence Interval Formula for CI’s, or
Save Columns / Indiv Confidence Interval Formula for PI’s.
Note: Again, there are two related options. The Formula versions computes values when you
add new observations to the data table. The non-Formula versions compute values for what is
in the table and nothing more.
Diagnostics:
All are obtained using the menu under the red triangle by the Y variable (i.e., continuing the above list
of additional information).
Plot of residuals vs predicted values:
Click the red triangle, select Row Diagnostics / Plot Residual by Predicted. You can also save
predicted values and save residuals, then plot those two variables. Remember the new
variables are provided as new columns in the data set.
Normal Quantile Plots: Not shown automatically (unlike Fit Y by X). Here’s how to get a plot:
Click the red triangle, select Save Columns / Residuals, then Analyze / Distribution, looking at the
residual column. Click the red triangle by the variable name (Residual PRICE, not Distributions)
and click Normal Quantile plot.
The dots are the data, the straight line is what is expected if the errors have a normal
distribution. The curved lines are approximate prediction intervals if the errors are normal. I
haven’t talked about computing these (and won’t). You should be more concerned about lack
of normality if the data fall outside the prediction intervals.
Sequential SS:
Analyze / Fit Model reports the partial SS (type III) for each term in the model by default. To get the
sequential SS, fit the model, click the red triangle by price, and select the Estimates tab. Select
Sequential Tests. The sequential F tests are added to the output for the Fit Model; the box of output is
labelled with the words sequential (Type I).
Interaction or quadratic terms in a multiple regression model:
There are two ways to fit a model with an interaction or quadratic term:
create a variable with the product or square of the appropriate X variable(s) by hand
have JMP create that product or square “on the fly”.
Both give the same results, so long as “on the fly” is done with some care. Here’s how to do each:
14. Creating a new X variable: This should be familiar by now. Select the data set to make it the
active window. Right click on a blank column and select column info. Type in an appropriate
column name, then select column properties / edit formula. Use the buttons to write the
desired formula, AGE x NUMBIDS. Then click the OK buttons. You will get a new column with
the product. You should see that the values are identical to those in the pre-existing AGE-BID
column. Use that product variable in a multiple regression model in the obvious way: just add it
to the Construct Model Effects box in the Analyze/Fit Model dialog.
Quadratic variables (e.g. age2) are constructed by AGE*AGE or using the power button (xy),
which gives you squares (y=2) by default.
15. Creating an interaction or quadratic term “on the fly”: The Fit Model dialog allows you to use
“created” variables in a regression model without creating a new named variable. This is done
with the Cross button in the Analyze / Fit Model dialog. Set up the linear effects in the model,
as before by putting PRICE in the Y variable box and Adding AGE and NUMBIDS to the Construct
Model Effects box. To include an interaction term, select both AGE and NUMBIDS in the Select
Columns box, then left-click Cross. An AGE*NUMBIDS term will be added to the Construct
Model Effects box, which will look like:
To construct a quadratic term, select Age in the Select Columns box and Age in the Construct
Model effects box, then click Cross. The Model Selection dialog with the two terms selected and
the result included as a model term looks like:
The AGE*AGE term is the square of AGE. If you also wanted NUMBIDS2, you would repeat this
with NUMBIDS.
IMPORTANT NOTE: To recreate class results (and to match results from adding variables), you
need to do one more thing anytime you create variables “on the fly”: Click the red triangle by
Model Specification and unclick Center Polynomials.
We will talk in lecture later about centering polynomials. For now, it is a distraction.
If the parameter estimates box includes terms like: (AGE-144.938)*(AGE-144.938), you didn’t
turn off polynomial centering. Your results will not match mine. Recreate and rerun the model
with Center Polynomials unclicked.
model comparison F test for two nested models: JMP gives you the “all variables” vs “only the
intercept” model comparison automatically (the Analysis of Variance table in the JMP output). Other
model comparisons can be obtained in either of two ways.
16. Fit each model separately (two runs of Analyze / Fit Model). Look at the Analysis of Variance
result for each model, extract the Error DF and Sums of Squares. Calculate the F statistic by
hand. The p-value has to be obtained from printed tables.
17. JMP can test arbitrarily complex null hypotheses about parameters. This includes the null
hypotheses that correspond to model comparisons. My example will test whether both
quadratic terms are needed in the model:
𝑃𝑟𝑖𝑐𝑒 = 𝛽0 + 𝛽1 𝐴𝑔𝑒 + 𝛽2 𝑁𝑢𝑚𝑏𝑖𝑑𝑠 + 𝛽3 𝐴𝑔𝑒 × 𝑁𝑢𝑚𝑏𝑖𝑑𝑠 + 𝛽4 𝐴𝑔𝑒 2 + 𝛽5 𝑁𝑢𝑚𝑏𝑖𝑑𝑠 2
i.e., the null hypothesis that 𝛽4 = 0 and 𝛽5 = 0
Fit the full model (the 6 parameter model), then click the red triangle by Response PRICE, select
Estimates / Custom Test. You should get a dialog box looking like:
This allows you to specify each component of the Null hypothesis (here, 𝛽4 = 0 and 𝛽5 = 0).
To see how to specify this to JMP, remember that the null hypothesis we want to test is exactly
the same as: 1 × 𝛽4 = 0 and 1 × 𝛽5 = 0. There are two pieces to this null hypothesis. The first
concerns β4; the second concerns β5. We have to specify each piece.
Each piece is specified to JMP by entering the coefficient (1) adjacent to the appropriate
parameter and the resulting value (0) as the result. Click on the number next to each parameter
and enter the desired value. We only have to change the AGE*AGE coefficient because the
default result (the value by the =) is 0. The first piece, 𝛽4 = 0, is the following in the Custom
Test dialog:
Since we have two pieces to the null hypothesis, click Add Column to get a second column in
which we can enter the second piece. When done, the Custom Test dialog should look like:
Click Done to run the test. The output is on the next page.
The two columns are information about each piece separately. In this case (since the
coefficients are 1 for each piece), that is the same information available from the parameter
estimates box. The output we want is in the box at the bottom of the window. Sum of Squares
is the change in SS Error between the two models; Numerator DF is the change in DF Error
between the two models. JMP goes directly to the F statistic and gives you the p-value. Here,
my conclusion would be no evidence of a quadratic relationship for either age or numbids.
Practically, I would then omit those two variables from my model.
Note of caution: JMP does exactly what you tell it to do. It can’t read your mind. If you put a 1
in the wrong place, you will get the wrong results because JMP is testing the wrong hypothesis.
In particular, it is easy to put two 1’s in the same column, instead of in two different columns. If
you put two 1’s in the same column, you are asking JMP to test the hypothesis:
1 × 𝛽4 + 1 × 𝛽5 = 0. Very different!
If you aren’t sure that you’ve specified your test correctly, I suggest two checks:
1) Does the test have the correct d.f.? If there are two pieces in your null hypothesis, the test
should have numerator df = 2. Alternatively, the numerator df should equal the number of
equal signs in your null hypothesis.
2) Does the test have the correct SS? Check by fitting the two models you want (so the models
being compared are clear), then hand computing the change in SS. If that’s correct, the rest of
the output is almost certainly correct.
Variance Inflation Factors:
These are not shown by default but can be requested from JMP. VIF is characteristic of parameters in
the model, so they are accessed though the Parameter Estimates box in the Fit Model results.
18. Load the clocks data and fit a multiple regression model using the Fit Model dialog.
19. Right-click inside the Parameter Estimates box of results (Yes, the JMP user interface would be
more consistent if there were a red triangle, but there isn’t). Mouse over Columns to bring up
the menu of items shown in the Parameter Estimates box (see figure below). You see that that
the items that are shown in the table are already clicked (except for ~Bias, which is only relevant
for some models. Where relevant, it is shown by default).
Click on VIF to add a column with the VIF values for each parameter to the results.
Standardized residuals and Cook’s Distance:
20. After fitting the model, click the red triangle by Response PRICE and select Save Columns. The
list that appears includes all sorts of statistics that could be computed for each observation.
Note: the box below shows results for a model that includes the AGE-BID variable. The menus
and procedure to get diagnostics are the same for any model.
What I called Standardized residuals are called Studentized residuals by JMP (6th item from the
top of the list). Cook’s Distance is the item labelled Cook’s D Influence (5th from the bottom).
21. After you select your choice, a new column will be added to the data set. The columns are
labelled by the quantity you requested and the response variable. That way if you fit models to
more than one response variable (or a transformation of the response variable), you can match
column to model.
22. Although you could look through the lists of numbers to find interesting values, it is easy to miss
a number in a long list. The most common approach to scan a long list is to plot them:
Click in the data table to make that window active. Select Graph / Overlay Plot. Select the
variable to be plotted (e.g. Studentized Resid PRICE) and put that in the Y box.
You have two options for the X variable:
If you have an informative number for each observation (e.g. well number), you can select that
variable and put it in the X box.
If you don’t, leave the X box blank. The row number (i.e. observation number) will be used as
the X coordinate.
There is no id variable for the clocks, so I left X blank.
Click OK and you get the plot.
23. See below for how to identify interesting points in this plot
24. Another option is to sort the list of numbers so the interesting values (unusually large or
unusually small) are easy to find.
Click on the data window to make that active and right click on the column to sort on (e.g.
Cook’s D Influence). Select Sort from the menu then select either ascending or descending.
Descending puts the largest values first; Ascending puts the smallest values first.
You may get a warning from JMP about ‘Can’t reorder observations’. That’s because there is a
plot like the plot of studentized residuals against row number that would be messed up by
reordering (and hence renumbering observations). You can ignore this because JMP will open a
second data set window with the observations in the sorted order.
Identifying interesting observations:
JMP implements linked graphics: when you select a point in one window, it is highlighted in all
windows. This makes it easy to identify the interesting observations in a plot.
1. In the plot of Studentized residuals, there is an observation with a studentized residual larger
than 2. Which observation is this?
2. There are two ways to find out:
a) click in the plot window to activate it, then move the pointer over the point of interest. The
row number should appear.
b) click in the plot window to activate it, then select the point of interest (left click near the
point). Look in the data window and scroll down until you find the selected observation: # 16.
3. You can clear the selection by clicking somewhere far from any observation in the plot
window.
4. You can select multiple points by holding down the left mouse button while selecting or by
shift-clicking multiple points.
5. You can work the other direction: if you select rows in the data window, they are highlighted
in the plot.
6. If you have multiple plots open, points are highlighted in all windows. To demonstrate this,
select Graph / overlay plot and plot Y=price vs X=age, then select Graph / overlay plot and plot
Y=numbids vs X=age. Clicking on a point or points in one graph highlights in all.
Some potentially useful JMP Fit Model shortcuts:
1. By default, the Model Specification dialog box closes when you run a model. You can keep it
around by clicking the “Keep dialog open” box below the Run button. This is very helpful when
you plan to fit various models to the same data set.
2. If you decide to transform a variable (e.g., after looking at the residual x predicted value plot),
you can do that “on the fly” in the Model Specification dialog. Select the variable to be
transformed (Y or one or more of the X variables), click the red triangle by “Transform” and
select the transformation.
3. If you have a lot of variables in the model and you want to create all (or some) pairs of
interaction variables, JMP can do this for you. Select the X desired variables in the “Select
Columns” box, then left click on Macros. Choose Factorial to degree from the drop down menu.
The default degree is 2 (box below Macros). This puts all variables and all products of variables
into the Model Effects box. If you want all variables and some interactions, select all the
variables and Add them to the Construct Model Effects. Then, select the variables that
participate in interactions and click Macros / Factorial to Degree. The desired interactions are
added to the model box. (Technically, the selected variables are added a second time, then
removed because they are redundant).
Note: This does not create quadratic terms, just the products.
4. If you want JMP to create a bunch of quadratic terms for you, select the desired variables (from
the Select Columns box), then click Macros / Polynomial to degree. JMP adds squared terms for
each selected variable to the Model Effects box.
Note: This does not create products, just the quadratic terms.
5. If you want both quadratics and products, i.e., a full quadratic model, select the desired
variables and click Macros / Response Surface. All products and quadratics are added to the
Model Effects box. The desired variables change to NAME &RS, which turns on additional
results options that are usually wanted for optimization (remember the mention of response
surface models in lecture). Ask me if you want to know more (completely optional!).
6. If you want to delete a term from the Model Effects box, select it, then click Remove.

Download Report

Multiple regression: scatterplot matrices, using Fit Model for

Paperzz.com

Your Paperzz