Stat 401 A – Lab 10 Goals: In this lab, we will learn tools for multiple regression. Specifically, we will see how to: draw scatterplot matrices and estimate correlation coefficients fit a multiple regression (more than one X variable) evaluate diagnostics calculate sequential (type I SS) fit models with quadratic and interaction terms We will use the grandfather clock data for all demonstrations. If you did the lab 9 self assessment, you should already have the glclocks.txt data file. If not, download it from the labs or datasets page on the class web site. We will focus on the AGE, NUMBIDS, and PRICE variables. The AGE-BID variable contains the product of the age value and the number of bids value for each clock. We won’t use AGE-BID for most this lab. Correlation coefficients and Scatterplot matrices: 1. From the menu in the home window, select Analyze / Multivariate Methods / Multivariate. (Multivariate Methods brings up a menu of methods, so you need to make a second choice). 2. Select Age and Price and put them in the Y, columns box, then click OK. You should see: Correlations AGE PRICE AGE 1.0000 0.7296 PRICE 0.7296 1.0000 Scatterplot Matrix 3. The table at the top gives the correlation coefficient for all pairs of variables. The diagonals are always 1.00 (can you explain why?). The 0.7296 occurs twice because the correlation of age and price is the same as the correlation of price and age. 4. The plot below is called a scatterplot matrix. It is a very convenient way of plotting relationships among multiple variables. Variable names are along the diagonal. For all plots in the first ROW, the Y variable is age; for all plots in the first COLUMN, the X variable is AGE. So the two plots shown here are flipped versions of the same plot. The red ellipses are called density ellipses. They contain 95% of the observations and are computed under the assumptions that the two variables have a bivariate normal distribution (a two-variable generalization of the normal distributions we’ve been using all semester). You expect some points outside the ellipse, especially with a large data set (e.g., with 200 points, the ellipse is expected to contain 95% of 200 = 190 points, which puts 10 points outside the ellipses). I’m not emphasizing these ellipses, but some folks find them helpful to identify unusual observations. 5. The power of the scatterplot matrix is better shown when more than 2 variables are plotted. Here’s the matrix for all three variables: AGE, NUMBIDS, and PRICE. Now, three separate plots are plotted compactly on the same page. 6. If you want a confidence interval for the population correlation or a test of population correlation = 0, click the red triangle by Multivariate and select Pairwise Correlations. Ask if you don’t understand any of the results. Fitting Multiple Linear Regressions 7. In the home window, select Analyze / Fit Model. That will bring up the following dialog box: 8. It should be clear how to specify the response variable: put the variable into the Y box. 9. The X variables are specified by putting them in the ‘Construct Model Effects’ box. Select AGE and NUMBIDS (either one at a time, or together), then click ‘Add’ to put the variables in the Model Effects box. I prefer to reduce the amount of output JMP gives me, so I click on Emphasis and change the default Effect Leverage to Minimal Report (this is optional). The dialog should now look like: If you want to fit a regression, the X variables should be continuous (blue ramp) variables. This is what separates fitting an ANOVA model (X is nominal, red bars) from fitting a regression (X is continuous, blue ramp). 10. Fit the model by clicking the Run button. 11. The result window looks very similar to that from the Fit Y by X dialog. It should look like: 12. You should know how to interpret the Root Mean Square Error number and the information in the Parameter Estimates box. Ask if you don’t remember. 13. The big difference between Fit Model and Fit Y by X is the menus for additional information that are obtained by clicking on the red triangle by Response PRICE. Here’s how to get results that you might want: Confidence intervals on regression parameters: Click the red triangle, select Regression Reports / Show all confidence intervals As with Fit Y by X, you can only get 95% intervals. If you want an interval with some other coverage, you have the hard pieces (estimate and standard error) to use in a hand calculation. Predicted values for new observations: Click the red triangle, select Save Columns / Prediction Formula The data table will have a new column, labelled Pred Formula. You get predictions for new observations by entering the X values in the data table. (Remember, if PRICE is predicted from AGE and NUMBIDS, you need to enter values for both variables). Note: If you choose Predicted Values instead of Predicted Formula, you get the values for all observations in the data table, but the predicted values are not computed when you enter new observations. CI’s or PI’s for new observations: Click the red triangle, select Save Columns / Mean Confidence Interval Formula for CI’s, or Save Columns / Indiv Confidence Interval Formula for PI’s. Note: Again, there are two related options. The Formula versions computes values when you add new observations to the data table. The non-Formula versions compute values for what is in the table and nothing more. Diagnostics: All are obtained using the menu under the red triangle by the Y variable (i.e., continuing the above list of additional information). Plot of residuals vs predicted values: Click the red triangle, select Row Diagnostics / Plot Residual by Predicted. You can also save predicted values and save residuals, then plot those two variables. Remember the new variables are provided as new columns in the data set. Normal Quantile Plots: Not shown automatically (unlike Fit Y by X). Here’s how to get a plot: Click the red triangle, select Save Columns / Residuals, then Analyze / Distribution, looking at the residual column. Click the red triangle by the variable name (Residual PRICE, not Distributions) and click Normal Quantile plot. The dots are the data, the straight line is what is expected if the errors have a normal distribution. The curved lines are approximate prediction intervals if the errors are normal. I haven’t talked about computing these (and won’t). You should be more concerned about lack of normality if the data fall outside the prediction intervals. Sequential SS: Analyze / Fit Model reports the partial SS (type III) for each term in the model by default. To get the sequential SS, fit the model, click the red triangle by price, and select the Estimates tab. Select Sequential Tests. The sequential F tests are added to the output for the Fit Model; the box of output is labelled with the words sequential (Type I). Interaction or quadratic terms in a multiple regression model: There are two ways to fit a model with an interaction or quadratic term: create a variable with the product or square of the appropriate X variable(s) by hand have JMP create that product or square “on the fly”. Both give the same results, so long as “on the fly” is done with some care. Here’s how to do each: 14. Creating a new X variable: This should be familiar by now. Select the data set to make it the active window. Right click on a blank column and select column info. Type in an appropriate column name, then select column properties / edit formula. Use the buttons to write the desired formula, AGE x NUMBIDS. Then click the OK buttons. You will get a new column with the product. You should see that the values are identical to those in the pre-existing AGE-BID column. Use that product variable in a multiple regression model in the obvious way: just add it to the Construct Model Effects box in the Analyze/Fit Model dialog. Quadratic variables (e.g. age2) are constructed by AGE*AGE or using the power button (xy), which gives you squares (y=2) by default. 15. Creating an interaction or quadratic term “on the fly”: The Fit Model dialog allows you to use “created” variables in a regression model without creating a new named variable. This is done with the Cross button in the Analyze / Fit Model dialog. Set up the linear effects in the model, as before by putting PRICE in the Y variable box and Adding AGE and NUMBIDS to the Construct Model Effects box. To include an interaction term, select both AGE and NUMBIDS in the Select Columns box, then left-click Cross. An AGE*NUMBIDS term will be added to the Construct Model Effects box, which will look like: To construct a quadratic term, select Age in the Select Columns box and Age in the Construct Model effects box, then click Cross. The Model Selection dialog with the two terms selected and the result included as a model term looks like: The AGE*AGE term is the square of AGE. If you also wanted NUMBIDS2, you would repeat this with NUMBIDS. IMPORTANT NOTE: To recreate class results (and to match results from adding variables), you need to do one more thing anytime you create variables “on the fly”: Click the red triangle by Model Specification and unclick Center Polynomials. We will talk in lecture later about centering polynomials. For now, it is a distraction. If the parameter estimates box includes terms like: (AGE-144.938)*(AGE-144.938), you didn’t turn off polynomial centering. Your results will not match mine. Recreate and rerun the model with Center Polynomials unclicked. model comparison F test for two nested models: JMP gives you the “all variables” vs “only the intercept” model comparison automatically (the Analysis of Variance table in the JMP output). Other model comparisons can be obtained in either of two ways. 16. Fit each model separately (two runs of Analyze / Fit Model). Look at the Analysis of Variance result for each model, extract the Error DF and Sums of Squares. Calculate the F statistic by hand. The p-value has to be obtained from printed tables. 17. JMP can test arbitrarily complex null hypotheses about parameters. This includes the null hypotheses that correspond to model comparisons. My example will test whether both quadratic terms are needed in the model: 𝑃𝑟𝑖𝑐𝑒 = 𝛽0 + 𝛽1 𝐴𝑔𝑒 + 𝛽2 𝑁𝑢𝑚𝑏𝑖𝑑𝑠 + 𝛽3 𝐴𝑔𝑒 × 𝑁𝑢𝑚𝑏𝑖𝑑𝑠 + 𝛽4 𝐴𝑔𝑒 2 + 𝛽5 𝑁𝑢𝑚𝑏𝑖𝑑𝑠 2 i.e., the null hypothesis that 𝛽4 = 0 and 𝛽5 = 0 Fit the full model (the 6 parameter model), then click the red triangle by Response PRICE, select Estimates / Custom Test. You should get a dialog box looking like: This allows you to specify each component of the Null hypothesis (here, 𝛽4 = 0 and 𝛽5 = 0). To see how to specify this to JMP, remember that the null hypothesis we want to test is exactly the same as: 1 × 𝛽4 = 0 and 1 × 𝛽5 = 0. There are two pieces to this null hypothesis. The first concerns β4; the second concerns β5. We have to specify each piece. Each piece is specified to JMP by entering the coefficient (1) adjacent to the appropriate parameter and the resulting value (0) as the result. Click on the number next to each parameter and enter the desired value. We only have to change the AGE*AGE coefficient because the default result (the value by the =) is 0. The first piece, 𝛽4 = 0, is the following in the Custom Test dialog: Since we have two pieces to the null hypothesis, click Add Column to get a second column in which we can enter the second piece. When done, the Custom Test dialog should look like: Click Done to run the test. The output is on the next page. The two columns are information about each piece separately. In this case (since the coefficients are 1 for each piece), that is the same information available from the parameter estimates box. The output we want is in the box at the bottom of the window. Sum of Squares is the change in SS Error between the two models; Numerator DF is the change in DF Error between the two models. JMP goes directly to the F statistic and gives you the p-value. Here, my conclusion would be no evidence of a quadratic relationship for either age or numbids. Practically, I would then omit those two variables from my model. Note of caution: JMP does exactly what you tell it to do. It can’t read your mind. If you put a 1 in the wrong place, you will get the wrong results because JMP is testing the wrong hypothesis. In particular, it is easy to put two 1’s in the same column, instead of in two different columns. If you put two 1’s in the same column, you are asking JMP to test the hypothesis: 1 × 𝛽4 + 1 × 𝛽5 = 0. Very different! If you aren’t sure that you’ve specified your test correctly, I suggest two checks: 1) Does the test have the correct d.f.? If there are two pieces in your null hypothesis, the test should have numerator df = 2. Alternatively, the numerator df should equal the number of equal signs in your null hypothesis. 2) Does the test have the correct SS? Check by fitting the two models you want (so the models being compared are clear), then hand computing the change in SS. If that’s correct, the rest of the output is almost certainly correct. Variance Inflation Factors: These are not shown by default but can be requested from JMP. VIF is characteristic of parameters in the model, so they are accessed though the Parameter Estimates box in the Fit Model results. 18. Load the clocks data and fit a multiple regression model using the Fit Model dialog. 19. Right-click inside the Parameter Estimates box of results (Yes, the JMP user interface would be more consistent if there were a red triangle, but there isn’t). Mouse over Columns to bring up the menu of items shown in the Parameter Estimates box (see figure below). You see that that the items that are shown in the table are already clicked (except for ~Bias, which is only relevant for some models. Where relevant, it is shown by default). Click on VIF to add a column with the VIF values for each parameter to the results. Standardized residuals and Cook’s Distance: 20. After fitting the model, click the red triangle by Response PRICE and select Save Columns. The list that appears includes all sorts of statistics that could be computed for each observation. Note: the box below shows results for a model that includes the AGE-BID variable. The menus and procedure to get diagnostics are the same for any model. What I called Standardized residuals are called Studentized residuals by JMP (6th item from the top of the list). Cook’s Distance is the item labelled Cook’s D Influence (5th from the bottom). 21. After you select your choice, a new column will be added to the data set. The columns are labelled by the quantity you requested and the response variable. That way if you fit models to more than one response variable (or a transformation of the response variable), you can match column to model. 22. Although you could look through the lists of numbers to find interesting values, it is easy to miss a number in a long list. The most common approach to scan a long list is to plot them: Click in the data table to make that window active. Select Graph / Overlay Plot. Select the variable to be plotted (e.g. Studentized Resid PRICE) and put that in the Y box. You have two options for the X variable: If you have an informative number for each observation (e.g. well number), you can select that variable and put it in the X box. If you don’t, leave the X box blank. The row number (i.e. observation number) will be used as the X coordinate. There is no id variable for the clocks, so I left X blank. Click OK and you get the plot. 23. See below for how to identify interesting points in this plot 24. Another option is to sort the list of numbers so the interesting values (unusually large or unusually small) are easy to find. Click on the data window to make that active and right click on the column to sort on (e.g. Cook’s D Influence). Select Sort from the menu then select either ascending or descending. Descending puts the largest values first; Ascending puts the smallest values first. You may get a warning from JMP about ‘Can’t reorder observations’. That’s because there is a plot like the plot of studentized residuals against row number that would be messed up by reordering (and hence renumbering observations). You can ignore this because JMP will open a second data set window with the observations in the sorted order. Identifying interesting observations: JMP implements linked graphics: when you select a point in one window, it is highlighted in all windows. This makes it easy to identify the interesting observations in a plot. 1. In the plot of Studentized residuals, there is an observation with a studentized residual larger than 2. Which observation is this? 2. There are two ways to find out: a) click in the plot window to activate it, then move the pointer over the point of interest. The row number should appear. b) click in the plot window to activate it, then select the point of interest (left click near the point). Look in the data window and scroll down until you find the selected observation: # 16. 3. You can clear the selection by clicking somewhere far from any observation in the plot window. 4. You can select multiple points by holding down the left mouse button while selecting or by shift-clicking multiple points. 5. You can work the other direction: if you select rows in the data window, they are highlighted in the plot. 6. If you have multiple plots open, points are highlighted in all windows. To demonstrate this, select Graph / overlay plot and plot Y=price vs X=age, then select Graph / overlay plot and plot Y=numbids vs X=age. Clicking on a point or points in one graph highlights in all. Some potentially useful JMP Fit Model shortcuts: 1. By default, the Model Specification dialog box closes when you run a model. You can keep it around by clicking the “Keep dialog open” box below the Run button. This is very helpful when you plan to fit various models to the same data set. 2. If you decide to transform a variable (e.g., after looking at the residual x predicted value plot), you can do that “on the fly” in the Model Specification dialog. Select the variable to be transformed (Y or one or more of the X variables), click the red triangle by “Transform” and select the transformation. 3. If you have a lot of variables in the model and you want to create all (or some) pairs of interaction variables, JMP can do this for you. Select the X desired variables in the “Select Columns” box, then left click on Macros. Choose Factorial to degree from the drop down menu. The default degree is 2 (box below Macros). This puts all variables and all products of variables into the Model Effects box. If you want all variables and some interactions, select all the variables and Add them to the Construct Model Effects. Then, select the variables that participate in interactions and click Macros / Factorial to Degree. The desired interactions are added to the model box. (Technically, the selected variables are added a second time, then removed because they are redundant). Note: This does not create quadratic terms, just the products. 4. If you want JMP to create a bunch of quadratic terms for you, select the desired variables (from the Select Columns box), then click Macros / Polynomial to degree. JMP adds squared terms for each selected variable to the Model Effects box. Note: This does not create products, just the quadratic terms. 5. If you want both quadratics and products, i.e., a full quadratic model, select the desired variables and click Macros / Response Surface. All products and quadratics are added to the Model Effects box. The desired variables change to NAME &RS, which turns on additional results options that are usually wanted for optimization (remember the mention of response surface models in lecture). Ask me if you want to know more (completely optional!). 6. If you want to delete a term from the Model Effects box, select it, then click Remove.
© Copyright 2026 Paperzz