STAT 360 – Regression Analysis – Assignment 7 ( 75 pts.) Fall 2015 1 – Black Cherry Trees in R Note: You must complete all parts of this problem in R. Also note these data were used in Section 14 – Response Transformations so you can certainly look at the analysis done in the notes for some guidance with this problem. However I am going to have you look at an alternative model to the model developed in that section for this problem. These record the girth in inches, height in feet and volume of timber in cubic feet of each of a sample of 31 felled black cherry trees in Allegheny National Forest, Pennsylvania. Note that girth is the diameter of the tree (in inches) measured at 4 ft. 6 in. above the ground. The variables in this dataset: 𝑌 = Vol – volume of the black cherry tree (ft3) 𝑋1 = 𝐷 – girth/diameter of tree (in.) 𝑋2 = 𝐻𝑡 – height of tree (ft.) a) Construct a scatterplot matrix of these data using the pairs.plus function in the Regression.RData directory I sent to you via e-mail. Briefly comment on what you learn from looking at this plot. (3 pts.) b) Fit the model 𝐸(𝑉𝑜𝑙|𝐷, 𝐻𝑡) = 𝛽𝑜 + 𝛽1 𝐷 + 𝛽2 𝐻𝑡, call this model bc.lm1. Use plot(bc.lm1) to examine residual plots and case diagnostics for this model. Comment on any model deficiencies suggested by these plots. (3 pts.) Black Cherry Tree c) Even though there is clearly curvature in the plot of the residuals vs. the fitted values, conduct Tukey’s Test for Nonadditivity by adding the squared fitted values from bc.lm1 as term in the model above (see haystack example in notes). Does this test suggest curvature? Explain . (3 pts.) d) In the notes from Section 14 we used an inverse fitted value (response) plot to suggest the 𝑇(𝑌) = √𝑉𝑜𝑙 as a transformation to address the curvature. For this assignment use the Box-Cox Transformation method to identify a suitable transformation to achieve approximate normality of the response. It will give a range of possible 𝜆 values, choose the common transformation closest to the optimal 𝜆 value chosen by Box-Cox. R commands for doing this are shown below. STAT 360 – Regression Analysis – Assignment 7 ( 75 pts.) Fall 2015 You can do this R by using either (or both) of the functions below: > BCtran(Vol) this function is in the Regression.RData directory. > summary(powerTransform(Vol)) function in the car library What are the optimal 𝜆 value and the nearest common transformation recommended by the Box-Cox method? What is the LR test p-value for testing 𝑁𝐻: 𝜆 = 0 𝑣𝑠. 𝐴𝐻: 𝜆 ≠ 0 ? (4 pts.) e) Fit the model using the common transformation for the response found in part (d) and call this model bc.lm2? Examine residual plots for this model and comment the adequacy of this model – i.e. use the command plot(bc.lm2). (3 pts.) f) If we assume that black cherry trees are conical (not comical) then we might expect that the volume of a tree is given by the volume of a cone. (see the notes linked below for a variety of geometric considerations that are used in estimated the volume of a tree http://www2.latech.edu/~strimbu/Teaching/FOR306/T4.pdf) 𝑉𝑜𝑙 = 1 𝐷 𝜋𝑟 2 ℎ 𝑤ℎ𝑒𝑟𝑒 𝑟 = 𝑟𝑎𝑑𝑖𝑢𝑠 𝑜𝑓 𝑡𝑟𝑒𝑒 ( 2 ) 𝑎𝑛𝑑 ℎ = 𝐻𝑡 3 Taking the logarithm of both sides gives, 𝜋 log(𝑉𝑜𝑙) = log ( 3 ) + 2 log(𝑟) + log(ℎ) This suggest using the following mean function with the response log transformed. 𝐸(log(𝑉𝑜𝑙) |𝐷, 𝐻𝑡) = 𝛽𝑜 + 𝛽1 log(𝐷) + 𝛽2 log(𝐻𝑡) Fit the model with the response and both predictors log transformed and call this model bc.lm3. Examine the residuals from this fit and comment on the models adequacy. (3 pts.) STAT 360 – Regression Analysis – Assignment 7 ( 75 pts.) Fall 2015 g) What is the 𝑅 2 for this model? Interpret this quantity. (2 pts.) h) Given the expression for the log of the volume above 𝜋 log(𝑉𝑜𝑙) = log (3 ) + 2 log(𝑟) + log(ℎ) we might expect that the estimated coefficient for the diameter to be twice as large as the estimated coefficient for the height. Does this appear to be the case? Explain. (2 pts.) i) Conduct a test to see if there are any outliers for the model fit in part (f), i.e. bc.lm3. Summarize your results. (2 pts.) j) Do the CERES plots for log(𝐷) & log(𝐻𝑡) in the model fit in part (f), i.e. bc.lm3 suggest any further transformation of these terms would improve the model? Explain. (3 pts.) STAT 360 – Regression Analysis – Assignment 7 ( 75 pts.) Fall 2015 2 – Abalones for the North Coast and Islands of Bass Strait Tasmania, Australia (Datafiles: Abalone (no sex).JMP and Abalone (no sex).txt) The goal of this regression analysis is to develop a model for predicting the age of abalone (or paua) from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and very time-consuming task. It is hoped that other measurements, which are easier to obtain, can be used to predict the age. Also the dust from cleaning a paua shell is toxic and can lead to health problems if exposed routinely to it. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem. Abalone or Paua The variables in these data: 𝑌 = 𝑟𝑖𝑛𝑔𝑠 − 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑖𝑛𝑔𝑠 𝑋1 = 𝑙𝑒𝑛𝑔𝑡ℎ − 𝑙𝑜𝑛𝑔𝑒𝑠𝑡 𝑠ℎ𝑒𝑙𝑙 𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑚𝑒𝑛𝑡 (𝑚𝑚) 𝑋2 = 𝑑𝑖𝑎𝑚 − 𝑝𝑒𝑟𝑝𝑒𝑛𝑑𝑖𝑐𝑢𝑙𝑎𝑟 𝑡𝑜 𝑙𝑒𝑛𝑔𝑡ℎ (𝑚𝑚) 𝑋3 = ℎ𝑒𝑖𝑔ℎ𝑡 − ℎ𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝑎𝑏𝑎𝑙𝑜𝑛𝑒 𝑤𝑖𝑡ℎ 𝑚𝑒𝑎𝑡 𝑖𝑛𝑠𝑖𝑑𝑒 (𝑚𝑚) 𝑋4 = 𝑤ℎ𝑜𝑙𝑒. 𝑤𝑒𝑖𝑔ℎ𝑡 − 𝑤𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑤ℎ𝑜𝑙𝑒 𝑎𝑏𝑎𝑙𝑜𝑛𝑒 (𝑔) 𝑋5 = 𝑠ℎ𝑢𝑐𝑘𝑒𝑑. 𝑤𝑒𝑖𝑔ℎ𝑡 − 𝑤𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝑚𝑒𝑎𝑡 (𝑔) 𝑋6 = 𝑣𝑖𝑠𝑐. 𝑤𝑒𝑖𝑔ℎ𝑡 − 𝑔𝑢𝑡 𝑤𝑒𝑖𝑔ℎ𝑡 (𝑎𝑓𝑡𝑒𝑟 𝑏𝑙𝑒𝑒𝑑𝑖𝑛𝑔)(𝑔) 𝑋7 = 𝑠ℎ𝑒𝑙𝑙. 𝑤𝑒𝑖𝑔ℎ𝑡 − 𝑤𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝑠ℎ𝑒𝑙𝑙 𝑎𝑓𝑡𝑒𝑟 𝑑𝑟𝑦𝑖𝑛𝑔 (𝑔) Data Source: Warwick J. Nash, Tracy L. Sellers, Simon R. Talbot, Andrew J. Cawthorn & Wes B. Ford from their paper: "The Population Biology of Abalone (Haliotis) in Tasmania Island and Blacklip Abalone (H. rubra) from the North Coast and Islands of Bass Strait" (1994) Sea Fisheries Division, Technical Report No. 48 (ISSN 1034-3288) a) Examine a scatterplot matrix of the response (𝑌) and all of the potential predictors (𝑋1 , … , 𝑋7 ). Discuss these plots in terms of the following: (5 pts.) marginal/univariate distributions of 𝑌 𝑎𝑛𝑑 𝑋𝑗 ′𝑠. relationships between 𝑌 𝑣𝑠. 𝑋𝑗 ′𝑠 relationships between 𝑋𝑗′ 𝑠 unusual cases multicollinearity issues In R this can be done using the function pairs.plus in Regression.RData. shell – before and after cleaning. STAT 360 – Regression Analysis – Assignment 7 ( 75 pts.) Fall 2015 b) Fit the model 𝐸(𝑌|𝑿) = 𝛽𝑜 + 𝛽1 𝑋1 + ⋯ + 𝛽7 𝑋7 and 𝑉𝑎𝑟(𝑌|𝑿) = 𝜎 2 in R and call it ab.lm1. Construct plots of the residuals from this model and comment on any model deficiencies suggested. Use the code below to plot all the diagnostic displays in one plotting window and include the plot. (4 pts.) > > > > ab.lm1 = lm(rings~.,data=Abalone) par(mfrow=c(2,2)) plot(ab.lm1) par(mfrow=c(1,1)) c) Construct an inverse fitted value (response) plot using the function invResPlot in the car library. > invResPlot(ab.lm1) What transformation (𝑇(𝑦)) would you use for the response given what you have seen in part (b) and the inverse fitted value (response) plot shows? Explain why this plot may not give an accurate visual impression of the response transformation 𝑇(𝑦) for these data. (4 pts.) d) Fit the model using the transformation you chose in part (c) and call it ab.lm2. Again examine the residual plots and comment on any model deficiencies exhibited by these plots. (3 pts.) e) Use Tukey’s Nonadditivity Test to test for curvature in the model from part (d). Summarize your findings giving the t-statistic and associated p-value. (3 pts.) f) In order to address the curvature not addressed by our current model (ab.lm2) we will now consider creating terms that allow for nonlinear effects in some of the predictors. One tool to do this is the component-plus-residual plot or C+R Plot. Construct C+R Plots for each predictor in ab.lm2. > crPlots(ab.lm2) Which predictors show the greatest degree of nonlinearity? Explain why some of these plots may not be giving an accurate impression of the function form for the predictors in this model. (4 pts.) STAT 360 – Regression Analysis – Assignment 7 ( 75 pts.) Fall 2015 g) To address the problems with the C+R Plots for these data and examine CERES Plots for each predictor. > ceresPlots(ab.lm2) Which predictors show the greatest degree of nonlinearity? What terms or transformations are suggested by the CERES plot for the following variables: length, diam, height, whole.weight, and shell.weight? Note: you do not necessarily need to transform all of these predictors. (6 pts.) h) Using the choices for predictor transformations or nonlinear functions of the predictors (e.g. polynomials) build a multiple regression model incorporating them. Write out the mean function for final model. Summarize your final model and include that summary below. (8 pts.) i) Use the plot function to examine residuals and case diagnostics for your final model from part (h). Discuss any problems/concerns exhibited by these plots and case diagnostics. (3 pts.) j) Use the outlier t-test to check for outliers in your final model. Which observation(s) are classified as outliers? (3 pts.) k) At this point you should have 2 – 5 observations/cases you find questionable due to their influence or the fact they are poorly fit (i.e. outliers). Rerun your model with these observations deleted and compare this model summary to the one including these deleted observations you found in part (h). > poo.sub = lm(formula(poo),data=Abalone,subset=-c(1,2,3,4)) Here poo is the model from part (h) and –c(1,2,3,4)will contain the observations you identified in parts (i) & (j), e.g. if the observations to remove are 337, 888, and 3144 then you would use – c(337,888,3144). Compare and contrast the differences in the model summaries. (4 pts.)
© Copyright 2025 Paperzz