Assignment 7

STAT 360 – Regression Analysis – Assignment 7 ( 75 pts.)
Fall 2015
1 – Black Cherry Trees in R
Note: You must complete all parts of this problem in R. Also note these data were used
in Section 14 – Response Transformations so you can certainly look at the analysis done
in the notes for some guidance with this problem. However I am going to have you
look at an alternative model to the model developed in that section for this problem.
These record the girth in inches, height in feet and volume of timber in cubic feet of each of a
sample of 31 felled black cherry trees in Allegheny National Forest, Pennsylvania. Note that
girth is the diameter of the tree (in inches) measured at 4 ft. 6 in. above the ground.
The variables in this dataset:



𝑌 = Vol – volume of the black cherry tree (ft3)
𝑋1 = 𝐷 – girth/diameter of tree (in.)
𝑋2 = 𝐻𝑡 – height of tree (ft.)
a) Construct a scatterplot matrix of these data using the
pairs.plus function in the Regression.RData directory
I sent to you via e-mail. Briefly comment on what you learn
from looking at this plot. (3 pts.)
b) Fit the model 𝐸(𝑉𝑜𝑙|𝐷, 𝐻𝑡) = 𝛽𝑜 + 𝛽1 𝐷 + 𝛽2 𝐻𝑡, call
this model bc.lm1. Use plot(bc.lm1) to
examine residual plots and case diagnostics for this model.
Comment on any model deficiencies suggested by these
plots. (3 pts.)
Black Cherry Tree
c) Even though there is clearly curvature in the plot of the residuals vs. the fitted
values, conduct Tukey’s Test for Nonadditivity by adding the squared fitted
values from bc.lm1 as term in the model above (see haystack example in notes).
Does this test suggest curvature? Explain . (3 pts.)
d) In the notes from Section 14 we used an inverse fitted value (response) plot to
suggest the 𝑇(𝑌) = √𝑉𝑜𝑙 as a transformation to address the curvature. For this
assignment use the Box-Cox Transformation method to identify a suitable
transformation to achieve approximate normality of the response. It will give a
range of possible 𝜆 values, choose the common transformation closest to the
optimal 𝜆 value chosen by Box-Cox. R commands for doing this are shown
below.
STAT 360 – Regression Analysis – Assignment 7 ( 75 pts.)
Fall 2015
You can do this R by using either (or both) of the functions below:
> BCtran(Vol)  this function is in the Regression.RData directory.
> summary(powerTransform(Vol))  function in the car library
What are the optimal 𝜆 value and the nearest common transformation
recommended by the Box-Cox method? What is the LR test p-value for testing
𝑁𝐻: 𝜆 = 0 𝑣𝑠. 𝐴𝐻: 𝜆 ≠ 0 ? (4 pts.)
e) Fit the model using the common transformation for the response found in part
(d) and call this model bc.lm2? Examine residual plots for this model and
comment the adequacy of this model – i.e. use the command plot(bc.lm2).
(3 pts.)
f) If we assume that black cherry trees are conical (not comical) then we might
expect that the volume of a tree is given by the volume of a cone.
(see the notes linked below for a variety of geometric considerations that are used in estimated the volume
of a tree http://www2.latech.edu/~strimbu/Teaching/FOR306/T4.pdf)
𝑉𝑜𝑙 =
1
𝐷
𝜋𝑟 2 ℎ 𝑤ℎ𝑒𝑟𝑒 𝑟 = 𝑟𝑎𝑑𝑖𝑢𝑠 𝑜𝑓 𝑡𝑟𝑒𝑒 ( 2 ) 𝑎𝑛𝑑 ℎ = 𝐻𝑡
3
Taking the logarithm of both sides gives,
𝜋
log(𝑉𝑜𝑙) = log ( 3 ) + 2 log(𝑟) + log(ℎ)
This suggest using the following mean function with the
response log transformed.
𝐸(log(𝑉𝑜𝑙) |𝐷, 𝐻𝑡) = 𝛽𝑜 + 𝛽1 log(𝐷) + 𝛽2 log(𝐻𝑡)
Fit the model with the response and both predictors log transformed and call this
model bc.lm3. Examine the residuals from this fit and comment on the models
adequacy. (3 pts.)
STAT 360 – Regression Analysis – Assignment 7 ( 75 pts.)
Fall 2015
g) What is the 𝑅 2 for this model? Interpret this quantity.
(2 pts.)
h) Given the expression for the log of the volume above
𝜋
log(𝑉𝑜𝑙) = log (3 ) + 2 log(𝑟) + log(ℎ)
we might expect that the estimated coefficient for the diameter to be twice as
large as the estimated coefficient for the height. Does this appear to be the case?
Explain. (2 pts.)
i) Conduct a test to see if there are any outliers for the model fit in part (f), i.e.
bc.lm3. Summarize your results. (2 pts.)
j) Do the CERES plots for log(𝐷) & log(𝐻𝑡) in the model fit in part (f), i.e. bc.lm3
suggest any further transformation of these terms would improve the model?
Explain. (3 pts.)
STAT 360 – Regression Analysis – Assignment 7 ( 75 pts.)
Fall 2015
2 – Abalones for the North Coast and Islands of Bass Strait Tasmania, Australia
(Datafiles: Abalone (no sex).JMP and Abalone (no sex).txt)
The goal of this regression analysis is to develop a model for predicting the age of
abalone (or paua) from physical measurements. The age of abalone is determined by
cutting the shell through the cone, staining it, and counting the number of rings
through a microscope -- a boring and very time-consuming task. It is hoped that other
measurements, which are easier to obtain, can be used to predict the age. Also the dust
from cleaning a paua shell is toxic and can lead to health problems if exposed routinely
to it. Further information, such as weather patterns and location (hence food
availability) may be required to solve the problem.
Abalone or Paua
The variables in these data:








𝑌 = 𝑟𝑖𝑛𝑔𝑠 − 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑖𝑛𝑔𝑠
𝑋1 = 𝑙𝑒𝑛𝑔𝑡ℎ − 𝑙𝑜𝑛𝑔𝑒𝑠𝑡 𝑠ℎ𝑒𝑙𝑙 𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑚𝑒𝑛𝑡 (𝑚𝑚)
𝑋2 = 𝑑𝑖𝑎𝑚 − 𝑝𝑒𝑟𝑝𝑒𝑛𝑑𝑖𝑐𝑢𝑙𝑎𝑟 𝑡𝑜 𝑙𝑒𝑛𝑔𝑡ℎ (𝑚𝑚)
𝑋3 = ℎ𝑒𝑖𝑔ℎ𝑡 − ℎ𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝑎𝑏𝑎𝑙𝑜𝑛𝑒 𝑤𝑖𝑡ℎ 𝑚𝑒𝑎𝑡 𝑖𝑛𝑠𝑖𝑑𝑒 (𝑚𝑚)
𝑋4 = 𝑤ℎ𝑜𝑙𝑒. 𝑤𝑒𝑖𝑔ℎ𝑡 − 𝑤𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑤ℎ𝑜𝑙𝑒 𝑎𝑏𝑎𝑙𝑜𝑛𝑒 (𝑔)
𝑋5 = 𝑠ℎ𝑢𝑐𝑘𝑒𝑑. 𝑤𝑒𝑖𝑔ℎ𝑡 − 𝑤𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝑚𝑒𝑎𝑡 (𝑔)
𝑋6 = 𝑣𝑖𝑠𝑐. 𝑤𝑒𝑖𝑔ℎ𝑡 − 𝑔𝑢𝑡 𝑤𝑒𝑖𝑔ℎ𝑡 (𝑎𝑓𝑡𝑒𝑟 𝑏𝑙𝑒𝑒𝑑𝑖𝑛𝑔)(𝑔)
𝑋7 = 𝑠ℎ𝑒𝑙𝑙. 𝑤𝑒𝑖𝑔ℎ𝑡 − 𝑤𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝑠ℎ𝑒𝑙𝑙 𝑎𝑓𝑡𝑒𝑟 𝑑𝑟𝑦𝑖𝑛𝑔 (𝑔)
Data Source:
Warwick J. Nash, Tracy L. Sellers, Simon R. Talbot, Andrew J.
Cawthorn & Wes B. Ford from their paper:
"The Population Biology of Abalone (Haliotis) in Tasmania
Island and Blacklip Abalone (H. rubra) from the North Coast
and Islands of Bass Strait" (1994)
Sea Fisheries Division, Technical Report No. 48 (ISSN 1034-3288)
a) Examine a scatterplot matrix of the response (𝑌) and all of the potential
predictors (𝑋1 , … , 𝑋7 ).
Discuss these plots in terms of the following: (5 pts.)
 marginal/univariate distributions of 𝑌 𝑎𝑛𝑑 𝑋𝑗 ′𝑠.
 relationships between 𝑌 𝑣𝑠. 𝑋𝑗 ′𝑠

relationships between 𝑋𝑗′ 𝑠
 unusual cases
 multicollinearity issues
In R this can be done using the function pairs.plus in Regression.RData.
shell – before and
after cleaning.
STAT 360 – Regression Analysis – Assignment 7 ( 75 pts.)
Fall 2015
b) Fit the model 𝐸(𝑌|𝑿) = 𝛽𝑜 + 𝛽1 𝑋1 + ⋯ + 𝛽7 𝑋7 and 𝑉𝑎𝑟(𝑌|𝑿) = 𝜎 2 in R and call it
ab.lm1. Construct plots of the residuals from this model and comment on any
model deficiencies suggested. Use the code below to plot all the diagnostic
displays in one plotting window and include the plot. (4 pts.)
>
>
>
>
ab.lm1 = lm(rings~.,data=Abalone)
par(mfrow=c(2,2))
plot(ab.lm1)
par(mfrow=c(1,1))
c) Construct an inverse fitted value (response) plot using the function
invResPlot in the car library.
> invResPlot(ab.lm1)
What transformation (𝑇(𝑦)) would you use for the response given what you have
seen in part (b) and the inverse fitted value (response) plot shows? Explain why
this plot may not give an accurate visual impression of the response
transformation 𝑇(𝑦) for these data. (4 pts.)
d) Fit the model using the transformation you chose in part (c) and call it ab.lm2.
Again examine the residual plots and comment on any model deficiencies
exhibited by these plots. (3 pts.)
e) Use Tukey’s Nonadditivity Test to test for curvature in the model from part (d).
Summarize your findings giving the t-statistic and associated p-value. (3 pts.)
f) In order to address the curvature not addressed by our current model (ab.lm2)
we will now consider creating terms that allow for nonlinear effects in some of the
predictors. One tool to do this is the component-plus-residual plot or C+R Plot.
Construct C+R Plots for each predictor in ab.lm2.
> crPlots(ab.lm2)
Which predictors show the greatest degree of nonlinearity? Explain why some of
these plots may not be giving an accurate impression of the function form for the
predictors in this model. (4 pts.)
STAT 360 – Regression Analysis – Assignment 7 ( 75 pts.)
Fall 2015
g) To address the problems with the C+R Plots for these data and examine CERES
Plots for each predictor.
> ceresPlots(ab.lm2)
Which predictors show the greatest degree of nonlinearity? What terms or
transformations are suggested by the CERES plot for the following variables:
length, diam, height, whole.weight, and shell.weight? Note: you do not
necessarily need to transform all of these predictors. (6 pts.)
h) Using the choices for predictor transformations or nonlinear functions of the
predictors (e.g. polynomials) build a multiple regression model incorporating
them. Write out the mean function for final model. Summarize your final model
and include that summary below. (8 pts.)
i) Use the plot function to examine residuals and case diagnostics for your final
model from part (h). Discuss any problems/concerns exhibited by these plots
and case diagnostics. (3 pts.)
j) Use the outlier t-test to check for outliers in your final model. Which
observation(s) are classified as outliers? (3 pts.)
k) At this point you should have 2 – 5 observations/cases you find questionable due
to their influence or the fact they are poorly fit (i.e. outliers). Rerun your model
with these observations deleted and compare this model summary to the one
including these deleted observations you found in part (h).
> poo.sub = lm(formula(poo),data=Abalone,subset=-c(1,2,3,4))
Here poo is the model from part (h) and –c(1,2,3,4)will contain the
observations you identified in parts (i) & (j), e.g. if the observations to remove are
337, 888, and 3144 then you would use – c(337,888,3144). Compare and
contrast the differences in the model summaries. (4 pts.)

Download Report

Assignment 7

Paperzz.com

Your Paperzz