Homework2_Sum2011.pdf

ST512
HOMEWORK 2
NCSU Sum2 2011
Q2.1
The data presented are numbers of visits to Hawaii from the other 49 states in the
year 2000 along with population, income and distance to Hawaii, that might help "explain"
the variation in visits.
In the data step we compute the natural logarithm of visits: LVISITS; Dist: Ldist. Pop:
Lpop, and Income: Lincome.
1.
Run the procedure PROC CORR on the log transformed data
2.
Which explanatory variable is most highly correlated with LVISITS ?
a. Using only that variable, what is the predicting equation of LVISITS?
b. Write down the R-square for comparison to the next model.
3. Regress the log transformed visits on similarly transformed income, population, and
distance.
a. Write out the fitted equation.
b. What proportion of variation is explained by the multiple regression?
c. How much improvement in R-square did you get (versus the simple linear model)?
d. Test the residuals for normality. Use alpha=0.05.
Our model contains the variable population. We would expect a state with a large population
to produce more visits to Hawaii than a small state. Perhaps we should have looked at the
proportion of people visiting Hawaii, that is, log(visits/pop).
Notice that log(visits/pop) = log(visits)-log(pop).
So if
log(visits/pop) = o +1 log(income) + 2 log(distance)
then,
log(visits) = o +1log(income) + 2 log(distance) + 1 log(pop).
In other words, the coefficient on log(pop) would be a 1 in the regression model in 3).
4. Compute a t test for testing the null hypothesis that the regression coefficient for
log(pop) is 1. Write down the null hypothesis, the t test and its degrees of freedom, and
your decision.
5. Present a 95% confidence interval for the conditional mean of number of visits from a city
with a population =321344 and a distance = 3314 and average income = 27436.
July 08 2011
MLR
Page 1
ST512
HOMEWORK 2
NCSU Sum2 2011
Q2.2
A medical study was conducted to study the relationship between the systolic blood pressure
and the explanatory variables, weight (kgm) and age (days) for infants. The data for 25 infants
are shown here.
1. Run the procedure PROC CORR for Age, Weight and SystolicBP.
a. What explanatory variable shows the highest linear correlation with
2.
3.
4.
5.
6.
7.
8.
9.
SystolicBP?
b. Are the explanatory variables linearly correlated?
Run a regression analysis for Systolic blood pressure with both explanatory variables
and their interaction in the regression model and write the estimated prediction
equation for SystolicBP.
Interpret the partial regression coefficient for weight.
An infant of 5 weeks was found to have a weight of 3.1 kgs, compute a 95% prediction
interval for her SystolicBP.
Compute a 95% confidence interval for the conditional mean of Systolic Blood Pressure
for 5-week infants with a weight of 3.1 kgm.
Interpret the regression coefficient for the interaction age and weight. What is the
effect, on the Systolic BP, of a kilogram increase in the weight of 3-week old infants?
What would be the effect if the infant is 6-week old?
Based on the residual diagnostic do you suspect any violation to the assumptions?
Based on the residual diagnostic, should we concern about the presence of extreme
values or outliers?
If affirmative your answer in 8), rerun the analysis without the potential outlier.
Explain changes in regression coefficients and R Squared.
July 08 2011
MLR
Page 2
ST512
HOMEWORK 2
NCSU Sum2 2011
Q2.3
A construction science project was to compare the daily gas consumption of 20
homes with a new form of insulation to 20 similar homes with standard
insulation. They set up instruments to record the temperature both inside and
outside of the homes over a six-month period of time (October-March). The
average differences in these values are given below. They also obtained the
average daily gas consumption (in Kilowatt hours). All the homes were heated
with gas
1. Run a regression analysis for gas consumption on temperatures
differences for each type of insulation. Compare the fit of the two
lines.
2. Run a jointly regression analysis for gas consumption on temperatures
differences for each type of insulation, fitting a separate intercept
and separate slope for each type of insulation. Compare intercept and
slopes with values in 1.
3. Run a jointly regression analysis for gas consumption on temperatures
differences for each type of insulation, fitting a separate intercept
and common slope for each type of insulation. Test whether separate
slopes are needed in the model.
4. Predict the average gas consumption for both homes using new and
standard insulation when the temperature difference is °F.
5. Place 95% confidence intervals on your predicted values in part 4).
6. Is there evidence that average gas consumption has been reduced by
using the new form of insulation?
July 08 2011
MLR
Page 3
ST512
HOMEWORK 2
NCSU Sum2 2011
Q2.4
The data gives the normal average January minimum temperature in
degrees Fahrenheit with the latitude and longitude of 56 U.S.
cities. (For each year from 1931 to 1960, the daily minimum
temperatures in January were added together and divided by 31.
Then, the averages for each year were averaged over the 30
years.)
Ref: : Peixoto, J.L. (1990) A property of well-formulated
polynomial regression models. American Statistician, 44, 26-30.
Observed facts
1. A partial regression plot of Lat vs JanTemp shows that the
relationship between JanTemp and Lat, after removing the
effects of Long, is linear and negative.
2. A partial regression plot of Long, shows that the
relationship between JanTemp and Long, after removing the
effects of Lat, is not linear.
3. The regression model assumes that the relationship between
JanTemp and Long is linear. This plot shows that this
assumption is clearly violated.
4. Peixoto (1990) reports a study in which a linear
relationship is assumed between JanTemp and Lat; then,
after removing the effects of Lat, a cubic polynomial in
Long is used to predict JanTemp.
Run the analysis that will support above facts
1. Run a multiple linear regression with Latitude and
Longitude as explanatory variables.
a. Do a residual diagnostic analysis.
b. Is there evidence that multiple linear regression is
not the best fitting?
2. Run a single regression model for longitude,
a. output the residuals,
b. plot Latitude (axis X) vs residuals (axis Y)
c. describe the type of relationship between latitude and
residuals
3. Run a single regression model for latitude,
a. output the residuals,
b. plot Longitude (axis X) vs residuals (axis Y)
c. describe the type of relationship between longitude
and residuals
4. Run a multiple regression model with Linear Latitude and a
cubic polynomial for longitude.
5. Write the regression equation, indicate R squared.
6. Plot residuals against Longitude, describe changes.
July 08 2011
MLR
Page 4
ST512
HOMEWORK 2
NCSU Sum2 2011
From Peixoto’s article.
Polynomial models
Well Formulated model
Reference: Peixoto, J.L. (1990) A property of well-formulated polynomial regression
models. American Statistician, 44, 26-30. Also found in: Hand, D.J., et al. (1994) A
Handbook of Small Data Sets, London: Chapman & Hall, 208-210.
July 08 2011
MLR
Page 5