ST512 HOMEWORK 2 NCSU Sum2 2011 Q2.1 The data presented are numbers of visits to Hawaii from the other 49 states in the year 2000 along with population, income and distance to Hawaii, that might help "explain" the variation in visits. In the data step we compute the natural logarithm of visits: LVISITS; Dist: Ldist. Pop: Lpop, and Income: Lincome. 1. Run the procedure PROC CORR on the log transformed data 2. Which explanatory variable is most highly correlated with LVISITS ? a. Using only that variable, what is the predicting equation of LVISITS? b. Write down the R-square for comparison to the next model. 3. Regress the log transformed visits on similarly transformed income, population, and distance. a. Write out the fitted equation. b. What proportion of variation is explained by the multiple regression? c. How much improvement in R-square did you get (versus the simple linear model)? d. Test the residuals for normality. Use alpha=0.05. Our model contains the variable population. We would expect a state with a large population to produce more visits to Hawaii than a small state. Perhaps we should have looked at the proportion of people visiting Hawaii, that is, log(visits/pop). Notice that log(visits/pop) = log(visits)-log(pop). So if log(visits/pop) = o +1 log(income) + 2 log(distance) then, log(visits) = o +1log(income) + 2 log(distance) + 1 log(pop). In other words, the coefficient on log(pop) would be a 1 in the regression model in 3). 4. Compute a t test for testing the null hypothesis that the regression coefficient for log(pop) is 1. Write down the null hypothesis, the t test and its degrees of freedom, and your decision. 5. Present a 95% confidence interval for the conditional mean of number of visits from a city with a population =321344 and a distance = 3314 and average income = 27436. July 08 2011 MLR Page 1 ST512 HOMEWORK 2 NCSU Sum2 2011 Q2.2 A medical study was conducted to study the relationship between the systolic blood pressure and the explanatory variables, weight (kgm) and age (days) for infants. The data for 25 infants are shown here. 1. Run the procedure PROC CORR for Age, Weight and SystolicBP. a. What explanatory variable shows the highest linear correlation with 2. 3. 4. 5. 6. 7. 8. 9. SystolicBP? b. Are the explanatory variables linearly correlated? Run a regression analysis for Systolic blood pressure with both explanatory variables and their interaction in the regression model and write the estimated prediction equation for SystolicBP. Interpret the partial regression coefficient for weight. An infant of 5 weeks was found to have a weight of 3.1 kgs, compute a 95% prediction interval for her SystolicBP. Compute a 95% confidence interval for the conditional mean of Systolic Blood Pressure for 5-week infants with a weight of 3.1 kgm. Interpret the regression coefficient for the interaction age and weight. What is the effect, on the Systolic BP, of a kilogram increase in the weight of 3-week old infants? What would be the effect if the infant is 6-week old? Based on the residual diagnostic do you suspect any violation to the assumptions? Based on the residual diagnostic, should we concern about the presence of extreme values or outliers? If affirmative your answer in 8), rerun the analysis without the potential outlier. Explain changes in regression coefficients and R Squared. July 08 2011 MLR Page 2 ST512 HOMEWORK 2 NCSU Sum2 2011 Q2.3 A construction science project was to compare the daily gas consumption of 20 homes with a new form of insulation to 20 similar homes with standard insulation. They set up instruments to record the temperature both inside and outside of the homes over a six-month period of time (October-March). The average differences in these values are given below. They also obtained the average daily gas consumption (in Kilowatt hours). All the homes were heated with gas 1. Run a regression analysis for gas consumption on temperatures differences for each type of insulation. Compare the fit of the two lines. 2. Run a jointly regression analysis for gas consumption on temperatures differences for each type of insulation, fitting a separate intercept and separate slope for each type of insulation. Compare intercept and slopes with values in 1. 3. Run a jointly regression analysis for gas consumption on temperatures differences for each type of insulation, fitting a separate intercept and common slope for each type of insulation. Test whether separate slopes are needed in the model. 4. Predict the average gas consumption for both homes using new and standard insulation when the temperature difference is °F. 5. Place 95% confidence intervals on your predicted values in part 4). 6. Is there evidence that average gas consumption has been reduced by using the new form of insulation? July 08 2011 MLR Page 3 ST512 HOMEWORK 2 NCSU Sum2 2011 Q2.4 The data gives the normal average January minimum temperature in degrees Fahrenheit with the latitude and longitude of 56 U.S. cities. (For each year from 1931 to 1960, the daily minimum temperatures in January were added together and divided by 31. Then, the averages for each year were averaged over the 30 years.) Ref: : Peixoto, J.L. (1990) A property of well-formulated polynomial regression models. American Statistician, 44, 26-30. Observed facts 1. A partial regression plot of Lat vs JanTemp shows that the relationship between JanTemp and Lat, after removing the effects of Long, is linear and negative. 2. A partial regression plot of Long, shows that the relationship between JanTemp and Long, after removing the effects of Lat, is not linear. 3. The regression model assumes that the relationship between JanTemp and Long is linear. This plot shows that this assumption is clearly violated. 4. Peixoto (1990) reports a study in which a linear relationship is assumed between JanTemp and Lat; then, after removing the effects of Lat, a cubic polynomial in Long is used to predict JanTemp. Run the analysis that will support above facts 1. Run a multiple linear regression with Latitude and Longitude as explanatory variables. a. Do a residual diagnostic analysis. b. Is there evidence that multiple linear regression is not the best fitting? 2. Run a single regression model for longitude, a. output the residuals, b. plot Latitude (axis X) vs residuals (axis Y) c. describe the type of relationship between latitude and residuals 3. Run a single regression model for latitude, a. output the residuals, b. plot Longitude (axis X) vs residuals (axis Y) c. describe the type of relationship between longitude and residuals 4. Run a multiple regression model with Linear Latitude and a cubic polynomial for longitude. 5. Write the regression equation, indicate R squared. 6. Plot residuals against Longitude, describe changes. July 08 2011 MLR Page 4 ST512 HOMEWORK 2 NCSU Sum2 2011 From Peixoto’s article. Polynomial models Well Formulated model Reference: Peixoto, J.L. (1990) A property of well-formulated polynomial regression models. American Statistician, 44, 26-30. Also found in: Hand, D.J., et al. (1994) A Handbook of Small Data Sets, London: Chapman & Hall, 208-210. July 08 2011 MLR Page 5
© Copyright 2026 Paperzz