STATISTICS 110, FALL 2015 Homework #3 Solutions Assigned Wed, October 14, Due Wed, October 21 1. Do Problem 2.8 (page 81). 2.8 (This is a multiple choice question, but since that was not clear I’ll address each part. But you only need to choose the correct answer for credit.) A regression equation was fit to a set of data for which the correlation, r, between X and Y was 0.6. Which of the following must be true? a. The slope of the regression line is 0.6. False. The correlation is related to the slope, but is only equal to it if the sample standard deviations of X and Y are equal. b. The regression model explains 60% of the variability in Y. False. It explains 36% of the variability, found by squaring the correlation to get R2. c. The regression model explains 36% of the variability in Y. True: This is the definition of R2. d. At least half of the residuals are smaller than 0.6 in absolute value. False: The value of the correlation tells us nothing of this sort about the residuals. 2. To receive any credit for this problem, you must show all of the relevant R output for all 3 parts and mark on the output where you found your answers. (This is because some answers are in the back of the book.) a. Do Part (a) of Exercise 2.13 (page 82), but show two different test statistics for testing the hypotheses. Make sure you include all 5 steps of hypothesis testing, clearly marked. 2.13a. Step 1, hypotheses: Testing whether sugar content has a linear relationship with calories is equivalent to testing H0:β1=0 versus Ha:β1≠0 for the regression with Y = Calories and X = Sugar. Step 2, conditions and test statistic: Ideally, we would look at the residuals to make sure the conditions appear to be met, and we would look at a plot of Sugar versus Calories to see if the relationship is linear and not curved. But we have not focused on that step, so you are not required to show plots. (The plots do show somewhat of a non-normal distribution, but nothing extreme, and the relationship looks weak but approximately linear.) The R commands and results for the test statistics are shown below, with the relevant parts highlighted. The two test statistics and their values are t = 3.507 and F = 12.3. Step 3: In both cases (t and F tests), the p-value is given as 0.0013 (or in the case of F to more decimal places as 0.01296). Step 4: Because the p-value of .0013 < 0.05, reject the null hypothesis and conclude that there is a linear relationship. Step 5: In context, we conclude that sugar content has a linear relationship with calories (although not a strong one) in cereals similar to the ones in the sample. > CerModel<-lm(Calories~Sugar) > summary(CerModel) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 87.4277 5.1627 16.935 <2e-16 *** Sugar 2.4808 0.7074 3.507 0.0013 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 19.27 on 34 degrees of freedom Multiple R-squared: 0.2656, Adjusted R-squared: 0.244 F-statistic: 12.3 on 1 and 34 DF, p-value: 0.001296 b. Do Part (b) of Exercise 2.13. The R command confint will be useful. 2.13b The R code and resulting confidence interval (highlighted) are shown below. Interpretation: We are 95% confident that for each additional gram of sugar in a serving of cereal, there are between 1.04 and 3.92 additional calories in the serving. > confint(CerModel) 2.5 % 97.5 % (Intercept) 76.935859 97.919521 Sugar 1.043205 3.918421 c. Repeat the test you did in Part (a), except now use a one-sided alternative hypothesis. Use the direction that makes sense for this example, and explain why you chose that direction. Explain whether or not you can use the same two different test statistics you used in Part (a). Step 1: The one sided test would have the same null hypothesis, but the alternative hypothesis is that if there is a relationship, it’s that calories go up as sugar goes up, so the slope is positive; Ha:β1 > 0. (It would not make sense to hypothesize that calories go down as sugar goes up.) Step 2: The t-test can be used but the F-test cannot, because the F-test does not give information on whether the slope was positive or negative. In this case, t = +3.507, so that is in the correct direction to support Ha. Step 3: Because this is a one-sided test, we only need to include the tail area that supports the alternative hypothesis when computing the p-value. In this case that would be the area above the test statistic, which is the area above 3.507. We can find this area by dividing the original p-value in half, because it included the area above 3.507 and below –3.507. So the p-value is now (0.0013)/2 or 0.00065. Step 4: Reject the null hypothesis because .00065 < .05. Step 5: Conclude that there is a positive linear relationship between sugar content and calories in cereals similar to the ones in this sample. 3. Do Problem 2.12 (page 82). SSModel 110 0.73 or 73%. Interpretation: It’s a bit hard to interpret without a context, but the SSTotal 150 general interpretation is that 73.3% of the total variability in the y values is explained by knowing the x values and using the linear model to predict y using x. r2 4. Do Problem 2.34 (page 89), parts (a) and (c) only. To do this problem you will need to read in the data and delete the first 4 observations, then run the linear model and get the value of r2 from that. Here are the commands and the relevant result of the summary: > NewUSstamps <- USstamps[ -c(1,2, 3, 4), ] > StampModel<-lm(Price ~ Year, data = NewUSstamps) > summary(StampModel) Multiple R-squared: 0.9853, Adjusted R-squared: 0.9845 a. The percent of the variation in postal rates explained by Year is r2 (expressed as a percent) = 98.53%. c. To do this part you need to use the anova command in R: > anova(StampModel) Analysis of Variance Table Response: Price Df Sum Sq Mean Sq F value Pr(>F) Year 1 3841.2 3841.2 1273.1 < 2.2e-16 *** Residuals 19 57.3 3.0 From this output, SSModel = 3841.2, SSE = 57.3 and SSTotal (not given by R) is 3941.2 + 57.3 = 3898.5. For the interpretation, you could mention some of the following: The total variation in Price is 3898.5, of which 3841.2 is explained by knowing the year and using the regression model and 57.3 is unexplained residual. The anova table can be used to find r2, which is SSModel/SSTotal = 3841.2/3898.5 = .9853 or 98.53%. The anova table can be used to find the F test statistic, which is given in the table as 1273.1. 5. Suppose a test of H0: β1 = 0 versus Ha: β1 > 0 resulted in a calculated t-statistic of 2.10 with df = 15. The pvalue for that test is 0.0265. (I found this using the R command >1-pt(2.1,15). ) a. Draw a picture of a t-distribution showing how the p-value in this situation would be found. See answer under part b. b. For this same situation, if instead the alternative is two-sided, Ha: β1 ≠ 0, give the p-value and draw a picture illustrating it. The p-value for a two-sided test is double what it is for the one-sided test, so in this case it’s 2(.0265) = .0530.The pictures for both parts (a) and (b) are shown here. t distribution with df = 15 t distribution with df = 15 0.0265 0.0265 0 0.0265 -2.10 2.10 0 2.10 c. Would the same conclusion be reached about the null and alternative hypotheses for the one-sided and two-sided test, using a significance level of 0.05? No. The p-value for the one-sided test is < 0.05, but the p-value for the two-sided test is > 0.05. So the null hypothesis would be rejected for the one-sided test but not for the two-sided test. d. For this same situation with the two-sided alternative hypothesis given in part (b), give the value of the F-statistic that would be used to test the hypotheses. F-statistic = (t-statistic)2 = (2.1)2 = 4.41. e. For the F-statistic found in part (d), what is the corresponding p-value? The same as for the two-sided t-test, so it’s 0.053. f. For the F statistic you found in part (d), draw a picture of an F distribution and illustrate the area representing the p-value. (It doesn’t have to be exact, just draw it by hand.) Here is what the picture looks like as F distribution with df1=1, df2=15 drawn by computer, but as long as you show a picture that starts at 0 and is skewed to the right, and show the test statistics and approximately correct shaded area you should get credit. 0.053 0 F 4.41 6. Suppose a linear model is fit to predict Y = weight using X = height. If instead a new linear model is fit using Y = height and X = weight (i.e, reversing X and Y) state whether each of the following would be the same for this new model as it was for the original model, or if it would be different. a. The values of the sample slope and intercept. Different. (You don’t need to explain, but here is the explanation. The least squares line minimizes the vertical distances between the points and the line. If the roles of X and Y were switched, it would be like minimizing the horizontal distances of the original graph, and that would result in a different line.) b. The value of r2. The same. (The correlation between X and Y is the same as the correlation between Y and X.) c. The residuals for each person. Different. (The reasoning is the same as in part (a).) d. The value of the F statistic for testing H0: β1 = 0. The same. (This one is not so obvious, but remember that if you square the t-statistic for correlation = 0 you get the F-statistic for testing H0: β1 = 0. And since the correlation between X and Y is the same as between Y and X, the test for correlation gives the same value as well.) 7. Do Problem 2.10 (page 82). Describe the effect (if any) on the width (difference between the upper and lower endpoints) of a prediction interval if all else remains the same except: For all parts, you need to refer to the formula for the part that is added and subtracted to create a prediction interval. The formula is SEYˆ ˆ 1 1 (x * x)2 , referred to as SE in what follows. n ( xi x ) 2 a. If the sample size is increased. If the sample size is increased, the interval will be narrower, because the SE will be smaller. This occurs because both the term 1/n and the term that follows it get smaller as n gets larger. b. If the variability in the values of the predictor variable increases. This means that the value of ( xi x ) 2 increases, because that represents how spread out the x values are. If that sum increases, then the SE decreases, so the width of the interval decreases. c. If the variability in the values of the response variable increases. The variability in the response variable (y) is estimated by , so if that increases, the SE increases, and the width of the interval increases. d. If the value of interest for the predictor variable moves further from the mean value of the predictor variable. In this case, the value of ( x * x ) 2 increases, so SE increases as well, and the width of the interval increases. 8. The data set http://www.ics.uci.edu/~jutts/110/HtWt.txt (used in class) has heights and weights for 43 college males. Use linear regression with Y = weight and X = height to do the following. The R commands and output for parts a and b are as follows: > HtWtModel<-lm(Weight ~ Height, data = HtWt) > predict( HtWtModel, list( Height = 73 ) , interval = "confidence") fit lwr upr 1 192.7788 180.9709 204.5866 > predict( HtWtModel, list( Height = 73 ) , interval = "prediction") fit lwr upr 1 192.7788 142.8911 242.6664 a. Find and interpret a 95% confidence interval for the mean weight when height = 73 inches. The interval is 180.97 to 204.59. Interpretation: We are 95% confident that the mean weight for the population of all men 73 inches tall (and similar to those in the sample) is between 180.97 pounds and 204.59 pounds. b. Find and interpret a 95% prediction interval for height = 73 inches. The interval is 142.89 to 242.67. Interpretation: We predict that 95% of all men who are 73 inches tall weigh between 142.89 pounds and 242.67 pounds. or We are 95% sure that a randomly selected male who is 73 inches tall will weigh between 142.89 pounds and 242.67 pounds. c. How do the midpoints of the two intervals compare? Explain why this makes sense. The midpoint for both intervals is 192.78, the value of “fit,” which is the predicted value of y when x = 73. This makes sense because it is the best guess for both the mean of the y values when x = 73 and for predicting one individual’s weight when x = 73. d. How do the widths of the two intervals compare? Explain why this makes sense. The confidence interval for the mean is much shorter than the prediction interval for an individual. This makes sense because we need a much wider interval to capture individual weights than we do to estimate the mean weight. e. What height would result in the narrowest possible prediction interval for weight? Explain. The height equal to the mean of the x values, which in this case is 70.116 inches. (You learned how to find this in R in week 1.) That height gives the narrowest possible interval because the term ( x * x ) 2 = 0, so the standard error is as small as it can be. f. Would it make sense to use this data set to report a 95% prediction interval for the weight of a 12-year old boy who is 66 inches tall? If so, give the interval. If not, explain why it would not make sense. No. The model was fit using adult college students. The relationship between height and weight might be different for 12-year old boys.
© Copyright 2026 Paperzz