Chapter 14 Inference for Regression Lesson 14-1, Part 1 Inference for Regression Review Least-Square Regression A family doctor is interested in examining the relationship between patient’s age and total cholesterol. He randomly selects 14 of his female patients and obtains the data present in Table 1. The data are based upon results obtained from the National Center for Health Statistics. Table 1 Age Total Cholesterol Age Total Cholesterol 25 180 42 183 25 195 48 204 28 186 51 221 32 180 51 243 32 210 58 208 32 197 62 228 38 239 65 269 Review Least-Square Regression 1. What is the least-square regression line for predicting total cholesterol from age for women? The least square regression equation is ŷ = 151.3537 + 1.3991x, where ŷ represents the predicted total cholesterol for a female who age is x. Review Least-Square Regression 2. What is the correlation coefficient between age and cholesterol? Interpret the correlation coefficient in the context of the problem The linear correlation coefficient is 0.718. There is a moderate, positive linear relationship between female age and total cholesterol. Review Least-Square Regression 3. What is the predicted cholesterol level of 67 year old female? yˆ 151.3537 1.399 x cholesterol 151.3537 1.3991(age) 151.3537 1.3991(67) 245 Review Least-Square Regression 4. Interpret the slope of the regression line in the context of the problem? For each increase in age of one year, the total cholesterol is predicted to increases by 1.3991. Statistics and Parameters • When doing inference for regression, we use ŷ a bx to estimate the population regression line. ▫ a and b are estimators of population parameters α and β, the intercept and slope of the population regression line. Conditions • The conditions necessary for doing inference for regression are: ▫ For each given value of x, the values of the response variable y-values are independent and normally distributed. ▫ For each value of x, the standard deviation, σ, of yvalues is the same. ▫ The mean response of the y values for the fixed values of x are linearly related by the equation μy = α + βx Standard Error of the Regression Line • Gives the variability of the vertical distances of the y-values from the regression line • Remember that a residual was the error involved when making a prediction from the regression equation • The spread around the line is measured with the standard deviation of the residual, s. s y i yˆi n 2 2 residuals n 2 2 Standard Error of the Slope of the Regression Line • Gives the variability of the estimates of the slope of the regression line y yˆ 2 SE b i s x x 2 i i n2 x x 2 i Summary • Inference for regression depends upon estimating μy = α + βx with ŷ = a + bx • For each x, the response values of y are independent and follow a normal distribution, each distribution having the same standard deviation. • Inference for regression depends on the following statistics: ▫ ▫ ▫ ▫ a, the estimate of the y intercept, α, of μy b, the estimate of the slope, β, of μy s, the standard error of the residuals SEb the standard error of the slope of the regression line. Computing Standard Error of the Residual Age, x Total Cholesterol, y ŷ = 151.3537 + 1.3991x 25 180 186.33 -6.33 40.0689 25 195 186.33 8.67 75.1689 28 186 190.53 -4.53 20.5209 32 180 196.12 -16.12 259.8544 32 210 196.12 13.88 192.6544 32 197 196.12 0.88 0.7744 38 239 204.52 34.48 1188.8704 62 228 238.10 -10.10 102.01 65 269 242.30 26.70 712.89 Residuals (y – ŷ) Residuals2 (y – ŷ)2 Σ residuals2 = 4553.708 Computing Standard Error S residuals n 2 2 4553.705 19.48 14 2 Example – Page 787, #14.2 Body weights and backpack weights were collected for eight students Weight (lbs) 120 Backpack 26 weight (lbs) 187 109 103 131 165 158 116 30 26 24 29 35 31 28 These data were entered into a statistical package and leastsquares regression of backpack weight on body weight as requested. Here are the results. Example – Page 787, #14.2 Predictor Constant Coef 16.265 Stdev 3.937 t-ratio 4.13 p 0.006 BodyWT 0.09080 0.02831 3.21 0.018 S = 2.270 R-sq = 63.2% R-sq(adj) = 57.0% A) What is the equation of the least-square line? Backpack weight = 16.265 + 0.09080(bodyweight) Example – Page 787, #14.2 Predictor Coef Stdev t-ratio p Constant 16.265 3.937 4.13 0.006 BodyWT 0.09080 0.02831 3.21 0.018 S = 2.270 R-sq = 63.2% R-sq(adj) = 57.0% B) The model for regression inference has three parameters, which we call α, β and σ. Can you determine the estimates for α and β from the computer printout? a = 16.265 estimates the true intercept α and b = 0.09080 estimates the true slope β. Example – Page 787, #14.2 Predictor Coef Stdev t-ratio p Constant 16.265 3.937 4.13 0.006 BodyWT 0.09080 0.02831 3.21 0.018 S = 2.270 R-sq = 63.2% R-sq(adj) = 57.0% C) The computer output reports that s = 2.270. This is an estimate of the parameter σ. Use the formula for s to verify the computer’s value of s. Use your TI to verify this. Example – Page 788, #14.4 Exercise 3.71 on page 187 provided data on the speed of competitive runners and the number of steps they took per second. Good runners take more steps per second as they speed up. Here is the data again. speed 15.86 16.88 17.50 18.62 19.97 21.06 22.11 steps 3.05 3.12 3.17 3.25 3.36 3.46 3.55 A) Enter the data into your calculator, perform least-square regression, and plot the scatterplot with the least-square line. What is the strength of the association between speed and steps per second? Example – Page 788, #14.4 Steps = 1.77 + 0.0803(speed). There is a very strong positive linear relationship between speed and steps; r = 0.999. nearly all the variation (r2 = 0.998) 99.8% of it in steps per second is explained by the linear relationship. steps per second Example – Page 788, #14.4 speed (feet per second) Example – Page 788, #14.4 C) The model for regression inference has three parameters, α, β and σ. Estimate these parameters from the data a = 1.766 is the estimate of α b = 0.0803 is the estimate of β s = 0.0091 is the estimate of σ Lesson 14-1, Part 2 Inference for Regression Significance Test for the Slope of a Regression Line • We want to test whether the slope of the regression line is zero or not. ▫ If the slope of the line is zero, then there is no linear relationship between x and y variables. ▫ Remember (formula for b) if r = 0, then b = 0 • Hypothesis ▫ Two Tailed: Ho: β = 0 and Ha: β ≠ 0 ▫ Left Tailed: Ho: β = 0 and Ha: β < 0 ▫ Right Tailed: Ho: β = 0 and Ha: β > 0 Test Statistics and Confidence Interval bβ b t SE b SE b b t * SE b • t distribution with n – 2 degrees of freedom • SEb = Standard error of the slope SE b s x x 2 i Reading Computer Printouts Example – Page 794, #14.6 Exercise 14.1 (page 786) presents data on the lengths of two bones in five fossil specimens of the extinct beast Archaeopteryx. Here is part of the output from the S-PLUS statistical software when we regress the length y of the humerus on the length x of the femur. Coefficients Value Std Error t value Pr(>|t|) (Intercepts) – 3.6596 4.4590 – 0.8207 0.4719 Femur 1.1969 0.0751 Example – Page 794, #14.6 Coefficients Value Std Error t value Pr(>|t|) (Intercepts) – 3.6596 4.4590 – 0.8207 0.4719 Femur 1.1969 0.0751 A) What is the equation of the least-squares regression line? humerus 3.6596 1.1969( femur ) Example – Page 794, #14.6 Coefficients Value Std Error t value Pr(>|t|) (Intercepts) – 3.6596 4.4590 – 0.8207 0.4719 Femur 1.1969 0.0751 B) We left out the t statistic for testing Ho: β = 0 and its P-value. Use the output to find t. b 1.1969 t 15.94 S b 0.0751 Example – Page 794, #14.6 C)How many degrees of freedom does t have? Use Table C to approximate the P-value of t against the one-sided alternative Ha: β > 0. df = 3; since t > 12.92, we know P-value < 0.0005 tcdf (15.9374, E 99,3) 2.685104 Example – Page 794, #14.6 D)Write a sentence to describe your conclusion about the slope of the true regression line. There is very strong evidence that β > 0, that is, that the line is useful for predicting the length of the humerus given the length of the femur Example – Page 794, #14.6 E) Determine a 99% confidence interval for the true slope of the regression line. Example – Page 794, #14.6 b t *S b 1.1969 5.841(0.0751) (0.758,1.636) Example – Page 794, #14.8 There is some evidence that drinking moderate amounts of wine helps prevent heart attacks. Exercise 3.63 (Page 183) gives data on yearly wine consumption (liters of alcohol from drinking wine, per person) and yearly deaths from heart disease (deaths per 100,000 people) in 19 developed nations. A) Is there statistically significant evidence of a negative association between wine consumption and heart disease deaths? Carry out the appropriate test of significance and write a summary statement about your conclusions. Example – Page 794, #14.8 Example – Page 794, #14.8 β = negative association between wine consumption and heart disease deaths. Ho : β 0 Ha : β 0 Example – Page 794, #14.8 Linear Regression T-test Condition 1. The observations are independent 2. The true relationship is linear (check scatterplot to check that the overall pattern is linear or plot of residuals against the predicted values) 3. The standard deviation of the response about the true line is the same everywhere (make sure the spread around the line is nearly constant) 4. The response varies normally about the true regression line (normal probability plot of residuals is quite straight) Example – Page 794, #14.8 b 22.97 t 6.47 S b 3.357 Sb s 2 ( x x ) p value 2.96 106 P value 0.0005 Reject Ho, since p-value = 0.0005 < = 0.05 and conclude that there a linear relationship between wine consumption and heart disease deaths. Example – Page 795, #14.10 Exercise 14.4 (page 788) presents data on the relationship between the speed of runners (x, in feet per second) and the number of steps y that they take in a second. Here is part of the Data Desk Regression output for these data: R squared = 99.8% s = 0.0091 with 7 – 2 = 5 degrees of freedom Variable Coefficient s.e. of Coeff t-ratio prob Constant 1.76608 0.0307 57.6 <0.0001 speed 0.080284 0.0016 49.7 <0.0001 Example – Page 795, #14.10 R squared = 99.8% s = 0.0091 with 7 – 2 = 5 degrees of freedom Variable Coefficient s.e. of Coeff t-ratio prob Constant 1.76608 0.0307 57.6 <0.0001 speed 0.080284 0.0016 49.7 <0.0001 A) How can you tell from this output, even without the scatterplot, that there is a very strong straight-line relationship between running speed and steps per second? Example – Page 795, #14.10 R squared = 99.8% s = 0.0091 with 7 – 2 = 5 degrees of freedom Variable Coefficient s.e. of Coeff t-ratio prob Constant 1.76608 0.0307 57.6 <0.0001 speed 0.080284 0.0016 49.7 <0.0001 r2 is very close to 1, which means that nearly all the variation in steps per second is accounted for by foot speed. Also, the P-value for β is small. Example – Page 795, #14.10 R squared = 99.8% s = 0.0091 with 7 – 2 = 5 degrees of freedom Variable Coefficient s.e. of Coeff t-ratio prob Constant 1.76608 0.0307 57.6 <0.0001 speed 0.080284 0.0016 49.7 <0.0001 B) What parameter in the regression model gives the rate at which steps per second increase as running speed increases? Give a 99% confidence interval for this rate. Example – Page 795, #14.10 R squared = 99.8% s = 0.0091 with 7 – 2 = 5 degrees of freedom Variable Coefficient s.e. of Coeff t-ratio prob Constant 1.76608 0.0307 57.6 <0.0001 speed 0.080284 0.0016 49.7 <0.0001 β (the slope) is this rate; the estimate is listed as coeffincient of “Speed,” 0.080284. b t *S b 0.080284 4.032(0.0016) (0.074,0.087) Lesson 14-2, Part 1 Predictions and Conditions Confidence Intervals • Write the given value of the explanatory variable x as x*. ▫ The distinction between predicting a single outcome and predicting the mean of all outcomes when x = x* determines what margin of error is correct. • Estimate the mean response, we use a confidence interval. ▫ µy = α + βx* • Estimate an individual response y, we use a prediction interval Confidence Intervals for Regression Response A level C confidence interval for the mean response µy when x takes the value x* is yˆ t SE μˆ * The standard error 2 x x 1 SE μˆ s 2 n x x * Prediction Intervals for Regression Response A level C prediction interval for a single observation on y when x takes the value x* yˆ t SE yˆ * The standard error 2 x x 1 SE yˆ s 1 2 n x x * Conditions for Regression Inference • The observations are independent • The true relationship is linear • The standard deviation of the response about the true line is the same everywhere. • The response varies normally about the true regression line. • Check conditions using the residuals. Examine the residual plot to check that the relationship is roughly linear and that the scatter about the line is the same from end to end. Violations of the regression conditions: The variation of the residuals is not constant. Violations of the regression conditions: There is a curved relationship between the response variable and the explanatory variable. Example – Page 802, #14.12 A) The residuals for the crying and IQ data appear in Example 14.3 (page 785). Make a stemplot to display the distribution of the residuals. Are there outliers or signs of strong departures from normality? 19.20 31.13 22.65 15.18 12.18 15.15 16.63 6.18 1.70 9.14 9.82 20.85 22.60 1.66 10.82 24.35 6.68 6.14 0.37 18.94 6.17 12.60 8.85 32.89 9.15 0.34 10.87 18.47 23.58 9.14 8.62 2.85 19.34 10.89 51.32 2.80 14.30 2.55 Example – Page 802, #14.12 3 1 2 4 3 3 1 0 8 5 5 3 2 0 9 9 9 9 7 6 6 6 3 2 2 0 0 0 3 3 9 1 2 3 4 5 0 1 1 1 4 8 9 9 1 4 One residual (51.32) may be a high 3 outlier, but the stemplot does not Show any other deviations from normality. 1 Example – Page 802, #14.12 B) What other assumptions or conditions are required for using inference for regression on these data? Check that those conditions are satisfied and then describe your findings. Example – Page 802, #14.12 Example – Page 802, #14.12 The scatter of the data points about the regression line varies to a extent as we move along the line, but the variation is not serious, as a residual plot shows. The other conditions can be assumed to be satisfied. Example – Page 802, #14.12 C) Would a 95% prediction interval for x = 25 be narrower, the same size, or wider than a 95% confidence interval? Explain your reasoning. A prediction interval would be wider. For a fixed confidence level, the margin of error is always larger when we are predicting a single observation than when we are estimating the mean response. Example – Page 802, #14.12 D) A computer package reports that the 95% prediction interval for x = 25 is (91.85, 165.33). Explain what this interval means in simple language. We are 95% confident that when x (crying intensity) = 25, the corresponding value of y (IQ) will be between 91.85 and 165.33 Example – Page 802, #14.14 In exercise 14.11 (page 795) we regressed the lean of the Leaning Tower of Pisa on year to estimate the rate at which the tower is tilting. Here are the residuals from that regression, in order by years across the rows: 4.220 3.099 0.418 1.264 5.011 0.670 2.055 3.626 2.308 4.648 5.967 1.714 7.396 Use the residuals to check the regression conditions, and describe your findings. Is the regression in exercise 14.11 trustworthy? Example – Page 802, #14.14 In exercise 14.11 (page 795) we regressed the lean of the Leaning Tower of Pisa on year to estimate the rate at which the tower is tilting. Here are the residuals from that regression, in order by years across the rows: 4.220 3.099 0.418 1.264 5.011 0.670 2.055 3.626 2.308 4.648 5.967 1.714 7.396 Use the residuals to check the regression conditions, and describe your findings. Is the regression in exercise 14.11 trustworthy? Example – Page 802, #14.14 Residual Normal Prop. Of Residual The scatterplot of the residual versus year does not suggest any problems. The regression in Exercise 14.11 should be fairly reliable Example – Page 809, #14.24 Here are data on the time (in minutes) Professor Moore takes to swim 2000 yards and his pulse rate (beat per minute) after swimming: Time: 34.12 35.72 34.72 34.05 34.13 35.72 36.17 35.57 Pulse: 152 124 152 146 128 136 144 Time: 35.57 35.43 36.05 34.85 34.70 34.75 33.93 Pulse: 148 144 124 148 144 140 156 Time: 34.00 34.35 35.62 35.68 35.28 35.97 148 132 124 132 139 35.37 34.60 Pulse: 136 140 136 148 Example – Page 809, #14.24 A scatterplot shows a negative linear relationship: a faster time (fewer minutes) is associated with a higher heart rate. Here is part of the output from the regression function in Excel spreadsheets. Coefficients Standard Error t Stat P-value Intercepts 479.9341457 66.22779275 7.246718119 3.87075E–07 X variable – 9.694903394 1.888664503 – 5.1332057 4.37908E–05 Give a 90% confidence interval for the slope of the true regression line. Explain what your result tells us about the relationship between the professor’s swimming time and heart rate. Example – Page 809, #14.24 Coefficients Standard Error t Stat P-value Intercepts 479.9341457 66.22779275 7.246718119 3.87075E–07 X variable – 9.694903394 1.888664503 – 5.1332057 4.37908E–05 b t SEb * 21 9.9649 1.721(1.8887) – 12.9454 to – 6.4444 bpm per minute With a 90% confidence, we can say that for each 1-minute increase in swimming time, pulse rate drops by 6 to 13 bpm. Example – Page 809, #14.24 Using the TI Example – Page 809, #14.25 Exercise 14.24 gives data on a swimmer’s time and heart rate. One day the swimmer completes his laps in 34.3 minutes but forgets to take his pulse. Minitab gives this prediction for heart rate when x* = 34.3: Fit StDev Fit 147.40 1.97 90.0% CI 90.0% PI (144.02, 150.78) (135.79, 159.01) A) Verify that “Fit” is the predicted heart rate from the least-square line found in exercise 14.24. Then choose one of the intervals from the output to estimate the swimmer’s heart rate that day and explain why you chose this interval. Example – Page 809, #14.25 Fit StDev Fit 147.40 1.97 90.0% CI 90.0% PI (144.02, 150.78) (135.79, 159.01) ŷ ( pulse) 479.9 9.6949 x(time) when x = 34.3 minutes ŷ ( pulse) 479.9 9.6949(34.3) 147.37 this agrees the output Example – Page 809, #14.25 Fit StDev Fit 147.40 1.97 90.0% CI 90.0% PI (144.02, 150.78) (135.79, 159.01) The prediction interval is appropriate for estimating one value (as opposed to mean of many values): 135.79 to 159.01 bpm Example – Page 809, #14.25 Fit StDev Fit 147.40 1.97 90.0% CI 90.0% PI (144.02, 150.78) (135.79, 159.01) B) Minitab gives only one of the two standard errors used in prediction. It is SEˆ the standard error for estimating the mean response. Use this fact and a critical value from table C to verify Minitab’s 90% confidence interval for the mean heart rate on days when the swimming time is 34.3 minutes. Example – Page 809, #14.25 Fit StDev Fit 147.40 1.97 90.0% CI 90.0% PI (144.02, 150.78) (135.79, 159.01) yˆ t SEˆ * 21 147.40 1.721(1.97) 144.01 to 150.79, which agrees with the computer output
© Copyright 2025 Paperzz