Chapter 11: Simple Linear Regression Where We’ve Been Presented methods for estimating and testing population parameters for a single sample Extended those methods to allow for a comparison of population parameters for multiple samples McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 2 Where We’re Going Introduce the straight-line linear regression model as a means of relating one quantitative variable to another quantitative variable Introduce the correlation coefficient as a means of relating one quantitative variable to another quantitative variable Assess how well the simple linear regression model fits the sample data Use the simple linear regression model to predict the value of one variable given the value of another variable McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 3 11.1: Probabilistic Models There may be a deterministic reality connecting two variables, y and x But we may not know exactly what that reality is, or there may be an imprecise, or random, connection between the variables. The unknown/unknowable influence is referred to as the random error So our probabilistic models refer to a specific connection between variables, as well as influences we can’t specify exactly in each case: y = f(x) + random error McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 4 11.1: Probabilistic Models The relationship between home runs and runs in baseball seems at first glance to be deterministic … Runs 6 5 4 3 2 1 0 0 1 2 3 4 5 6 Home Runs McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 5 11.1: Probabilistic Models But if you consider how many runners are on base when the home run is hit, or even how often the batter misses a base and is called out, the Runs rigid model becomes more variable. 6 5 4 3 2 1 0 0 1 2 3 4 5 6 Home Runs McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 6 11.1: Probabilistic Models General Form of Probabilistic Models y = Deterministic component + Random error where y is the variable of interest, and the mean value of the random error is assumed to be 0: E(y) = Deterministic component. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 7 11.1: Probabilistic Models McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 8 11.1: Probabilistic Models The goal of regression analysis is to find the straight line that comes closest to all of the points in the scatter plot simultaneously. Attendance 300 y = -0.192x + 150.3 250 Attendance 200 150 100 50 0 0 20 40 60 80 100 120 140 160 Week McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 9 11.1: Probabilistic Models A First-Order Probabilistic Model y = β0 + β1x + ε where y = dependent variable x = independent variable β0 + β1x = E(y) = deterministic component ε = random error component β0 = y – intercept β1 = slope of the line McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 10 11.1: Probabilistic Models β0, the y – intercept, and β1, the slope of the line, are population parameters, and invariably unknown. Regression analysis is designed to estimate these parameters. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 11 11.2: Fitting the Model: The Least Squares Approach Step 1 Hypothesize the deterministic component of the probabilistic model E(y) = β0 + β1x Step 2 Use sample data to estimate the unknown parameters in the model McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 12 11.2: Fitting the Model: The Least Squares Approach 4000 Values on the line are the predicted values of total offerings given the average offering. Total Offering 3500 3000 2500 2000 1500 1000 500 0 0 10 The distances between the scattered dots and the line are the errors of prediction. 20 30 40 50 Average Offering McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 13 11.2: Fitting the Model: The Least Squares Approach 4000 Values on the line are the predicted values of total offerings given the average offering. Total Offering 3500 3000 2500 The line’s estimated parameters are the values that minimize the sum of the squared errors of prediction, and the method of finding those values is called the method of least squares. 2000 1500 1000 500 0 0 10 The distances between the scattered dots and the line are the errors of prediction. 20 30 40 50 Average Offering McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 14 11.2: Fitting the Model: The Least Squares Approach y 0 1 x ˆ ˆ ˆ y Estimates: 0 1x Model: ˆ ˆ ˆ ( y y ) [ y ( Deviation: i i i 0 1 xi )] 2 ˆ ˆ [ y ( x)] SSE: i 0 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 1 15 11.2: Fitting the Model: The Least Squares Approach The least squares line ŷ ˆ0 ˆ1 x is the line that has the following two properties: 1. 2. The sum of the errors (SE) equals 0. The sum of squared errors (SSE) is smaller than that for any other straightline model. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 16 11.2: Fitting the Model: The Least Squares Approach Formulas for the Least Squares Estimates Slope: ˆ1 SSxy SSxx ( x x )( y y ) (x x ) i x y xy i i i i 2 i 2 x i i n 2 xi n y intercept: 0 y ˆ1 x McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 17 11.2: Fitting the Model: The Least Squares Approach Can home runs be used to predict errors? Is there a relationship between the number of home runs a team hits and the quality of its fielding? McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 18 11.2: Fitting the Model: The Least Squares Approach Home Runs (x) 158 155 139 191 124 Ʃ𝑥𝑖= 767 Errors (y) xi2 xiyi 126 24964 19908 87 24025 13485 65 19321 9035 95 36481 18145 119 15625 14756 Ʃyi = 492 Ʃxi2 = 120416 Ʃxiyi = 75329 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 19 11.2: Fitting the Model: The Least Squares Approach Formulas for the Least Squares Estimates Slope: ˆ1 SSxy SSxx x y xy i i i 2 x i i n 2 xi n (767)(492) 75,329 5 (767) 2 120, 416 5 .0521 y intercept: 0 y ˆ1 x 98.4 (.0521)(153.4) 107.2 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 20 11.2: Fitting the Model: The Least Squares Approach Formulas for the Least Squares Estimates y These results suggest that teams xi xi yiare which hit more home runs SS xy n ˆ Slope: (slightly) better fielders (maybe not 1 SSxx what we expected). There are, xi 2 xi in however, only five observations the sample. It is important to take a n closer .0521look at the assumptions we made and the results we got. i 2 (767)(492) 75,329 5 (767) 2 120, 416 5 y intercept: 0 y ˆ1 x 98.4 (.0521)(153.4) 106.39 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 21 11.3: Model Assumptions Assumptions 1. The mean of the probability distribution of ε is 0. 2. The variance, σ 2, of the probability distribution of ε is constant. 3. The probability distribution of ε is normal. 4. The values of ε associated with any two values of y are independent. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 22 11.3: Model Assumptions The variance, σ 2, is used in every test statistic and confidence interval used to evaluate the model. Invariably, σ 2 is unknown and must be estimated. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 23 11.3: Model Assumptions Estimation of 2 for a (First-Order) Straight-Line Model 2 ˆ SS yy ˆ1SS xy y y SSE i i 2 s n2 n2 n2 SS yy yi y and SS xy xi x yi y 2 The estimated standard error of is the square root of the variance: s s2 SSE n2 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 24 11.4: Assessing the Utility of the Model: Making Inferences about the Slope β1 Note: There may be many different patterns in the scatter plot when there is no linear relationship. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 25 11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 A critical step in the evaluation of the model is to test whether 1 = 0 . y . . y . . . . . . y ............ . . . x Positive Relationship 1 > 0 x No Relationship 1 = 0 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression . . . . . . . . . . . . . . . x Negative Relationship 1 < 0 26 11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 H0 : 1 = 0 Ha : 1 ≠ 0 . y . . y . . . . . . y ............ . . . x Positive Relationship β1 > 0 x No Relationship β1 = 0 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression . . . . . . . . . . . . . . . x Negative Relationship β1 < 0 27 11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 The four assumptions described above produce a normal sampling distribution for the slope estimate: ˆ1 N ( 1 , ˆ ) 1 where ˆ 1 SS xx and ˆ ˆ sˆ 1 1 s , SS xx called the estimated standard error of the least squares slope estimate. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 28 11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 A Test of Model Utility: Simple Linear Regression One-Tailed Test Two-Tailed Test H 0 : 1 0 H 0 : 1 0 H a : 1 0 ( 0) H a : 1 0 ˆ ˆ1 1 Test Statistic : t sˆ s / SS xx 1 Rejection Region: t t ( t ) |t | t /2 Degrees of freedom = n 2 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 29 11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 yi-- (yi-- )2 (xi - )2 = E(Errors|HRs) 4.6 21.16 98.14 27.86 776.4 87 1.6 2.56 98.31 -11.31 127.9 139 65 -14.4 207.4 99.23 -34.23 1171 191 95 37.6 1414 96.25 -1.245 1.55 124 Ʃxi = 767 119 Ʃyi = 492 -29.4 864.4 SSxx= 2509 100.1 18.92 357.8 SSE = 2435 Home Runs (x) Errors (y) 158 126 155 xi - = 153.4 SSE 2435 s 28.49 n2 3 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 30 11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 s 28.49 ˆ1 ˆH 0 ˆ1 0 .0521 t t .1002 s sˆ 28.49 2509 1 SS xx Since the t-value does not lead to rejection of the null hypothesis, we can conclude that A different set of data may yield different results. There is a more complicated relationship. There is no relationship (non-rejection does not lead to this conclusion automatically). McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 31 11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 Interpreting p-Values for Software packages report two-tailed p-values. To conduct one-tailed tests of hypotheses, the reported p-values must be adjusted: Upper-tailed test (H a : 1 0) : p 2 if t 0 p value = p 1 2 if t 0 p 2 if t 0 Lower-tailed test (H a : 1 0) : p value = p 1 2 if t 0 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 32 11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 A Confidence Interval on β1 ˆ1 t /2 sˆ 1 where the estimated standard error is sˆ 1 s SS xx and t/2 is based on (n – 2) degrees of freedom McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 33 11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 In the home runs and errors example, the estimated 1 was -.0521, and the estimated standard error was .569. With 3 degrees of freedom, t = 3.182. The confidence interval is, therefore, ˆ1 t /2 sˆ 0.521 3.182(.569) .0521 1.81 1 which includes 0, so there may be no relationship between the two variables. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 34 11.5: The Coefficients of Correlation and Determination The coefficient of correlation, r, is a measure of the strength of the linear relationship between two variables. It is computed as follows: r SS xy SS xx SS yy . McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 35 11.5: The Coefficients of Correlation and Determination Positive linear relationship . y . . y . . . . . . . No linear relationship . . . . .. . . . .. . . x r → +1 y . . Negative linear relationship . . . . . . . . . . . . . . . . x r0 x r → -1 Values of r equal to +1 or -1 require each point in the scatter plot to lie on a single straight line. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 36 11.5: The Coefficients of Correlation and Determination In the example about homeruns and errors, SSxy= -143.8 and SSxx= 2509. SSyy is computed as 2 y so r y 2 n SS xy SS xx SS yy 4922 50,856 2443.2 5 143.8 .058 (2509)(2443.2) McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 37 11.5: The Coefficients of Correlation and Determination An r value that close to zero suggests there may not be a linear relationship between the variables, which is consistent with our earlier look at the null hypothesis and the confidence interval on β 1. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 38 11.5: The Coefficients of Correlation and Determination The coefficient of determination, r2, represents the proportion of the total sample variability around the mean of y that is explained by the linear relationship between x and y. r 2 SS yy SSE SS yy SSE 1 SS yy 0 r 1 2 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 39 11.5: The Coefficients of Correlation and Determination Predict values of y with the mean of y if no other information is available High r2 Predict values of y|x based on a hypothesized linear relationship Evaluate the power of x to predict values of y with the coefficient of determination Low r2 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression • x provides important information about y • Predictions are more accurate based on the model • Knowing values of x does not substantially improve predictions on y • There may be no relationship between x and y, or it may be more subtle than a linear relationship 40 11.6: Using the Model for Estimation and Prediction Statistical Inference based on the linear regression model ŷ ˆ0 ˆ1 x Estimate the mean of y for a specific value of x: E(y)|x (over many experiments with this x-value) Estimate an individual value of y for a given x value (for a single experiment with this value of x) McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 41 11.6: Using the Model for Estimation and Prediction Sampling Error for the Estimator of the Mean of y | x p 1 y (the standard error of y ) n ( x p x )2 SS xx where is the standard deviation of the error term . Using s to estimate , a 100(1- )% confidence interval 2 1 (xp x ) on the mean of y is yˆ t /2 s , n SS xx with n-2 degrees of freedom. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 42 11.6: Using the Model for Estimation and Prediction Based on our model results, a team that hits 140 home runs is expected to make 99.9 errors: yˆ 107.2 .0521x 107.2 .0521(140) 99.9 In our example, s 1, n 5, x 153.4 and SS xx 2509, so 2 1 (xp x ) 1 (140 153.4) 2 ˆ y s 1 .5211 n SS xx 5 2509 A 95% Confidence Interval for y | x 140 is 2 1 (xp x ) yˆ t.025,df 3 s 99.9 3.182(.5211) 99.9 1.66 n SS xx McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 43 11.6: Using the Model for Estimation and Prediction Sampling Error for the Predictor of an Individual Value of y | x p ( y yˆ ) 2 ( x x ) 1 p (the standard error of prediction) 1 n SS xx where is the standard deviation of the error term . Using s to estimate , a 100(1- )% preduction interval 2 1 (xp x ) for y|x p is yˆ t /2 s 1 , with n-2 degrees of freedom. n SS xx McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 44 11.6: Using the Model for Estimation and Prediction A 95% Prediction Interval for an Individual Team’s Errors yˆ 107.2 .0521x 107.2 .0521(140) 99.9 In our example, s 1, n 5, x 153.4 and SS xx 2509, so 1 ˆ y s 1 n ( x p x )2 SS xx 1 (140 153.4) 2 1 1 1.13 5 2509 A 95% Confidence Interval for y | x 140 is 1 yˆ t.025,df 3 s 1 n ( x p x )2 SS xx 99.9 3.182(1.13) 99.9 3.596 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 45 11.6: Using the Model for Estimation and Prediction Prediction intervals for individual new values of y are wider than confidence intervals on the mean of y because of the extra source of error. Error in E(y|xp) Error in E(y|xp) Error in predicting a mean value of y|xp Sampling error from the y population Error in predicting a specific value of y|xp McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 46 11.6: Using the Model for Estimation and Prediction McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 47 11.6: Using the Model for Estimation and Prediction Estimating y beyond the range of values associated with the observed values of x can lead to large prediction errors. Beyond the range of observed x values, the relationship may look very different. Estimated relationship True relationship Xi Xj Range of observed values of x McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 48 11.7: A Complete Example Step 1 How does the proximity of a fire house (x) affect the damages (y) from a fire? y = f(x) y = β0 +β1x + McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 49 11.7: A Complete Example McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 50 11.7: A Complete Example McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 51 11.7: A Complete Example Step 2 The data (found in Table 11.7) produce the following estimates (in thousands of dollars): ˆ0 10.28 ˆ1 4.91 The estimated damages equal $10,280 + $4910 for each mile from the fire station, or yˆ 10.28 4.92 x McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 52 11.7: A Complete Example Step 3 The estimate of the standard deviation, σ, of ε is s = 2.31635 Most of the observed fire damages will be within 2s ≤ 4.64 thousand dollars of the predicted value McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 53 11.7: A Complete Example Step 4 Test that the true slope is 0 H 0 : 1 0 H a : 1 0 SAS automatically performs a two-tailed test, with a reported p-value < .0001. The one-tailed p-value is < .00005, which provides strong evidence to reject the null. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 54 11.7: A Complete Example Step 4 A 95% confidence interval on β1 from the SAS output is 4.071 ≤ β1 ≤ 5.768. The coefficient of determination, r 2, is .9235. The coefficient of correlation, r, is r r .9235 .96 2 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 55 11.7: A Complete Example Suppose the distance from the nearest station is 3.5 miles. We can estimate the damage with the model estimates. yˆ ˆ0 ˆ1 x yˆ 10.2 4.92(3.5) 27.5 95% prediction interval is (22.324, 32.667) We’re 95% sure the damage for a fire 3.5 miles from the nearest station will be between $22,324 and $32,667. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 56 11.7: A Complete Example Suppose the distance from the nearest station is 3.5 miles. We can estimate the damage with the model estimates. Since the x-values in our yˆ ˆ0 sample ˆ1 x range from .7 to 6.1, predictions about y for xyˆ 10.2 4.92(3.5) 27.5 values beyond this range will 95% prediction interval is (22.324, 32.667) be unreliable. We’re 95% sure the damage for a fire 3.5 miles from the nearest station will be between $22,324 and $32,667. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression 57
© Copyright 2026 Paperzz