STP231 Brief Class Notes Instructor: Ela Jackiewicz Chapter 12 Regression and Correlation In this chapter we will consider bivariate data in forms of ordered pairs (x, y). Both x and y can be observed or y can be observed for specific values of x that are selected by the researcher. Data must display a linear trend on the scatterplot for us to consider fitting a linear equation that will express y in terms of x. We will use the following vocabulary with respect to x and y: y – response variable or dependent variable x – predictor variable/explanatory variable or independent variable Our line is called Least Squares Regression Line and will have a form: y =b0b1 x , where y is a predicted value of y for given x. The name of the line reflects the fact that our line has smallest possible sum of squared errors (of all lines that can be possibly fitted to the given data). Errors or residuals are : e= y− y Formulas for the slope and y-intercept of Least-Squares Regression equation: y =b0b1 x : ∑ ( xi −x)( y i −y ) r s y b1 = = , b0 =y−b1 x sx ∑ (xi −x)2 where s x = ∑ x i− x 2 n−1 s y= ∑ y i− y 2 n−1 (standard deviations of x-s and y-s respectively) Following example will illustrate the computations. In our example we consider people on the diet, X=# of times per week person eats out , Y=# of pounds lost dieting during one month period xi yi 3 7 6 7 2 10 2 9 4 15 25 40 x i−̄x -2 2 1 2 -3 y i− ̄y (x i−̄x )2 4 4 1 4 9 22 2 -6 1 -4 7 ( y i− ̄y )2 ( x i−̄x )( y i− ̄y ) 4 36 1 16 49 -4 -12 1 -8 -21 106 SS(total) -44 x =25/5=5 y =40/5=8 b1=−44/22=−2 b 0=8−−25=18 , so our equation is : y =18−2x Equation tells us that for every 1 time increase in x (times to eat out in a week), y (weight loss) decreases by 2 pounds. Units of the slope are : (units of y)/ (units of x) 2 gives total variability in y values Total Sum of squares SStotal=∑ y i − y STP231 Brief Class Notes Instructor: Ela Jackiewicz Making predictions using our equation: Predict number of pounds lost if dieting person eats out 5 times per week: y=18-10=8 lb (this prediction is for x within data range) Warning about making predictions: It is safe to predict y for x within data range. Extrapolation - making predictions for values of the predictor variable outside the range of the observed values of the predictor variable (x) Grossly incorrect predictions can result from extrapolation. For ex. Prediction for x=15 is y=-12 (weight gain of 12 lb). One may think that value is reasonable, weight gain is to be expected if you eat out so often, but because we have no data in that range, we may not be sure that our particular trend will hold as far out of the range of x values. Besides Total sum of squares there are two other important sums of squares that we will use. We can now obtain residual sum of squares, and regression sum of squares by obtaining predicted values of y for each x, by using our equation. xi yi 3 7 6 7 2 10 2 9 4 15 25 40 ŷ i 12 4 6 4 14 y i− ̂y -2 -2 3 0 1 ŷ i− ̄y 2 ( y i− ̂y ) 4 4 9 0 1 4 -4 -2 -4 6 18 SS(resid) 2 ( ŷ i− ̄y ) 16 16 4 16 36 88 SS(reg) 2 Residual Sum of Squares: SSresid=∑ y i− y i gives total variability in y values unexplained by the regression equation 2 Regression Sum of Squares: SSreg =∑ y i− y gives total variability in y values explained by the regression equation Regression Identity: SS(total)=SS(reg)+SS(resid) We can verify that in our example : 106=88+18 We will use the three sums of squares in farther definitions. Problems we may encounter with our data: Outlier – a data point that lies far from the regression line, relative to other data points Influential observation – a data point whose removal causes the regression to change considerably. It is usually separated in the x-direction from the other data points. It pulls the regression line towards itself. STP231 Brief Class Notes Instructor: Ela Jackiewicz Assessing the fit of our regression line: √ SS(resid) n−2 This measure tells us how close are the data points to our fitted regression line, the smaller is this value, the better is the fit of our regression line. If s e is smaller than s y (standard deviation of y-s), it means a good fit. The units of s e are the same as units of y. In our example s e=√(18 /3)=2.45lb . We can use SS(resid) to compute Residual Standard Deviation: s e= In the “nice data” , if it is not too small, we expect roughly : 68% of observed y-s to be within ±s e of the regression line 95% of observed y-s points to be within ±2s e of the regression line and 99.7% of observed y-s points to be within ±3s e of the regression line In other words these percentages (roughly) of data points are expected to be within a vertical distances (parallel to the y axis) above and below the regression line. Another value that tells us about the fit of the regression line is: 2 Coefficient of Determination: r = SS reg SSresid =1− SStotal SStotal This measure gives us percentage of total variability in y values that is explained by our regression line. If this is close to 1 (100%), then that indicates a good fit. 0≤r 2≤1 18 88 = =0.83 , it indicates a good fit, 83% of total variability in y-values 106 106 is explained by the regression line. In our example r 2=1− Related to r 2 is Linear correlation Coefficient of X and Y, r : 1 xi −x ∑ x i−x y i−y = 1 n−1 ∑ r= n−1 sx sx sy ( )( y i− y = sy ) ∑ xi −x y i− y ∑ xi −x 2 ∑ y i − y 2 , Linear Correlation coefficient gives information about strength and direction of a linear trend in our data. Values close to 1 or -1 indicate extremely strong linear association between X and Y. Linear correlation r is symmetric, if X and Y are reversed, value of r remains the same. Sign(r)=sign of the slope, r is unit free and −1≤r≤1 . In our example r= - 0.91, indicating very strong negative linear trend. STP231 Brief Class Notes Instructor: Ela Jackiewicz There is an approximate relationship of r to se and sy . This approximation is good for large data se ≈ √ 1−r 2 If this ratio is small, it indicates a good fit sy 2 se 2 Above formula can also be expressed as r ≈1− 2 and used to approximate r 2 for large data. sy In our example 1−0.83=0.41 and se 2.45 = =0.48 , and r 2≈0.77 not a great s y 5.15 approximation , since our data is small. Hypothesis test and CI for the slope and correlation coefficient: Assuming that the following linear model applies to X and Y: Linear Model has a form: Y = Y | X , where Y = 0 1 X Y | X is a linear function of X, so Y | X = population mean Y value for given X, Y | X = population standard deviation of Y-s for given X value, population correlation coefficient is (rho), Term in our model represents random error, we include it in the model to reflect that Y varies even if X is fixed. We can test hypothesis related to the true slope and true correlation coefficient. The terms we compute for Least Squares Regression are estimating parameters in our model: b 0 estimates 0 , b1 estimates 1 , r estimates , s e estimates Y | X If we want to test if slope is zero (indicating no relationship between x and y), our hypotheses are: H 0 : 1=0 vs H a : β1≠0 or H a : β1>0 or H a :β1<0 Choice of alternative depends on the question. b1 Test statistics: t s = df=n-2, Standard Error of the slope b1: SEb se se SEb = = , the second formula is easier to compute, ∑ (x i −x̄ )2 s x √ n−1 1 1 √ We can also compute Confidence interval for a true slope: 95% CI for 1 : b1 ±t .025 SEb 1 STP231 Brief Class Notes Instructor: Ela Jackiewicz Equivalently we may test hypotheses involving true correlation coefficient. Our hypotheses are: H 0 : =0 vs Test statistics: t s =r H a : ρ≠0 or H a : ρ>0 or H a : ρ<0 n−2 2 , df=n-2 is equivalent to the one before, but easier to compute. 1−r In our example we may ask the question: Assuming that a linear model applies, test the hypothesis that there is no relationship between x and y against an alternative that y decreases with increasing x. Use 5% significance level. Our test hypotheses are: H 0 : 1=0 and We will use second test statistics: t =−.91 H a : β1<0 or ( H 0 :ρ=0 and H a : ρ<0 ) 5−2 =−3.82 P-value=0.0157<0.05 1−.83 Null is rejected, we have evidence for alternative. Yes, y decreases with increasing x, there is statistically significant negative linear relationship between x and y. Least squares Linear Regression Example Is number of beers consumed a good predictor of BAC (blood alcohol content)? Can you drive legally after 6 beers (legal limit is .08)? Random sample of 16 students from Ohio State University participated in a study. Each student was assigned randomly number of beers to drink and after 30 minutes a police officer measured their BAC in grams of alcohol per deciliter of blood. The results are given in the table below. # of beers=x 5 2 9 BAC=y .11 .03 .19 Answer following questions: • • • • • • • • • • • • 8 3 7 3 5 3 5 4 6 .12 .04 .095 .07 .065 .02 .06 .07 .11 5 7 .085 .09 1 4 .01 .05 Graph these data to confirm existence of a linear trend. Obtain LS regression line for these data. What is the slope of your line telling you about change in BAC after you drink a next beer? Do you think y-intercept value is sensible? Compute linear correlation coefficient and coefficient of determination. What % of total variability in BAC is explained by the regression line? Are r and r2 indicating a good fit? Clearly answer both questions posed at the beginning of the problem. Can we use our equation to safely predict BAC for one that drinks 15 beers? Given SS(resid)=.00594, compute s e , compare it to sy and decide if it indicates a good fit. Test the hypothesis of no linear relationship between x and y against directional alternative hypothesis that y increases with increasing x. Obtain 95% CI for the true slope β1 STP231 Brief Class Notes Instructor: Ela Jackiewicz Different (Computational) formulas for the slope, r and all three sums of squares are: Sxy Sxy 2 S xy2 , , , , SStotal=S yy b1 = SSreg = SSresid=S yy − Sxx Sxx S xx r= Sxy S xx S yy S xx = ∑ xi2− Sxy 2 , r = Sxx S yy 2 where: ∑ xi 2 ∑ y i 2 ∑ xi ∑ y i , S yy = ∑ y 2i − , S xy = ∑ x i y i− n n n CALCULATOR 1) To use above computational formulas and/or find basic statistics for x and y use STAT EDIT place x-s and y-s on L1 and L2 respectively, then STAT CALC , select 2-Var Stats <enter> 2-Var Stats L1, L2 <enter> 2)To find the regression line: STAT EDIT place x-s and y-s on L1 and L2 respectively STAT CALC , select LinReg(ax+b) <enter> LinReg(ax+b) L1, L2 <enter> Output should display slope, y intercept, r and r2.If you do not see r and r2, then go to the CATALOG, select DIAGNOSTICS ON <enter><enter>, next time you run the regression it will show both values. 3) To perform the test: STAT TESTS use LinRegT-Test, place Xlist L1 Ylist L2, then select appropriate alternative hypothesis β β1 and ρ ) β (They use for 1 , test is for both In the output value of s=s e , you also get values of s x and s y there - 4) To compute CI for β1 use STAT TESTS use LinRegT-interval, place Xlist L1 Ylist L2 , then select appropriate C-level If you do not have the CI on your calculator, use s from the LinRegT-Test output to compute se SEb = and compute CI by hand. s x √ n−1 1
© Copyright 2024 Paperzz