∑(xi−x)(yi−y) ∑(xi−x)

STP231 Brief Class Notes Instructor: Ela Jackiewicz
Chapter 12
Regression and Correlation
In this chapter we will consider bivariate data in forms of ordered pairs (x, y).
Both x and y can be observed or y can be observed for specific values of x that are selected by the
researcher. Data must display a linear trend on the scatterplot for us to consider fitting a linear equation
that will express y in terms of x. We will use the following vocabulary with respect to x and y:
y – response variable or dependent variable
x – predictor variable/explanatory variable or independent variable
Our line is called Least Squares Regression Line and will have a form: y =b0b1 x , where y is
a predicted value of y for given x.
The name of the line reflects the fact that our line has smallest possible sum of squared errors (of
all lines that can be possibly fitted to the given data).
Errors or residuals are : e= y− y
Formulas for the slope and y-intercept of Least-Squares Regression equation: y =b0b1 x :
∑ ( xi −x)( y i −y ) r s y
b1 =
=
, b0 =y−b1 x
sx
∑ (xi −x)2
where s x =

∑ x i− x 2
n−1
s y=

∑  y i− y 2
n−1
(standard deviations of x-s and y-s respectively)
Following example will illustrate the computations. In our example we consider people on the diet,
X=# of times per week person eats out , Y=# of pounds lost dieting during one month period
xi
yi
3
7
6
7
2
10
2
9
4
15
25 40
x i−̄x
-2
2
1
2
-3
y i− ̄y
(x i−̄x )2
4
4
1
4
9
22
2
-6
1
-4
7
( y i− ̄y )2
( x i−̄x )( y i− ̄y )
4
36
1
16
49
-4
-12
1
-8
-21
106
SS(total)
-44
x =25/5=5 y =40/5=8 b1=−44/22=−2 b 0=8−−25=18 ,
so our equation is : y =18−2x
Equation tells us that for every 1 time increase in x (times to eat out in a week), y (weight loss)
decreases by 2 pounds. Units of the slope are : (units of y)/ (units of x)
2
  gives total variability in y values
Total Sum of squares SStotal=∑  y i − y
STP231 Brief Class Notes Instructor: Ela Jackiewicz
Making predictions using our equation:
Predict number of pounds lost if dieting person eats out 5 times per week: y=18-10=8 lb
(this prediction is for x within data range)
Warning about making predictions:
It is safe to predict y for x within data range.
Extrapolation - making predictions for values of the predictor variable outside the range of the
observed values of the predictor variable (x)
 Grossly incorrect predictions can result from extrapolation.
For ex. Prediction for x=15 is y=-12 (weight gain of 12 lb). One may think that value is reasonable,
weight gain is to be expected if you eat out so often, but because we have no data in that range, we
may not be sure that our particular trend will hold as far out of the range of x values.
Besides Total sum of squares there are two other important sums of squares that we will use.
We can now obtain residual sum of squares, and regression sum of squares by obtaining predicted
values of y for each x, by using our equation.
xi
yi
3
7
6
7
2
10
2
9
4
15
25
40
ŷ i
12
4
6
4
14
y i− ̂y
-2
-2
3
0
1
ŷ i− ̄y
2
( y i− ̂y )
4
4
9
0
1
4
-4
-2
-4
6
18
SS(resid)
2
( ŷ i− ̄y )
16
16
4
16
36
88
SS(reg)
 2
Residual Sum of Squares: SSresid=∑  y i− y i  gives total variability in y values unexplained
by the regression equation
2

Regression Sum of Squares: SSreg =∑  y i− y  gives total variability in y values explained by
the regression equation
Regression Identity: SS(total)=SS(reg)+SS(resid)
We can verify that in our example : 106=88+18
We will use the three sums of squares in farther definitions.
Problems we may encounter with our data:
Outlier – a data point that lies far from the regression line, relative to other data points
Influential observation – a data point whose removal causes the regression to change considerably. It
is usually separated in the x-direction from the other data points. It pulls the regression line towards
itself.
STP231 Brief Class Notes Instructor: Ela Jackiewicz
Assessing the fit of our regression line:
√
SS(resid)
n−2
This measure tells us how close are the data points to our fitted regression line, the smaller is this value,
the better is the fit of our regression line. If s e is smaller than s y (standard deviation of y-s), it
means a good fit. The units of s e are the same as units of y.
In our example s e=√(18 /3)=2.45lb .
We can use SS(resid) to compute Residual Standard Deviation: s e=
In the “nice data” , if it is not too small, we expect roughly :
68% of observed y-s to be within ±s e of the regression line
95% of observed y-s points to be within ±2s e of the regression line and
99.7% of observed y-s points to be within ±3s e of the regression line
In other words these percentages (roughly) of data points are expected to be within a vertical distances
(parallel to the y axis) above and below the regression line.
Another value that tells us about the fit of the regression line is:
2
Coefficient of Determination: r =
SS reg
SSresid
=1−
SStotal
SStotal
This measure gives us percentage of total variability in y values that is explained by our regression
line. If this is close to 1 (100%), then that indicates a good fit. 0≤r 2≤1
18
88
=
=0.83 , it indicates a good fit, 83% of total variability in y-values
106 106
is explained by the regression line.
In our example
r 2=1−
Related to r 2 is Linear correlation Coefficient of X and Y, r :
1
xi −x
∑ x i−x y i−y  = 1
n−1
∑
r=
n−1
sx
sx sy
( )(
y i− y
=
sy
)
∑  xi −x  y i− y 
∑  xi −x 2 ∑  y i − y 2
,
Linear Correlation coefficient gives information about strength and direction of a linear trend in our
data. Values close to 1 or -1 indicate extremely strong linear association between X and Y.
Linear correlation r is symmetric, if X and Y are reversed, value of r remains the same. Sign(r)=sign of
the slope, r is unit free and −1≤r≤1 .
In our example r= - 0.91, indicating very strong negative linear trend.
STP231 Brief Class Notes Instructor: Ela Jackiewicz
There is an approximate relationship of r to se and sy . This approximation is good for large data
se
≈ √ 1−r 2 If this ratio is small, it indicates a good fit
sy
2
se
2
Above formula can also be expressed as r ≈1− 2 and used to approximate r 2 for large data.
sy
In our example
1−0.83=0.41
and
se 2.45
=
=0.48 , and r 2≈0.77 not a great
s y 5.15
approximation , since our data is small.
Hypothesis test and CI for the slope and correlation coefficient:
Assuming that the following linear model applies to X and Y:
Linear Model has a form: Y = Y | X  , where
Y = 0  1 X 
 Y | X is a linear function of X, so
 Y | X = population mean Y value for given X,  Y | X = population standard deviation of Y-s for
given X value, population correlation coefficient is  (rho),
Term  in our model represents random error, we include it in the model to reflect that Y varies even
if X is fixed. We can test hypothesis related to the true slope and true correlation coefficient.
The terms we compute for Least Squares Regression are estimating parameters in our model:
b 0 estimates  0 , b1 estimates  1 , r estimates  , s e estimates  Y | X
If we want to test if slope is zero (indicating no relationship between x and y), our hypotheses are:
H 0 : 1=0 vs
H a : β1≠0 or H a : β1>0 or H a :β1<0
Choice of alternative depends on the question.
b1
Test statistics: t s =
df=n-2, Standard Error of the slope b1:
SEb
se
se
SEb =
=
, the second formula is easier to compute,
∑ (x i −x̄ )2 s x √ n−1
1
1
√
We can also compute Confidence interval for a true slope:
95% CI for  1 : b1 ±t .025 SEb
1
STP231 Brief Class Notes Instructor: Ela Jackiewicz
Equivalently we may test hypotheses involving true correlation coefficient.
Our hypotheses are:
H 0 : =0 vs
Test statistics: t s =r

H a : ρ≠0 or H a : ρ>0 or H a : ρ<0
n−2
2 , df=n-2 is equivalent to the one before, but easier to compute.
1−r
In our example we may ask the question: Assuming that a linear model applies, test the hypothesis that
there is no relationship between x and y against an alternative that y decreases with increasing x. Use
5% significance level.
Our test hypotheses are:
H 0 : 1=0 and
We will use second test statistics: t =−.91
H a : β1<0 or ( H 0 :ρ=0 and H a : ρ<0 )

5−2
=−3.82 P-value=0.0157<0.05
1−.83
Null is rejected, we have evidence for alternative. Yes, y decreases with increasing x, there is
statistically significant negative linear relationship between x and y.
Least squares Linear Regression Example
Is number of beers consumed a good predictor of BAC (blood alcohol content)? Can you drive legally
after 6 beers (legal limit is .08)?
Random sample of 16 students from Ohio State University participated in a study. Each student was
assigned randomly number of beers to drink and after 30 minutes a police officer measured their BAC
in grams of alcohol per deciliter of blood. The results are given in the table below.
# of beers=x
5
2
9
BAC=y
.11 .03 .19
Answer following questions:
•
•
•
•
•
•
•
•
•
•
•
•
8
3
7
3
5
3
5
4
6
.12
.04
.095 .07 .065 .02 .06
.07
.11
5
7
.085 .09
1
4
.01 .05
Graph these data to confirm existence of a linear trend.
Obtain LS regression line for these data.
What is the slope of your line telling you about change in BAC after you drink a next beer?
Do you think y-intercept value is sensible?
Compute linear correlation coefficient and coefficient of determination.
What % of total variability in BAC is explained by the regression line?
Are r and r2 indicating a good fit?
Clearly answer both questions posed at the beginning of the problem.
Can we use our equation to safely predict BAC for one that drinks 15 beers?
Given SS(resid)=.00594, compute s e , compare it to sy and decide if it indicates a good fit.
Test the hypothesis of no linear relationship between x and y against directional alternative
hypothesis that y increases with increasing x.
Obtain 95% CI for the true slope β1
STP231 Brief Class Notes Instructor: Ela Jackiewicz
Different (Computational) formulas for the slope, r and all three sums of squares are:
Sxy
Sxy 2
S xy2 ,
,
,
,
SStotal=S yy
b1 =
SSreg =
SSresid=S yy −
Sxx
Sxx
S xx
r=
Sxy
 S xx S yy
S xx = ∑ xi2−
Sxy 2
, r =
Sxx S yy
2
where:
 ∑ xi 2
 ∑ y i 2
∑ xi ∑ y i
, S yy = ∑ y 2i −
, S xy = ∑ x i y i−
n
n
n
CALCULATOR
1) To use above computational formulas and/or find basic statistics for x and y
use
STAT EDIT place x-s and y-s on L1 and L2 respectively, then
STAT CALC , select 2-Var Stats <enter>
2-Var Stats L1, L2 <enter>
2)To find the regression line: STAT EDIT place x-s and y-s on L1 and L2 respectively
STAT CALC , select LinReg(ax+b) <enter>
LinReg(ax+b) L1, L2 <enter>
Output should display slope, y intercept, r and r2.If you do not see r and r2, then go to the CATALOG,
select DIAGNOSTICS ON <enter><enter>, next time you run the regression it will show both
values.
3) To perform the test: STAT TESTS use LinRegT-Test, place
Xlist L1
Ylist L2, then select appropriate alternative hypothesis
β
β1 and ρ )
β
(They use
for
1 , test is for both
In the output value of s=s e , you also get values of s x and s y there
-
4) To compute CI for
β1 use STAT TESTS use LinRegT-interval, place
Xlist L1
Ylist L2 , then select appropriate C-level
If you do not have the CI on your calculator, use s from the LinRegT-Test output to compute
se
SEb =
and compute CI by hand.
s x √ n−1
1