look at figure 1-1

CH 15 Regression and Correlation
15.1 Stochastic Relationships and Scatter Diagrams
Two Quantitative Variables
Plot observed data on a graph.
 Horizontal (X axis)
independent variable
 Vertical (Y axis)
dependent variable
We call the graph a scatter diagram or scatter plot.
Example
X = Dosage of Drug
Y = Reduction in Blood Pressure
X
100
200
300
400
500
Y
10
18
32
44
56
Scatter Diagram
Note: the strong relationship between X and Y
Perfect positive linear correlation
50
40
C1
30
20
10
0
0
10
20
30
C2
40
50
Perfect negative linear correlation
50
40
C1
30
20
10
0
0
10
20
30
40
50
C2
Positive linear correlation
50
40
C1
30
20
10
0
0
10
20
30
40
50
30
40
50
C2
Negative linear correlation
50
40
C1
30
20
10
0
0
10
20
C2
Non-linear correlation
30
C2
20
10
0
0
10
20
30
40
50
C1
No correlation
50
40
C2
30
20
10
0
0
10
20
30
40
50
C1
15.9 The Correlation Coefficient
We wish to quantify the strength of a linear relationship with the Pearson correlation
coefficient, r.
r
 ( x  x )( y  y )
 ( x  x)  ( y  y)
2
2

n x
n xy   x  y 
2

  x  n y 2   y 
2
2

Positive r suggests large values of X and Y occur together and that small values of X and
Y occur together.
Ex: Experience and Salary
Negative r suggests large values of one variable tend to occur with small values of the
other variable.
Ex: Weight of a car and gas mileage
-1 <= r <= 1
r = 1 All data is exactly on a straight line with positive slope
r = -1 All data is exactly on a straight line with negative slope
r = 0 no linear relationship
The stronger the linear relationship the larger |r| is.
Example
Back to the Dosage of a Drug and Blood pressure data
x
100
200
300
400
500
 x  1500
r

n x
y
10
18
32
44
56
 y  160
x2
10,000
40,000
90,000
160,000
250,000
 x 2  550,000
n xy   x  y 
2

  x  n y 2   y 
2
2
559,800  1500160
xy
1,000
3,600
9,600
17,600
28,000
 xy  59,800

5550,000  1500 56,520  160 
2
y2
100
324
1,024
1,936
3,136
 y 2  6,520
2

59,000
3,500,000,000
 .099728
Existence of correlation does not imply a cause and effect relationship.
Yields of tomatoes and beans have positive correlation (Driving force is weather)
15.2 The Simple Linear Regression Model
Purpose of linear regression: To predict the value of a difficult to measure variable,
Y(response variable), based on an easy to measure variable, X(explanatory variable).
Example
Predict reaction time from blood alcohol level
In order to use linear regression, make sure the model is reasonable. The points on the
scatter plot should fall around a straight line and the correlation coefficient should be
strong.
If the model is not reasonable, do not fit a straight line.
The linear regression model is:
Y  b 0  b1 x
where b 0 is the y-intercept and b1 is the slope
Example
Back to the dosage of drug and reduction in blood pressure data
The strong relationship between these variables has been established. We will now
predict the Reduction of Blood Pressure, y, based on the Dosage of Drug, x.
Regression Plot
Y = -3.4 + 0.118X
R-Sq = 99.5 %
60
Pressure
50
40
30
20
10
100
200
300
400
500
Drug
Notice b 0  3.4 and b1  0.118
Lets predict the Reduction in Blood Pressure if 250 is the Dosage of Drug.
y  3.4  0.118x
y  3.4  0.118(250)
y  26.1
Interpolation – Predicting Y values for X values that are within the range of the scatter
plot. (This is what regression should be used for)
Extrapolation – Predicting Y values for X values beyond the range of the observations.
(This should not be done using a basic regression model it is a complex problem)
The R 2 value is the percent of the variation of y explained by the model.
Example
For the dosage of drug and blood pressure example
R 2  0.99728 2  .995  99.5%
0%  R 2  100%
The higher R 2 is, the better the model is.
15.3 Method of Least Squares
The regression model expresses Y as a function of X plus random error.
Random error reflects variation in Y values among items or individuals having the same
X value.
Draw picture
We need a line that is the “best” fit for our data. We will use the method of least-squares.
This says that the sum of the squares of the vertical distances from the points to the line is
minimized.
Draw picture
It can be shown that
 ( x  x )( y  y )  n xy   x  y 
b1 
2
n x 2   x 
 (x  x)2
b0  y  b1 x 
 y b x
1
n
Note: The least-squares line can be affected greatly by extreme data points.
Example
n xy   x  y  559,800  1500160 59,000
b1 


 0.118
2
2
500,000
5550,000  1500
n x 2   x 
b0 
 y  b  x  160  0.1181500   17  3.4
1
n
5
5
Notice these match the model we used before
Residual – the difference between an actual value and the fitted value
e  y  b0  b1 x 
Example
Residual for the point (400, 44)
e  y  b0  b1 x   44   3.4  0.118400  44  43.8  0.2