Lecture 3

Lecture 3
Selected material from:
Ch. 5 Summarizing bivariate data
1
Example: Bus in the US
One‐way bus fares ($) from Rochester, New York to cities less than 750 miles from it.
We got the data from the website of the bus company because we want to know how fares depend on distance.
Therefore, we make a scatterplot with distance on the x axis and fare on the y axis.
2
miles
Destination City Distance
Albany, NY
240
Baltimore, MD
430
Buffalo, NY
69
Chicago, IL
607
Cleveland, OH
257
Montreal, QU
480
New York City, NY 340
Ottawa, ON
467
Philadelphia, PA
335
Potsdam, NY
239
Syracuse, NY
95
Toronto, ON
178
Washington, DC
496
$
Standard
One-Way
Fare
39
81
17
96
61
70.5
65
82
67
47
20
35
87
Making a scatterplot
$100
y‐axis
Greyhound Bus Fares Vs. Distance
$90
• 13 cities,
each city gets
one dot
• One dot represents 2
variables:
x=distance
y=fare for the city
Standard One-Way Fare
$80
$70
$60
$50
$40
$30
$20
$10
50
150
250
350
Distance from Rochester, NY (miles)
x‐axis
3
450
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
550
65
650
Scatterplot of bus data
$100
Greyhound Bus Fares Vs. Distance
$90
Destination City Distance
Albany, NY
240
Baltimore, MD
430
Buffalo, NY
69
Chicago, IL
607
Cleveland, OH
257
Montreal, QU
480
New York City, NY 340
Ottawa, ON
467
Philadelphia, PA
335
Potsdam, NY
239
Syracuse, NY
95
Toronto, ON
178
Washington, DC
496
4
Standard
One-Way
Fare
39
81
17
96
61
70.5
65
82
67
47
20
35
87
Standard One-W ay Fare
$80
$70
$60
$50
$40
$30
$20
$10
50
150
250
350
450
550
Distance from Rochester, NY (miles)
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
Toronto, ON
65
650
Interpretation of data
But for two
cities nearly
the same distance away
from Rochester
the fare can
be lower for one! 
5
Scatterplot for 13 Cities
$100
Greyhound Bus Fares Vs. Distance
$90
$80
Standard One-Way Fare
As would be
expected,
the fare (cost)
increases the
farther you
travel.
$70
$60
$50
$40
$30
$20
$10
50
150
250
350
450
Distance from Rochester, NY (miles)
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
550
65
650
Scatterplot of bus data
As would be expected, the cost
increases the farther you travel.
$100
Greyhound Bus Fares Vs. Distance
$90
$80
Destination City Distance
Albany, NY
240
Baltimore, MD
430
Buffalo, NY
69
Chicago, IL
607
Cleveland, OH
257
Montreal, QU
480
New York City, NY 340
Ottawa, ON
467
Philadelphia, PA
335
Potsdam, NY
239
Syracuse, NY
95
178
6Toronto, ON
Washington, DC
496
Standard
One-Way
Fare
39
81
17
96
61
70.5
65
82
67
47
20
35
87
Standard One-W ay Fare
But for two cities nearly the same $70
distance away from Rochester
the cost can be lower for one! $60
$50
$40
$30
$20
$10
50
150
250
350
450
550
Distance from Rochester, NY (miles)
Potsdam, NY
distance=239
fare=$47
Albany, NY
distance=240
fare=$39
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
65
650
Why more expensive to travel to Potsdam than to Albany?
Albany, NY, population ≈ 100,000,
close to major cities, has businesses,
tourist attractions
7
The value of y (cost) is not determined solely by x (distance).
Potsdam, NY, population ≈ 15,000,
far from any other city, no businesses, tourism (named after Potsdam, Germany)
It usually costs more to go to less travelled places.
8
Bus in the US
The y value tends to increase as x increases. We say that there is a positive relationship between the variables distance and fare.
Greyhound Bus Fares Vs. Distance
$90
$80
Standard One-Way Fare
It appears that the y value (fare) could be predicted reasonably well from the x value (distance) by finding a line that is close to the points in the plot.
We say the relationship is linear.
$100
$70
$60
$50
$40
$30
$20
$10
50
150
250
350
450
Distance from Rochester, NY (miles)
9
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
550
65
650
Scatterplot
$100
y‐axis
Greyhound Bus Fares Vs. Distance
$90
Standard One-Way Fare
$80
$70
$60
$50
$40
$30
$20
$10
50
150
250
350
450
550
65
650
Distance from Rochester, NY (miles)
x‐axis
The y and x axes need not start at (0,0). Here, x starts at 50 miles, y at $10; choose convenient x‐ and y‐axes so that all data points 10
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
appear on the graph.
Association
Positive Association ‐ Two variables are positively associated when above‐average values of one tend to accompany above‐
average values of the other and below‐average values tend similarly to occur together. (i.e., the y values tend to increase as the x values increase.)
Negative Association ‐ Two variables are negatively associated when above‐average values of one accompany below‐average values of the other, and vice versa. (i.e., the y values tend to decrease as the x values increase.)
11
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
Pearson correlation coefficient
Pearson correlation coefficient (r): a measure of the strength of the linear relationship between two variables. Formula for r:
sx 
 ( x  x)
i
(n  1)
  
 x x y y 
sy
z x z y   sx


r

n 1
n 1
2
the sample standard deviation of x, and
similarly for sy.
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
12
Properties of r (correlation)
The value of r does not depend on the unit of measurement (e.g. miles or $) for each variable.
The value of r does not depend on which of the two variables is labeled x.
The value of r is between –1 and +1.
The correlation coefficient is –1 only when all the points lie on a downward‐sloping line
+1 only when all the points lie on an upward‐sloping line.
The value of r is a measure of the extent to which x and y are linearly related.
13
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
Example: Strong negative correlation
14
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
No correlation
15
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
16
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
Example: Strong positive association
17
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
Interpretation of r
In general the correlation is:
• Strong r ≤ ‐.8 or r ≥ .8
• Moderate ‐.8 < r < ‐.5 or .5 < r < .8
• Weak ‐.5 ≤ r ≤ .5
18
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
How to calculate: Bus example
x  325.615
s x  164.2125
y=59.0385
s y  25.506
  
 x x y y 
sy
z x z y   sx



r
n 1
n 1
19
x
y
240
430
69
607
257
480
340
467
335
239
95
178
496
39
81
17
96
61
70.5
65
82
67
47
20
35
87
11.6021
r
 0.9668
13  1
x-x
sx
-0.5214
0.6357
-1.5627
1.7135
-0.4178
0.9402
0.0876
0.8610
0.0571
-0.5275
-1.4044
-0.8989
1.0376
y-y
sy
 x-x   y-y 


 
 sx   sy 
-0.7856
0.8610
-1.6481
1.4491
0.0769
0.4494
0.2337
0.9002
0.3121
-0.4720
-1.5305
-0.9424
1.0962
0.4096
0.5473
2.5755
2.4831
-0.0321
0.4225
0.0205
0.7751
0.0178
0.2489
2.1494
0.8472
1.1374
11.6021
Strong positive association!
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
Warning! The pearson correlation coefficient only measures linear association.
You must look at the scatterplot first to check that it looks approximately linear before just computing r. If it looks nonlinear then do not compute r.
Example to illustrate what can go wrong .
20
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
x
y
1.2
2.5
6.5
13.1
24.2
34.1
20.8
37.5
23.3
21.5
12.2
3.9
4.0
18.0
1.7
26.1
Example
x
y
1.2
2.5
6.5
13.1
24.2
34.1
20.8
37.5
23.3
21.5
12.2
3.9
4.0
18.0
1.7
26.1
y
1.2
2.5
6.5
13.1
24.2
34.1
20.8
37.5
23.3
21.5
12.2
3.9
4.0
18.0
1.7
26.1
-1.167
-1.074
-0.788
-0.314
0.481
1.191
0.237
1.434
yy
sy
 x  x  y  y 
 s  s 
 X  y 
0.973
0.788
-0.168
-1.022
-1.012
0.428
-1.249
1.261
-1.136
-0.847
0.133
0.322
-0.487
0.510
-0.296
1.810
0.007
r=
0.001
x  17.488, sx  13.951, y  13.838, s y  9.721
r near 0 means no association between x and y.
21
x
xx
sX
r
 x  x  y  y  1
1


  (0.007)  0.001
n  1  s X   s y  7
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
Example
30.0
Scatterplot
25.0
22
x
y
1.2
2.5
6.5
13.1
24.2
34.1
20.8
37.5
23.3
21.5
12.2
3.9
4.0
18.0
1.7
26.1
20.0
y 15.0
10.0
5.0
0.0
0
5
10
15
20
25
30
x
35
40
30.0
Scatterplot
25.0
20.0
y 15.0
But the scatterplot reveals an almost perfect quadratic relationship between x and y, so to say there is no association or relationship between them is incorrect!
10.0
5.0
0.0
0
5
10
15
20
25
30
35
x
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
40
Linear relations
Mathematical formula for linearly related variables, x and y, where x is the explanatory variable and y
y the response variable.
y = a + bx is the equation of a straight line. b is the slope of the line: the amount by which y increases when x increases by 1 unit. 15
y = 7 + 3x
Equation of the line:
a + bx
y increases by b = 3
10
x increases by 1
Slope: increase in y for
a unit increase in x
5
a is the intercept
(sometimes called the vertical intercept) of the line: the height of the line 23
above the value x = 0.
intercept: value of y
when x = 0
a=7
0
x
0
2
4
6
8
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
How well does the line a + bx fit the data?
Regression Plot
Standard Fare= 10.1380 + 0.150179 Distance
Suppose a line is drawn through the data points.
S = 6.80319
R-Sq(adj) = 92.9 %
105
95
How do we measure how well it fits the data?
Vertical difference
The closer the differences to 0 the better the fit.
Standard Fare
85
The measure we use is based on vertical differences (sometimes called deviations) between the data values and the line.
24
R-Sq = 93.5 %
= y – (a + bx)
75
65
55
45
35
25
15
0
100
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
200
300
Distance
400
500
600
Least squares criterion
The most widely used criterion for measuring the goodness of fit of a line y = a + bx to bivariate data (x1, y1), (x2, y2),, (xn, yn) is the sum of the squared deviations about the line:
  y  (a  bx)
2
  y1  (a  bx1 )     y n  (a  bx n )
2
vertical distance of
the first data value
2
vertical distance of
the last data value
The line that gives the best fit to the data is the one that minimizes this sum; it is called the least squares line or sample regression line.
We find the best line by finding the a and b which give the smallest value of the sum of squared vertical differences. (technically, set partial derivatives of a,b to 0)
25
Solution for a and b
slope of the least squares line  x  x  y  y  

b
x  x
y intercept a  y  bx
2
Equation of the least squares line is
ŷ  a  bx
26
where the ^ above y emphasizes that (read as y‐hat) is a ŷ
prediction of y resulting from substitution of the fitted a and b (above) into the line equation.
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
Calculations for US bus example
x
240
430
69
607
257
480
340
467
335
239
95
178
496
4233
27
y
39
81
17
96
61
70.5
65
82
67
47
20
35
87
768
xx
-85.615
104.385
-256.615
281.385
-68.615
154.385
14.385
141.385
9.385
-86.615
-230.615
-147.615
170.385
(x  x)2
7329.994
10896.148
65851.456
79177.302
4708.071
23834.609
206.917
19989.609
88.071
7502.225
53183.456
21790.302
29030.917
323589.08
yy
-20.038
21.962
-42.038
36.962
1.962
11.462
5.962
22.962
7.962
-12.038
-39.038
-24.038
27.962
 x-x  y-y
1715.60
2292.45
10787.72
10400.41
-134.59
1769.49
85.75
3246.41
74.72
1042.72
9002.87
3548.45
4764.22
48596.19
© 2008
Brooks/Cole, a
division of
Thomson
Learning, Inc.
Calculations for US bus example
  x  x  y  y   48596.19 and
  x  x   323589.08
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
2
So
 x  x  y  y  48596.19

b

 0.15018
323589.08
x  x 
2
Also n  13,
 x  4233 and  y  768
4233
768
 325.615 and y 
 59.0385
13
13
This gives
so x 
The fare increases
on average 15 cents for each mile
travelled.
a  y - bx  59.0385 - 0.15018(325.615)  10.138
28
The regression line is yˆ  10.138  0.15018x.
Regression plot
Overlay the fitted line on the scatterplot
Regression Plot
Standard Fare= 10.1380 + 0.150179 Distance
S = 6.80319
R-Sq = 93.5 %
regression line
R-Sq(adj) = 92.9 %
105
95
85
Standard Fare
This is the best line you can fit to the data according to the least squares criterion.
75
65
55
45
35
25
15
0
29
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
100
200
300
Distance
400
500
600
Another formula for b
The formula
 x  x  y  y  

b
x  x
2
gives same
result
can be rewritten as
b

x   y 


xy 
x
30
2

n
2
 x
n
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
Easier to do on a
calculator/computer
Steps in a linear regression 1.) Determine which variable is the explanatory variable (x) and which is the response (y)  use that the goal is to see how y changes with x.
2.) Look at the scatterplot to make sure the data could follow a linear relationship. If not, stop and consider a different model (e.g. quadratic).
3.) After fitting the regression model, check if there are unusual values (outliers) or patterns indicating something was wrong with the assumption of a linear relationship. Next 
4.) Determine predictions from the regression equation and their accuracy. Next 
31
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
Predicted/fitted values
The predicted or fitted values result
from substituting each sample x value
into the equation for the least squares
line. This gives
ŷ1  a  bx1 =1st predicted value
ŷ 2  a  bx 2 =2nd predicted value
...
ŷ n  a  bx n =nth predicted value
32
The residuals for the least squares line are the
values: y1  yˆ 1 , y 2  yˆ 2 , ..., y n  yˆ n © 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
Residuals
The residuals for the least squaresRegression
line are Plot
the
values: y1  yˆ 1 , y 2  yˆ 2 , ..., y n  yˆ n
Standard Fare= 10.1380 + 0.150179 Distance
S = 6.80319
R-Sq = 93.5 %
R-Sq(adj) = 92.9 %
105
95
the difference between
each data value (y) and
the fitted regression line at its
corresponding (x).
85
Standard Fare
Residuals are the
vertical differences:
75
Vertical difference
65
55
45
35
25
15
0
33
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
100
200
300
Distance
400
500
600
Bus in the US
Regression
line
34
x
y
240
430
69
607
257
480
340
467
335
239
95
178
496
39
81
17
96
61
70.5
65
82
67
47
20
35
87
Predicted value
yˆ  10.1  .150x
46.18
74.72
20.50
101.30
48.73
82.22
61.20
80.27
60.45
46.03
24.41
36.87
84.63
Residual
y  ŷ
-7.181
6.285
-3.500
-5.297
12.266
-11.724
3.801
1.728
6.552
0.969
-4.405
-1.870
2.373
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
Residual plot for bus example
A residual plot is a scatterplot of the data pairs (x, residual). Residuals Versus x
(response is y)
Residual
Since residuals
are vertical
differences between the
response (y)
and what is
estimated as (fitted
to) y, near 0 is good, large is bad.
10
0
-10
0
35
100
200
300
x
© 2008 Brooks/Cole, a division
of
Thomson Learning, Inc.
400
500
600
Residual plot for bus example
From the scatterplot of the data you can see that the 6th data value (Cleveland, Regression Plot
OH, 257 miles, $61) is far from the Standard Fare= 10.1380 + 0.150179 Distance
regression line.
S = 6.80319 R-Sq = 93.5 % R-Sq(adj) = 92.9 %
These two differences are
the same, different y‐scales
just makes them look different.
105
95
Residuals Versus x
75
(response is y)
65
55
10
45
35
25
15
0
100
200
300
400
500
Residual
Standard Fare
85
0
600
Distance
... but it becomes more
obvious in the residual plot.
36
-10
0
100
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
200
300
x
400
500
600
US bus data Residuals Versus x
(response is y)
Residual
10
0
Predicted fares are too low.
-10
Predicted fares are too high.
0
100
200
300
400
500
600
x
It appears that the line systematically predicts fares that are too high for cities close to Rochester and predicts fares that are too low for most cities between 200 and 500 miles.
37
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
US bus data Residuals Versus x
(response is y)
© 2008 Brooks/Cole, a
division of Thomson
Learning, Inc.
Residual
10
0
Predicted
fares are
too low.
-10
Predicted
fares are
too high.
0
100
200
300
400
500
600
Regression Plot
x
Standard Fare= 10.1380 + 0.150179 Distance
S = 6.80319
R-Sq = 93.5 %
R-Sq(adj) = 92.9 %
105
95
38
You can also see in the scatterplot of the data that at the beginning the line falls above the
data values and afterwards, below.
Standard Fare
85
75
65
55
45
35
25
15
0
100
200
300
Distance
400
500
600
Other kinds of residual plots
Another common type of residual plot is a scatter
plot of the data pairs (ŷ, residual). The following plot
was produced by Minitab for the Greyhound data.
Notice, that this residual plot shows the same type of
systematic problems with the model.
© 2008 Brooks/Cole, a
division of Thomson
Learning, Inc.
Residuals Versus the Fitted Values
(response is y)
Residual
10
0
-10
20
39
30
40
a + bx instead of x here 
50
60
70
Fitted Value
80
90
100
Residual plot ‐ What to look for
Residual
Isolated residuals or patterns in residuals indicate potential problems.
Ideally residuals should be randomly spread out above and below zero, as in this example.
0
Interpretation
1.Residuals below 0 indicate over prediction
2.Residuals above 0 indicate under prediction.
40
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
x
Definitions and formulas
The total sum of squares, denoted by
SSTo, is defined as
SSTo  (y1  y) 2  (y 2  y) 2    (y n  y)2
  (y  y) 2
 Difference from the mean of all y‘s
The residual sum of squares, denoted by
SSResid, is defined as
SSResid  (y1  yˆ 1 )  (y 2  yˆ 2 )    (y n  yˆ n )
2
  (y  y)
ˆ 2
41
2
2
 Difference from the fitted regression line
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
SSTo versus SSResid
SSResid measures from the regression Standard Fare= 10.1380 + 0.150179 Distance
line
y  59
Regression Plot
S = 6.80319
R-Sq = 93.5 %
105
R-Sq(adj) = 92.9 %
y
95
At this data value
look how much
larger the difference
for SSTo is than SSResid!
85
Standard Fare
SSTo measures
vertical differences
of all data values
from this horizontal
line.
Horizontal line
at y = 59
75
65
55
45
35
25
15
0
100
200
300
Distance
400
500
600
42
Calculation formulas
SSTo and SSResid are generally found as part of the standard output from most statistical packages or can be obtained using the following computational formulas:
SSTo   y 
2
  y
2
n
SSResid   y  a  y  b  xy
2
The coefficient of determination, r2, can be
computed as 2
SSResid
r  1
SSTo
Higher r2 means the slope improved the predictions more.
43
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
Coefficient of determination
The coefficient of determination, denoted by r2 or R2, gives the proportion of variation in y that can be attributed to an approximate linear relationship between x and y.
Mathematics can show that the coefficient of determination (r2) is the square of the Pearson correlation coefficient (r). Recall the Pearson correlation coefficient measured the strength of a linear relationship between x and y and did not depend on which variable was called x.
44
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
Bus in the US
n  13,  y  767.5,  y  53119.25,  xy  298506
2
a  10.138, b  0.150179

y

SSTo   y 
n
2
2
767.52
 53119.25 
 7807.23
13
SSResid   y  a  y  b xy
.
2
 53119.25  (10.138)(767.5)  (0.150179)(298506)  509.00
SSTo > SSResid as expected because these data had a strong slope!
45
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
Interpretation
SSResid
509
 1
 .9348
r  1
7807.23
SSTo
2
93.5% of the variation in bus fares can be attributed to the linear relationship between distance and fare.
Distance traveled is a strong predictor for how much you expect to pay on the bus; the farther you travel the more you pay.
Use this information to determine how far you can travel with only a certain amount of money to spend.
46
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
More on variability
The standard deviation about the least squares line
is denoted se and given by SSResid
se 
n2
se is interpreted as the “typical” amount by which an observation deviates from the least squares line. Bus in the US
47
SSResid
509
se 

 $6.80
11
n2
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
Computer programs can do regression for you: Example output from Minitab
a
b
Regression Analysis: Standard Fare versus Distance
Least squares
regression line
The regression equation is
Standard Fare = 10.1 + 0.150 Distance
Predictor
Constant
Distance
Coef
10.138
0.15018
S = 6.803
se
48
SE Coef
4.327
0.01196
R-Sq = 93.5%
DF
1
11
12
P
0.039
0.000
R-Sq(adj) = 92.9%
Analysis of Variance
Source
Regression
Residual Error
Total
T
2.34
12.56
r2
SS
7298.1
509.1
7807.2
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
SSResid
MS
7298.1
46.3
SSTo
F
157.68
P
0.000
Bus in the US: Add more data Add 7 new cities throughout the US to increase the data from the previous 13 cities to 20 cities. Seattle is 4578.8 km (2848 miles) away from Rochester,
but its fare is not that
much higher!
49
Standard
Distance Fare
Buffalo, NY
69
17
New York City
340
65
Cleveland, OH
257
61
Baltimore, MD
430
81
Washington, DC 496
87
Atlanta, GE
998
115
Chicago, IL
607
96
San Francisco
2861
159
Seattle, WA
2848
159
Philadelphia, PA 335
67
Orlando, FL
1478
109
Phoenix, AZ
2569
149
Houston, TX
1671
129
New Orleans, LA 1381
119
Syracuse, NY
95
20
Albany, NY
240
39
Potsdam, NY
239
47
Toronto, ON
178
35
Ottawa, ON
467
82
Montreal, QU
480
70.5
First: Look at scatterplot
Standard Fare
150
These new points do not follow the same trend as the previous ones.
100
They are not rising
as fast (different
slope).
50
0
0
1000
2000
Distance
50
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
3000
x‐axis expanded from
650 to 3000 miles
Second: Results expected?
Seattle
Standard Fare
150
New
Houston Phoenix
Orleans
Atlanta
San
Francisco
Examine these cities.
100
Orlando
All major cities with tourist attractions. 50
All to south or west,
travel to here cheaper than expected.
0
0
1000
2000
3000
Distance
51
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
Linear regression
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
Correlation coefficient, r = 0.921
R2 = 0.849, se = $17.42 Regression line, Standard Fare = 46.058 + 0.043535 Distance
Regression Plot
Residuals Versus Distance
Standard Far = 46.0582 + 0.0435354 Distance
S = 17.4230
R-Sq = 84.9 %
A bad
residual plot
(response is Standard)
R-Sq(adj) = 84.1 %
30
20
10
Residual
Standard Far
150
100
0
-10
50
-20
-30
0
0
1000
2000
Distance
52
3000
0
1000
2000
3000
Distance
Even though the correlation coefficient is reasonably high and 84.9% of the variation in fares is explained, the regression line does not fit the data well and will not give good predictions.
Nonlinear regression
The scatterplot does not look linear, it appears to have a curved shape.
Standard Fare
150
We sometimes replace x or y or both with a transformation and then perform a linear regression using the transformed variables. This may lead to a better fit.
100
50
0
0
1000
2000
Distance
53
3000
For this particular data, the shape of the curve is almost logarithmic so we might try to replace the distance with log10(distance) [the logarithm to the base 10) of the distance].
© 2008 Brooks/Cole, a division of Thomson Learning, Inc.
Nonlinear regression for bus example
Compute log10
of each of the distances.
54
Standard
Distance Log10(distance) Fare
Buffalo, NY
69
1.83885
17
New York City
340
2.53148
65
Cleveland, OH
257
2.40993
61
Baltimore, MD
430
2.63347
81
Washington, DC
496
2.69548
87
Atlanta, GE
998
2.99913
115
Chicago, IL
607
2.78319
96
San Francisco
2861
3.45652
159
Seattle, WA
2848
3.45454
159
Philadelphia, PA
335
2.52504
67
Orlando, FL
1478
3.16967
109
Phoenix, AZ
2569
3.40976
149
Houston, TX
1671
3.22298
129
New Orleans, LA 1381
3.14019
119
Syracuse, NY
95
1.97772
20
Albany, NY
240
2.38021
39
Potsdam, NY
239
2.37840
47
Toronto, ON
178
2.25042
35
Ottawa, ON
467
2.66932
82
Montreal, QU
480
70.5
© 2008 Brooks/Cole,
a2.68124
division of
Thomson Learning, Inc.
Bus in the US
Regression Analysis: Standard Fare versus Log10(Distance)
The regression equation is
Standard Fare = - 163 + 91.0 Log10(Distance)
Predictor
Constant
Log10(Di
S = 7.869
Coef
-163.25
91.039
SE Coef
10.59
3.826
R-Sq = 96.9%
Typical Error = $7.87
Better accuracy
(Decreased from $17.42)
T
-15.41
23.80
P
0.000
0.000
R-Sq(adj) = 96.7%
High r2
96.9% of the variation attributed to the model
(increased from 84.9%)
55
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
Bus in the US
Rest of computer output…
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
18
19
Unusual Observations
Obs
Log10(Di
Standard
11
3.17
109.00
SS
35068
1115
36183
Fit
125.32
MS
35068
62
F
566.30
SE Fit
2.43
P
0.000
Residual
-16.32
St Resid
-2.18R
R denotes an observation with a large standardized residual
The only outlier is Orlando and as you’ll see from the
next two slides, it is not too bad.
56
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
Bus in the US
Regression Plot
Standard Fare = -163.246 + 91.0389 Log10(Distance)
S = 7.86930
R-Sq(adj) = 96.7 %
150
Standard Fare
When we look at how the prediction curve looks on a graph that has the Standard Fare and log10(Distance) axes, we see the result looks reasonably linear. R-Sq = 96.9 %
100
50
0
57
© 2008 Brooks/Cole, a
division of Thomson
Learning, Inc.
2.0
2.5
3.0
Log10(Distance)
3.5
Bus in the US
When we look at how the prediction curve looks on a graph that has Distance instead of log10(Distance) as the x axis, the data and curve are nonlinear.
The prediction model from transforming distance appears to now work well, and shows clearly that fares level off at higher distances.
Fare  163.25  91.0Log10 (Distance)
Standard Fare
150
Prediction Model
100
50
0
0
58
© 2008 Brooks/Cole, a division of
Thomson Learning, Inc.
1000
2000
Distance
3000
Tutorial: Call of the Amazonian frog
From Herpetologica, 2004: pg. 420 – 29.
The Amazonian tree frog uses vocal communication to call for a mate. A study was performed to access whether the daily rainfall (mm) was associated with the proportion of frogs calling (call rate). For proportions p that are between 0 and 1, a logistic transformation is often used:
• Logistic transform = ln[p/(1‐p)], ln = natural logarithm.
• For 0 < p < 1, ‐ Infinity < ln[p/(1‐p)] < Infinity as assumed for an outcome variable that follows the Normal distribution.
• Choosing the right transformation for data can be tricky, there are tips in Chapter 5 of the book. For exams and homework, we 59 will guide you as here.
Data: Call of the Amazonian frog
1.) Plots of p = call rate versus rainfall and the logistic
transform ln[p/(1-p)] versus rainfall are given below.
Which plot suits a linear regression better?
ln(p/(1-p)) vs. rainfall
2
-1
0
1
ln(callrate/(1-callrate))
0.6
0.4
0.2
callrate
0.8
3
4
1.0
p vs. rainfall
60
1
2
3
rainfall(mm)
4
5
1
2
3
rainfall(mm)
4
5
rainfall call rate
[1]
0.2 0.17
[2]
0.3 0.19
[3]
0.4 0.20
[4]
0.5 0.21
[5]
0.7 0.27
[6]
0.8 0.28
[7]
0.9 0.29
[8]
1.1 0.34
[9]
1.2 0.39
[10]
1.3 0.41
[11]
1.5 0.46
[12]
1.6 0.49
[13]
1.7 0.53
[14]
2.0 0.60
[15]
2.2 0.67
[16]
2.4 0.71
[17]
2.6 0.75
[18]
2.8 0.82
[19]
2.9 0.84
[20]
3.2 0.88
[21]
3.5 0.90
[22]
4.2 0.97
[23]
4.9 0.98
[24]
5.0 0.98
Data: Call of the Amazonian frog
2.) Using the best transformation, calculate the intercept
and slope of the fitted regression line.
3.) Interpret the meaning of the intercept and slope in 2.)
4.) Give an estimate of the regression line at 4 mm daily
rainfall. What is this an estimate of?
Recall computational formula for the slope b of the
regression of y on x : b 
x y

 xy 
n
.
2

x

2
x  n
Some useful summaries calculated in the stat package R:
sum(log(callrate/(1-callrate))) 11.6
sum(rainfall) 47.9
sum(rainfall^2) 141.27
sum(rainfall*log(callrate/(1-callrate)))
77.56
61
rainfall call rate
[1]
0.2 0.17
[2]
0.3 0.19
[3]
0.4 0.20
[4]
0.5 0.21
[5]
0.7 0.27
[6]
0.8 0.28
[7]
0.9 0.29
[8]
1.1 0.34
[9]
1.2 0.39
[10]
1.3 0.41
[11]
1.5 0.46
[12]
1.6 0.49
[13]
1.7 0.53
[14]
2.0 0.60
[15]
2.2 0.67
[16]
2.4 0.71
[17]
2.6 0.75
[18]
2.8 0.82
[19]
2.9 0.84
[20]
3.2 0.88
[21]
3.5 0.90
[22]
4.2 0.97
[23]
4.9 0.98
[24]
5.0 0.98