Chapter 4: Random Variables and Probability Distributions

Chapter 11: Simple Linear Regression
Where We’ve Been


Presented methods for estimating and
testing population parameters for a
single sample
Extended those methods to allow for a
comparison of population parameters
for multiple samples
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
2
Where We’re Going




Introduce the straight-line linear regression model
as a means of relating one quantitative variable to
another quantitative variable
Introduce the correlation coefficient as a means of
relating one quantitative variable to another
quantitative variable
Assess how well the simple linear regression model
fits the sample data
Use the simple linear regression model to predict
the value of one variable given the value of another
variable
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
3
11.1: Probabilistic Models
There may be a deterministic reality connecting
two variables, y and x
But we may not know exactly what that reality is,
or there may be an imprecise, or random,
connection between the variables. The
unknown/unknowable influence is referred to as
the random error
So our probabilistic models refer to a specific
connection between variables, as well as
influences we can’t specify exactly in each case:
y = f(x) + random error
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
4
11.1: Probabilistic Models
The relationship between home runs and runs in
baseball seems at first glance to be deterministic …
Runs
6
5
4
3
2
1
0
0
1
2
3
4
5
6
Home Runs
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
5
11.1: Probabilistic Models
But if you consider how many runners are on base when the home run
is hit, or even how often the batter misses a base and is called out, the
Runs rigid model becomes more variable.
6
5
4
3
2
1
0
0
1
2
3
4
5
6
Home Runs
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
6
11.1: Probabilistic Models
General Form of Probabilistic Models
y = Deterministic component + Random error
where y is the variable of interest, and the
mean value of the random error is assumed
to be 0: E(y) = Deterministic component.
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
7
11.1: Probabilistic Models
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
8
11.1: Probabilistic Models
The goal of regression analysis is to find the
straight line that comes closest to all of the points in
the scatter plot simultaneously.
Attendance
300
y = -0.192x + 150.3
250
Attendance

200
150
100
50
0
0
20
40
60
80
100
120
140
160
Week
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
9
11.1: Probabilistic Models

A First-Order Probabilistic Model
y = β0 + β1x + ε where y =
dependent variable
x = independent variable
β0 + β1x = E(y) = deterministic component
ε = random error component
β0 = y – intercept
β1 = slope of the line
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
10
11.1: Probabilistic Models
β0, the y – intercept, and
β1, the slope of the line,
are population
parameters, and
invariably unknown.
Regression analysis is
designed to estimate
these parameters.
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
11
11.2: Fitting the Model: The
Least Squares Approach
Step 1
Hypothesize the deterministic
component of the probabilistic model
E(y) = β0 + β1x
Step 2
Use sample data to estimate the
unknown parameters in the model
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
12
11.2: Fitting the Model: The
Least Squares Approach
4000
Values on the line are the predicted values
of total offerings given the average offering.
Total Offering
3500
3000
2500
2000
1500
1000
500
0
0
10
The distances between the scattered dots and
the line are the errors of prediction.
20
30
40
50
Average Offering
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
13
11.2: Fitting the Model: The
Least Squares Approach
4000
Values on the line are the predicted values
of total offerings given the average offering.
Total Offering
3500
3000
2500
The line’s estimated parameters are the values that
minimize the sum of the squared errors of prediction,
and the method of finding those values is called the
method of least squares.
2000
1500
1000
500
0
0
10
The distances between the scattered dots and
the line are the errors of prediction.
20
30
40
50
Average Offering
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
14
11.2: Fitting the Model: The
Least Squares Approach




y   0  1 x
ˆ
ˆ
ˆ
y




Estimates:
0
1x
Model:
ˆ
ˆ
ˆ
(
y

y
)

[
y

(



Deviation:
i
i
i
0
1 xi )]
2
ˆ
ˆ
[ y  (    x)]
SSE:

i
0
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
1
15
11.2: Fitting the Model: The
Least Squares Approach

The least squares line ŷ  ˆ0  ˆ1 x
is the line that has the following two
properties:
1.
2.
The sum of the errors (SE) equals 0.
The sum of squared errors (SSE) is
smaller than that for any other straightline model.
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
16
11.2: Fitting the Model: The
Least Squares Approach
Formulas for the Least Squares Estimates
Slope: ˆ1 
SSxy
SSxx
( x  x )( y  y ) 



 (x  x )
i
x   y 


xy 
i
i
i
i
2
i
2
x
 i 
i
n
2
  xi 
n
y intercept:  0  y  ˆ1 x
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
17
11.2: Fitting the Model: The
Least Squares Approach
Can home runs be used to predict errors?
Is there a relationship
between the number of
home runs a team hits and
the quality of its fielding?
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
18
11.2: Fitting the Model: The
Least Squares Approach
Home Runs (x)
158
155
139
191
124
Ʃ𝑥𝑖= 767
Errors (y)
xi2
xiyi
126
24964
19908
87
24025
13485
65
19321
9035
95
36481
18145
119
15625
14756
Ʃyi = 492 Ʃxi2 = 120416 Ʃxiyi = 75329
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
19
11.2: Fitting the Model: The
Least Squares Approach
Formulas for the Least Squares Estimates
Slope: ˆ1 
SSxy
SSxx


x   y 


xy 
i
i
i
2
x
 i 
i
n
2
  xi 
n
(767)(492)
75,329 
5

(767) 2
120, 416 
5
 .0521
y  intercept:  0  y  ˆ1 x  98.4  (.0521)(153.4)  107.2
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
20
11.2: Fitting the Model: The
Least Squares Approach
Formulas for the Least Squares Estimates
    y 
These results suggest that teams xi
xi yiare

which hit more
home
runs
SS
xy
n
ˆ
Slope:



(slightly)
better
fielders
(maybe
not
1
SSxx
what we expected).
There are,
xi
2
xi in
however, only five observations
the sample. It is important to take a n
 closer
.0521look at the assumptions we
made and the results we got.


 
i
2
(767)(492)
75,329 
5

(767) 2
120, 416 
5
y  intercept:  0  y  ˆ1 x  98.4  (.0521)(153.4)  106.39
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
21
11.3: Model Assumptions
Assumptions
1. The mean of the probability distribution of
ε is 0.
2. The variance, σ 2, of the probability
distribution of ε is constant.
3. The probability distribution of ε is normal.
4. The values of ε associated with any two
values of y are independent.
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
22
11.3: Model Assumptions


The variance, σ 2, is used in every test statistic and
confidence interval used to evaluate the model.
Invariably, σ 2 is unknown and must be estimated.
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
23
11.3: Model Assumptions
Estimation of  2 for a (First-Order) Straight-Line Model
2
ˆ
SS yy  ˆ1SS xy
y

y
SSE   i
i
2
s 


n2
n2
n2
SS yy    yi  y  and SS xy    xi  x  yi  y 
2
The estimated standard error of  is the square root of the variance:
s  s2 
SSE
n2
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
24
11.4: Assessing the Utility of the Model:
Making Inferences about the Slope β1
Note: There may be many different
patterns in the scatter plot when
there is no linear relationship.
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
25
11.4: Assessing the Utility of the Model:
Making Inferences about the Slope 1
A critical step in the evaluation of the
model is to test whether 1 = 0
.
y
.
.
y
. .
. . .
.
y
............
.
. .
x
Positive Relationship
1 > 0
x
No Relationship
1 = 0
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
.
.
. .
. . .
. .
. .
.
.
.
.
x
Negative Relationship
1 < 0
26
11.4: Assessing the Utility of the Model:
Making Inferences about the Slope 1
H0 : 1 = 0
Ha : 1 ≠ 0
.
y
.
.
y
. .
. . .
.
y
............
.
. .
x
Positive Relationship
β1 > 0
x
No Relationship
β1 = 0
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
.
.
. .
. . .
. .
. .
.
.
.
.
x
Negative Relationship
β1 < 0
27
11.4: Assessing the Utility of the Model:
Making Inferences about the Slope 1

The four assumptions described above produce a normal
sampling distribution for the slope estimate:
ˆ1
N ( 1 ,  ˆ )
1
where  ˆ 
1

SS xx
and ˆ ˆ  sˆ 
1
1
s
,
SS xx
called the estimated standard error of the least squares
slope estimate.
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
28
11.4: Assessing the Utility of the Model:
Making Inferences about the Slope 1
A Test of Model Utility: Simple Linear Regression
One-Tailed Test
Two-Tailed Test
H 0 : 1  0
H 0 : 1  0
H a : 1  0 ( 0)
H a : 1  0
ˆ
ˆ1
1
Test Statistic : t 

sˆ s / SS xx
1
Rejection Region:
t  t ( t )
|t | t /2
Degrees of freedom = n  2
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
29
11.4: Assessing the Utility of the Model:
Making Inferences about the Slope 1
yi--
(yi-- )2
(xi - )2
=
E(Errors|HRs)
4.6
21.16
98.14
27.86
776.4
87
1.6
2.56
98.31
-11.31
127.9
139
65
-14.4
207.4
99.23
-34.23
1171
191
95
37.6
1414
96.25
-1.245
1.55
124
Ʃxi = 767
119
Ʃyi = 492
-29.4
864.4
SSxx= 2509
100.1
18.92
357.8
SSE = 2435
Home Runs
(x)
Errors
(y)
158
126
155
xi -
= 153.4
SSE
2435
s

 28.49
n2
3
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
30
11.4: Assessing the Utility of the Model:
Making Inferences about the Slope 1
s  28.49
ˆ1  ˆH 0
ˆ1  0
.0521
t

t 
 .1002
s
sˆ
28.49 2509
1
SS xx

Since the t-value does not lead to rejection of the
null hypothesis, we can conclude that



A different set of data may yield different results.
There is a more complicated relationship.
There is no relationship (non-rejection does not lead to
this conclusion automatically).
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
31
11.4: Assessing the Utility of the Model:
Making Inferences about the Slope 1

Interpreting p-Values for 


Software packages report two-tailed p-values.
To conduct one-tailed tests of hypotheses, the
reported p-values must be adjusted:
Upper-tailed test (H a : 1  0) :
 p 2 if t  0
p  value =  p
1  2 if t  0
 p 2 if t  0
Lower-tailed test (H a : 1  0) : p  value =  p
1  2 if t  0
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
32
11.4: Assessing the Utility of the Model:
Making Inferences about the Slope 1

A Confidence Interval on β1
ˆ1  t /2 sˆ
1
where the estimated standard error is
sˆ 
1
s
SS xx
and t/2 is based on (n – 2) degrees of freedom
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
33
11.4: Assessing the Utility of the Model:
Making Inferences about the Slope 1


In the home runs and errors example, the
estimated 1 was -.0521, and the
estimated standard error was .569. With 3
degrees of freedom, t = 3.182.
The confidence interval is, therefore,
ˆ1  t /2 sˆ  0.521  3.182(.569)  .0521  1.81
1
which includes 0, so there may be no
relationship between the two variables.
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
34
11.5: The Coefficients of
Correlation and Determination

The coefficient of correlation, r, is a
measure of the strength of the linear
relationship between two variables. It
is computed as follows:
r
SS xy
SS xx SS yy
.
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
35
11.5: The Coefficients of
Correlation and Determination
Positive linear relationship
.
y
.
.
y
. .
. . .
.
.
No linear relationship
.
.
.
.
..
.
. .
..
.
.
x
r → +1
y
.
.
Negative linear relationship
.
.
.
. .
. . .
. .
. .
.
.
.
.
x
r0
x
r → -1
Values of r equal to +1 or -1 require each point in the scatter plot to lie on a single straight line.
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
36
11.5: The Coefficients of
Correlation and Determination


In the example about homeruns and errors,
SSxy= -143.8 and SSxx= 2509.
SSyy is computed as
2
y

so r 
y



2
n
SS xy
SS xx SS yy
4922
 50,856 
 2443.2
5
143.8

 .058
(2509)(2443.2)
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
37
11.5: The Coefficients of
Correlation and Determination

An r value that close to zero suggests
there may not be a linear relationship
between the variables, which is
consistent with our earlier look at the
null hypothesis and the confidence
interval on β 1.
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
38
11.5: The Coefficients of
Correlation and Determination

The coefficient of determination, r2, represents
the proportion of the total sample variability
around the mean of y that is explained by the
linear relationship between x and y.
r 
2
SS yy  SSE
SS yy
SSE
 1
SS yy
0  r 1
2
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
39
11.5: The Coefficients of
Correlation and Determination
Predict values of y with the
mean of y if no other
information is available
High
r2
Predict values of y|x based
on a hypothesized linear
relationship
Evaluate the power of x to
predict values of y with the
coefficient of determination
Low
r2
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
• x provides
important
information about
y
• Predictions are
more accurate
based on the
model
• Knowing values
of x does not
substantially
improve
predictions on y
• There may be no
relationship
between x and y,
or it may be more
subtle than a
linear relationship
40
11.6: Using the Model for
Estimation and Prediction
Statistical
Inference
based on the
linear
regression
model
ŷ  ˆ0  ˆ1 x
Estimate the mean of
y for a specific value
of x: E(y)|x
(over many
experiments with this
x-value)
Estimate an
individual value of y
for a given x value
(for a single
experiment with this
value of x)
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
41
11.6: Using the Model for
Estimation and Prediction
Sampling Error for the Estimator of the Mean of y | x p
1
 y (the standard error of y )  

n
( x p  x )2
SS xx
where  is the standard deviation of the error term  .
Using s to estimate  , a 100(1- )% confidence interval
2
1 (xp  x )
on the mean of y is yˆ  t /2 s

,
n
SS xx
with n-2 degrees of freedom.
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
42
11.6: Using the Model for
Estimation and Prediction

Based on our model results, a team that hits 140 home
runs is expected to make 99.9 errors:
yˆ  107.2  .0521x  107.2  .0521(140)  99.9
In our example, s  1, n  5, x  153.4 and SS xx  2509, so
2
1 (xp  x )
1 (140  153.4) 2
ˆ y  s 
1 
 .5211
n
SS xx
5
2509
A 95% Confidence Interval for y | x  140 is
2
1 (xp  x )
yˆ  t.025,df 3 s

 99.9  3.182(.5211)  99.9  1.66
n
SS xx
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
43
11.6: Using the Model for
Estimation and Prediction
Sampling Error for the Predictor of an Individual Value of y | x p
 ( y  yˆ )
2
(
x

x
)
1
p
(the standard error of prediction)   1  
n
SS xx
where  is the standard deviation of the error term  .
Using s to estimate  , a 100(1- )% preduction interval
2
1 (xp  x )
for y|x p is yˆ  t /2 s 1  
, with n-2 degrees of freedom.
n
SS xx
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
44
11.6: Using the Model for
Estimation and Prediction

A 95% Prediction Interval for an Individual Team’s Errors
yˆ  107.2  .0521x  107.2  .0521(140)  99.9
In our example, s  1, n  5, x  153.4 and SS xx  2509, so
1
ˆ y  s 1  
n
( x p  x )2
SS xx
1 (140  153.4) 2
 1 1 
 1.13
5
2509
A 95% Confidence Interval for y | x  140 is
1
yˆ  t.025,df 3 s 1  
n
( x p  x )2
SS xx
 99.9  3.182(1.13)  99.9  3.596
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
45
11.6: Using the Model for
Estimation and Prediction
Prediction intervals for
individual new values of y
are wider than confidence
intervals on the mean of y
because of the extra
source of error.
Error in
E(y|xp)
Error in
E(y|xp)
Error in
predicting a
mean value
of y|xp
Sampling
error from
the y
population
Error in
predicting a
specific
value of
y|xp
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
46
11.6: Using the Model for
Estimation and Prediction
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
47
11.6: Using the Model for
Estimation and Prediction


Estimating y beyond the range of values associated with
the observed values of x can lead to large prediction
errors.
Beyond the range of observed x values, the relationship
may look very different.
Estimated relationship
True relationship
Xi
Xj
Range of observed
values of x
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
48
11.7: A Complete Example

Step 1

How does the proximity of a fire
house (x) affect the damages (y)
from a fire?


y = f(x)
y = β0 +β1x + 
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
49
11.7: A Complete Example
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
50
11.7: A Complete Example
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
51
11.7: A Complete Example

Step 2

The data (found in Table 11.7) produce the
following estimates (in thousands of dollars):
ˆ0  10.28
ˆ1  4.91

The estimated damages equal $10,280 + $4910
for each mile from the fire station, or
yˆ  10.28  4.92 x
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
52
11.7: A Complete Example

Step 3


The estimate of the standard deviation, σ,
of ε is
s = 2.31635
Most of the observed fire damages will be
within 2s ≤ 4.64 thousand dollars of the
predicted value
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
53
11.7: A Complete Example

Step 4


Test that the true slope is 0
H 0 : 1  0
H a : 1  0
SAS automatically performs a two-tailed
test, with a reported p-value < .0001. The
one-tailed p-value is < .00005, which
provides strong evidence to reject the null.
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
54
11.7: A Complete Example

Step 4



A 95% confidence interval on β1 from the
SAS output is 4.071 ≤ β1 ≤ 5.768.
The coefficient of determination, r 2, is
.9235.
The coefficient of correlation, r, is
r  r  .9235  .96
2
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
55
11.7: A Complete Example

Suppose the distance from the nearest
station is 3.5 miles. We can estimate the
damage with the model estimates.
yˆ  ˆ0  ˆ1 x
yˆ  10.2  4.92(3.5)  27.5
95% prediction interval is (22.324, 32.667)
We’re 95% sure the damage for a fire 3.5
miles from the nearest station will be
between $22,324 and $32,667.
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
56
11.7: A Complete Example

Suppose the distance from the nearest
station is 3.5 miles. We can estimate the
damage with the model estimates.
Since the x-values in our
yˆ  ˆ0 sample
 ˆ1 x range from .7 to 6.1,
predictions
about
y for xyˆ  10.2
 4.92(3.5)
 27.5
values beyond this range will
95% prediction
interval is (22.324, 32.667)
be unreliable.
We’re 95% sure the damage for a fire 3.5
miles from the nearest station will be
between $22,324 and $32,667.
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
57