ch03-sec01-6.pdf

ST430 Introduction to Regression Analysis
ST430: Introduction to Regression Analysis, Ch3, Sec
3.1-3.6
Luo Xiao
August 31, 2015
1 / 29
ST430 Introduction to Regression Analysis
Simple Linear Regression
2 / 29
Simple Linear Regression
ST430 Introduction to Regression Analysis
Recall:
A regression model describes how a dependent variable (or response) Y is
affected, on average, by one or more independent variables (or factors, or
covariates) X1 , X2 , . . . , Xp .
3 / 29
Simple Linear Regression
ST430 Introduction to Regression Analysis
The straight-line probabilistic model
Simplest case of a regression model:
One independent variable, p = 1, X1 ≡ X ;
Linear dependence;
Model equation:
E (Y ) = β0 + β1 X ,
or equivalently
Y = β0 + β1 X + .
4 / 29
Simple Linear Regression
ST430 Introduction to Regression Analysis
Interpreting the parameters:
β0 is the intercept (so called because it is where the graph of
y = β0 + β1 x meets the y -axis x = 0);
β1 is the slope; that is, the change in E (Y ) as X = x is changed to
X = x + 1.
Note: if β1 = 0, X has no effect on y ; that will often be an interesting
hypothesis to test.
5 / 29
Simple Linear Regression
ST430 Introduction to Regression Analysis
Graph for a straight line
6 / 29
Simple Linear Regression
ST430 Introduction to Regression Analysis
Advertising and Sales example
x = monthly advertising expenditure, in hundreds of dollars;
y = monthly sales revenue, in thousands of dollars;
β0 = expected revenue with no advertising;
β1 = expected revenue increase per $100 increase in advertising, in
thousands of dollars.
Sample data for five months:
Advertising
Revenue
7 / 29
1
1
2
1
3
2
4
2
5
4
Simple Linear Regression
ST430 Introduction to Regression Analysis
Scatterplot
R code:
x = c(1,2,3,4,5)
y = c(1,1,2,2,4)
plot(x,y,type="p",pch=19,
xlim=c(0,5),ylim=c(0,4),
xlab="Advertising expenditure",ylab="Sales revenue")
8 / 29
Simple Linear Regression
ST430 Introduction to Regression Analysis
2
1
0
Sales revenue
3
4
R output
0
1
2
3
4
5
Advertising expenditure
9 / 29
Simple Linear Regression
ST430 Introduction to Regression Analysis
We could try various values of β0 and β1 .
For given values of β0 and β1 , we get predictions
pi = β0 + β1 xi , i = 1, 2, 3, 4, 5.
The difference between the observed value yi and the prediction pi is the
residual
ri = yi − pi , i = 1, 2, 3, 4, 5.
A good choice of β0 and β1 gives accurate predictions, and generally small
residuals.
10 / 29
Simple Linear Regression
ST430 Introduction to Regression Analysis
One candidate line (β0 = −0.1, β1 = 0.7):
R code
x = c(1,2,3,4,5)
y = c(1,1,2,2,4)
plot(x,y,type="p",pch=19,
xlim=c(0,5),ylim=c(0,4),
xlab="Advertising expenditure",ylab="Sales revenue")
abline(a=-0.1,b=0.7,col="red")
11 / 29
Simple Linear Regression
ST430 Introduction to Regression Analysis
2
1
0
Sales revenue
3
4
R output
0
1
2
3
4
5
Advertising expenditure
12 / 29
Simple Linear Regression
ST430 Introduction to Regression Analysis
Fitting the model
How to measure the overall size of the residuals?
Most common measure (but not the only possibility): sum of squares of
residuals
X
ri2 =
X
=
X
(yi − pi )2
{yi − (β0 + β1 xi )}2
= S(β0 , β1 ).
The least squares line is the one with the smallest sum of squares.
P
Note: the least squares line has the property that ri = 0; Definition 3.1
(page 95) does not need to impose that as a constraint.
13 / 29
Simple Linear Regression
ST430 Introduction to Regression Analysis
The least squares estimates of β0 and β1 are the coefficients of the least
squares line.
Some algebra shows that the least squares estimates are
(xi − x̄ )(yi − ȳ )
xi yi − nx̄ ȳ
P
= P 2
2
(xi − x̄ )
xi − nx̄ 2
P
β̂1 =
P
and
β̂0 = ȳ − β̂1 x̄ .
14 / 29
Simple Linear Regression
ST430 Introduction to Regression Analysis
Other criteria
Why square the residuals?
We could use least absolute deviations estimates, minimizing
S1 (β0 , β1 ) =
X
|yi − (β0 + β1 xi )|.
Convenience: we have equations for the least squares estimates, but to find
the least absolute deviations estimates we have to solve a linear
programming problem.
Optimality: least squares estimates are BLUE if the errors are uncorrelated
with constant variance, and MVUE if additionally is normal.
15 / 29
Simple Linear Regression
ST430 Introduction to Regression Analysis
Model assumptions
The least squares line gives point estimates of β0 and β1 .
These estimates are always unbiased.
To use the other forms of statistical inference:
interval estimates, such as confidence intervals;
hypothesis tests;
we need some assumptions about the random errors .
16 / 29
Simple Linear Regression
ST430 Introduction to Regression Analysis
1
Zero mean: E (i ) = 0; as noted earlier, this is not really an
assumption, but a consequence of the definition
= Y − E (Y ).
2
3
4
Constant variance: V (i ) = σ 2 ; this is a nontrivial assumption, often
violated in practice.
Normality: i ∼ N(0, σ 2 ); this is also a nontrivial assumption, always
violated in practice, but sometimes a useful approximation.
Independence: i and j are statistically independent; another
nontrivial assumption, often true in practice, but typically violated with
time series and spatial data.
17 / 29
Simple Linear Regression
ST430 Introduction to Regression Analysis
Model assumptions
18 / 29
Simple Linear Regression
ST430 Introduction to Regression Analysis
Notes:
Assumptions 2 and 4 are the conditions under which least squares estimates
are BLUE (Best Linear Unbiased Estimators);
Assumptions 2, 3, and 4 are the conditions under which least squares
estimates are MVUE (Minimum Variance Unbiased Estimators).
19 / 29
Simple Linear Regression
ST430 Introduction to Regression Analysis
Estimating σ 2
Recall that σ 2 is the variance of i , which we have assumed to be the same
for all i.
That is,
σ 2 = V (i )
= V [Yi − E (Yi )]
= V [Yi − (β0 + β1 Xi )], i = 1, 2, . . . , n.
We observe Yi = yi and Xi = xi ; if we knew β0 and β1 , we would estimate
σ 2 by
1X
1
{yi − (β0 + β1 xi )}2 = S(β0 , β1 ).
n
n
20 / 29
Simple Linear Regression
ST430 Introduction to Regression Analysis
We do not know β0 and β1 , but we have least squares estimates β̂0 and β̂1 .
So we could use S β̂0 , β̂1 as an approximation to S(β0 , β1 ).
But we know that S β̂0 , β̂1 < S(β0 , β1 ), so
1 S β̂0 , β̂1
n
would be a biased estimate of σ 2 .
21 / 29
Simple Linear Regression
ST430 Introduction to Regression Analysis
We can show that, under Assumptions 2 and 4,
h E S β̂0 , β̂1
So
s2 =
i
= (n − 2)σ 2 .
1
1 X
S β̂0 , β̂1 =
(yi − ŷi )2 ,
n−2
n−2
where ŷi = β̂0 + β̂1 xi , is an unbiased estimate of σ 2 .
This is sometimes written
s 2 = Mean Square Error = MSE
SSE
=
.
degrees of freedom for Error
22 / 29
Simple Linear Regression
ST430 Introduction to Regression Analysis
Inferences about the line
We are often interested in the question of whether X has any effect on
E (Y ).
Since
E (Y ) = β0 + β1 X ,
the independent variable X has some effect whenever β1 6= 0.
So we need to test the null hypothesis H0 : β1 = 0.
23 / 29
Simple Linear Regression
ST430 Introduction to Regression Analysis
We also need to construct a confidence interval for β1 , to indicate how
precisely we know its value.
For both purposes, we need the standard error:
σβ̂1 = √
σ
,
SSxx
where
SSxx =
X
(xi − x̄ )2 .
As always, since σ is unknown, we replace it by its estimate s, to get the
estimated standard error
s
σ̂β̂1 = √
.
SSxx
24 / 29
Simple Linear Regression
ST430 Introduction to Regression Analysis
A confidence interval for β1 is
β̂1 ± tα/2,n−2 × σ̂β̂1 .
Note that we use the t-distribution with n − 2 degrees of freedom, because
that is the degrees of freedom associated with s 2 .
To test H0 : β1 = 0, we use the test statistic
t=
β̂1
,
σ̂β̂1
and reject H0 at the significance level α if
|t| > tα/2,n−2 .
25 / 29
Simple Linear Regression
ST430 Introduction to Regression Analysis
Compare the confidence interval and the hypothesis test
Note that we reject H0 if and only if the corresponding confidence interval
does not include 0.
26 / 29
Simple Linear Regression
ST430 Introduction to Regression Analysis
Linear regression in R: Advertising and Sales example
x = c(1,2,3,4,5)
y = c(1,1,2,2,4)
fit = lm(y~x)
summary(fit)
Try yourself to get the R output!
27 / 29
Simple Linear Regression
ST430 Introduction to Regression Analysis
Construct confidence interval in R:
Use R command: confint(fit)
R output
##
2.5 %
97.5 %
## (Intercept) -2.12112485 1.921125
## x
0.09060793 1.309392
28 / 29
Simple Linear Regression
ST430 Introduction to Regression Analysis
ANOVA table
Use R command: anova(fit)
R output:
##
##
##
##
##
##
##
##
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x
1
4.9 4.9000 13.364 0.03535 *
Residuals 3
1.1 0.3667
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
29 / 29
Simple Linear Regression