ch03-sec07-08.pdf

ST430 Introduction to Regression Analysis
ST430: Introduction to Regression Analysis, Ch3, Sec
3.6-3.8
Luo Xiao
September 2, 2015
1 / 25
ST430 Introduction to Regression Analysis
Simple Linear Regression
2 / 25
Simple Linear Regression
ST430 Introduction to Regression Analysis
Acknowledgement
1
This template for Rmarkdown is adapted from the latex template
designed by Professor Peter Bloomfield.
2
Some notes are also adapted from Professor Peter Bloomfield’s lecture
notes.
3 / 25
Simple Linear Regression
ST430 Introduction to Regression Analysis
Linear regression in R: Advertising and Sales example
x = c(1,2,3,4,5) # x is a vector of 5 scalars
y = c(1,1,2,2,4) # y is a vector of 5 scalars
fit = lm(y~x) # lm() is the linear regression function
summary(fit) # summary of the linear regression
Try the code yourself to get the R output!
4 / 25
Simple Linear Regression
ST430 Introduction to Regression Analysis
Confidence interval in R:
Use R function: confint()
R output
##
2.5 %
97.5 %
## (Intercept) -2.12112485 1.921125
## x
0.09060793 1.309392
5 / 25
Simple Linear Regression
ST430 Introduction to Regression Analysis
Compare the confidence interval and the hypothesis test
Note that we reject H0 if and only if the corresponding confidence interval
does not include 0.
6 / 25
Simple Linear Regression
ST430 Introduction to Regression Analysis
ANOVA table
Use R function: anova()
R output:
##
##
##
##
##
##
##
##
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x
1
4.9 4.9000 13.364 0.03535 *
Residuals 3
1.1 0.3667
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
7 / 25
Simple Linear Regression
ST430 Introduction to Regression Analysis
How to read data in R?
Standard data format for R: "xx.Rdata" or "xx.RData"
Two simple ways:
1
Open the file directly in R or RStudio
2
Load the data by R function "load("filename")" (Notes: needs to put
data in the working directory)
Other file formats such as ".txt" and ".xlsx" can be also be read into R
using special R functions.
8 / 25
Simple Linear Regression
ST430 Introduction to Regression Analysis
Read data in R: example
setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl
load("ADSALES.Rdata")
ADSALES #name of the loaded data
##
##
##
##
##
##
1
2
3
4
5
ADVEXP_X SALES_Y
1
1
2
1
3
2
4
2
5
4
The loaded data is a data frame
9 / 25
Simple Linear Regression
ST430 Introduction to Regression Analysis
What is a data frame in R?
a matrix-like structure
each column may have a name
each column can be of different types
(numeric/factor/logical/character)
10 / 25
Simple Linear Regression
ST430 Introduction to Regression Analysis
Example: a data frame in R
# vectors of different types
name = c("John","Ann","Tom") #character
sex = c("Male","Female","Male") #character
height = c(5.9,5.3,5.7) #numeric
# create a data frame here
data = data.frame(names=name, sex = sex, height=height)
names(data) # display column names of the data frame
data #display the data frame
11 / 25
Simple Linear Regression
ST430 Introduction to Regression Analysis
Example: a data frame in R
## [1] "name"
"sex"
"height"
##
name
sex height
## 1 John
Male
5.9
## 2 Ann Female
5.3
## 3 Tom
Male
5.7
12 / 25
Simple Linear Regression
ST430 Introduction to Regression Analysis
Working with data frames in R
setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl
load("ADSALES.Rdata")
x = ADSALES$ADVEXP_X # x is the advertising expenditure
y = ADSALES$SALES_Y # y is the sales revenue
13 / 25
Simple Linear Regression
ST430 Introduction to Regression Analysis
Useful measures of linear regression
Coefficient of correlation
Coefficient of determination
14 / 25
Simple Linear Regression
ST430 Introduction to Regression Analysis
Coefficient of correlation
The regression equation
Y = β0 + β1 X + shows the linear relationship between X and Y .
The correlation coefficient r shows the strength of that relationship.
r always lies between -1 and +1;
15 / 25
Simple Linear Regression
ST430 Introduction to Regression Analysis
16 / 25
Simple Linear Regression
ST430 Introduction to Regression Analysis
17 / 25
Simple Linear Regression
ST430 Introduction to Regression Analysis
18 / 25
Simple Linear Regression
ST430 Introduction to Regression Analysis
Calculate r directly as
(xi − x̄ )(yi − ȳ )
SSxy
.
r = qP
=p
P
SSxx SSyy
(xi − x̄ )2 (yi − ȳ )2
P
Note that
β̂1 =
SSxy
SSxx
Hence, calculate r from β̂1 as
s
r = β̂1 ×
SSxx
.
SSyy
Note that r always has the same sign as β̂1 .
19 / 25
Simple Linear Regression
ST430 Introduction to Regression Analysis
Calculation of correlation in R
Use R function: cor()
x = c(1,2,3,4,5) # x is a vector of 5 scalars
y = c(1,1,2,2,4) # y is a vector of 5 scalars
cor(x,y) # cor() calculates correlation
## [1] 0.9036961
20 / 25
Simple Linear Regression
ST430 Introduction to Regression Analysis
Correlation and causation
Not the same thing!
A 1999 article in the journal Nature found “a strong association between
myopia and night-time ambient light exposure during sleep in children
before they reach two years of age”.
The article noted that no causal link was established, but continued “it
seems prudent that infants and young children sleep at night without
artificial lighting in the bedroom”.
Much anguish for parents of myopic children!
21 / 25
Simple Linear Regression
ST430 Introduction to Regression Analysis
Later studies found that myopic parents tend to leave the light on, and also
tend to have myopic children.
One study, in particular, found that “the proportion of myopic children in
those subjected to a range of nursery-lighting conditions is remarkably
uniform”.
This suggests that the association observed in the first study resulted from
parental behavior and inheritance, not from a causal effect of night-time
lighting.
The moral: “Correlation does not imply causation”.
22 / 25
Simple Linear Regression
ST430 Introduction to Regression Analysis
Coefficient of determination
The coefficient of determination R 2 also measures the strength of the
relationship between x and y .
With only one independent variable, R 2 = r 2 .
When we have more than one independent variable, R 2 measures the
strength of the relationship of y to all of them.
The correlation coefficient r is always between pairs of individual
variables.
23 / 25
Simple Linear Regression
ST430 Introduction to Regression Analysis
We interpret R 2 as the fraction of the variance of y that is “explained” by
the regression.
The definition is
SSyy − SSE
(yi − ŷi )2
R =
=1− P
.
(yi − ȳ )2
SSyy
2
P
If the regression is strong, we expect ŷi to be a good predictor of yi , so
SSE < SSyy , whence the ratio is small and R 2 is close to 1.
Conversely, if the regression is weak, ŷi is not much better than ȳ as a
predictor of yi , so the ratio is close to 1 and R 2 is close to 0.
24 / 25
Simple Linear Regression
ST430 Introduction to Regression Analysis
Find R 2 in R output
x = c(1,2,3,4,5) # x is a vector of 5 scalars
y = c(1,1,2,2,4) # y is a vector of 5 scalars
fit = lm(y~x) # lm() is the linear regression function
summary(fit) # summary of the linear regression
Try the above code in R!
25 / 25
Simple Linear Regression