ch05-sec1-6.pdf

ST430 Introduction to Regression Analysis
ST430: Introduction to Regression Analysis, Chapter 5,
Sections 5.1-5.6
Luo Xiao
October 14, 2015
1 / 22
ST430 Introduction to Regression Analysis
Model Building
2 / 22
Model Building
ST430 Introduction to Regression Analysis
Model building means finding a model that will:
provide a good fit to a set of data;
give good estimates of E (Y |X1 , X2 , . . . , Xk );
give good predictions of Y .
In some situations, such a model may not exist!
Finding the best (or least bad) model may still provide insights into the
problem at hand.
3 / 22
Model Building
ST430 Introduction to Regression Analysis
Using background information
In most situations, you can use knowledge of the context to guide
model-building.
Example: summertime daily peak electricity demand and temperature:
Example 5.2 in textbook;
Y : Peak load (unit: megawatts);
X : Temperature (unit: Fahrenheit degree);
Scatter plot on next slide.
Temperature-dependent component of demand is largely for air-conditioning.
4 / 22
Model Building
ST430 Introduction to Regression Analysis
140
120
100
LOAD
160
180
Scatter plot of electricity data
70
80
90
100
TEMP
5 / 22
Model Building
ST430 Introduction to Regression Analysis
“Cooling degree days”:
(
degree days =
T − T0
0
if T > T0 , the “base temperature”
if T ≤ T0 .
Simple model:
E (Y ) = temperature-independent load + degree day load
= β0 + β1 max(T − T0 , 0).
Try T0 = 80. R code (Output in “output1.txt”):
T0 = 82
fit = lm(LOAD~pmax(TEMP-T0,0),data=POWERLOADS)
summary(fit)
6 / 22
Model Building
ST430 Introduction to Regression Analysis
140
120
100
LOAD
160
180
Model fit with T0 = 80
70
80
90
100
TEMP
7 / 22
Model Building
ST430 Introduction to Regression Analysis
What is the best T0 ?
0.9
0.7
0.5
Adjusted R−squared
We search over all possible T0 and find the one that gives the highest Ra2 .
70
75
80
85
90
95
100
T0
8 / 22
Model Building
ST430 Introduction to Regression Analysis
140
120
100
LOAD
160
180
Model fit with T0 = 82
70
80
90
100
TEMP
9 / 22
Model Building
ST430 Introduction to Regression Analysis
A different, smoother, model (but linear in the parameters):
E (Y ) = β0 + β1 T + β2 T 2
R code for fitting the model:
fit = lm(LOAD~TEMP + I(TEMP^2),data=POWERLOADS)
Output in “output2.txt”. Note improvement in Ra2 .
10 / 22
Model Building
ST430 Introduction to Regression Analysis
140
120
100
LOAD
160
180
Model fit with the quadratic model
70
80
90
100
TEMP
11 / 22
Model Building
ST430 Introduction to Regression Analysis
Models with two (or more) quantitative variables
First order model:
E (Y ) = β0 + β1 X1 + β2 X2 .
Second order models:
Interaction model:
E (Y ) = β0 + β1 X1 + β2 X2 + β3 X1 X2 ;
Complete second order model:
E (Y ) = β0 + β1 X1 + β2 X2 + β3 X1 X2 + β4 X12 + β5 X22 .
12 / 22
Model Building
ST430 Introduction to Regression Analysis
Which to use?
Consider the response surface: a graph of E (Y ) against X1 and X2 :
For the first order model, the response surface is a plane.
For second order models, the response surface is quadratic; it may be
bowl-shaped, or an inverted bowl, or have a saddle point.
Note that
(X1 + X2 )2 − (X1 − X2 )2
4
so the response surface for the interaction model is always saddle-shaped.
X1 X2 =
13 / 22
Model Building
ST430 Introduction to Regression Analysis
Product quality data
Many products are produced using chemicals. The quality of products
depend on temperature and pressure at which the chemical reactions
happen.
Example 5.3 in textbook
Y : quality (percentage)
X1 : temperature
X2 : pressure
Visualization
Scatter plots (next slide)
3-dimensional plots (see R code in file "3d_rgl.R")
14 / 22
Model Building
ST430 Introduction to Regression Analysis
90
80
70
40
50
60
QUALITY
80
70
60
50
40
QUALITY
90
Scatter plots
80
85
90
TEMP
15 / 22
95
100
50
52
54
56
PRESSURE
Model Building
58
60
ST430 Introduction to Regression Analysis
Model fit with the complete 2nd order model
Model:
E (Y ) = β0 + β1 X1 + β2 X2 + β3 X1 X2 + β4 X12 + β5 X22
R code:
setwd("/Users/xiaoyuesixi/Dropbox/teaching/2015Fall/R_datasets
load("PRODQUAL.Rdata")# load in data
fit = lm(QUALITY~TEMP*PRESSURE +
I(TEMP^2) + I(PRESSURE^2), data = PRODQUAL)
summary(fit)
16 / 22
Model Building
ST430 Introduction to Regression Analysis
Fitted response surface
R code:
X1 <- seq(80, 100, length = 40)
X2 <- seq(50, 60, length = 40)
X <- expand.grid(TEMP = X1, PRESSURE = X2)
yhat <- predict(fit,newdata = X)
library(rgl) # only work in R but not in RStudio
plot3d(X$TEMP,X$PRESSURE,yhat, pch=20,col="blue")
17 / 22
Model Building
ST430 Introduction to Regression Analysis
Coded variables
Some variables can be represented on different scales.
E.g., temperature in degrees Celsius or Fahrenheit.
Suppose some response Y is modeled as a linear function of temperature:
E (Y ) = β0 + β1 X ,
with X = temperature in degrees Fahrenheit.
18 / 22
Model Building
ST430 Introduction to Regression Analysis
If X ∗ = temperature in degrees Celsius, then X = 32 + 1.8X ∗ .
So
E (Y ) = β0 + β1 (32 + 1.8X ∗ )
= (β0 + 32β1 ) + (1.8β1 ) X ∗
= β0∗ + β1∗ X ∗ ,
where
β0∗ = β0 + 32β1
and
β1∗ = 1.8β1 .
19 / 22
Model Building
ST430 Introduction to Regression Analysis
So if Y is linearly related to X , then it is also linearly related to X ∗ , with
different coefficients β0∗ and β1∗ . Furthermore, the significance of effect of X
remains the same.
We sometimes code variables to make an equation more easily interpreted.
When a variable takes only two distinct values, we often code them as −1
and +1.
E.g., if X is temperature with levels 80◦ F and 100◦ F, and
X ∗ = (x − 90)/10,
then X ∗ = −1 when X = 80, and X ∗ = 1 when X = 100.
20 / 22
Model Building
ST430 Introduction to Regression Analysis
A variable with three levels can similarly be coded as −1, 0, and +1,
provided the three levels are equally spaced.
The interpretation of the corresponding coefficient β ∗ is, as always, the
change in E (Y ) when X ∗ changes by 1, with all other variables fixed.
But with a variable coded like this, a change of 1 in X ∗ means moving, say,
from the midpoint value to the high value.
The corresponding change in E (Y ) is often called the effect of the
variable.
21 / 22
Model Building
ST430 Introduction to Regression Analysis
When a variable takes more than two or three values, it is sometimes
standardized:
Xi − X̄
.
Xi∗ = ui =
sx
All coefficients are then in the units of Y , so they can be compared
numerically.
If Y is also standardized, the coefficients are dimensionless.
These are called standardized regression coefficients, and are widely
used in some fields.
Despite what the text says, standardization has no effect on computational
errors, with modern algorithms.
22 / 22
Model Building