Lecture 6 Polynomial Regression

Lecture 6
Polynomial Regression
Polynomial regression models are useful when the true
curvilinear response function is a polynomial function or
can be well approximated by a polynomial function.
1
Introduction
Polynomial regression can be considered as a special case
of the general linear model where X1 = X,
X2 = X 2 ,. . .,Xp−1 = X p−1 . The model for polynomial
regression is then
Yi = β0 + β1 Xi + β2 Xi2 + . . . + βp−1 Xip−1 + ǫi ,
where
ǫi ∼ N (0, σ 2 ) for i = 1, 2, . . . , n.
How can we decide the order of the polynomial, i.e.
determine a value for p?
2
(1)
Determining the Order of the Polynomial
Step 1 Fit a simple linear regression and test the hypothesis that
β1 = 0. If the test is significant, continue.
Step 2 Fit the quadratic term and test the hypothesis that β2 = 0
given that β1 has already been fitted. Continue if the test is
significant.
Step 3 Continue fitting terms until a non-significant result is
obtained.
Step 4 Check diagnostics to to see if the model provides an
adequate fit to the data.
Step 5 If the fit is adequate accept the current model. If the fit is
not adequate repeat steps three, four and five.
3
(Note: If an adequate model is not obtained after fitting a
moderate number of terms, it may be that a different type of model
is needed.
4
Example: In an industrial experiment the nitric oxide
(NOx) emissions from the exhaust of an experimental one
cylinder engine using ethanol as a fuel were measured.
The variable E gives the ratio of air to Ethanol mix.
5
4
3
NOx
2
1
0.6
0.7
0.8
0.9
E
6
1.0
1.1
1.2
What does this plot suggest?
7
To begin we fit the linear and quadratic terms.
# fit quadratic
# E variable contains ethanol data
ethanol2.lm<-lm(NOx~E + I(E^2),data=ethanol)
print(anova(ethanol2.lm))
#########################
Analysis of Variance Table
Response: NOx
Df Sum Sq Mean Sq F value Pr(>F)
E
1
1.2
1.2
3.97 0.049
I(E^2)
1
85.8
85.8 295.00 <2e-16
Residuals 85
24.7
0.3
(Note: In R we can add additional polynomial terms by using
I(E^2), I(E^3),.. )
8
The F -test for predictor I(E^2) tests the hypothesis that
the quadratic term is needed after the linear has been
fitted and is clearly significant (P ≈ 0). The quadratic
term is required in the model. That is, the second degree
polynomial provides a better fit than the linear.
9
The next step is to examine the cubic term and if that is
required, the quartic term, and so on. To fit the cubic
polynomial we can simply add the cubic term to the
quadratic using the update command.
# update by adding cubic term to ethanol2.lm
ethanol3.lm<-update(ethanol2.lm,.~.+I(E^3))
print(anova(ethanol3.lm))
##########################
Analysis of Variance Table
Response: NOx
Df Sum Sq Mean Sq F value Pr(>F)
E
1
1.2
1.2
3.96
0.05
I(E^2)
1
85.8
85.8 294.36 <2e-16
I(E^3)
1
0.2
0.2
0.82
0.37
Residuals 84
24.5
0.3
10
Exercise: Write the appropriate hypothesis to test the
cubic term and state your conclusion.
11
Since the cubic term is non-significant, we check the
adequacy of the model by plotting the residuals against
the fitted values.
12
2
1
0
Std. Residuals
−1
−0.5
0.5
1.5
Fitted Values
13
2.5
Question: What does this plot show?
14
Since the cubic model is not adequate we continue by
fitting the quartic term.
ethanol4.lm<-update(ethanol3.lm,.~.+I(E^4))
print(anova(ethanol4.lm))
#############################
Analysis of Variance Table
Response: NOx
Df Sum Sq Mean Sq F value Pr(>F)
E
1
1.2
1.2
8.51 0.0045
I(E^2)
1
85.8
85.8 631.85 < 2e-16
I(E^3)
1
0.2
0.2
1.75 0.1890
I(E^4)
1
13.2
13.2
97.31 1.2e-15
Residuals 83
11.3
0.1
15
Since the quartic term is significant the next step is to
test the significance of the 5th degree term.
ethanol5.lm<-update(ethanol4.lm,.~.+I(E^5))
print(anova(ethanol5.lm))
########################
Analysis of Variance Table
Response: NOx
Df Sum Sq Mean Sq F value Pr(>F)
E
1
1.2
1.2
9.08 0.0034
I(E^2)
1
85.8
85.8 674.03 < 2e-16
I(E^3)
1
0.2
0.2
1.87 0.1751
I(E^4)
1
13.2
13.2 103.80 3.3e-16
I(E^5)
1
0.8
0.8
6.54 0.0124
Residuals 82
10.4
0.1
What does this suggest?
16
Again since the 5th degree term is significant we test for
the significance of the 6th degree term.
ethanol6.lm<-update(ethanol5.lm,.~.+I(E^6))
print(anova(ethanol6.lm))
########################
Analysis of Variance Table
Response: NOx
Df Sum Sq Mean Sq F value Pr(>F)
E
1
1.2
1.2
10.26 0.00194
I(E^2)
1
85.8
85.8 762.06 < 2e-16
I(E^3)
1
0.2
0.2
2.12 0.14972
I(E^4)
1
13.2
13.2 117.36 < 2e-16
I(E^5)
1
0.8
0.8
7.40 0.00800
I(E^6)
1
1.3
1.3
11.71 0.00098
Residuals 81
9.1
0.1
17
What does this suggest?
18
Since the sixth degree term is significant, we test the 7th
degree term.
ethanol7.lm<-update(ethanol6.lm,.~.+I(E^7))
print(anova(ethanol7.lm))
########################
Analysis of Variance Table
Response: NOx
Df Sum Sq Mean Sq F value Pr(>F)
E
1
1.2
1.2
10.46 0.00177
I(E^2)
1
85.8
85.8 777.04 < 2e-16
I(E^3)
1
0.2
0.2
2.16 0.14587
I(E^4)
1
13.2
13.2 119.66 < 2e-16
I(E^5)
1
0.8
0.8
7.54 0.00745
I(E^6)
1
1.3
1.3
11.94 0.00088
I(E^7)
1
0.3
0.3
2.59 0.11130
Residuals 80
8.8
0.1
19
What does this suggest?
20
Comments
1. The cubic term is non-significant (after adding the
linear and quadratic terms).
2. However it appears the fourth degree term is
definitely required as are the fifth and sixth degree
terms.
3. The seventh degree term is not significant so we
conclude the sixth degree polynomial is probably
appropriate.
4. Only one order is considered in fitting polynomials begin with the linear, and then enter quadratic, cubic,
quartic, ... until an adequate model is found.
21
5. We include ALL the terms up to the sixth degree!
The linear and cubic terms are not usually omitted
just because they are non-significant. They can only
be omitted if we also have a theoretical reason to
omit them. For example if the intercept term is to be
omitted, we should first have reason to believe the
true relationship does go through the origin and then
a test of the intercept being zero must be
non-significant.
6. We must now examine the fit of the proposed model
before accepting it.
22
Diagnostics
23
0.5
2.0
3.5
8864
66
2
66
64
Normal Q−Q
−2 0
88
Standardized residuals
Residuals vs Fitted
−2
0
2
Scale−Location
Cook’s distance
0.5
2.0
3.5
Fitted values
87
1.0
66
64
66
0.0
88
Cook’s distance
Theoretical Quantiles
24
Fitted values
0
40
88
80
Obs. number
What does this suggest?
25
NB: There is one observation (87) with a large Cook’s
distance (approx. 1.6) and a corresponding percentile of
85%. Note that observation 87 has the lowest value of E
(0.54). Often, first and last observations are identified as
influential points.
26
The final model
NOx = −1346 + 9915E − 29766E 2 + 46551E 3
−39953E 4 + 17851E 5 − 3248E 6
27
Call:
lm(formula = NOx ~ E + I(E^2) + I(E^3) + I(E^4)
+ I(E^5) + I(E^6), data = ethanol, x = T)
Residuals:
Min
1Q Median
-0.7252 -0.1769 -0.0302
3Q
0.1926
Max
0.8732
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
-1346
390
-3.45 0.00088
E
9915
2814
3.52 0.00070
I(E^2)
-29766
8338
-3.57 0.00060
I(E^3)
46551
12981
3.59 0.00057
I(E^4)
-39953
11204
-3.57 0.00061
I(E^5)
17851
5086
3.51 0.00074
I(E^6)
-3248
949
-3.42 0.00098
28
To plot the fitted line and the 95% confidence bands we
need to order the data points.
29
# calculate predicted values
# and confidence intervals
pred<-predict(ethanol6.lm,se.fit=T,
interval="confidence")
# pred$fit contains the fitted values and
# upper and lower bounds of the CI
fit<-pred$fit[,1]
lower<-pred$fit[,2]
upper<-pred$fit[,3]
# plot data
par(mfrow=c(1,1))
plot(NOx~E,data=ethanol,pch=16,cex=0.5)
# Note use of order() to order the values of E
# before plotting
r<-order(ethanol$E)
# lines() adds fitted curve and confidence bands
lines(ethanol$E[r],fit[r])
lines(ethanol$E[r],lower[r],lty=2)
lines(ethanol$E[r],upper[r],lty=2)
30
4
3
NOx
2
1
0.6
0.8
1.0
E
31
1.2
The plot of the fitted values on the scatter plot appears
to fit the data reasonably well. The model appears to
provide an adequate fit to the data.
32