Lecture 6 Polynomial Regression Polynomial regression models are useful when the true curvilinear response function is a polynomial function or can be well approximated by a polynomial function. 1 Introduction Polynomial regression can be considered as a special case of the general linear model where X1 = X, X2 = X 2 ,. . .,Xp−1 = X p−1 . The model for polynomial regression is then Yi = β0 + β1 Xi + β2 Xi2 + . . . + βp−1 Xip−1 + ǫi , where ǫi ∼ N (0, σ 2 ) for i = 1, 2, . . . , n. How can we decide the order of the polynomial, i.e. determine a value for p? 2 (1) Determining the Order of the Polynomial Step 1 Fit a simple linear regression and test the hypothesis that β1 = 0. If the test is significant, continue. Step 2 Fit the quadratic term and test the hypothesis that β2 = 0 given that β1 has already been fitted. Continue if the test is significant. Step 3 Continue fitting terms until a non-significant result is obtained. Step 4 Check diagnostics to to see if the model provides an adequate fit to the data. Step 5 If the fit is adequate accept the current model. If the fit is not adequate repeat steps three, four and five. 3 (Note: If an adequate model is not obtained after fitting a moderate number of terms, it may be that a different type of model is needed. 4 Example: In an industrial experiment the nitric oxide (NOx) emissions from the exhaust of an experimental one cylinder engine using ethanol as a fuel were measured. The variable E gives the ratio of air to Ethanol mix. 5 4 3 NOx 2 1 0.6 0.7 0.8 0.9 E 6 1.0 1.1 1.2 What does this plot suggest? 7 To begin we fit the linear and quadratic terms. # fit quadratic # E variable contains ethanol data ethanol2.lm<-lm(NOx~E + I(E^2),data=ethanol) print(anova(ethanol2.lm)) ######################### Analysis of Variance Table Response: NOx Df Sum Sq Mean Sq F value Pr(>F) E 1 1.2 1.2 3.97 0.049 I(E^2) 1 85.8 85.8 295.00 <2e-16 Residuals 85 24.7 0.3 (Note: In R we can add additional polynomial terms by using I(E^2), I(E^3),.. ) 8 The F -test for predictor I(E^2) tests the hypothesis that the quadratic term is needed after the linear has been fitted and is clearly significant (P ≈ 0). The quadratic term is required in the model. That is, the second degree polynomial provides a better fit than the linear. 9 The next step is to examine the cubic term and if that is required, the quartic term, and so on. To fit the cubic polynomial we can simply add the cubic term to the quadratic using the update command. # update by adding cubic term to ethanol2.lm ethanol3.lm<-update(ethanol2.lm,.~.+I(E^3)) print(anova(ethanol3.lm)) ########################## Analysis of Variance Table Response: NOx Df Sum Sq Mean Sq F value Pr(>F) E 1 1.2 1.2 3.96 0.05 I(E^2) 1 85.8 85.8 294.36 <2e-16 I(E^3) 1 0.2 0.2 0.82 0.37 Residuals 84 24.5 0.3 10 Exercise: Write the appropriate hypothesis to test the cubic term and state your conclusion. 11 Since the cubic term is non-significant, we check the adequacy of the model by plotting the residuals against the fitted values. 12 2 1 0 Std. Residuals −1 −0.5 0.5 1.5 Fitted Values 13 2.5 Question: What does this plot show? 14 Since the cubic model is not adequate we continue by fitting the quartic term. ethanol4.lm<-update(ethanol3.lm,.~.+I(E^4)) print(anova(ethanol4.lm)) ############################# Analysis of Variance Table Response: NOx Df Sum Sq Mean Sq F value Pr(>F) E 1 1.2 1.2 8.51 0.0045 I(E^2) 1 85.8 85.8 631.85 < 2e-16 I(E^3) 1 0.2 0.2 1.75 0.1890 I(E^4) 1 13.2 13.2 97.31 1.2e-15 Residuals 83 11.3 0.1 15 Since the quartic term is significant the next step is to test the significance of the 5th degree term. ethanol5.lm<-update(ethanol4.lm,.~.+I(E^5)) print(anova(ethanol5.lm)) ######################## Analysis of Variance Table Response: NOx Df Sum Sq Mean Sq F value Pr(>F) E 1 1.2 1.2 9.08 0.0034 I(E^2) 1 85.8 85.8 674.03 < 2e-16 I(E^3) 1 0.2 0.2 1.87 0.1751 I(E^4) 1 13.2 13.2 103.80 3.3e-16 I(E^5) 1 0.8 0.8 6.54 0.0124 Residuals 82 10.4 0.1 What does this suggest? 16 Again since the 5th degree term is significant we test for the significance of the 6th degree term. ethanol6.lm<-update(ethanol5.lm,.~.+I(E^6)) print(anova(ethanol6.lm)) ######################## Analysis of Variance Table Response: NOx Df Sum Sq Mean Sq F value Pr(>F) E 1 1.2 1.2 10.26 0.00194 I(E^2) 1 85.8 85.8 762.06 < 2e-16 I(E^3) 1 0.2 0.2 2.12 0.14972 I(E^4) 1 13.2 13.2 117.36 < 2e-16 I(E^5) 1 0.8 0.8 7.40 0.00800 I(E^6) 1 1.3 1.3 11.71 0.00098 Residuals 81 9.1 0.1 17 What does this suggest? 18 Since the sixth degree term is significant, we test the 7th degree term. ethanol7.lm<-update(ethanol6.lm,.~.+I(E^7)) print(anova(ethanol7.lm)) ######################## Analysis of Variance Table Response: NOx Df Sum Sq Mean Sq F value Pr(>F) E 1 1.2 1.2 10.46 0.00177 I(E^2) 1 85.8 85.8 777.04 < 2e-16 I(E^3) 1 0.2 0.2 2.16 0.14587 I(E^4) 1 13.2 13.2 119.66 < 2e-16 I(E^5) 1 0.8 0.8 7.54 0.00745 I(E^6) 1 1.3 1.3 11.94 0.00088 I(E^7) 1 0.3 0.3 2.59 0.11130 Residuals 80 8.8 0.1 19 What does this suggest? 20 Comments 1. The cubic term is non-significant (after adding the linear and quadratic terms). 2. However it appears the fourth degree term is definitely required as are the fifth and sixth degree terms. 3. The seventh degree term is not significant so we conclude the sixth degree polynomial is probably appropriate. 4. Only one order is considered in fitting polynomials begin with the linear, and then enter quadratic, cubic, quartic, ... until an adequate model is found. 21 5. We include ALL the terms up to the sixth degree! The linear and cubic terms are not usually omitted just because they are non-significant. They can only be omitted if we also have a theoretical reason to omit them. For example if the intercept term is to be omitted, we should first have reason to believe the true relationship does go through the origin and then a test of the intercept being zero must be non-significant. 6. We must now examine the fit of the proposed model before accepting it. 22 Diagnostics 23 0.5 2.0 3.5 8864 66 2 66 64 Normal Q−Q −2 0 88 Standardized residuals Residuals vs Fitted −2 0 2 Scale−Location Cook’s distance 0.5 2.0 3.5 Fitted values 87 1.0 66 64 66 0.0 88 Cook’s distance Theoretical Quantiles 24 Fitted values 0 40 88 80 Obs. number What does this suggest? 25 NB: There is one observation (87) with a large Cook’s distance (approx. 1.6) and a corresponding percentile of 85%. Note that observation 87 has the lowest value of E (0.54). Often, first and last observations are identified as influential points. 26 The final model NOx = −1346 + 9915E − 29766E 2 + 46551E 3 −39953E 4 + 17851E 5 − 3248E 6 27 Call: lm(formula = NOx ~ E + I(E^2) + I(E^3) + I(E^4) + I(E^5) + I(E^6), data = ethanol, x = T) Residuals: Min 1Q Median -0.7252 -0.1769 -0.0302 3Q 0.1926 Max 0.8732 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1346 390 -3.45 0.00088 E 9915 2814 3.52 0.00070 I(E^2) -29766 8338 -3.57 0.00060 I(E^3) 46551 12981 3.59 0.00057 I(E^4) -39953 11204 -3.57 0.00061 I(E^5) 17851 5086 3.51 0.00074 I(E^6) -3248 949 -3.42 0.00098 28 To plot the fitted line and the 95% confidence bands we need to order the data points. 29 # calculate predicted values # and confidence intervals pred<-predict(ethanol6.lm,se.fit=T, interval="confidence") # pred$fit contains the fitted values and # upper and lower bounds of the CI fit<-pred$fit[,1] lower<-pred$fit[,2] upper<-pred$fit[,3] # plot data par(mfrow=c(1,1)) plot(NOx~E,data=ethanol,pch=16,cex=0.5) # Note use of order() to order the values of E # before plotting r<-order(ethanol$E) # lines() adds fitted curve and confidence bands lines(ethanol$E[r],fit[r]) lines(ethanol$E[r],lower[r],lty=2) lines(ethanol$E[r],upper[r],lty=2) 30 4 3 NOx 2 1 0.6 0.8 1.0 E 31 1.2 The plot of the fitted values on the scatter plot appears to fit the data reasonably well. The model appears to provide an adequate fit to the data. 32
© Copyright 2026 Paperzz