Homework set 10 - Due 4/26/2013 Math 3200 – Renato Feres Preliminaries The point of this assignment is to show how to do simple linear regression analysis in R. Not everything we describe in these preliminaries is needed for the homework problems. Much of this material is taken from Using R for Introductory Statistics, by John Verzani, http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf, R Cookbook, by Paul Teetor, and Introductory Statistics with R, by Peter Dalgaard. The simple linear regression model. The basic model for linear regression is that pairs of data (xi , y i ) are related through the equation y i = β0 + β1 xi + ǫi where β0 and β1 are constants and the ǫi are random deviations from the straight line model; they are assumed to ¡ ¢ be independent values from a N 0, σ2 -distribution. We will see later in these preliminaries how the validity of the various assumptions involved in this model of the data can be tested. ¢2 P¡ The quantity Q(β0 , β1 ) = y i − (β0 + β1 xi ) is minimized when the coefficients β0 , β1 take on the values β̂1 = P (xi − x̄)(y i − ȳ) , β̂0 = ȳ − βˆ1 x̄. P (xi − x̄)2 Denoting by ŷ i = β̂0 + β̂1 xi the fitted values of y and defining the residuals e i = y i − ŷ i , we can write the minimum value of Q, also known as the error sum of squares, or SSE, as follows: Q min = SSE = n X i=1 e i2 . Example. The maximum heart rate of a person is known to be related to age. The formula y = 220 − x is often assumed, where y is maximum heart rate in beats per minute and x is the person’s age in years. To test this formula empirically, 15 persons are tested for their maximum heart rate and the following data are found: > x=c(18,23,25,35,65,54,34,56,72,19,23,42,18,39,37) > y=c(202,186,187,180,156,169,174,172,153,199,193,174,198,183,178) A scatter plot of the data and the regression line are shown in the next graph. The graph is obtained using the commands: > plot(x,y) #make a scatter plot of the data > abline(lm(y~x)) #plot the regression line The key R-function used is lm, which stands for linear model. The resulting graph is shown next. A linear relation between x and y is readily apparent. 200 190 180 y 170 160 20 30 40 50 60 70 x In order to find the coefficients of the regression line, use again the function: > lm(y~x) Call: lm(formula = y ~ x) Coefficients: (Intercept) 210.0485 x -0.7977 From this we can read the equation of the regression line: ŷ = 210.049 − 0.798x. The coefficients can also be obtained by calling the summary of the lm function: > summary(lm(y~x)) Call: lm(formula = y ~ x) Residuals: Min 1Q -8.9258 -2.5383 Median 0.3879 3Q 3.1867 Max 6.6242 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 210.04846 2.86694 73.27 < 2e-16 *** x -0.79773 0.06996 -11.40 3.85e-08 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.578 on 13 degrees of freedom Multiple R-squared: 0.9091,Adjusted R-squared: 0.9021 F-statistic: 130 on 1 and 13 DF, p-value: 3.848e-08 2 There is a great deal of information to be interpreted here. For the moment I simply draw your attention to the column ‘Estimate’ under ‘Coefficients’ where the intercept and slope estimates are shown. Statistical inference on β1 . There are three parameters in the model: σ, β0 , β1 . Recall that σ2 is the variance of the random error ǫ. The slope β1 is of particular interest since it tells how important is the influence of x on y; the closer β1 is to zero, the smaller the effect of x on y. We first need an estimate for σ2 . From the general theory we know that s2 = 1 X 2 ei n −2 is an unbiased estimator of σ2 with n −2 degrees of freedom, where n is the sample size. The standard error for β̂1 can now be written as ¡ ¢ s SE β̂1 = pP . (xi − x̄)2 If the null hypothesis is H0 : β1 = a versus the alternative hypothesis H1 : β1 6= a then a test of H0 is done by means of the t -statistic t= β̂1 − a ¡ ¢. SE β̂1 From this we can find the P-value from the Student’s t -distribution. Example. Continuing with the maximum heart rate example, we can do a test to see if the slope of −1 is correct. Let H0 : β1 = −1 against the alternative hypothesis that H1 : β1 6= 1. We can create the test statistic and find the P-value by hand as follows: > m=lm(y~x) > e=resid(m) #This extracts the vector of residuals e1, ..., en > b1=(coef(m))[[’x’]] #The x part of the coefficients > n=length(x) > s=sqrt(sum(e^2)/(n-2)) > SE=s/sqrt(sum((x-mean(x))^2)) > t=(b1-(-1))/SE #This finds the t-statistic > 2*pt(-t,13) #Find the P-value using the cumulative df of t. [1] 0.01262031 This shows that it is not very likely that the true slope is −1. The summary function automatically does a hypotheses test for the assumption H0 : β1 = 0 vs. H1 : β1 6= 0. The corresponding P-value is shown under ‘Coefficients’ (row ‘Intercept’, last column.) Note that it is extremely unlikely that the true slope is 0. Inference on β0 . We can similarly do inferences on β0 . Just as for β1 , the above summary includes the test for H0 : β0 = 0. The estimator β̂0 of β0 is also unbiased and has standard error ¡ ¢ SE β̂0 = s s The test statistic for H0 : β0 = a is then t= x̄ 2 1 . +P (xi − x̄)2 n β̂0 − a ¡ ¢ SE β̂0 which has a t -distribution with n − 2 degrees of freedom. 3 “Dissecting” summary. Let us go back and look further into the above summary of the regression analysis done by the function lm. The first interesting information is under Residuals This gives a superficial view of the distribution of the residuals e i , which can be used for a quick check of the distributional assumptions in the linear model. The average of the residuals is 0 by definition, so the median should not be far from 0 and the minimum and maximum should roughly be equal in absolute value. Then, under Coefficients we see the regression coefficient and the intercept with their standard errors, t-tests, and P-values for the null-hypothesis that assumes value zero for β0 and β1 . The symbols on the right are graphical indicators of the level of significance. The line below the table shows the definition of these indicators; for example, one star means 0.01 < p < 0.05. Next is the line for the Residual standard error. This is the residual variation, an expression of the variation of the observations around the regression line, estimating the model parameter σ. It is the value of s defined above. Then there is Multiple R-squared. This is the square of the sample correlation coefficient r (see textbook.) And finally, the F-statistic. This is an F test for the hypothesis that the regression coefficient is zero. this test is not really interesting in a simple (as opposed to multiple) linear regression since it just duplicates information already given. It becomes interesting when there is more than one explanatory (or predictor) variable. In fact, the F-test is identical to the square of the t-test. This is true in any model with 1 degree of freedom (i.e., with a single predictor variable.) Functions of lm. I summarize here a few functions of interest that you can explore. Each of the following functions acts on m, defined by > m=lm(y~x) • coefficients(m) – model coefficients • coef(m) – the same as coefficients(m) • confint(m,level=0.99) – confidence intervals for the regression coefficients • deviance(m) – residual sum of squares (SSE) • fitted(m) – vector of fitted y values • residuals(m) – vector of model residuals • resid(m) – the same as residuals(m) • summary(m) – the summary function already described • vcov(m) – variance-covariance matrix of the main parameters. As an example, suppose we want a scatter plot of the residuals vs. the fitted values. The following will do it: > > > > > m=lm(y~x) e=resid(m) yhat=fitted(m) plot(yhat,e) abline(lm(e~yhat)) The result is the graph shown next. 4 5 0 e −5 160 170 180 190 yhat Regressing on transformed variables. It is possible to use the linear regression method even when the model of the relationship between x and y is not given by a linear function. This is typically done by applying linear regression to functions of or x and/or y. For example, if we expect y to be an exponential function of x, it makes sense to apply linear regression to log y and x. In this case we would do the above analysis starting from > m=lm(log(y)~x) In general, we can apply lm two functions g (y) and f (x). Even when the model relationship between x and y is linear, we may still want to transform y so as to make the hypothesis H0 : β0 = 0 and H0 : β1 = 0 meaningful since these are the default null hypotheses for the regression summary function. In the maximum heart rate example we wanted to test the initial (hypothetical) model y = 220− x. The transformed function y − 220+ x would have under H0 an intercept and slope equal to zero. In this case we may start our analysis with > m=lm(y-220+x~x) Diagnosing a linear regression. Several types of diagnostic plots are produced when calling plot(lm(y x)). For example, to obtain a scatter plot of the residuals vs. the fitted values (compare this to Figure 10.7, page 368 of the textbook) we do: > plot(lm(y~x),which=1) Note that, rather than drawing a regression line as I did in the previous graph, R draws a smoothed line through the residuals as a visual aid to finding significant patterns, indicating deviation from zero mean. Ideally, the points in the residuals vs. fitted plot should be randomly scattered with no particular pattern. Another useful plot for diagnostic purposes is the normal Q-Q plot for the residuals; it should roughly falls on a line, indicating that the residuals follow a normal distribution. It is obtained by > plot(lm(y~x),which=2) The above two types of plots are shown in the next figure. The assumptions of normality and zero mean seem to hold tolerably. 5 Residuals vs Fitted Normal Q−Q 8 1 8 1 0 −1 Standardized residuals 0 −5 Residuals 5 1 −2 −10 7 160 170 180 7 190 −1 Fitted values lm(y ~ x) 0 1 Theoretical Quantiles lm(y ~ x) Prediction and confidence bands. The concepts of prediction values and intervals are discussed in section 10.3.3 of the textbook. Fitted lines are often presented with uncertainty bands around them. There are two kinds of bands, often referred to as the “narrow” and “wide” bands. The narrow, or confidence bands, reflect uncertainty about the line itself; if there are many observations, they will be quite narrow. These bands are typically curved, becoming narrower near the center of the point cloud. The wide, or prediction bands, include the uncertainty about future observations. Predicted values and prediction intervals may be extracted with the R-function predict. With no arguments, it just gives the fitted values: > predict(m) 1 2 3 4 5 6 7 8 9 195.6894 191.7007 190.1053 182.1280 158.1962 166.9712 182.9258 165.3758 152.6121 10 11 12 13 14 15 194.8917 191.7007 176.5439 195.6894 178.9371 180.5326 If you add into predict the argument interval=’confidence’ or interval=’prediction’ then you get the vector of predicted values augmented with lower and upper limits: > predict(m,interval=’confidence’) fit lwr upr 1 195.6894 191.8087 199.5700 2 191.7007 188.3520 195.0495 3 190.1053 186.9437 193.2668 4 182.1280 179.5503 184.7058 5 158.1962 153.2965 163.0959 6 166.9712 163.3843 170.5582 7 182.9258 180.3230 185.5285 8 165.3758 161.5704 169.1811 9 152.6121 146.7833 158.4410 10 194.8917 191.1235 198.6598 11 191.7007 188.3520 195.0495 12 176.5439 173.8948 179.1931 6 13 195.6894 191.8087 199.5700 14 178.9371 176.3712 181.5030 15 180.5326 177.9786 183.0866 Here is what to do if you want to predict new values of y (level=0.95 is the default value): > newdata=data.frame(x=20) > predict(m,newdata,interval=’predict’,level=0.95) fit lwr upr 1 194.0939 183.5492 204.6386 Plotting the two bands requires some care. Here is a way to do it: > > > > > > > > > pred.frame=data.frame(x=18:72) pp=predict(m,interval=’predict’,newdata=pred.frame) pp=predict(m,interval=’prediction’,newdata=pred.frame) pc=predict(m,interval=’confidence’,newdata=pred.frame) plot(x,y,ylim=range(y,pp,na.rm=TRUE)) pred.x=pred.frame$x matlines(pred.x,pc,lty=c(1,2,2),col=’black’) matlines(pred.x,pp,lty=c(1,3,3),col=’black’) grid() y 140 150 160 170 180 190 200 This produced the following plot: 20 30 40 50 60 70 x A few explanations for the above script: First we create a new data frame in which the x variable contains the values at which we want predictions to be made; pp and pc are then made to contain the result of predict for the new data in predict.frame with prediction limits and confidence limits, respectively. For the plotting, we first create a standard scatterplot, except that we ensure that it has enough room for the prediction limits. This is obtained by setting ylim=range(y,pp,na.rm=TRUE). The function range returns a vector of length 2, containing the minimum and maximum value of its arguments. We need the na.rm=TRUE argument (“remove missing data”) to cause missing values to be skipped for the range computation and y is included to ensure 7 that points outside the prediction limits are not missed. Finally, the curves are added, using the matlines (“multiple lines”) function. Problems 1. Problem 10.6, page 388. The following data give the barometric pressure (in inches of mercury) and the boiling point (in ◦ F) of water in the Alps. (I have imported the data under the variable name Data.) > Data Pres 1 20.79 2 20.79 3 22.40 4 22.67 5 23.15 6 23.35 7 23.89 8 23.99 9 24.02 10 24.01 11 25.14 12 26.57 13 28.49 14 27.76 15 29.04 16 29.88 17 30.06 Temp 194.5 194.3 197.9 198.4 199.4 199.9 200.9 201.1 201.4 201.3 203.6 204.6 209.5 208.6 210.7 211.9 212.2 (a) Make a scatter plot of the boiling point by barometric pressure. Does the relationship appear to be approximately linear? (b) Fit a least squares regression line. What proportion of variation in the boiling point is accounted for by linear regression on the barometric pressure? (c) Calculate the mean square error estimate of σ. Solutions: (a) The following commands produce the desired scatter plot, shown next. The linear relationship is very pronounced. Pressure=Data[,1] Temperature=Data[,2] m=lm(Temperature~Pressure) plot(Temperature~Pressure) abline(m) grid() 8 210 205 Temperature 200 195 22 24 26 28 30 Pressure (b) The least squares line is also obtained by the above script. (c) Recall that the mean square error estimate of σ is s= r 1 X 2 ei n −2 where the sample size is n = 17. One way to obtain s is by using the definition: > TemperatureHat=fitted(m) > s=sqrt((1/15)*sum((Temperature-TemperatureHat)^2)) > s [1] 0.4440299 Therefore, the mean square error estimate of σ is s=0.444. We can obtain the same value simply by extracting from the summary report for lm: > summary(lm(Temperature~Pressure)) Call: lm(formula = Temperature ~ Pressure) Residuals: Min 1Q -1.22687 -0.22178 Median 0.07723 3Q 0.19687 Max 0.51001 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 155.29648 0.92734 167.47 <2e-16 *** Pressure 1.90178 0.03676 51.74 <2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 9 Residual standard error: 0.444 on 15 degrees of freedom Multiple R-squared: 0.9944,Adjusted R-squared: 0.9941 F-statistic: 2677 on 1 and 15 DF, p-value: < 2.2e-16 We can read the value under Residual standard error: 0.444 on 15 degrees of freedom. 2. Problem 10.11, page 389. Refer to the Old Faithful data in Exercise 10.4. (a) Calculate a 95% PI for the time to the next eruption if the last eruption lasted 3 minutes. (b) Calculate a 95% CI for the mean time to the next eruption for a last eruption lasting 3 minutes. Compare this CI with the PI obtained in (a). (c) Repeat (a) if the last eruption lasted 1 minute. Do you think this prediction is reliable? Why or why not? Solution: We begin by importing the data into two variables Last and Next. The sample size is n=21. (a) To find the prediction interval using R’s regression analysis functions simply call: > m=lm(Next~Last) > newdata=data.frame(Last=3) > predict(m,newdata,interval=’predict’) fit lwr upr 1 60.38332 47.2377 73.52893 This gives the prediction interval [47.24, 73.53]. To do the same by hand we would use the values in inequality (10.19), page 362 of the textbook. (b) To find the confidence interval for the mean of Next given that Last=3: > m=lm(Next~Last) > newdata=data.frame(Last=3) > predict(m,newdata,interval=’confidence’) fit lwr upr 1 60.38332 57.51009 63.25654 Therefore, the confidence interval is [57.51 63.26]. To find this interval by hand we would use inequalities (10.18), page 362 of the textbook. (c) If Last=1: > m=lm(Next~Last) > newdata=data.frame(Last=1) > predict(m,newdata,interval=’predict’) fit lwr upr 1 40.80318 26.33021 55.27614 10 Now the predicted value lies in [26.33 55.28]. We should not expect this interval to be too reliable since the 1 lies outside the range of values of Last, which is the interval [1.7, 4.9]. (See the brief discussion at the middle of page 363 of the textbook.) 3. Problem 10.16, page 391. A prime number is a positive integer that has no integer factors other than 1 and itself. (1 is not regarded as a prime.) The number of primes in any given interval of whole numbers is highly irregular. However, the proportion of primes less than or equal to any given number x (denoted by p(x)) follows a regular pattern as x increases. The following table gives the number and proportion of primes for x = 10n for n = 1, 2, . . . , 10. The objective of the present exercise is to discover this pattern. No. of Primes x Proportion of Primes (p(x)) 1 10 4 0.400000 102 25 0.250000 103 168 0.168000 104 1, 229 0.122900 105 9, 592 0.095920 6 10 78, 498 0.078498 107 664, 579 0.066458 108 5, 761, 455 0.057615 109 50, 847, 534 0.050848 1010 455, 052, 512 0.045505 p (a) Plot the proportion of primes, p(x), against 10, 000/x, 1000/ x, 1/ log10 x. Which relationship appears most linear? (b) Estimate the slope of the line p(x) = β0 + β1 log1 10 x and show that β̂1 ≈ log10 e = 0.4343. (c) Explain how the relationship found in (b) roughly translates into the prime number theorem: For large x, p(x) ≈ 1/ loge x. Solution: (a) We first import the data. The variable primes will represent the number of primes for each value of x (second column) and proportion the column for p(x). (R seems to interpret the values of x given in the data table provided by the textbook as strings of characters; it didn’t allow me to do any calculation with them.) So I’m entering the values here “by hand.”) Now define the three functions of x and plot: > > > > > x=10^c(1:10) f1=10000/x f2=1000/sqrt(x) f3=log(10)/log(x) plot(proportion,f1) The last line above produces the scatter plot for proportion versus f1. The three graphs are shown next. It is obvious that the graph of proportion versus the reciprocal of the logarithm of x is the most linear. (b) We now regress proportion on f3. Here is the result (using the functions coefficients and confint): 11 200 400 600 800 1000 0.35 proportion 0.05 0.15 0.25 0.35 0.05 0.15 proportion 0.25 0.35 0.25 proportion 0.15 0.05 0 0 50 100 f1 150 200 250 300 0.2 0.4 f2 0.6 0.8 1.0 f3 > coefficients(m) (Intercept) f3 0.01518797 0.40419159 > confint(m,level=0.95) 2.5 % 97.5 % (Intercept) -0.002615247 0.03299118 f3 0.358967988 0.44941519 We can read from this the estimated value of the slope: β̂1 = 0.0404 and the confidence interval (not asked by the problem) [0.36, 0.45]. The estimate value of the intercept is 0.015. (c) The correct asymptotic relation is p(x) ≈ 1/ loge x. This is a mathematical rather than statistical result. Assuming that the correct value of the slope is β1 = 0.4343 = log10 e = 1/ loge 10 and the intercept is β0 = 0, then for large x p(x) ≈ β1 × 1 1 = = 1/ loge x log10 x loge (10) × log10 x which is what the problem asks to show. 4. Problem 10.20, page 393. To relate the stopping distance of a car to its speed, ten cars were tested at five different speeds, two cars at each speed. The following data were obtained. Speed x (mph) 20 20 30 30 40 40 50 50 60 60 Stop. Dist. y (ft) 16.3 26.7 39.2 63.5 65.7 98.4 104.1 155.6 217.2 160.8 (a) Fit an LS straight line to these data. Plot the residuals against the speed. (b) Comment on the goodness of the fit based on the overall F -statistic and the residual plot. Which two assumptions of the linear regression model seem to be violated? (c) Based on the residual plot, what transformation of stopping distance should be used to linearize the relationship with respect to speed? A clue to find this transformation is provided by the following engineering argument: In bringing a car to a stop, its kinetic energy is dissipated as its braking energy, and the two are roughly equal. The kinetic energy is proportional to the square of the car’s speed, while the breaking energy is proportional to the stopping distance, assuming a constant breaking force. (d) Make this linearizing transformation and check the goodness of fit. What is the predicted stopping distance according to this model if the car is traveling at 40 mph? 12 Solution: (a) After importing the data as a table Data with two columns we can produce the scatter plot with the regression line: speed=Data[,1] distance=Data[,2] m=lm(distance~speed) plot(speed,distance) abline(m) distance 50 100 150 200 > > > > > 20 30 40 50 60 50 60 speed To plot the residuals agains speed: −30 −10 0 e 10 20 30 40 > e=residuals(m) > plot(speed,e) 20 30 40 speed ¡ ¢ (b) The residual graph shows systematic deviation from the assumption of normality e i ∼ N 0, σ2 . The mean values of e i follow a clear nonlinear curve and the variance appears to increase with the speed. The F-statistic 13 is 58.77 with P-value 5.9 × 10−5 . The very small P-value means that we should reject the hypothesis that β1 = 0. Moreover, r 2 = 0.88 means that most (nearly 90%) of the variation in distance is accounted for by the regression on speed rather than random error. All of this suggests to me that there is a strong dependence of distance on speed, although this dependence is not well described by a linear relation. (c) The physical argument proposed is that the kinetic energy, which is proportional to the square of speed must equal the dissipated energy through breaking, which is proportional to distance (assuming the breaking force to be approximately constant). This suggests the following relationship: distance = β0 + β1 (speed)2 . (d) Based on (c) we try the regression of distance on the square of speed. The coefficients are > lm(distance~speedsq) Call: lm(formula = distance ~ speedsq) Coefficients: (Intercept) 1.62064 speedsq 0.05174 distance 50 100 150 200 So the linearized regression line is distance = 0.05 + 1.62 × (speed)2 . The graph is shown next. 500 1000 1500 2000 2500 3000 3500 speedsq To evaluate the goodness of fit, we can look at the residual plots produced by R. The below graphs show that the residuals follow much better the regression model assumptions, although the variance still show a clear increase with distance or speed. The predicted stopping distance at speed 40 mph is 1.62 + 0.052 × (40)2 = 84.34 feet. In order to obtain the prediction interval at speed 40 mph (which the problem does not ask for) at the 95% level, we do the following: 14 > m=lm(distance~speedsq) > newdata=data.frame(speedsq=1600) > predict(m,newdata,interval=’predict’) fit lwr upr 1 84.40229 31.35449 137.4501 This gave the fitted value 84.40 and (the fairly wide) prediction interval [31.35, 137.45]. 100 1.5 1.0 −1.5 10 0.5 Standardized residuals 7 9 −0.5 0.0 30 10 0 −30 −20 −10 Residuals 20 9 50 Normal Q−Q 2.0 Residuals vs Fitted 150 7 10 −1.5 Fitted values lm(distance ~ speedsq) −1.0 −0.5 0.0 0.5 Theoretical Quantiles lm(distance ~ speedsq) 15 1.0 1.5
© Copyright 2026 Paperzz