Problem Set 7: Answer Key PSYCH 613, Fall 2015 1. Continuing with the Olympic data set. . . . The file olympics.sav on the canvas site has the winning times for the mens 1500-meter run in the Olympics from 1900-2012. Olympics were not held in 1916, 1940, and 1944 due to wars; some countries boycotted in 1980 and 1984 (From Ryan, Joiner, and Ryan, 1985). (a) Use a statistics package to fit a regression line for predicting winning time from year. Interpret the numerical values for the intercept and slope (i.e., what do those two numbers mean?). SPSS Syntax * * * NOTE: All we need including the other In practice, youll is the regression and /dependent lines, but Im. information for the other questions in this homework. always want to include and look at diagnostics. REGRESSION VARIABLES time year /STATISTICS R COEFF ANOVA OUTS /DEPENDENT time /METHOD ENTER /CASEWISE ALL PLOT(ZRESID) /RESIDUALS DEFAULTS /SCATTERPLOT (*zresid, *pred). SPSS Output 1 R Syntax lm1 <- lm(time summary(lm1) ˜ year, data=olympics) R Output Call: lm(formula = time ˜ year, data = olympics) Residuals: Min 1Q -8.4291 -2.9810 Median 0.0075 3Q 3.3436 Max 5.7684 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 812.14737 45.46514 17.86 2.29e-15 *** year -0.30006 0.02321 -12.93 2.63e-12 *** --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 4.04 on 24 degrees of freedom Multiple R-squared: 0.8744, Adjusted R-squared: 0.8692 F-statistic: 167.1 on 1 and 24 DF, p-value: 2.632e-12 Interpetation of the intercept: At year zero (which doesn’t actually exist; theres only 1 BC and 1 AD), our model predicts that the average male Olympian of their day (even though they didn’t have the Olympics) would have taken about 812s to finish. Interpretation of the slope: For every year we predict a decrease in 1500m times of about 0.30 seconds. (b) The t test for the correlation you computed in the previous problem set is identical to which t test in this regression? What does this equivalence tell you about testing the correlation and testing this particular regression parameter? Mathematical explanation: The t-test for the correlation is equal to the tobserved for the slope in a regression with one predictor. This is because the linear regression is decomposing the variance in a similar way to the correlationit’s using least squares to minimize the distance between all points and the best fit line, which ends up being mathematically equivalent to what the correlation calculates. Conceptual explanation: Both the slope and the correlation are measuring the same thing: how the two variables, time and year, change together. The correlation, however, is the standardized measure of that change. Standardized means that the units have been divided out of the estimate. If you look at the regression output, you will see that the standardized estimate of the slope is identical to the value of the correlation. 2 (c) Superimpose the regression line on the scatterplot. Are there any appreciable departures from the straight line? SPSS Syntax IGRAPH /VIEWNAME='Scatterplot' /X1 = VAR(year) /Y = VAR(time) /FITLINE METHOD = REGRESSION LINEAR LINE = TOTAL /SCATTER. SPSS Output 3 R Syntax with(olympics, plot(year, abline(coef(lm1)) # Alternative if abline(lm(time ˜ time)) you haven’t run the model year, data=olympics)) yet R Output The linear fit seems adequate, but I also notice some departures from the line. Specifically, in the more recent years the pattern of decreasing times seems to be plateauing. This is evident in the times around 1960 being to the left of the fit line, while times around 1990 are to the right of the best fit line. (d) Use residuals to check the normality assumption. You can use either a histogram or a normal probability plot. Explain what patterns you are checking for in the plot you use. SPSS Syntax I wont reprint my SPSS syntax for this since I already included the line /RESIDUALS DEFAULTS previously. This prints a histogram of the residuals as well as the normal P-P plot of the residuals. 4 SPSS Output 5 R Syntax plot(lm1) R Output What we see in the plots are generally good fits with the assumption of normality. In the histogram of the residuals we see that the residuals roughly fall into a normal distribution. There is a slight negative skew, but considering the small sample size it isnt much of a problem. The normal Q-Q plots reinforce this conclusion. In both the plot from SPSS and R the points roughly fall along the reference line (e) What winning time does the regression predict for the 2016 Olympics? Compute the 95% interval around this prediction. What worry do you have about predicting to the year 2016 from data ranging from 1900-2012? FYI: The current world record is 206.0 (July, 1998). SPSS Syntax *add new data point to dataset. *2016 9999. MISSING VALUES time(9999). 6 REGRESSION VARIABLES time year /STATISTICS R COEFF ANOVA OUTS /DEPENDENT time /METHOD ENTER /CASEWISE ALL PLOT(ZRESID) DEFAULT COOK SEPRED /RESIDUALS DEFAULTS /SCATTERPLOT (*zresid, *pred). SPSS Output With the SPSS graph we have to manually plot the predicted point. We can then calculate the confidence interval around the estimate using the SEPRED value from SPSS. 7 se(Ŷ ) = √ M SE + SEP RED2 = √ 16.32 + 1.552 = 4.33 CI = E(Ŷ ± tα/2,df ∗ se(Ŷ ) = 207.2258 ± (2.06 ∗ 4.33) = [198.31, 216.15] R Syntax #Find the value and the confidence interval p <- data.frame(year=2016) predict(lm1, p, interval=’prediction’) #Plot the predicted values with(olympics, plot(year, time, xlim=c(1900, ylim=c(205, 247))) points(2016, 207.2258, col=’red’) 2020), R Output fit lwr upr 1 207.2258 198.2949 216.1567 I’m still somewhat worried about calculating the estimate for 2016 from this data because the pattern seems somewhat non-linear and the pattern seems to level out around 1960, as I noted in (1a). Additionally, this estimate looks much faster than the 2005-2012 times. It would be good to compare this to other models, such as transforming year to a square root of the year variable, or adding a parabolic component. When working with nonlinear data one needs to be careful about extrapolating to data points outside the range of observation. 8
© Copyright 2026 Paperzz