PS7 Key

Problem Set 7: Answer Key
PSYCH 613, Fall 2015
1. Continuing with the Olympic data set. . . . The file olympics.sav on the canvas site has
the winning times for the mens 1500-meter run in the Olympics from 1900-2012. Olympics
were not held in 1916, 1940, and 1944 due to wars; some countries boycotted in 1980 and
1984 (From Ryan, Joiner, and Ryan, 1985).
(a) Use a statistics package to fit a regression line for predicting winning time from year.
Interpret the numerical values for the intercept and slope (i.e., what do those two numbers
mean?).
SPSS Syntax
*
*
*
NOTE:
All we need
including
the other
In practice,
youll
is the regression and /dependent lines, but Im.
information for the other
questions
in this
homework.
always want
to include and look
at diagnostics.
REGRESSION VARIABLES time year
/STATISTICS R COEFF ANOVA OUTS
/DEPENDENT time
/METHOD ENTER
/CASEWISE ALL PLOT(ZRESID)
/RESIDUALS DEFAULTS
/SCATTERPLOT (*zresid, *pred).
SPSS Output
1
R Syntax
lm1 <- lm(time
summary(lm1)
˜
year,
data=olympics)
R Output
Call:
lm(formula = time ˜ year, data = olympics)
Residuals:
Min
1Q
-8.4291 -2.9810
Median
0.0075
3Q
3.3436
Max
5.7684
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 812.14737
45.46514
17.86 2.29e-15 ***
year
-0.30006
0.02321 -12.93 2.63e-12 ***
--Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 4.04 on 24 degrees of freedom
Multiple R-squared: 0.8744, Adjusted R-squared: 0.8692
F-statistic: 167.1 on 1 and 24 DF, p-value: 2.632e-12
Interpetation of the intercept: At year zero (which doesn’t actually exist;
theres only 1 BC and 1 AD), our model predicts that the average male
Olympian of their day (even though they didn’t have the Olympics) would
have taken about 812s to finish.
Interpretation of the slope: For every year we predict a decrease in 1500m
times of about 0.30 seconds.
(b) The t test for the correlation you computed in the previous problem set is identical
to which t test in this regression? What does this equivalence tell you about testing the
correlation and testing this particular regression parameter?
Mathematical explanation: The t-test for the correlation is equal to the
tobserved for the slope in a regression with one predictor. This is because the
linear regression is decomposing the variance in a similar way to the correlationit’s using least squares to minimize the distance between all points and
the best fit line, which ends up being mathematically equivalent to what
the correlation calculates.
Conceptual explanation: Both the slope and the correlation are measuring
the same thing: how the two variables, time and year, change together. The
correlation, however, is the standardized measure of that change. Standardized means that the units have been divided out of the estimate. If you look
at the regression output, you will see that the standardized estimate of the
slope is identical to the value of the correlation.
2
(c) Superimpose the regression line on the scatterplot. Are there any appreciable departures
from the straight line?
SPSS Syntax
IGRAPH /VIEWNAME='Scatterplot'
/X1 = VAR(year)
/Y = VAR(time)
/FITLINE METHOD = REGRESSION LINEAR LINE = TOTAL
/SCATTER.
SPSS Output
3
R Syntax
with(olympics, plot(year,
abline(coef(lm1))
#
Alternative if
abline(lm(time ˜
time))
you haven’t run the model
year,
data=olympics))
yet
R Output
The linear fit seems adequate, but I also notice some departures from the
line. Specifically, in the more recent years the pattern of decreasing times
seems to be plateauing. This is evident in the times around 1960 being to
the left of the fit line, while times around 1990 are to the right of the best
fit line.
(d) Use residuals to check the normality assumption. You can use either a histogram or
a normal probability plot. Explain what patterns you are checking for in the plot you
use.
SPSS Syntax I wont reprint my SPSS syntax for this since I already included the line
/RESIDUALS DEFAULTS previously. This prints a histogram of the residuals as well as the
normal P-P plot of the residuals.
4
SPSS Output
5
R Syntax
plot(lm1)
R Output
What we see in the plots are generally good fits with the assumption of normality. In the histogram of the residuals we see that the residuals roughly
fall into a normal distribution. There is a slight negative skew, but considering the small sample size it isnt much of a problem. The normal Q-Q
plots reinforce this conclusion. In both the plot from SPSS and R the points
roughly fall along the reference line
(e) What winning time does the regression predict for the 2016 Olympics? Compute the
95% interval around this prediction. What worry do you have about predicting to the year
2016 from data ranging from 1900-2012?
FYI: The current world record is 206.0 (July, 1998).
SPSS Syntax
*add new data point to dataset.
*2016 9999.
MISSING VALUES time(9999).
6
REGRESSION VARIABLES time year
/STATISTICS R COEFF ANOVA OUTS
/DEPENDENT time
/METHOD ENTER
/CASEWISE ALL PLOT(ZRESID) DEFAULT COOK SEPRED
/RESIDUALS DEFAULTS
/SCATTERPLOT (*zresid, *pred).
SPSS Output
With the SPSS graph we have to manually plot the predicted point. We
can then calculate the confidence interval around the estimate using the
SEPRED value from SPSS.
7
se(Ŷ ) =
√
M SE + SEP RED2 =
√
16.32 + 1.552 = 4.33
CI = E(Ŷ ± tα/2,df ∗ se(Ŷ ) = 207.2258 ± (2.06 ∗ 4.33)
= [198.31, 216.15]
R Syntax
#Find the value and the confidence interval
p
<- data.frame(year=2016)
predict(lm1,
p, interval=’prediction’)
#Plot the predicted values
with(olympics, plot(year, time,
xlim=c(1900,
ylim=c(205, 247)))
points(2016,
207.2258,
col=’red’)
2020),
R Output
fit
lwr
upr
1 207.2258 198.2949 216.1567
I’m still somewhat worried about calculating the estimate for 2016 from this
data because the pattern seems somewhat non-linear and the pattern seems
to level out around 1960, as I noted in (1a). Additionally, this estimate looks
much faster than the 2005-2012 times. It would be good to compare this
to other models, such as transforming year to a square root of the year
variable, or adding a parabolic component. When working with nonlinear
data one needs to be careful about extrapolating to data points outside the
range of observation.
8