Indianapolis_SpeedData_HW2Q1.pdf

ST512
HOMEWORK 2
SSII10
Q1. Linear Regression fitting two segment lines in a single set of points
Objectives
1. Run a linear regression in a subset of the data.
2. Create a new indicator variable.
3. Create a new variable to count number of years from origin.
4. Compare linear regression fit with a simpler fit.
5. Fit segmented lines to a set of data points.
6. Select best model.
7. Analyze adequacy of model and violation of assumptions.
8. Conclusion.
The data below gives the wining times for the Indianapolis 500 automobile
race each year from 1911 to 2010.
July10, 2010
Obs
YEAR
INDY 500 WINNING SPEEDS 1911-2010
DRIVER
distance
Speed
FIRSTPART
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
1911
1912
1913
1914
1915
1916
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
Ray_Har
Joe_Daw
Jules_G
René_Th
Ralph_D
Dario_R
Howdy_W
Gaston_
Tommy_M
Jimmy_M
Tommy_M
Lora_L.
Peter_D
Frank_L
George_
Louis_M
Ray_Kee
Billy_A
Louis_S
Fred_Fr
Louis_M
Bill_Cu
Kelly_P
Louis_M
Wilbur_
Floyd_R
Wilbur_
Wilbur_
Floyd_D
George_
Mauri_R
Mauri_R
Bill_Ho
Johnnie
Lee_Wal
Troy_Ru
Bill_Vu
Bill_Vu
Bob_Swe
Pat_Fla
Sam_Han
Jimmy_B
Rodger_
Jim_Rat
A.J._Fo
Rodger_
Parnell
A.J._Fo
Jim_Cla
Graham_
A.J._Fo
500
500
500
500
500
300
500
500
500
500
500
500
500
400*
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
*345
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
74.602
78.719
75.933
82.474
89.840
84.001
88.050
88.618
89.621
94.484
90.545
98.234
101.127
95.904
97.545
99.482
97.585
100.448
96.629
104.144
104.162
104.863
106.240
109.069
113.580
117.200
115.035
114.277
115.117
114.820
116.338
119.814
121.327
124.002
126.244
128.922
128.740
130.840
128.209
128.490
135.601
133.719
135.875
138.767
139.130
140.293
143.137
147.350
150.686
144.317
151.207
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
RAMP
-65
-64
-63
-62
-61
-60
-57
-56
-55
-54
-53
-52
-51
-50
-49
-48
-47
-46
-45
-44
-43
-42
-41
-40
-39
-38
-37
-36
-35
-30
-29
-28
-27
-26
-25
-24
-23
-22
-21
-20
-19
-18
-17
-16
-15
-14
-13
-12
-11
-10
-9
Page 1
ST512
HOMEWORK 2
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
Bobby_U
Mario_A
Al_Unse
Al_Unse
Mark_Do
Gordon_
Johnny_
Bobby_U
Johnny_
A.J._Fo
Al_Unse
Rick_Me
Johnny_
Bobby_U
Gordon_
Tom_Sne
Rick_Me
Danny_S
Bobby_R
Al_Unse
Rick_Me
Emerson
Arie_Lu
Rick_Me
Al_Unse
Emerson
Al_Unse
Jacques
Buddy_L
Arie_Lu
Eddie_C
Kenny_B
Juan_Pa
Hélio_C
Hélio_C
Gil_de_
Buddy_R
Dan_Whe
Sam_Hor
Dario_F
Scott_D
Hélio_C
Dario_F
500
500
500
500
500
*332.5
500
*435
*255
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
*450
500
500
*415
500
500
500
152.882
156.867
155.749
157.735
162.692
159.063
158.589
149.213
148.725
161.331
161.363
158.899
142.862
139.084
162.029
162.117
163.612
152.982
170.722
162.175
144.809
167.581
185.981
176.457
134.477
157.207
160.872
153.616
147.956
145.827
145.155
153.176
167.607
153.601
166.499
156.291
138.518
157.603
157.085
151.774
143.567
150.318
161.623
SSII10
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
-8
-7
-6
-5
-4
-3
-2
-1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
* Race rain-shortened
July10, 2010
Page 2
ST512
HOMEWORK 2
SSII10
Model 1. Linear Fitting – 1911-2010
Model 2. Linear Fitting 1911-1975
Model 3. Overall fitting 1911-2010 - Linear 1911-1975 and constant value for 1976-2010
July10, 2010
Page 3
ST512
HOMEWORK 2
SSII10
Summary and QUESTIONS
1. Model 1. Simple linear regression of Speed on Year for period 19112010. Write the regression equation.
2. Model 2. The winning speeds appear to increase at a (surprisingly)
linear rate until about 1975. Simple linear regression of Speed on
Year for period 1911-1975. Write the regression equation.
3. A simple way we might estimate the relationship is to take the (YEAR,
SPEED) point for the first year of the race and that for 1976. Draw a
line between them (you might draw that with a pencil on your graph) and figure out its
equation. SPEED = ____ + ____(YEAR). How much does this slope differ from that
of the least squares regression in Model 2.?
4. Approach in 3.) also gives unbiased estimates and is easy - why do we
prefer a regression fit?
5. Model 3. After some evaluation, I decided to fit a line for year<1976
and a constant value, for year 1976 or greater. To do this, I regress
SPEED on RAMP for the whole set of points. Note that the variable RAMP
is equal to the numbers of years away from 1976 for each year<1976 and
0 for year 1976 or greater.
Write the regression equation.
6. Compare this new slope and intercept versus values found in Model 1.
7. Model 4. Next, I fit a model with YEAR and RAMP for 1976 as predictors
(explanatory variables). Note that this implies two nonzero slopes,
one before and one after 1976.
a. Write down the prediction equation.
b. Compute the predicted speed for year 1976. Compute the residual value for this
year.
c. Is this fitting a significant improvement? (ignoring any assumption violations)
Segmented linear fitting: period 1911-1975 and 1976-2010
For questions below use Model 3, i.e, linear fit for year < 1976 and
constant value for year 1976 or greater.
8. Plot of diagnostic residual plots and squared residuals versus year is
presented below. Squared residuals may be considered as an estimate
(albeit inaccurate) of the error variance for that year.
July10, 2010
Page 4
ST512
HOMEWORK 2
SSII10
Write down the assumptions (on the errors) under which regression results are
justified. Does your residual plot call into question any of the other assumptions?
Explain.
10. Summarize your findings, would you recommend any of these models? Support your
answer.
9.
Summary table - Modeling
RSQUARE
(ADJ
RSQUARE)
0.8211
(0.8192)
Model
b0
b1
MSE
1. n=94
-1552.87
0.85904
136.6819
2. n=59
159.68167
1.27138
13.3327
0.9786
(0.9783)
3. n=94
157.19354
1.21420
54.68212
0.9284
(0.9277)
Overall RAMP_76
4. n=94
499.79091
byear=-0.17235
bramp= 1.43332
53.10965
0.9311
(0.9297)
Segmented regression
Year 1976 is cutoff
AdjR 
2
Observation
Overall linear reg.
SPEED ON YEAR
Linear Regression SPEED
ON RAMP
YEAR < 1976
 n  i  1  R 2 
n p
n  the number of observations
p=the number of parameters, including the intercept
i  1, if there is an intercept, 0 otherwise
July10, 2010
Page 5