ST512 HOMEWORK 2 SSII10 Q1. Linear Regression fitting two segment lines in a single set of points Objectives 1. Run a linear regression in a subset of the data. 2. Create a new indicator variable. 3. Create a new variable to count number of years from origin. 4. Compare linear regression fit with a simpler fit. 5. Fit segmented lines to a set of data points. 6. Select best model. 7. Analyze adequacy of model and violation of assumptions. 8. Conclusion. The data below gives the wining times for the Indianapolis 500 automobile race each year from 1911 to 2010. July10, 2010 Obs YEAR INDY 500 WINNING SPEEDS 1911-2010 DRIVER distance Speed FIRSTPART 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 1911 1912 1913 1914 1915 1916 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 Ray_Har Joe_Daw Jules_G René_Th Ralph_D Dario_R Howdy_W Gaston_ Tommy_M Jimmy_M Tommy_M Lora_L. Peter_D Frank_L George_ Louis_M Ray_Kee Billy_A Louis_S Fred_Fr Louis_M Bill_Cu Kelly_P Louis_M Wilbur_ Floyd_R Wilbur_ Wilbur_ Floyd_D George_ Mauri_R Mauri_R Bill_Ho Johnnie Lee_Wal Troy_Ru Bill_Vu Bill_Vu Bob_Swe Pat_Fla Sam_Han Jimmy_B Rodger_ Jim_Rat A.J._Fo Rodger_ Parnell A.J._Fo Jim_Cla Graham_ A.J._Fo 500 500 500 500 500 300 500 500 500 500 500 500 500 400* 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 *345 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 74.602 78.719 75.933 82.474 89.840 84.001 88.050 88.618 89.621 94.484 90.545 98.234 101.127 95.904 97.545 99.482 97.585 100.448 96.629 104.144 104.162 104.863 106.240 109.069 113.580 117.200 115.035 114.277 115.117 114.820 116.338 119.814 121.327 124.002 126.244 128.922 128.740 130.840 128.209 128.490 135.601 133.719 135.875 138.767 139.130 140.293 143.137 147.350 150.686 144.317 151.207 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 RAMP -65 -64 -63 -62 -61 -60 -57 -56 -55 -54 -53 -52 -51 -50 -49 -48 -47 -46 -45 -44 -43 -42 -41 -40 -39 -38 -37 -36 -35 -30 -29 -28 -27 -26 -25 -24 -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 Page 1 ST512 HOMEWORK 2 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Bobby_U Mario_A Al_Unse Al_Unse Mark_Do Gordon_ Johnny_ Bobby_U Johnny_ A.J._Fo Al_Unse Rick_Me Johnny_ Bobby_U Gordon_ Tom_Sne Rick_Me Danny_S Bobby_R Al_Unse Rick_Me Emerson Arie_Lu Rick_Me Al_Unse Emerson Al_Unse Jacques Buddy_L Arie_Lu Eddie_C Kenny_B Juan_Pa Hélio_C Hélio_C Gil_de_ Buddy_R Dan_Whe Sam_Hor Dario_F Scott_D Hélio_C Dario_F 500 500 500 500 500 *332.5 500 *435 *255 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 *450 500 500 *415 500 500 500 152.882 156.867 155.749 157.735 162.692 159.063 158.589 149.213 148.725 161.331 161.363 158.899 142.862 139.084 162.029 162.117 163.612 152.982 170.722 162.175 144.809 167.581 185.981 176.457 134.477 157.207 160.872 153.616 147.956 145.827 145.155 153.176 167.607 153.601 166.499 156.291 138.518 157.603 157.085 151.774 143.567 150.318 161.623 SSII10 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -8 -7 -6 -5 -4 -3 -2 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 * Race rain-shortened July10, 2010 Page 2 ST512 HOMEWORK 2 SSII10 Model 1. Linear Fitting – 1911-2010 Model 2. Linear Fitting 1911-1975 Model 3. Overall fitting 1911-2010 - Linear 1911-1975 and constant value for 1976-2010 July10, 2010 Page 3 ST512 HOMEWORK 2 SSII10 Summary and QUESTIONS 1. Model 1. Simple linear regression of Speed on Year for period 19112010. Write the regression equation. 2. Model 2. The winning speeds appear to increase at a (surprisingly) linear rate until about 1975. Simple linear regression of Speed on Year for period 1911-1975. Write the regression equation. 3. A simple way we might estimate the relationship is to take the (YEAR, SPEED) point for the first year of the race and that for 1976. Draw a line between them (you might draw that with a pencil on your graph) and figure out its equation. SPEED = ____ + ____(YEAR). How much does this slope differ from that of the least squares regression in Model 2.? 4. Approach in 3.) also gives unbiased estimates and is easy - why do we prefer a regression fit? 5. Model 3. After some evaluation, I decided to fit a line for year<1976 and a constant value, for year 1976 or greater. To do this, I regress SPEED on RAMP for the whole set of points. Note that the variable RAMP is equal to the numbers of years away from 1976 for each year<1976 and 0 for year 1976 or greater. Write the regression equation. 6. Compare this new slope and intercept versus values found in Model 1. 7. Model 4. Next, I fit a model with YEAR and RAMP for 1976 as predictors (explanatory variables). Note that this implies two nonzero slopes, one before and one after 1976. a. Write down the prediction equation. b. Compute the predicted speed for year 1976. Compute the residual value for this year. c. Is this fitting a significant improvement? (ignoring any assumption violations) Segmented linear fitting: period 1911-1975 and 1976-2010 For questions below use Model 3, i.e, linear fit for year < 1976 and constant value for year 1976 or greater. 8. Plot of diagnostic residual plots and squared residuals versus year is presented below. Squared residuals may be considered as an estimate (albeit inaccurate) of the error variance for that year. July10, 2010 Page 4 ST512 HOMEWORK 2 SSII10 Write down the assumptions (on the errors) under which regression results are justified. Does your residual plot call into question any of the other assumptions? Explain. 10. Summarize your findings, would you recommend any of these models? Support your answer. 9. Summary table - Modeling RSQUARE (ADJ RSQUARE) 0.8211 (0.8192) Model b0 b1 MSE 1. n=94 -1552.87 0.85904 136.6819 2. n=59 159.68167 1.27138 13.3327 0.9786 (0.9783) 3. n=94 157.19354 1.21420 54.68212 0.9284 (0.9277) Overall RAMP_76 4. n=94 499.79091 byear=-0.17235 bramp= 1.43332 53.10965 0.9311 (0.9297) Segmented regression Year 1976 is cutoff AdjR 2 Observation Overall linear reg. SPEED ON YEAR Linear Regression SPEED ON RAMP YEAR < 1976 n i 1 R 2 n p n the number of observations p=the number of parameters, including the intercept i 1, if there is an intercept, 0 otherwise July10, 2010 Page 5
© Copyright 2026 Paperzz