STOR 664 Homework 1 Solution A. Exercise (Faraway book) Summary: > library(faraway) > data(teengamb) > teengamb$sex<-as.factor(teengamb$sex) > summary(teengamb) sex status income 0:28 Min. :18.00 Min. : 0.600 1:19 1st Qu.:28.00 1st Qu.: 2.000 Median :43.00 Median : 3.250 Mean :45.23 Mean : 4.642 3rd Qu.:61.50 3rd Qu.: 6.210 Max. :75.00 Max. :15.000 verbal Min. : 1.00 1st Qu.: 6.00 Median : 7.00 Mean : 6.66 3rd Qu.: 8.00 Max. :10.00 gamble Min. : 0.0 1st Qu.: 1.1 Median : 6.0 Mean : 19.3 3rd Qu.: 19.4 Max. :156.0 There are no missing values. From this summary we see that there were more males sampled than females. We also see that the mean of gamble is much larger than the median, suggesting the distribution is right skewed or may have large outliers, which is likely since the maximum value is so much larger than the other quartile values. Histograms of the income variable and the gamble variable: > hist(teengamb$income) > hist(teengamb$gamble) As the summary above suggested, both the histograms are skewed to the right and there are a few large outliers. Pairwise scatter plots: > pairs(teengamb,col=as.numeric(teengamb$sex)+2) 1 It appears in this study that males tend to spend more on gambling than females. Also, the variables verbal and status look like they may be slightly positively correlated and gamble and income may also be correlated. The following command confirms those two correlations are greater than 0.5. > cor(teengamb[,-1]) status income verbal gamble status 1.00000000 -0.2750340 0.5316102 -0.05042081 income -0.27503402 1.0000000 -0.1755707 0.62207690 verbal 0.53161022 -0.1755707 1.0000000 -0.22005619 gamble -0.05042081 0.6220769 -0.2200562 1.00000000 The correlation between gamble and income makes sense, because people who make more money have more money to spend on gambling. This concludes the preliminary analysis of this data. B. Ex. 2.1 ∑ βˆ0 = yi /n is apparently ∑ linear and unbiased. To show it has minimum variance among all linear, n unbiased estimators, let β˜0 = properties. Then, from unbiasedness, i=1 ci yi be any estimator of these ∑ ∑ ∑ ∑ 1 ∑ c we get ci = 1, ci (xi − x̄) = 0. Since Cov(β˜0 − βˆ0 , βˆ0 ) = Cov(yi , yj ) − i n12 V ar(yi ) = i i j n ∑ 1 2 ∑ 1 2 i ci n σ − i n2 σ = 0, we have V ar(β˜0 ) = V ar(β˜0 − βˆ0 + βˆ0 ) = V ar(β˜0 − βˆ0 ) + V ar(βˆ0 ) + 0 > V ar(βˆ0 ) showing that βˆ0 has the smallest variance, and is the BLUE. B. Ex. 2.3 ∑ ∑i −x̄)(yi 2−ȳ) and α̂ = ȳ − β̂ x̄ are obtained by standard least square calculation. And the followings β̂ = (x (xi −x̄) are calculated via standard expectation calculation. E(α̂) = α E(β̂) = β, ∑ 2 [ ] x̄2 2 ∑ (xi )/n 2 1 V ar(α̂) = σ =σ +∑ , (xi − x̄)2 n (xi − x̄)2 1 V ar(β̂) = σ 2 ∑ , (xi − x̄)2 x̄ Cov(α̂, β̂) = −σ 2 ∑ (xi − x̄)2 Writing ei = yi − α̂ − β̂xi = (α − α̂) + (β − β̂)xi + ϵi and Cov(ϵi , α̂) = σ 2 [ 1 n − (xi −x̄)x̄ ∑ (xi −x̄)2 ] i −x̄) , Cov(ϵi , β̂) = σ 2 ∑(x (xi −x̄)2 give the following. 2 E(ei ) = 0, = V ar(α̂) + x2i V ar(β̂) + V ar(ϵi ) − 2Cov(ϵi , α̂) − 2xi Cov(ϵi , β̂) + 2xi Cov(α̂, β̂) [ ] 1 (xi − x̄)2 = σ2 1 − − ∑ n (xi − x̄)2 ∑ 2 ∑ ∑ Thus, E( ei ) = E(e2i ) = {V ar(ei ) + (Eei )2 } = (n − 2)σ 2 . V ar(ei ) B. Ex. 2.4 ∑ ∑ The∑ least square estimator β̂ = xi yi / x2i is a unbiased linear ∑ estimator. To show ∑it is the BLUE, let β̃ = ci yi be another unbiased estimator. Then since E β̃ = β ci xi = β, we have ci xi = 1. Thus Cov(β̃ − β̂, β̂) = 0 and V ar(β̃) = V ar(β̃ − β̂ + β̂) = V ar(β̃ − β̂) + V ar(β̂) + 0 > V ar(β̂) shows that β̂ has the smallest variance, and is the BLUE. By writing ei = yi − β̂xi = (β − β̂)xi + ϵi and the fact that ϵi ’s are uncorrelated, we have Eei = 0, x2 V ar(ei ) = σ 2 (1 − ∑ ix2 ) and k k { x2 σ 2 (1 − ∑ ix2 ) if i = j k k Cov(ei , ej ) = x x −σ 2 ∑i xj2 if i ̸= j k k B. Ex. 2.13 (a) Straight line fits well except the first obs. which could be treated as an outlier. However, straight line is not realistic in the sense that the record will always greater than zero or more second. In general, we can say over the years, it seems that the time it takes to win the race has been decreasing. Time 210 220 230 240 250 260 270 Plot Year x Time 1900 1920 1940 1960 1980 2000 Year (b) From the R result, Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 225.3517 0.7969 282.78 < 2e-16 *** CenteredYear -0.3206 0.0249 -12.87 1.02e-11 *** Residual standard error: 3.904 on 22 degrees of freedom β̂0 = 225.3517, β̂1 = −0.3206, σ̂ 2 = s2 = 3.9042 = 15.24122, se(β̂0 ) = 0.7969, se(β̂1 ) = 0.0249. (c) Residuals: 1 3.2314487 9 -3.4285296 17 0.7937168 2 3 4 5 6 7 8 3.9136734 3.1958980 -2.1218774 5.4425719 -1.4752035 -0.5929788 -1.3107542 10 11 12 13 14 15 16 2.4181443 -0.9996310 -3.6174064 -7.9351818 -4.1529571 -6.0707325 -3.3885079 18 19 20 21 22 23 24 1.2759414 -3.3118340 1.4003907 6.8426153 3.7848399 1.3570646 4.7492892 3 It is easy see the pattern in the residual by plotting residuals. Note that residuals here show quadratic behavior. The error terms ϵi might not be independent. 0 −2 −1 std 1 Standardized residual plot 1900 1920 1940 1960 1980 2000 Year ˆ∗ (d) The predicted winning time √ at the 2008 Olympic Games is y = 208.1485(sec). 95% confidence ∗ 2 ˆ interval is (y ±t1−0.05/2,24−2 ·s· 1/24 + (2008 − x̄) /SXX ) = (204.9213, 211.3756). 95% prediction interval √ ∑ is (yˆ∗ ± t1−0.05/2,24−2 · s · 1 + 1/24 + (2008 − x̄)2 /SXX ) = (199.4326, 216.8644). where SXX = (xi − x̄)2 . B. Ex. 2.14 (a) β̂0 = 2.21727, β̂1 = 1.60313, s2 = σˆ2 = (0.2474)2 = 0.061197. 95% confidence interval for β0 = (2.107256, 2.327290) 95% confidence interval for β1 = (1.476523, 1.729743) 2 2 95% confidence interval for σ 2 = ( χ2 20s , χ2 20s ) = (0.035704, 0.127206) 20,1−0.05/2 20,0.05/2 (b) We are testing H0 : ei ∼Normal and Shapiro-Wilk normaility test gives p-value greater than 0.1. So we accept the null hypothesis. i.e. ei is Normally distributed. Shapiro-Wilk normality test data: bm$residual W = 0.9401, p-value = 0.1986 (c)We are testing H0 : yij = β0 + β1 (xi − x̄) + ϵij vs H1 : yij = ui + ϵij where i = 1, . . . , 6, j = 1, . . . , ni , and ANOVA table is given by Analysis of Variance Table Model 1: Model 2: Res.Df 1 20 2 16 CO ~ Cr CO ~ as.factor(ratio) RSS Df Sum of Sq F Pr(>F) 1.22395 1.12411 4 0.09983 0.3552 0.8366 Since p-value is large enough, H0 is not rejected and we have evidence of a linear relationship. (d) The predicted CO% is yˆ∗ = 3.16822. Let x∗ = K/C = 2. (i) 90% confidence interval for single experiment is as same as the prediction interval. √ (yˆ∗ ± t1−0.1/2,22−2 · s · 1 + 1/22 + (x∗ − x̄)/SXX = (2.729508, 3.606936) (ii) 90% confidence interval for long-run average √ (yˆ∗ ± t1−0.1/2,22−2 · s · 1/22 + (x∗ − x̄)/SXX = (3.058567, 3.277877) C. Model is yi = β0 + β1 (xi − x̄) + ϵi and ϵi ’s are uncorrelated r. v. with mean 0, variance σ 2 . We have ∑ Cov(ϵi , ϵj ) = 0 if i ̸= j, Cov(ϵi , βˆ0 ) = σ 2 /n , and Cov(ϵi , βˆ1 ) = σ 2 (xi − x̄)/ (xi − x̄)2 . Combining these 4 with results (2.9), (2.10) from textbook, Cov(ei , ej ) = Cov(ϵi − βˆ0 − βˆ1 (xi − x̄), ϵj − βˆ0 − βˆ1 (xj − x̄)) = Cov(ϵi , ϵj ) + V ar(β̂0 ) + V ar(β̂1 ) × (xi − x̄)(xj − x̄) −Cov(ϵi , βˆ0 ) − Cov(ϵj , βˆ0 ) − Cov(ϵi , βˆ1 )(xj − x̄) − Cov(ϵj , βˆ1 )(xi − x̄) + Cov(β̂0 , β̂1 )(xi − x̄ + xj − x̄) ] [ 1 (xi − x̄)(xj − x̄) = −σ 2 + ∑ n (xi − x̄)2 V ar(ei ) = V ar(ϵi ) + V ar(β̂0 ) + V ar(β̂1 ) × (xi − x̄)2 −2Cov(ϵi , βˆ0 ) − 2Cov(ϵi , βˆ1 )(xi − x̄) + 2Cov(β̂0 , β̂1 )(xi − x̄) [ ] 1 (xi − x̄)2 = σ2 1 − − ∑ n (xi − x̄)2 D. Durbin Watson test • Sample code: dwtest <- function(y, x, nsim=1000) { ### y: response x: covariates nsim: Number of simulation repetition n<-length(y) ### Compute Durbin-Watson D for given model resid <- lm(y~x)$resid NU <- 0; DE <- (resid[1])^2 ### Setting up Numerator and Denominator of D Statistic for(j in 2:n){ NU <- NU + (resid[j]-resid[j-1])^2 DE <- DE + (resid[j])^2} D <- NU/DE ### Look for the distribution of D under null hypothesis Count <- 0 Dist <- array(,nsim) for(i in 1:nsim){ temp<-rnorm(n) resid <- lm(temp~x)$resid NU <- 0; DE <- (resid[1])^2 for(j in 2:n){ NU <- NU + (resid[j]-resid[j-1])^2 DE <- DE + (resid[j])^2} Dist[i]<- NU/DE if(D > NU/DE){Count <- Count+1} } ### Produce the result, D-W’s D statistic and p-value of the test hist(Dist, nclass=30,main=’Distribution of D-W\’s D statistics’); abline(v=D, lty=3) result<-c(D, Count/nsim) names(result) <- c("D statistic", "p-value") return(result) } • Computation results: > am <- read.table(’D:/STAT664/amherst.dat’, header=T) 5 > dwtest(am$temp,am$year,10000) D statistic p-value 1.620461 0.005900 The null hypothesis that residuals from regression of Amherst data are independent is rejected. So, there is an evidence of autocorrelation. > CMA <- read.table(’D:/STAT664/charles.dat’,header=F) > dwtest(CMA$V3,CMA$V2,10000) D statistic p-value 1.954750 0.413100 The null hypothesis that residuals from regression of Mt. Airy-Charles data are independent is not rejected. So, these residuals are independent. 6 Appendix: Sample code for Ex.13 and Ex.14 ##### Ch1. Ex13 olympic <- read.table("D:/STAT664/olympic.dat",col.names=c(’Year’,’Time’)) ##### 13. a plot(olympic$Year,olympic$Time, xlab= "Year", ylab="Time") title("Plot Year x Time") ##### 13. b attach(subset(olympic, Year!=1896)) ## Exclude outlier CenteredYear <- Year - mean(Year) bm<-lm(Time~CenteredYear) summary(bm) ##### 13. c plot(Year, bm$resid) abline(h=0) std <- bm$resid / summary(bm)$sigma plot(Year, std) abline(h=c(0,-1,1,-2,2),lty="dotted") title("Standardized residual plot") ##### 13. d conf.i <- predict(bm, data.frame(CenteredYear=(2008-mean(Year))),se.fit=T) pr.i <- conf.i$residual.scale*sqrt(1+(conf.i$se.fit/conf.i$residual.scale)^2) ##prediction interval Result <- c(conf.i$fit-qt(1-0.05/2,24-2)*conf.i$se.fit, conf.i$fit+qt(1-0.05/2,24-2)*conf.i$se.fit, conf.i$fit-qt(1-0.05/2,24-2)*pr.i, conf.i$fit+qt(1-0.05/2,24-2)*pr.i) ##### Ch1. Ex14 co <- read.table("D:/STAT664/co.dat",header=T) attach(co) Cr <- ratio - mean(ratio) bm <- lm(CO~Cr) summary(bm) Interval <- c(2.217273 - qt(1-0.05/2,length(Cr)-2)* 2.217273 + qt(1-0.05/2,length(Cr)-2)* 1.603133 - qt(1-0.05/2,length(Cr)-2)* 1.603133 + qt(1-0.05/2,length(Cr)-2)* summary(bm)$coefficients c <- rnorm(40) shapiro.test(bm$residual) plot(bm$residual) fm <- lm(CO~as.factor(Cr)) anova(bm,fm) anova(bm) pvalue <- 1-pf(0.3552,4,20) 0.05274178, 0.05274178, 0.06069627, 0.06069627) ##### 14. d conf.i <- predict(bm, data.frame(Cr=(2-mean(ratio))),se.fit=T) pr.i <- conf.i$residual.scale*sqrt(1+(conf.i$se.fit/conf.i$residual.scale)^2) ##prediction interval Result <- c(conf.i$fit-qt(1-0.10/2,22-2)*conf.i$se.fit, conf.i$fit+qt(1-0.10/2,22-2)*conf.i$se.fit, conf.i$fit-qt(1-0.10/2,22-2)*pr.i, conf.i$fit+qt(1-0.10/2,22-2)*pr.i) 7
© Copyright 2024 Paperzz