Homework 1 solutions

STOR 664 Homework 1
Solution
A. Exercise (Faraway book)
Summary:
> library(faraway)
> data(teengamb)
> teengamb$sex<-as.factor(teengamb$sex)
> summary(teengamb)
sex
status
income
0:28
Min.
:18.00
Min.
: 0.600
1:19
1st Qu.:28.00
1st Qu.: 2.000
Median :43.00
Median : 3.250
Mean
:45.23
Mean
: 4.642
3rd Qu.:61.50
3rd Qu.: 6.210
Max.
:75.00
Max.
:15.000
verbal
Min.
: 1.00
1st Qu.: 6.00
Median : 7.00
Mean
: 6.66
3rd Qu.: 8.00
Max.
:10.00
gamble
Min.
: 0.0
1st Qu.: 1.1
Median : 6.0
Mean
: 19.3
3rd Qu.: 19.4
Max.
:156.0
There are no missing values. From this summary we see that there were more males sampled than females. We also see that the mean of gamble is much larger than the median, suggesting the distribution is
right skewed or may have large outliers, which is likely since the maximum value is so much larger than the
other quartile values.
Histograms of the income variable and the gamble variable:
> hist(teengamb$income)
> hist(teengamb$gamble)
As the summary above suggested, both the histograms are skewed to the right and there are a few large
outliers.
Pairwise scatter plots:
> pairs(teengamb,col=as.numeric(teengamb$sex)+2)
1
It appears in this study that males tend to spend more on gambling than females. Also, the variables
verbal and status look like they may be slightly positively correlated and gamble and income may also be
correlated. The following command confirms those two correlations are greater than 0.5.
> cor(teengamb[,-1])
status
income
verbal
gamble
status 1.00000000 -0.2750340 0.5316102 -0.05042081
income -0.27503402 1.0000000 -0.1755707 0.62207690
verbal 0.53161022 -0.1755707 1.0000000 -0.22005619
gamble -0.05042081 0.6220769 -0.2200562 1.00000000
The correlation between gamble and income makes sense, because people who make more money have
more money to spend on gambling. This concludes the preliminary analysis of this data.
B. Ex. 2.1
∑
βˆ0 =
yi /n is apparently ∑
linear and unbiased. To show it has minimum variance among all linear,
n
unbiased estimators, let β˜0 =
properties. Then, from unbiasedness,
i=1 ci yi be any estimator of these
∑
∑
∑ ∑ 1
∑
c
we get
ci = 1, ci (xi − x̄) = 0. Since Cov(β˜0 − βˆ0 , βˆ0 ) =
Cov(yi , yj ) − i n12 V ar(yi ) =
i
i
j
n
∑ 1 2 ∑ 1 2
i ci n σ −
i n2 σ = 0, we have
V ar(β˜0 ) = V ar(β˜0 − βˆ0 + βˆ0 ) = V ar(β˜0 − βˆ0 ) + V ar(βˆ0 ) + 0 > V ar(βˆ0 )
showing that βˆ0 has the smallest variance, and is the BLUE.
B. Ex. 2.3
∑
∑i −x̄)(yi 2−ȳ) and α̂ = ȳ − β̂ x̄ are obtained by standard least square calculation. And the followings
β̂ = (x
(xi −x̄)
are calculated via standard expectation calculation.
E(α̂) = α
E(β̂) = β,
∑ 2
[
]
x̄2
2 ∑ (xi )/n
2 1
V ar(α̂) = σ
=σ
+∑
,
(xi − x̄)2
n
(xi − x̄)2
1
V ar(β̂) = σ 2 ∑
,
(xi − x̄)2
x̄
Cov(α̂, β̂) = −σ 2 ∑
(xi − x̄)2
Writing
ei
= yi − α̂ − β̂xi
= (α − α̂) + (β − β̂)xi + ϵi
and Cov(ϵi , α̂) = σ 2
[
1
n
−
(xi −x̄)x̄
∑
(xi −x̄)2
]
i −x̄)
, Cov(ϵi , β̂) = σ 2 ∑(x
(xi −x̄)2 give the following.
2
E(ei )
=
0,
= V ar(α̂) + x2i V ar(β̂) + V ar(ϵi ) − 2Cov(ϵi , α̂) − 2xi Cov(ϵi , β̂) + 2xi Cov(α̂, β̂)
[
]
1
(xi − x̄)2
= σ2 1 − − ∑
n
(xi − x̄)2
∑ 2
∑
∑
Thus, E( ei ) = E(e2i ) = {V ar(ei ) + (Eei )2 } = (n − 2)σ 2 .
V ar(ei )
B. Ex. 2.4
∑
∑
The∑
least square estimator β̂ =
xi yi / x2i is a unbiased linear
∑ estimator. To show
∑it is the BLUE,
let β̃ =
ci yi be another unbiased estimator. Then since E β̃ = β ci xi = β, we have
ci xi = 1. Thus
Cov(β̃ − β̂, β̂) = 0 and
V ar(β̃) = V ar(β̃ − β̂ + β̂) = V ar(β̃ − β̂) + V ar(β̂) + 0 > V ar(β̂)
shows that β̂ has the smallest variance, and is the BLUE.
By writing ei = yi − β̂xi = (β − β̂)xi + ϵi and the fact that ϵi ’s are uncorrelated, we have Eei = 0,
x2
V ar(ei ) = σ 2 (1 − ∑ ix2 ) and
k k
{
x2
σ 2 (1 − ∑ ix2 ) if i = j
k k
Cov(ei , ej ) =
x x
−σ 2 ∑i xj2
if i ̸= j
k
k
B. Ex. 2.13
(a) Straight line fits well except the first obs. which could be treated as an outlier. However, straight
line is not realistic in the sense that the record will always greater than zero or more second. In general, we
can say over the years, it seems that the time it takes to win the race has been decreasing.
Time
210
220
230
240
250
260
270
Plot Year x Time
1900
1920
1940
1960
1980
2000
Year
(b) From the R result,
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 225.3517
0.7969 282.78 < 2e-16 ***
CenteredYear -0.3206
0.0249 -12.87 1.02e-11 ***
Residual standard error: 3.904 on 22 degrees of freedom
β̂0 = 225.3517, β̂1 = −0.3206, σ̂ 2 = s2 = 3.9042 = 15.24122, se(β̂0 ) = 0.7969, se(β̂1 ) = 0.0249.
(c) Residuals:
1
3.2314487
9
-3.4285296
17
0.7937168
2
3
4
5
6
7
8
3.9136734 3.1958980 -2.1218774 5.4425719 -1.4752035 -0.5929788 -1.3107542
10
11
12
13
14
15
16
2.4181443 -0.9996310 -3.6174064 -7.9351818 -4.1529571 -6.0707325 -3.3885079
18
19
20
21
22
23
24
1.2759414 -3.3118340 1.4003907 6.8426153 3.7848399 1.3570646 4.7492892
3
It is easy see the pattern in the residual by plotting residuals. Note that residuals here show quadratic
behavior. The error terms ϵi might not be independent.
0
−2
−1
std
1
Standardized residual plot
1900
1920
1940
1960
1980
2000
Year
ˆ∗
(d) The predicted winning time
√ at the 2008 Olympic Games is y = 208.1485(sec). 95% confidence
∗
2
ˆ
interval is (y ±t1−0.05/2,24−2 ·s· 1/24 + (2008 − x̄) /SXX ) = (204.9213, 211.3756). 95% prediction interval
√
∑
is (yˆ∗ ± t1−0.05/2,24−2 · s · 1 + 1/24 + (2008 − x̄)2 /SXX ) = (199.4326, 216.8644). where SXX = (xi − x̄)2 .
B. Ex. 2.14
(a) β̂0 = 2.21727, β̂1 = 1.60313, s2 = σˆ2 = (0.2474)2 = 0.061197.
95% confidence interval for β0 = (2.107256, 2.327290)
95% confidence interval for β1 = (1.476523, 1.729743)
2
2
95% confidence interval for σ 2 = ( χ2 20s
, χ2 20s ) = (0.035704, 0.127206)
20,1−0.05/2
20,0.05/2
(b) We are testing H0 : ei ∼Normal and Shapiro-Wilk normaility test gives p-value greater than 0.1. So
we accept the null hypothesis. i.e. ei is Normally distributed.
Shapiro-Wilk normality test
data: bm$residual
W = 0.9401, p-value = 0.1986
(c)We are testing H0 : yij = β0 + β1 (xi − x̄) + ϵij vs H1 : yij = ui + ϵij where i = 1, . . . , 6, j = 1, . . . , ni ,
and ANOVA table is given by
Analysis of Variance Table
Model 1:
Model 2:
Res.Df
1
20
2
16
CO ~ Cr
CO ~ as.factor(ratio)
RSS Df Sum of Sq
F Pr(>F)
1.22395
1.12411 4
0.09983 0.3552 0.8366
Since p-value is large enough, H0 is not rejected and we have evidence of a linear relationship.
(d) The predicted CO% is yˆ∗ = 3.16822. Let x∗ = K/C = 2.
(i) 90% confidence interval for single experiment is as same as the prediction interval.
√
(yˆ∗ ± t1−0.1/2,22−2 · s · 1 + 1/22 + (x∗ − x̄)/SXX = (2.729508, 3.606936)
(ii) 90% confidence interval for long-run average
√
(yˆ∗ ± t1−0.1/2,22−2 · s · 1/22 + (x∗ − x̄)/SXX = (3.058567, 3.277877)
C.
Model is yi = β0 + β1 (xi − x̄) + ϵi and ϵi ’s are uncorrelated r. v. with mean 0, variance σ 2 . We have
∑
Cov(ϵi , ϵj ) = 0 if i ̸= j, Cov(ϵi , βˆ0 ) = σ 2 /n , and Cov(ϵi , βˆ1 ) = σ 2 (xi − x̄)/ (xi − x̄)2 . Combining these
4
with results (2.9), (2.10) from textbook,
Cov(ei , ej )
= Cov(ϵi − βˆ0 − βˆ1 (xi − x̄), ϵj − βˆ0 − βˆ1 (xj − x̄))
= Cov(ϵi , ϵj ) + V ar(β̂0 ) + V ar(β̂1 ) × (xi − x̄)(xj − x̄)
−Cov(ϵi , βˆ0 ) − Cov(ϵj , βˆ0 ) − Cov(ϵi , βˆ1 )(xj − x̄) − Cov(ϵj , βˆ1 )(xi − x̄) + Cov(β̂0 , β̂1 )(xi − x̄ + xj − x̄)
]
[
1
(xi − x̄)(xj − x̄)
= −σ 2
+ ∑
n
(xi − x̄)2
V ar(ei )
= V ar(ϵi ) + V ar(β̂0 ) + V ar(β̂1 ) × (xi − x̄)2
−2Cov(ϵi , βˆ0 ) − 2Cov(ϵi , βˆ1 )(xi − x̄) + 2Cov(β̂0 , β̂1 )(xi − x̄)
[
]
1
(xi − x̄)2
= σ2 1 − − ∑
n
(xi − x̄)2
D. Durbin Watson test
• Sample code:
dwtest <- function(y, x, nsim=1000)
{
### y: response x: covariates nsim: Number of simulation repetition
n<-length(y)
### Compute Durbin-Watson D for given model
resid <- lm(y~x)$resid
NU <- 0; DE <- (resid[1])^2 ### Setting up Numerator and Denominator of D Statistic
for(j in 2:n){
NU <- NU + (resid[j]-resid[j-1])^2
DE <- DE + (resid[j])^2}
D <- NU/DE
### Look for the distribution of D under null hypothesis
Count <- 0
Dist <- array(,nsim)
for(i in 1:nsim){
temp<-rnorm(n)
resid <- lm(temp~x)$resid
NU <- 0; DE <- (resid[1])^2
for(j in 2:n){
NU <- NU + (resid[j]-resid[j-1])^2
DE <- DE + (resid[j])^2}
Dist[i]<- NU/DE
if(D > NU/DE){Count <- Count+1}
}
### Produce the result, D-W’s D statistic and p-value of the test
hist(Dist, nclass=30,main=’Distribution of D-W\’s D statistics’); abline(v=D, lty=3)
result<-c(D, Count/nsim)
names(result) <- c("D statistic", "p-value")
return(result)
}
• Computation results:
> am <- read.table(’D:/STAT664/amherst.dat’, header=T)
5
> dwtest(am$temp,am$year,10000)
D statistic
p-value
1.620461
0.005900
The null hypothesis that residuals from regression of Amherst data are independent is rejected. So,
there is an evidence of autocorrelation.
> CMA <- read.table(’D:/STAT664/charles.dat’,header=F)
> dwtest(CMA$V3,CMA$V2,10000)
D statistic
p-value
1.954750
0.413100
The null hypothesis that residuals from regression of Mt. Airy-Charles data are independent is not
rejected. So, these residuals are independent.
6
Appendix: Sample code for Ex.13 and Ex.14
##### Ch1. Ex13
olympic <- read.table("D:/STAT664/olympic.dat",col.names=c(’Year’,’Time’))
##### 13. a
plot(olympic$Year,olympic$Time, xlab= "Year", ylab="Time")
title("Plot Year x Time")
##### 13. b
attach(subset(olympic, Year!=1896)) ## Exclude outlier
CenteredYear <- Year - mean(Year)
bm<-lm(Time~CenteredYear)
summary(bm)
##### 13. c
plot(Year, bm$resid)
abline(h=0)
std <- bm$resid / summary(bm)$sigma
plot(Year, std)
abline(h=c(0,-1,1,-2,2),lty="dotted")
title("Standardized residual plot")
##### 13. d
conf.i <- predict(bm, data.frame(CenteredYear=(2008-mean(Year))),se.fit=T)
pr.i <- conf.i$residual.scale*sqrt(1+(conf.i$se.fit/conf.i$residual.scale)^2) ##prediction interval
Result <- c(conf.i$fit-qt(1-0.05/2,24-2)*conf.i$se.fit,
conf.i$fit+qt(1-0.05/2,24-2)*conf.i$se.fit,
conf.i$fit-qt(1-0.05/2,24-2)*pr.i,
conf.i$fit+qt(1-0.05/2,24-2)*pr.i)
##### Ch1. Ex14
co <- read.table("D:/STAT664/co.dat",header=T)
attach(co)
Cr <- ratio - mean(ratio)
bm <- lm(CO~Cr)
summary(bm)
Interval <- c(2.217273 - qt(1-0.05/2,length(Cr)-2)*
2.217273 + qt(1-0.05/2,length(Cr)-2)*
1.603133 - qt(1-0.05/2,length(Cr)-2)*
1.603133 + qt(1-0.05/2,length(Cr)-2)*
summary(bm)$coefficients
c <- rnorm(40)
shapiro.test(bm$residual)
plot(bm$residual)
fm <- lm(CO~as.factor(Cr))
anova(bm,fm)
anova(bm)
pvalue <- 1-pf(0.3552,4,20)
0.05274178,
0.05274178,
0.06069627,
0.06069627)
##### 14. d
conf.i <- predict(bm, data.frame(Cr=(2-mean(ratio))),se.fit=T)
pr.i <- conf.i$residual.scale*sqrt(1+(conf.i$se.fit/conf.i$residual.scale)^2) ##prediction interval
Result <- c(conf.i$fit-qt(1-0.10/2,22-2)*conf.i$se.fit,
conf.i$fit+qt(1-0.10/2,22-2)*conf.i$se.fit,
conf.i$fit-qt(1-0.10/2,22-2)*pr.i,
conf.i$fit+qt(1-0.10/2,22-2)*pr.i)
7