1 Example-Berkeley Guidance Study

1396, Time Series, Week 3-Regression/Residuals/Series Plot, Fall 2007
1
Now using R, we want to see several examples on regression analysis. We
will see how to obtain the fitted model, perform hypothesis testing, obtain
R2 (the coefficient of determination) and so on. Residuals analysis follows
after model fitting. Before the running these examples, you need to install
packages MASS, alr3 and car into R software.
1
Example-Berkeley Guidance Study
The berkeley Guidance Study was a longitudinal monitoring of girls born in
Berkeley, California between January 1928 and June 1929, and followed for
at least 18 years. The data set can be found in the class web. The variables
are described as follows.
Variable Description
Ht18
Age 18 height (cm)
Ht2
Age 2 height (cm)
Ht9
Age 9 height (cm)
Lg18
Age 18 leg circumference (cm)
Lg9
Age 9 leg circumference (cm)
Sex
all coded 1 for female
Soma
Somatotype, 1-7 body type scale
St18
Age 18 strength (kg)
St9
Age 9 strength (kg)
Wt18
Age 18 weight (kg)
Wt2
Age 2 weight (kg)
Wt9
Age 9 weight (kg)
Case
Case number
We are interested in the mean function
E(HT 18 | x) = β0 + β1 W T 2 + β2 HT 2 + β3 W T 9 + β4 HT 9 + β5 LG9 + β6 ST 9
> bgd <- read.table("BGD.dat", header=TRUE)
> bgd[1,]
Sex WT2 HT2 WT9
HT9 LG9 ST9 WT18 HT18 LG18 ST18 Soma Case Case.number
1
1 13.6 87.7 32.5 133.4 28.4 74 56.9 158.9 34.6 143
5 301
0
> HT18 <- bgd$HT18
> WT2 <- bgd$WT2
> HT2 <- bgd$HT2
2
1396, Time Series, Week 3-Regression/Residuals/Series Plot, Fall 2007
> WT9 <- bgd$WT9
> HT9 <- bgd$HT9
> LG9 <- bgd$LG9
> ST9 <- bgd$ST9
> HT18[1]
[1] 158.9
> summary(HT18)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
153.6
163.0
166.8
166.5
170.2
183.2
> summary(WT2)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
10.20
11.80
12.70
12.82
13.47
17.00
> summary(HT2)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
80.90
85.67
87.10
87.25
88.90
97.30
> summary(WT9)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
22.00
27.63
30.65
31.62
34.48
47.40
> summary(HT9)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
121.4
132.9
135.7
135.1
138.8
152.5
> summary(LG9)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
22.60
26.30
27.45
27.84
29.25
32.70
> summary(ST9)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
22.00
51.25
59.00
60.46
68.75 107.00
> bgd.fit <- lm(HT18 ~ WT2 + HT2 + WT9 +HT9 + LG9 + ST9)
> bgd.fit
Call:
lm(formula = HT18 ~ WT2 + HT2 + WT9 + HT9 + LG9 + ST9)
Coefficients:
(Intercept)
4.6836
ST9
-0.0704
WT2
0.3465
> summary(bgd.fit)
HT2
0.1067
WT9
-0.4853
HT9
1.1986
LG9
0.2072
1396, Time Series, Week 3-Regression/Residuals/Series Plot, Fall 2007
3
Call:
lm(formula = HT18 ~ WT2 + HT2 + WT9 + HT9 + LG9 + ST9)
Residuals:
Min
1Q
Median
-7.32345 -1.96404 -0.03138
3Q
2.00687
Max
8.29562
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.68358
15.92247
0.294
0.7696
WT2
0.34648
0.42840
0.809
0.4217
HT2
0.10665
0.20396
0.523
0.6029
WT9
-0.48533
0.20656 -2.350
0.0219 *
HT9
1.19857
0.15438
7.763 9.25e-11 ***
LG9
0.20721
0.39517
0.524
0.6019
ST9
-0.07041
0.03443 -2.045
0.0451 *
--Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 3.274 on 63 degrees of freedom
Multiple R-Squared: 0.7348,
Adjusted R-squared: 0.7095
F-statistic: 29.09 on 6 and 63 DF, p-value: < 2.2e-16
Question 1: How do you obtain the confidence intervals for each
β̂? Try the following two methods.
• Using R to produce these confidence intervals (see R code on
the class website)
• Using Std Error in the above table to produce the confidence
intervals
1.1
F test
> bgd.fit.null <- lm(HT18 ~ WT2 + WT9 + HT9 + ST9)
> summary(bgd.fit.null)
Call:
lm(formula = HT18 ~ WT2 + WT9 + HT9 + ST9)
## Testing H0: beta2=beta5=0
1396, Time Series, Week 3-Regression/Residuals/Series Plot, Fall 2007
Residuals:
Min
1Q
Median
-7.34236 -2.12640 -0.02455
3Q
1.88621
4
Max
8.17302
Coefficients:
Estimate Std. Error t value
(Intercept) 11.69966
12.69845
0.921
WT2
0.45004
0.37480
1.201
WT9
-0.41419
0.11007 -3.763
HT9
1.23276
0.11427 10.788
ST9
-0.07272
0.03267 -2.226
--Signif. codes: 0 ’***’ 0.001 ’**’ 0.01
Pr(>|t|)
0.360277
0.234200
0.000363 ***
3.99e-16 ***
0.029489 *
’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 3.241 on 65 degrees of freedom
Multiple R-Squared: 0.7319,
Adjusted R-squared: 0.7154
F-statistic: 44.37 on 4 and 65 DF, p-value: < 2.2e-16
> rss.h0 <- 3.241*65; rss <- 3.274*63
> rss
[1] 206.262
> rss.h0
[1] 210.665
Question 2: How to perform F test using rss and rss.h0?
1.2
Normality and Constant Variance
The qqnorm plot and the residuals plot against the fitted values of the
response:
5
Normal Q−Q Plot
●
●
●
●
●
●
5
●
●
0
●
●
●
●
●
● ●
●
●
●● ●
●
●
● ●
●
●
●● ● ● ● ● ●
●
●● ●
●●
●
●
●●
●
● ● ●●
●
●●
●
● ●
● ●● ●
●
●
●
●
●
●●
●
●●
●
−5
●
●●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●●●
Residuals
0
5
●●
●●
−5
Sample Quantiles of Regression Residuals
1396, Time Series, Week 3-Regression/Residuals/Series Plot, Fall 2007
●
●
●
●
●
●
−2
−1
0
1
2
155
Theoretical Quantiles
160
165
170
175
180
HT18 fitted values
●
●
1
0
−2
−1
Residuals
2
●
●
● ●
●
●
●
●
●
●
●
● ●
● ●● ●
●
●
●● ● ● ● ● ●
●
●
●
●● ●
●●
●
●
●
●
●●
●
● ● ●●
●
●●●
●
●
●● ● ●
● ●
●
●
●
●●
●
●●
●
●
155
160
165
●
170
175
180
HT18 fitted values
> shapiro.test(bgd.fit$residuals)
Shapiro-Wilk normality test
data: bgd.fit$residuals
W = 0.9936, p-value = 0.9784
> bgd.fit$residuals[bgd.fit$residuals>5]
32
36
63
65
6.061456 8.295617 7.331361 5.143828
Are these cases 32, 36, 63, 65 outliers or influence points?
1.3
Influential Points and Outliers
Different than from the outliers, removing the influential points can produce
quite different results than the previous model. To assess the influence, we use
6
1396, Time Series, Week 3-Regression/Residuals/Series Plot, Fall 2007
Cook’s distance. The cook distance for case i is
Di =
n
1 X
(ŷ(i),j − ŷj )2
pσ̂ 2
(1)
j=1
where p is the number of independent variables and ŷ(i),j is obtain as follows:
• Fit the regression without case i. We may call this regression as leave-oneout model
• ŷ(i),j is the prediction of yj using estimators of β from the leave-one-out
model and the corresponding independent variables.
We can obtain Cook’s distance for case 32, 36, 63, 65.
> library(car)
> round(cookd(bgd.fit),3)
1
2
3
4
5
0.018 0.003 0.172 0.031 0.016
14
15
16
17
18
0.000 0.001 0.007 0.002 0.000
27
28
29
30
31
0.035 0.002 0.000 0.011 0.006
40
41
42
43
44
0.012 0.002 0.005 0.057 0.006
53
54
55
56
57
0.002 0.001 0.002 0.001 0.057
66
67
68
69
70
0.005 0.001 0.071 0.039 0.000
6
0.001
19
0.028
32
0.177
45
0.003
58
0.000
7
0.000
20
0.000
33
0.002
46
0.000
59
0.000
8
0.008
21
0.031
34
0.003
47
0.017
60
0.005
9
0.000
22
0.000
35
0.013
48
0.000
61
0.012
10
0.000
23
0.027
36
0.080
49
0.016
62
0.020
11
0.008
24
0.085
37
0.005
50
0.019
63
0.106
12
0.013
25
0.001
38
0.007
51
0.001
64
0.021
Benchmark to use Cook’s distance: Cases having Di > 0.5 are generally consided
to have influence on the model. The Cook’s distance for case 32, 36, 63 and 65
are 0.177, 0.08, 0.106, 0.03. These are not influential points. No statistical test is
associated with Cook’s distance.
There is a test method for single outliers.
• Find ŷ(i),j , as we have done in finding the Cook’s distance
• tj =
yj −ŷ(i),j
se(ŷ(i),j )
• Compare tj to the quantiles of a t-distribution with n − p − 1 degree of
freedom.
13
0.006
26
0.001
39
0.000
52
0.004
65
0.030
1396, Time Series, Week 3-Regression/Residuals/Series Plot, Fall 2007
2
7
Example-Four Components of Real GNP
Now we consider the time series example shown in 3.4 from the textbook (page
62). Four time series (key components of U.S. real GNP) are manufacturing, retail,
services and agriculture recorded from 1960 to 1989 in millions of 1982 dollars. As
the first step to explore the data, we look at the time series plot.
Time series and Scatter Plots
200
400
600
800
Agriculture
Retail
Services
Manufacuring
0
GNP Components (Millions of 1982 dollars)
1000
2.1
60
65
70
75
80
85
year
We can further look at the relation among these variables.
90
1396, Time Series, Week 3-Regression/Residuals/Series Plot, Fall 2007
8
Scatter plot of the four components
800
●
200
●
●
80
●
●● ●
●●
●● ●
● ●●
●
●
800
600
400
●●
●
●
●
●
●
● ● ●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
RETAIL
600
500
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●●
●●
●●
●
90
●
● ●
●
●
● ●●
●
●
●
●
400
300
200
●
●
●
●
●●
●
●
●●●
●●
●
●
●●
●●
●
●
●●
●
●●
●
●
●●
●
●
●
● ●
●
●●●
●
●
2.2
●
●
● ●
●
●
● ●
●●●
●
●
●
● ●
●
●●
80
● ●
●
●
●
●
●
●
●
●●●
●
●
●●
●
● ●
●● ●
●
●
●
●
●●
●
●
●
●
●
70
●
●
●
● ●
●●
●
●
●
●
●
●● ●
●
●●
●
●
●
●
●
● ●● ●
●
MANUFACTURING
●
● ●
●
●
●
●
●
● ●
●
●
●
●● ●
●
●● ●
● ●
●
●
●
●
●●
70
●
●●
● ●
● ●●
●
●
● ●
●
●
●
100
350
●
● ●●
●● ●● ●
●
250
●
●● ●
● ● ●
● ●● ●●
●
150
●●
●
● ● ●●●
●
● ●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
● ●
600
●
● ●
●
●●
●●
500
●
●
●
AGRICULTURE
400
●
●
●
●
300
●
●
100
600
90
400
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
SERVICES
150 200 250 300 350 400
Relation between two time series-Regression and
Residuals
We see a strong linear relation between Retail and Services. It may be interesting to fit a regression on Retail against Services.
> l1.fit <- lm(retail ~ services)
> l1.fit
Call:
lm(formula = retail ~ services)
Coefficients:
(Intercept)
55.490
services
0.542
> summary(l1.fit)
1396, Time Series, Week 3-Regression/Residuals/Series Plot, Fall 2007
9
Call:
lm(formula = retail ~ services)
Residuals:
Min
1Q
-20.021 -4.646
Median
1.126
3Q
5.706
Max
15.733
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 55.48994
4.94325
11.22 7.07e-12 ***
services
0.54197
0.01239
43.73 < 2e-16 ***
--Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 8.98 on 28 degrees of freedom
Multiple R-Squared: 0.9856,
Adjusted R-squared: 0.9851
F-statistic: 1913 on 1 and 28 DF, p-value: < 2.2e-16
The residuals still forms a time series and it is reasonable to check whether
there is serial correlation over time. That is, is et correlated to et−1 ? The following
plot shows some evidence that there is slight moderate correlation.
1396, Time Series, Week 3-Regression/Residuals/Series Plot, Fall 2007
●
●
15
10
10
●
●
●
●
●
●
5
●
●
●
●
●
0
●
●
−5
e_t
●
●
●
●
●
●
●
●
●
−10
●
−15
●
−20
●
●
●
−20
−15
−10
−5
0
5
10
15
e_(t−1)
The Durbin-Watson test is performed as follows. The formulas of DurbinWatson statistic is on page 96.
> durbin.watson(l1.fit, method="normal", alternative="two.sided")
lag Autocorrelation D-W Statistic p-value
1
0.6001668
0.7856335
0
Alternative hypothesis: rho != 0
> durbin.watson(l1.fit, method="normal", alternative="positive")
lag Autocorrelation D-W Statistic p-value
1
0.6001668
0.7856335
0
Alternative hypothesis: rho > 0
> durbin.watson(l1.fit, method="normal", alternative="negative")
lag Autocorrelation D-W Statistic p-value
1
0.6001668
0.7856335
1
Alternative hypothesis: rho < 0

Download Report

1 Example-Berkeley Guidance Study

Paperzz.com

Your Paperzz