Assignment key in PDF format

R-Code for Assignment 2
Regression III
ICPSR Summer Program 2005
Dave Armstrong
University of Maryland
[email protected]
July 13, 2005
1
Question 1
First, you are asked to fit a bivariate linear regression of HAPPY on HHINCOME. Then, show the
fitted regression line. Finally, referring to the statistical output, describe the impact of HHINCOME
on HAPPY. Below is the code used to generate the model and the accompanying output.
mod <- lm(HAPPY ~ HHINCOME, data=health)
summary(mod)
Call: lm(formula = HAPPY ~ HHINCOME, data = health)
Residuals:
Min
1Q
-14.4820 -1.8420
Median
0.1740
3Q
2.1740
Max
7.8139
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.5301
0.1432 150.33
<2e-16 *** HHINCOME
0.3280
0.0201
16.32
<2e-16 ***
--Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 3.133 on 6361 degrees of freedom Multiple
R-Squared: 0.04017,
Adjusted R-squared: 0.04002 F-statistic:
266.2 on 1 and 6361 DF, p-value: < 2.2e-16
Now, you may just want a table, rather than the entire output from R . There is a package in
R called xtable that constructs LATEXtables from R output. To see what you get from that,
you can type xtable(mod) where mod refers to your model object. Then, you can copy that text
out of R and paste it into your .tex editor. Table 1 shows the unedited output from that table.1
1
I did put a caption in, but other than that, the table is just as it was copied out of R .
1
Table 1: Bivariate Linear Regression of Happiness on Household Income
(Intercept)
HHINCOME
Estimate
21.5301
0.3280
Std. Error
0.1432
0.0201
t value
150.33
16.32
Pr(>|t|)
0.0000
0.0000
I actually like a different kind of output from my models. I like the coefficient with standard
errors in parentheses. I wrote a little code for R that would do this. Table 2 shows the output
from this function.
outreg <- function(object, fit=TRUE){
out.mat <- matrix(NA, ncol=2, nrow=2*length(object$coef))
l <- 1
for(i in seq(from=1, to=2*length(object$coef), by=2)){
out.mat[i, 1] <if(summary(object)$coef[l,4]<0.05) paste(round(object$coef[l], 3), "*", sep="")
else round(object$coef[l], 3)
out.mat[i+1, 1] <- paste("(",
paste(if(as.character(round(diag(vcov(object))[l], 3)) == "0") "0.000"
else round(diag(vcov(object))[l], 3), ")", sep=""), sep="")
out.mat[i, 2] <- names(object$coef)[l]
out.mat[i+1, 2] <- c("")
l <- l+1
}
row.names(out.mat) <- out.mat[,2]
out.mat <- cbind(rep("&", length(out.mat[,1])), out.mat[,1],
rep(c("\\", "\\[.1in]"), length(out.mat[,1])/2))
colnames(out.mat) <- c("", "", "")
cat("\\begin{table}\n")
cat("\\caption{}\n")
cat("\\begin{center}\n")
cat("\\begin{tabular}{ld{3}}\n")
cat("\\hline\n")
cat("
& \\multicolumn{1}{c}{Estimate} \\\\\n")
cat("
& \\multicolumn{1}{c}{(S.E.)} \\\\\n")
cat("\\hline\\hline\n")
print.table(out.mat)
cat("\\hline\n")
cat("\\multicolumn{2}{l}{N=")
cat(paste(object$df.residual + object$rank, "}\\\\\n", sep=""))
cat("\\multicolumn{2}{l}{*p$<$0.05, 2-sided}\\\\\n")
if(fit == TRUE){
if(class(object)[1] == "lm"){
cat("\\multicolumn{2}{l}{R$^{2}$=")
cat(paste(round(summary(object)$r.squared, 3), "}\\\\\n", sep=""))
}
else if(class(object)[1] == "glm"){
2
cat("\\multicolumn{2}{l}{deviance=")
cat(paste(round(summary(object)$deviance, 3), "}\\\\\n", sep=""))
}
}
cat("\\hline\\hline\n")
cat("\\end{tabular}\n")
cat("\\end{center}\n")
cat("\\end{table}\n")
}
my.out <- outreg(mod)
Table 2: Binary Regression Table
Estimate
(S.E.)
21.53∗
(0.021)
(Intercept)
HHINCOME
0.328∗
(0.000)
N=6363
*p<0.05, 2-sided
R2 =0.04
To use this code, you have to have \usepackage{dcolumn} and
\newcolumntype{d}[1]{D{.}{.}{#1}} in the preamble of your LATEXdocument.
3
Now, you are supposed to construct a graph of the regression line. This can be done the
following way in R . Figure 1 shows the plot created by the code.
postscript("C:/documents and settings/darmstrong/desktop/regression III/2.1.ps",
height=6, width=6, horizontal=F)
plot(jitter(health$HAPPY, 3) ~ jitter(health$HHINCOME, 4), xlab="Household Income",
ylab="Happiness")
abline(mod, lty=2, lwd=2, col="red")
dev.off()
20
10
15
Happiness
25
30
Figure 1: Scatterplot of Happiness and Household Income with OLS Regression Line
0
2
4
6
8
10
Household Income
Looking back at the statistical output, you can see that a one-unit increase in income corresponds to an increase in the average level of Happiness of about 0.33 units. Since there are no
other variables in the model, this is not holding anything constant.
2
Question 2
Here, you are asked to first, construct a conditioning plot of HHINCOME versus HAPPY conditional
on CLASS and SEX. We did this for the last assignment, but to remind, the code for a trellis
display is below:
postscript("C:/documents and settings/darmstrong/desktop/regression III/1.5b.ps",
height=7, width=9, horizontal=F)
4
xyplot(jitter(health$HAPPY,2) ~ jitter(health$HHINCOME, 3) | health$CLASS +
health$SEX, xlab="Household Income", ylab="Happiness",
main="Happiness and Household Income Conditional on Class and Sex",
panel=function(x, y){
panel.xyplot(jitter(x, 2), jitter(y, 2), col="lightblue")
panel.lmline(x, y, col="blue")
panel.loess(x, y, col="red")
})
dev.off()
Figure 2: Scatterplot of Happiness on Household Income given Class and Sex
Happiness and Household Income Conditional on Class and Sex
2
Men
Manual labour
4
6
8 10
2
4
6
8 10
Men
Men
Men
Self employed Routine nonmanual Lower salariat
Men
Upper salariat
30
25
20
Happiness
15
10
Women
Manual labour
Women
Women
Women
Self employed Routine nonmanual Lower salariat
Women
Upper salariat
30
25
20
15
10
2
4
6
8 10
2
4
6
8 10
2
4
6
8 10
Household Income
Next, you are asked to fit a multiple regression and describe the changes (if any) in the effect
of Household Income on Happiness. You can do this in R as follows. Table 3 shows the results
of outreg here.
mod <- lm(HAPPY ~ HHINCOME + CLASS + SEX, data=health)
summary(mod)
outreg(mod)
As we can see from the table, the difference in the effect is not especially large, it is about
0.026. This makes sense because it seems like the relationship between Income and Happiness
is relatively constant across class and sex.
Next, you’re asked to perform an Anova from the car package. Remember, this is different
from the anova in the base R . This can be done as follows:
Anova(mod)
Anova Table (Type II tests)
5
Table 3: Multiple Regression of Happiness on Household Income, Class and Sex
Estimate
(S.E.)
21.494∗
(0.023)
(Intercept)
HHINCOME
0.302∗
(0.000)
CLASS: Self employed
0.469∗
(0.044)
CLASS: Routine nonmanual
0.331∗
(0.015)
CLASS: Lower salariat
0.577∗
(0.012)
CLASS: Upper salariat
0.563∗
(0.017)
SEX: Men
−0.147
(0.008)
N=6363, Reference Category (Class)=Manual Labour
*p<0.05, 2-sided
R2 =0.048
Response: HAPPY
Sum Sq
Df F value
Pr(>F)
HHINCOME
2017
1 207.0008 < 2.2e-16 ***
CLASS
353
4
9.0528 2.741e-07 ***
SEX
27
1
2.7232
0.09895 .
Residuals 61930 6356
It seems that sex is statistically insignificant after controlling for the effects of Income and
Class. We can then remove SEX from the model and re-estimate. The results are in Table 4:
mod <- lm(HAPPY ~ HHINCOME + CLASS, data=health)
outreg(mod)
6
Table 4: Regression of Happiness on Household Income and Class
Estimate
(S.E.)
21.422∗
(0.021)
(Intercept)
HHINCOME
0.296∗
(0.000)
CLASS: Self employed
0.464∗
(0.044)
CLASS: Routine nonmanual
0.419∗
(0.012)
CLASS: Lower salariat
0.627∗
(0.011)
CLASS: Upper salariat
0.61∗
(0.016)
N=6363, Reference Category (Class)=Manual Labour
*p<0.05, 2-sided
R2 =0.047
3
Question 3
Here, you are asked to execute the qvcalc command for the model estimated in the previous
question. This can be done as follows. Figure 3 shows a plot of the results.
my.qv <- qvcalc(mod, "CLASS")
postscript("C:/documents and settings/darmstrong/desktop/regression III/2.3.ps",
height=6, width=6, horizontal=F)
plot(my.qv)
dev.off()
What we can see is that manual laborers are unlike most other class types - especially
routine non-manual, lower salariat and upper salariat. These differences seem significant because
the intervals do not overlap. It also looks from the output of qvcalc like the estimates for
manual labour and self-employed are not significantly different from each other. However, the
statistical output from lm show otherwise. We talked in class about how this could also be done
mathematically by using the variance-covariance matrix provided in the estimation. Below, I
will show how I would do this.
7
Figure 3: Quasi-Variance Plot of CLASS
0.4
−0.2
0.0
0.2
estimate
0.6
0.8
1.0
Intervals based on quasi standard errors
Manual labour
Routine nonmanual
Upper salariat
CLASS
We can see from the model that coefficients 3-6 are the coefficients for the class variables.
What we want is a confidence interval for the difference of each type with each other type. We
know, from the statistical output, the difference between manual laborer and all of the other
types. Those are given by the coefficients. Otherwise, we need to get an estimate of the difference
and the variance for that estimated difference. It is relatively easy to make a loop that will do
these calculations for us. Below is such a loop.
b <- coef(mod)
v <- vcov(mod)
b.diff <- b.sd <- matrix(NA, ncol=4, nrow=4)
b.diff[1, ] <- 0-*b[3:6]
b.sd[1, ] <- sqrt(diag(v)[3:6])
for(i in seq(from=3, to=5)){
for(j in seq(from=i+1, to=6)){
b.diff[i-1,j-2] <- b[i]-b[j]
b.sd[i-1,j-2] <- sqrt(v[i,i] + v[j,j] + 2*(-1)*v[j,i])
}
}
b.t<- b.diff/b.sd
row.names(b.t) <- c("Manual Laborer", "Self Employed",
"Routine Nonmanual", "Lower Salariat")
colnames(b.t) <- c("Self Employed", "Routine Nonmanual",
"Lower Salariat", "Upper Salariat")
row.names(b.diff) <- row.names(b.sd) <- row.names(b.t)
colnames(b.diff) <- colnames(b.sd) <- colnames(b.t)
8
The first two lines simply define b as the 1 × 6 coefficient matrix and v as the 6 × 6 variancecovariance matrix of the estimates. The third line initializes the matrices b.diff and b.sd by
filling them with missing values. The fourth line fills in the b.diff matrix with what we know
already, the difference between manual laborers and all the other categories. Since the rest of
the matrix is set up in this way, we subtract the coefficient for all of the other categories from
the implied coefficient for manual laborers, which is 0. The fifth line fills in the first row of the
b.sd matrix with elements 3 to 6 of the square root of the diagonal of the variance-covariance
matrix of the estimates.
The sixth line sets up the for loop for i and the seventh line does the same thing for j.
Here, i is going from 3 to 5 and for each value of i, j is going from i+1 to 6. This will make
sure that we calculate every difference between the different types of classes. We set the first
one from 3 to 5 because the coefficients for the variable class are from 3 to 6 in the coefficient
matrix. Since we’re concerned with differences across the coefficients, we don’t need to make
the first loop go to 6, only 5, because we’re not interested in the difference between b[6] and
b[6]. For the same reason, the second loop goes from i+1 to 6 because we’re never interested
in the difference between b[i] and b[i].
Rows 8 and 9 of the code fill in the cells of the matrix. Specifically, row 8 takes the difference
between two coefficients and row 9 finds the standard error for that estimate by the formula
that we have seen in class.
var(aX + bY ) = a2 var(X) + b2 var(Y ) + 2(ab)cov(X, Y )
Which, for our purposes, translates to:
var((1)bi + (−1)bj ) = 12 var(bi ) + (−1)2 var(bj ) + 2(1)(−1)cov(bi , bj )
The next row defines a matrix of t−statistics as t =
bi −bj
Sbi −bj
. The final rows of the command
simply identify the row and column names for the matrices constructed by the previous commands. You could stop here, as the matrices b.diff, b.sd, and b.t provide all of the necessary
information. Let’s take a look at those matrices:
> b.diff
Self Employed Routine Nonmanual Lower Salariat Upper Salariat
Manual Laborer
-0.4642506
-0.4190497
-0.6266219
-0.61014762
Self Employed
NA
0.0452009
-0.1623714
-0.14589704
Routine Nonmanual
NA
NA
-0.2075723
-0.19109795
Lower Salariat
NA
NA
NA
0.01647431
> b.sd
Self Employed Routine Nonmanual Lower Salariat Upper Salariat
Manual Laborer
0.210301
0.1084048
0.1040208
0.1255736
Self Employed
NA
0.2194799
0.2167699
0.2274021
Routine Nonmanual
NA
NA
0.1211217
0.1395646
Lower Salariat
NA
NA
NA
0.1339132
> b.t
Self Employed Routine Nonmanual Lower Salariat Upper Salariat
Manual Laborer
2.207553
3.8655995
6.0240068
4.8588828
Self Employed
NA
0.1789734
-0.6463672
-0.5583463
Routine Nonmanual
NA
NA
-1.1890737
-1.0133523
Lower Salariat
NA
NA
NA
0.0877518
9
The first matrix, b.diff shows the pairwise differences between estimates. Here, the matrix is
defined by the estimate for the row class minus the estimate for the column class. The matrix
b.sd provides the standard errors for the corresponding differences in b.diff and the final
matrix b.t constructs the t−statistics for the differences. As you can see, Manual laborers
have a significantly lower average than all other classes. However, there are no other differences
between classes. Constructing a graph like this might be your best bet, especially if differences
between categories of a categorical variable is of substantive interest in your research. I would
suggest a graph that looks something like the one in Figure 4.
Self
Employed
Routine
Nonman
Lower
Salariat
Upper
Salariat
−0.464
0.21
−0.419
0.108
−0.627
0.104
−0.61
0.126
0.045
0.219
−0.162
0.217
−0.146
0.227
−0.208
0.121
−0.191
0.14
Lower
Salariat
Routine
Nonman
Self
Employed
Manual
Laborer
Figure 4: Graphical Depiction of Average Differences Between Classes
bold
ital
t(brow
− bcol)
> 1.96
t(brow
− bcol)
≤ 1.96
brow − bcol
S.E.(brow − bcol)
0.016
0.134
The code for this graph is not elegant, there is a lot of “brute force” manipulation of the
text and polygon commands in R . However, it would not actually take that long to construct
this graph and it is visually striking, I think. The code for this graph is attached in Appendix 1.
If you have any questions about it, please let me know.
10
4
Question 4
Here, you are asked to estimate a model regressing Happiness on Household Income, Class and
the interaction of the two independent variables. Then, graph the effects of Income conditional
on class. First, we will estimate and summarize the model. Table 5 shows the results of this
model estimation.
Table 5: Regression of HAPPY on Income, Class and Income × Class.
Estimate
(S.E.)
20.987∗
(0.047)
(Intercept)
HHINCOME
0.364∗
(0.001)
CLASS: Self employed
2.198∗
(0.488)
CLASS: Routine nonmanual
0.755
(0.15)
CLASS: Lower salariat
1.292∗
(0.154)
CLASS: Upper salariat
1.911∗
(0.27)
HHINCOME× CLASS: Self employed
−0.256∗
(0.009)
HHINCOME×CLASS: Routine nonmanual
−0.053
(0.003)
HHINCOME×CLASS: Lower salariat
−0.1
(0.003)
HHINCOME×CLASS: Upper salariat
−0.18∗
(0.005)
N=6363, Reference Category (Class)=Manual Labour
*p<0.05, 2-sided
R2 =0.049
Now, we need to create two different graphs - one trellis display of the effects of income
conditional on class and one graph that simply places all of the lines on the same graph without
confidence bounds. These are shown in Figure 5.
11
Figure 5: Trellis Display of Effect of Income Conditional on Class
CLASS
Manual labour
Self employed
Routine nonmanual
Lower salariat
Upper salariat
2
CLASS:Lower salariat
4
6
8
10
25
CLASS:Upper salariat
25
24.5
23
24
22
23.5
21
CLASS:Manual labour
CLASS:Self employed
CLASS:Routine nonmanual
25
HAPPY
HAPPY
24
23
22.5
24
23
22
22
21.5
21
2
4
6
8
10
2
4
6
8
2
10
6
8
10
HHINCOME
HHINCOME
5
4
Question 5
Here, you are asked to fit a linear model regressing prestige on type and education. The results
of this model are in Table 6.
12
Table 6: Regression of Prestige on Education and Occupation Type.
Estimate
(S.E.)
−2.698
(32.903)
(Intercept)
education
4.573∗
(0.451)
type: prof
6.142
(18.139)
type: wc
−5.458∗
(7.24)
N=98
Reference Category (type)=Blue Collar
*p<0.05, 2-sided
R2 =0.798
Now, it asks you to “attempt” to construct a plot of estimates and confidence intervals based
on quasi-standard errors. When you do this, you get the following result:
> my.qv <- qvcalc(mod, "type")
Warning message:
NaNs produced in: sqrt(qv)
> plot(my.qv)
Error in plot.qv(my.qv) : No comparison intervals available,
since one of the quasi variances is negative. See ?qvcalc for more.
If you look at the help file for qvcalc, you will see the following explanation:
Occasionally one (and only one) of the quasi variances is negative, and so the corresponding quasi standard error does not exist (it appears as ’NaN’). This is fairly
rare in applications, and when it occurs it is because the factor of interest is strongly
correlated with one or more other predictors in the model. It is not an indication
that quasi variances are inaccurate. An example is shown below using data from
the ’car’ package: the quasi variance approximation is exact (since ’type’ has only 3
levels), and there is a negative quasi variance. The quasi variances remain perfectly
valid (they can be used to obtain inference on any contrast), but it makes no sense
to plot ‘comparison intervals’ in the usual way since one of the quasi standard errors
is not a real number.
Here, since there is only one comparison we cannot obtain from the statistical output, calculating this information would be relatively easy. Give it a shot yourself and see what you come
up with.
13
Appendix 1: Constructing Graphical Difference Matrix
squares <- function(x0, y0, length=1, col){
x.vert <- c(x0+0.05, x0+length-0.05, x0+length-0.05, x0+0.05)
y.vert <- c(y0+0.05, y0+0.05, y0+length-0.05, y0+length-0.05)
polygon(x.vert, y.vert, col=col)
}
squares is basically a wrapper function for polygon that helps when you want to draw squares.
It takes 4 arguments, the starting value for x and the starting value for y (the coordinates for
the lower left-hand corner of the box), the length of the sides and the color to be used to shade
in the box. Making this command is unnecessary, but ends up saving a lot of time below.
I’ll go through the first few lines of the code below as this should familiarize you with the
necessary commands. The first line is a postscript wrapper for the graph that saves everything
between postscript() and dev.off() as the file listed in the command. The second line sets
up the plotting space to be 4-units square, but with no axes and nothing inside the plot. The
third and fourth lines put axes in the space. The fifth line draws the first square, in the upper
left-hand corner of the graph. This one corresponds to the difference between manual laborers
and self-employed folks. I shaded it in with col=’gray75’ because I its t-statistic is greater than
1.96. Otherwise, I leave them unshaded. The following 2 text commands place the estimated
difference in bold and the estimated standard error of the difference in italics within the box.
The following commands do the same thing for all of the different boxes.
postscript("C:/documents and settings/darmstrong/desktop/regression III/diffmat.ps",
height=6, width=6, horizontal=F)
plot(c(0,4), c(0, 4), type="n", main="", xlab="", ylab="", axes=F)
axis(3, at=c(0.5, 1.5, 2.5, 3.5), labels=c("Self\n Employed",
"Routine\n Nonman", "Lower\n Salariat", "Upper\n Salariat"), tick=F, lwd=0)
axis(2, at=c(0.5, 1.5, 2.5, 3.5), labels=c("Lower\n Salariat", "Routine\n Nonman",
"Self\n Employed", "Manual\n Laborer"), tick=F, lwd=0)
squares(0, 3, length=1, col="gray75")
text(0.5, 3.6, round(b.diff[1,1], 3), font=2)
text(0.5, 3.4, round(b.sd[1,1], 3), font=3)
squares(1, 3, length=1, col="gray75")
text(1.5, 3.6, round(b.diff[1,2], 3), font=2)
text(1.5, 3.4, round(b.sd[1,2], 3), font=3)
squares(2, 3, length=1, col="gray75")
text(2.5, 3.6, round(b.diff[1,3], 3), font=2)
text(2.5, 3.4, round(b.sd[1,3], 3), font=3)
squares(3, 3, length=1, col="gray75")
text(3.5, 3.6, round(b.diff[1,4], 3), font=2)
text(3.5, 3.4, round(b.sd[1,4], 3), font=3)
squares(1, 2, length=1, col="white")
text(1.5, 2.6, round(b.diff[2,2], 3), font=2)
text(1.5, 2.4, round(b.sd[2,2], 3), font=3)
squares(2, 2, length=1, col="white")
text(2.5, 2.6, round(b.diff[2,3], 3), font=2)
text(2.5, 2.4, round(b.sd[2,3], 3), font=3)
squares(3, 2, length=1, col="white")
14
text(3.5, 2.6, round(b.diff[2,4], 3), font=2)
text(3.5, 2.4, round(b.sd[2,4], 3), font=3)
squares(2, 1, length=1, col="white")
text(2.5, 1.6, round(b.diff[3,3], 3), font=2)
text(2.5, 1.4, round(b.sd[3,3], 3), font=3)
squares(3, 1, length=1, col="white")
text(3.5, 1.6, round(b.diff[3,4], 3), font=2)
text(3.5, 1.4, round(b.sd[3,4], 3), font=3)
squares(3, 0, length=1, col="white")
text(3.5, 0.6, round(b.diff[4,4], 3), font=2)
text(3.5, 0.4, round(b.sd[4,4], 3), font=3)
squares(0, 1.25, 0.25, col="gray75")
text(0.3, 1.375, expression(abs(t[(b[row]~~-~~b[col])])>1.96), pos=4)
squares(0, 0.95, 0.25, col="white")
text(0.3, 1.075, expression(abs(t[(b[row]~~-~~b[col])])<=1.96), pos=4)
text(0, 0.75, "bold", font=2)
text(0.3, 0.75, expression(b[row]-b[col]), font=2, pos=4)
text(0, 0.5, "ital", font=3)
text(0.3, 0.5, expression(S.E.(b[row]-b[col])), font=2, pos=4)
dev.off()
15