R-Code for Assignment 2 Regression III ICPSR Summer Program 2005 Dave Armstrong University of Maryland [email protected] July 13, 2005 1 Question 1 First, you are asked to fit a bivariate linear regression of HAPPY on HHINCOME. Then, show the fitted regression line. Finally, referring to the statistical output, describe the impact of HHINCOME on HAPPY. Below is the code used to generate the model and the accompanying output. mod <- lm(HAPPY ~ HHINCOME, data=health) summary(mod) Call: lm(formula = HAPPY ~ HHINCOME, data = health) Residuals: Min 1Q -14.4820 -1.8420 Median 0.1740 3Q 2.1740 Max 7.8139 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 21.5301 0.1432 150.33 <2e-16 *** HHINCOME 0.3280 0.0201 16.32 <2e-16 *** --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 3.133 on 6361 degrees of freedom Multiple R-Squared: 0.04017, Adjusted R-squared: 0.04002 F-statistic: 266.2 on 1 and 6361 DF, p-value: < 2.2e-16 Now, you may just want a table, rather than the entire output from R . There is a package in R called xtable that constructs LATEXtables from R output. To see what you get from that, you can type xtable(mod) where mod refers to your model object. Then, you can copy that text out of R and paste it into your .tex editor. Table 1 shows the unedited output from that table.1 1 I did put a caption in, but other than that, the table is just as it was copied out of R . 1 Table 1: Bivariate Linear Regression of Happiness on Household Income (Intercept) HHINCOME Estimate 21.5301 0.3280 Std. Error 0.1432 0.0201 t value 150.33 16.32 Pr(>|t|) 0.0000 0.0000 I actually like a different kind of output from my models. I like the coefficient with standard errors in parentheses. I wrote a little code for R that would do this. Table 2 shows the output from this function. outreg <- function(object, fit=TRUE){ out.mat <- matrix(NA, ncol=2, nrow=2*length(object$coef)) l <- 1 for(i in seq(from=1, to=2*length(object$coef), by=2)){ out.mat[i, 1] <if(summary(object)$coef[l,4]<0.05) paste(round(object$coef[l], 3), "*", sep="") else round(object$coef[l], 3) out.mat[i+1, 1] <- paste("(", paste(if(as.character(round(diag(vcov(object))[l], 3)) == "0") "0.000" else round(diag(vcov(object))[l], 3), ")", sep=""), sep="") out.mat[i, 2] <- names(object$coef)[l] out.mat[i+1, 2] <- c("") l <- l+1 } row.names(out.mat) <- out.mat[,2] out.mat <- cbind(rep("&", length(out.mat[,1])), out.mat[,1], rep(c("\\", "\\[.1in]"), length(out.mat[,1])/2)) colnames(out.mat) <- c("", "", "") cat("\\begin{table}\n") cat("\\caption{}\n") cat("\\begin{center}\n") cat("\\begin{tabular}{ld{3}}\n") cat("\\hline\n") cat(" & \\multicolumn{1}{c}{Estimate} \\\\\n") cat(" & \\multicolumn{1}{c}{(S.E.)} \\\\\n") cat("\\hline\\hline\n") print.table(out.mat) cat("\\hline\n") cat("\\multicolumn{2}{l}{N=") cat(paste(object$df.residual + object$rank, "}\\\\\n", sep="")) cat("\\multicolumn{2}{l}{*p$<$0.05, 2-sided}\\\\\n") if(fit == TRUE){ if(class(object)[1] == "lm"){ cat("\\multicolumn{2}{l}{R$^{2}$=") cat(paste(round(summary(object)$r.squared, 3), "}\\\\\n", sep="")) } else if(class(object)[1] == "glm"){ 2 cat("\\multicolumn{2}{l}{deviance=") cat(paste(round(summary(object)$deviance, 3), "}\\\\\n", sep="")) } } cat("\\hline\\hline\n") cat("\\end{tabular}\n") cat("\\end{center}\n") cat("\\end{table}\n") } my.out <- outreg(mod) Table 2: Binary Regression Table Estimate (S.E.) 21.53∗ (0.021) (Intercept) HHINCOME 0.328∗ (0.000) N=6363 *p<0.05, 2-sided R2 =0.04 To use this code, you have to have \usepackage{dcolumn} and \newcolumntype{d}[1]{D{.}{.}{#1}} in the preamble of your LATEXdocument. 3 Now, you are supposed to construct a graph of the regression line. This can be done the following way in R . Figure 1 shows the plot created by the code. postscript("C:/documents and settings/darmstrong/desktop/regression III/2.1.ps", height=6, width=6, horizontal=F) plot(jitter(health$HAPPY, 3) ~ jitter(health$HHINCOME, 4), xlab="Household Income", ylab="Happiness") abline(mod, lty=2, lwd=2, col="red") dev.off() 20 10 15 Happiness 25 30 Figure 1: Scatterplot of Happiness and Household Income with OLS Regression Line 0 2 4 6 8 10 Household Income Looking back at the statistical output, you can see that a one-unit increase in income corresponds to an increase in the average level of Happiness of about 0.33 units. Since there are no other variables in the model, this is not holding anything constant. 2 Question 2 Here, you are asked to first, construct a conditioning plot of HHINCOME versus HAPPY conditional on CLASS and SEX. We did this for the last assignment, but to remind, the code for a trellis display is below: postscript("C:/documents and settings/darmstrong/desktop/regression III/1.5b.ps", height=7, width=9, horizontal=F) 4 xyplot(jitter(health$HAPPY,2) ~ jitter(health$HHINCOME, 3) | health$CLASS + health$SEX, xlab="Household Income", ylab="Happiness", main="Happiness and Household Income Conditional on Class and Sex", panel=function(x, y){ panel.xyplot(jitter(x, 2), jitter(y, 2), col="lightblue") panel.lmline(x, y, col="blue") panel.loess(x, y, col="red") }) dev.off() Figure 2: Scatterplot of Happiness on Household Income given Class and Sex Happiness and Household Income Conditional on Class and Sex 2 Men Manual labour 4 6 8 10 2 4 6 8 10 Men Men Men Self employed Routine nonmanual Lower salariat Men Upper salariat 30 25 20 Happiness 15 10 Women Manual labour Women Women Women Self employed Routine nonmanual Lower salariat Women Upper salariat 30 25 20 15 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 Household Income Next, you are asked to fit a multiple regression and describe the changes (if any) in the effect of Household Income on Happiness. You can do this in R as follows. Table 3 shows the results of outreg here. mod <- lm(HAPPY ~ HHINCOME + CLASS + SEX, data=health) summary(mod) outreg(mod) As we can see from the table, the difference in the effect is not especially large, it is about 0.026. This makes sense because it seems like the relationship between Income and Happiness is relatively constant across class and sex. Next, you’re asked to perform an Anova from the car package. Remember, this is different from the anova in the base R . This can be done as follows: Anova(mod) Anova Table (Type II tests) 5 Table 3: Multiple Regression of Happiness on Household Income, Class and Sex Estimate (S.E.) 21.494∗ (0.023) (Intercept) HHINCOME 0.302∗ (0.000) CLASS: Self employed 0.469∗ (0.044) CLASS: Routine nonmanual 0.331∗ (0.015) CLASS: Lower salariat 0.577∗ (0.012) CLASS: Upper salariat 0.563∗ (0.017) SEX: Men −0.147 (0.008) N=6363, Reference Category (Class)=Manual Labour *p<0.05, 2-sided R2 =0.048 Response: HAPPY Sum Sq Df F value Pr(>F) HHINCOME 2017 1 207.0008 < 2.2e-16 *** CLASS 353 4 9.0528 2.741e-07 *** SEX 27 1 2.7232 0.09895 . Residuals 61930 6356 It seems that sex is statistically insignificant after controlling for the effects of Income and Class. We can then remove SEX from the model and re-estimate. The results are in Table 4: mod <- lm(HAPPY ~ HHINCOME + CLASS, data=health) outreg(mod) 6 Table 4: Regression of Happiness on Household Income and Class Estimate (S.E.) 21.422∗ (0.021) (Intercept) HHINCOME 0.296∗ (0.000) CLASS: Self employed 0.464∗ (0.044) CLASS: Routine nonmanual 0.419∗ (0.012) CLASS: Lower salariat 0.627∗ (0.011) CLASS: Upper salariat 0.61∗ (0.016) N=6363, Reference Category (Class)=Manual Labour *p<0.05, 2-sided R2 =0.047 3 Question 3 Here, you are asked to execute the qvcalc command for the model estimated in the previous question. This can be done as follows. Figure 3 shows a plot of the results. my.qv <- qvcalc(mod, "CLASS") postscript("C:/documents and settings/darmstrong/desktop/regression III/2.3.ps", height=6, width=6, horizontal=F) plot(my.qv) dev.off() What we can see is that manual laborers are unlike most other class types - especially routine non-manual, lower salariat and upper salariat. These differences seem significant because the intervals do not overlap. It also looks from the output of qvcalc like the estimates for manual labour and self-employed are not significantly different from each other. However, the statistical output from lm show otherwise. We talked in class about how this could also be done mathematically by using the variance-covariance matrix provided in the estimation. Below, I will show how I would do this. 7 Figure 3: Quasi-Variance Plot of CLASS 0.4 −0.2 0.0 0.2 estimate 0.6 0.8 1.0 Intervals based on quasi standard errors Manual labour Routine nonmanual Upper salariat CLASS We can see from the model that coefficients 3-6 are the coefficients for the class variables. What we want is a confidence interval for the difference of each type with each other type. We know, from the statistical output, the difference between manual laborer and all of the other types. Those are given by the coefficients. Otherwise, we need to get an estimate of the difference and the variance for that estimated difference. It is relatively easy to make a loop that will do these calculations for us. Below is such a loop. b <- coef(mod) v <- vcov(mod) b.diff <- b.sd <- matrix(NA, ncol=4, nrow=4) b.diff[1, ] <- 0-*b[3:6] b.sd[1, ] <- sqrt(diag(v)[3:6]) for(i in seq(from=3, to=5)){ for(j in seq(from=i+1, to=6)){ b.diff[i-1,j-2] <- b[i]-b[j] b.sd[i-1,j-2] <- sqrt(v[i,i] + v[j,j] + 2*(-1)*v[j,i]) } } b.t<- b.diff/b.sd row.names(b.t) <- c("Manual Laborer", "Self Employed", "Routine Nonmanual", "Lower Salariat") colnames(b.t) <- c("Self Employed", "Routine Nonmanual", "Lower Salariat", "Upper Salariat") row.names(b.diff) <- row.names(b.sd) <- row.names(b.t) colnames(b.diff) <- colnames(b.sd) <- colnames(b.t) 8 The first two lines simply define b as the 1 × 6 coefficient matrix and v as the 6 × 6 variancecovariance matrix of the estimates. The third line initializes the matrices b.diff and b.sd by filling them with missing values. The fourth line fills in the b.diff matrix with what we know already, the difference between manual laborers and all the other categories. Since the rest of the matrix is set up in this way, we subtract the coefficient for all of the other categories from the implied coefficient for manual laborers, which is 0. The fifth line fills in the first row of the b.sd matrix with elements 3 to 6 of the square root of the diagonal of the variance-covariance matrix of the estimates. The sixth line sets up the for loop for i and the seventh line does the same thing for j. Here, i is going from 3 to 5 and for each value of i, j is going from i+1 to 6. This will make sure that we calculate every difference between the different types of classes. We set the first one from 3 to 5 because the coefficients for the variable class are from 3 to 6 in the coefficient matrix. Since we’re concerned with differences across the coefficients, we don’t need to make the first loop go to 6, only 5, because we’re not interested in the difference between b[6] and b[6]. For the same reason, the second loop goes from i+1 to 6 because we’re never interested in the difference between b[i] and b[i]. Rows 8 and 9 of the code fill in the cells of the matrix. Specifically, row 8 takes the difference between two coefficients and row 9 finds the standard error for that estimate by the formula that we have seen in class. var(aX + bY ) = a2 var(X) + b2 var(Y ) + 2(ab)cov(X, Y ) Which, for our purposes, translates to: var((1)bi + (−1)bj ) = 12 var(bi ) + (−1)2 var(bj ) + 2(1)(−1)cov(bi , bj ) The next row defines a matrix of t−statistics as t = bi −bj Sbi −bj . The final rows of the command simply identify the row and column names for the matrices constructed by the previous commands. You could stop here, as the matrices b.diff, b.sd, and b.t provide all of the necessary information. Let’s take a look at those matrices: > b.diff Self Employed Routine Nonmanual Lower Salariat Upper Salariat Manual Laborer -0.4642506 -0.4190497 -0.6266219 -0.61014762 Self Employed NA 0.0452009 -0.1623714 -0.14589704 Routine Nonmanual NA NA -0.2075723 -0.19109795 Lower Salariat NA NA NA 0.01647431 > b.sd Self Employed Routine Nonmanual Lower Salariat Upper Salariat Manual Laborer 0.210301 0.1084048 0.1040208 0.1255736 Self Employed NA 0.2194799 0.2167699 0.2274021 Routine Nonmanual NA NA 0.1211217 0.1395646 Lower Salariat NA NA NA 0.1339132 > b.t Self Employed Routine Nonmanual Lower Salariat Upper Salariat Manual Laborer 2.207553 3.8655995 6.0240068 4.8588828 Self Employed NA 0.1789734 -0.6463672 -0.5583463 Routine Nonmanual NA NA -1.1890737 -1.0133523 Lower Salariat NA NA NA 0.0877518 9 The first matrix, b.diff shows the pairwise differences between estimates. Here, the matrix is defined by the estimate for the row class minus the estimate for the column class. The matrix b.sd provides the standard errors for the corresponding differences in b.diff and the final matrix b.t constructs the t−statistics for the differences. As you can see, Manual laborers have a significantly lower average than all other classes. However, there are no other differences between classes. Constructing a graph like this might be your best bet, especially if differences between categories of a categorical variable is of substantive interest in your research. I would suggest a graph that looks something like the one in Figure 4. Self Employed Routine Nonman Lower Salariat Upper Salariat −0.464 0.21 −0.419 0.108 −0.627 0.104 −0.61 0.126 0.045 0.219 −0.162 0.217 −0.146 0.227 −0.208 0.121 −0.191 0.14 Lower Salariat Routine Nonman Self Employed Manual Laborer Figure 4: Graphical Depiction of Average Differences Between Classes bold ital t(brow − bcol) > 1.96 t(brow − bcol) ≤ 1.96 brow − bcol S.E.(brow − bcol) 0.016 0.134 The code for this graph is not elegant, there is a lot of “brute force” manipulation of the text and polygon commands in R . However, it would not actually take that long to construct this graph and it is visually striking, I think. The code for this graph is attached in Appendix 1. If you have any questions about it, please let me know. 10 4 Question 4 Here, you are asked to estimate a model regressing Happiness on Household Income, Class and the interaction of the two independent variables. Then, graph the effects of Income conditional on class. First, we will estimate and summarize the model. Table 5 shows the results of this model estimation. Table 5: Regression of HAPPY on Income, Class and Income × Class. Estimate (S.E.) 20.987∗ (0.047) (Intercept) HHINCOME 0.364∗ (0.001) CLASS: Self employed 2.198∗ (0.488) CLASS: Routine nonmanual 0.755 (0.15) CLASS: Lower salariat 1.292∗ (0.154) CLASS: Upper salariat 1.911∗ (0.27) HHINCOME× CLASS: Self employed −0.256∗ (0.009) HHINCOME×CLASS: Routine nonmanual −0.053 (0.003) HHINCOME×CLASS: Lower salariat −0.1 (0.003) HHINCOME×CLASS: Upper salariat −0.18∗ (0.005) N=6363, Reference Category (Class)=Manual Labour *p<0.05, 2-sided R2 =0.049 Now, we need to create two different graphs - one trellis display of the effects of income conditional on class and one graph that simply places all of the lines on the same graph without confidence bounds. These are shown in Figure 5. 11 Figure 5: Trellis Display of Effect of Income Conditional on Class CLASS Manual labour Self employed Routine nonmanual Lower salariat Upper salariat 2 CLASS:Lower salariat 4 6 8 10 25 CLASS:Upper salariat 25 24.5 23 24 22 23.5 21 CLASS:Manual labour CLASS:Self employed CLASS:Routine nonmanual 25 HAPPY HAPPY 24 23 22.5 24 23 22 22 21.5 21 2 4 6 8 10 2 4 6 8 2 10 6 8 10 HHINCOME HHINCOME 5 4 Question 5 Here, you are asked to fit a linear model regressing prestige on type and education. The results of this model are in Table 6. 12 Table 6: Regression of Prestige on Education and Occupation Type. Estimate (S.E.) −2.698 (32.903) (Intercept) education 4.573∗ (0.451) type: prof 6.142 (18.139) type: wc −5.458∗ (7.24) N=98 Reference Category (type)=Blue Collar *p<0.05, 2-sided R2 =0.798 Now, it asks you to “attempt” to construct a plot of estimates and confidence intervals based on quasi-standard errors. When you do this, you get the following result: > my.qv <- qvcalc(mod, "type") Warning message: NaNs produced in: sqrt(qv) > plot(my.qv) Error in plot.qv(my.qv) : No comparison intervals available, since one of the quasi variances is negative. See ?qvcalc for more. If you look at the help file for qvcalc, you will see the following explanation: Occasionally one (and only one) of the quasi variances is negative, and so the corresponding quasi standard error does not exist (it appears as ’NaN’). This is fairly rare in applications, and when it occurs it is because the factor of interest is strongly correlated with one or more other predictors in the model. It is not an indication that quasi variances are inaccurate. An example is shown below using data from the ’car’ package: the quasi variance approximation is exact (since ’type’ has only 3 levels), and there is a negative quasi variance. The quasi variances remain perfectly valid (they can be used to obtain inference on any contrast), but it makes no sense to plot ‘comparison intervals’ in the usual way since one of the quasi standard errors is not a real number. Here, since there is only one comparison we cannot obtain from the statistical output, calculating this information would be relatively easy. Give it a shot yourself and see what you come up with. 13 Appendix 1: Constructing Graphical Difference Matrix squares <- function(x0, y0, length=1, col){ x.vert <- c(x0+0.05, x0+length-0.05, x0+length-0.05, x0+0.05) y.vert <- c(y0+0.05, y0+0.05, y0+length-0.05, y0+length-0.05) polygon(x.vert, y.vert, col=col) } squares is basically a wrapper function for polygon that helps when you want to draw squares. It takes 4 arguments, the starting value for x and the starting value for y (the coordinates for the lower left-hand corner of the box), the length of the sides and the color to be used to shade in the box. Making this command is unnecessary, but ends up saving a lot of time below. I’ll go through the first few lines of the code below as this should familiarize you with the necessary commands. The first line is a postscript wrapper for the graph that saves everything between postscript() and dev.off() as the file listed in the command. The second line sets up the plotting space to be 4-units square, but with no axes and nothing inside the plot. The third and fourth lines put axes in the space. The fifth line draws the first square, in the upper left-hand corner of the graph. This one corresponds to the difference between manual laborers and self-employed folks. I shaded it in with col=’gray75’ because I its t-statistic is greater than 1.96. Otherwise, I leave them unshaded. The following 2 text commands place the estimated difference in bold and the estimated standard error of the difference in italics within the box. The following commands do the same thing for all of the different boxes. postscript("C:/documents and settings/darmstrong/desktop/regression III/diffmat.ps", height=6, width=6, horizontal=F) plot(c(0,4), c(0, 4), type="n", main="", xlab="", ylab="", axes=F) axis(3, at=c(0.5, 1.5, 2.5, 3.5), labels=c("Self\n Employed", "Routine\n Nonman", "Lower\n Salariat", "Upper\n Salariat"), tick=F, lwd=0) axis(2, at=c(0.5, 1.5, 2.5, 3.5), labels=c("Lower\n Salariat", "Routine\n Nonman", "Self\n Employed", "Manual\n Laborer"), tick=F, lwd=0) squares(0, 3, length=1, col="gray75") text(0.5, 3.6, round(b.diff[1,1], 3), font=2) text(0.5, 3.4, round(b.sd[1,1], 3), font=3) squares(1, 3, length=1, col="gray75") text(1.5, 3.6, round(b.diff[1,2], 3), font=2) text(1.5, 3.4, round(b.sd[1,2], 3), font=3) squares(2, 3, length=1, col="gray75") text(2.5, 3.6, round(b.diff[1,3], 3), font=2) text(2.5, 3.4, round(b.sd[1,3], 3), font=3) squares(3, 3, length=1, col="gray75") text(3.5, 3.6, round(b.diff[1,4], 3), font=2) text(3.5, 3.4, round(b.sd[1,4], 3), font=3) squares(1, 2, length=1, col="white") text(1.5, 2.6, round(b.diff[2,2], 3), font=2) text(1.5, 2.4, round(b.sd[2,2], 3), font=3) squares(2, 2, length=1, col="white") text(2.5, 2.6, round(b.diff[2,3], 3), font=2) text(2.5, 2.4, round(b.sd[2,3], 3), font=3) squares(3, 2, length=1, col="white") 14 text(3.5, 2.6, round(b.diff[2,4], 3), font=2) text(3.5, 2.4, round(b.sd[2,4], 3), font=3) squares(2, 1, length=1, col="white") text(2.5, 1.6, round(b.diff[3,3], 3), font=2) text(2.5, 1.4, round(b.sd[3,3], 3), font=3) squares(3, 1, length=1, col="white") text(3.5, 1.6, round(b.diff[3,4], 3), font=2) text(3.5, 1.4, round(b.sd[3,4], 3), font=3) squares(3, 0, length=1, col="white") text(3.5, 0.6, round(b.diff[4,4], 3), font=2) text(3.5, 0.4, round(b.sd[4,4], 3), font=3) squares(0, 1.25, 0.25, col="gray75") text(0.3, 1.375, expression(abs(t[(b[row]~~-~~b[col])])>1.96), pos=4) squares(0, 0.95, 0.25, col="white") text(0.3, 1.075, expression(abs(t[(b[row]~~-~~b[col])])<=1.96), pos=4) text(0, 0.75, "bold", font=2) text(0.3, 0.75, expression(b[row]-b[col]), font=2, pos=4) text(0, 0.5, "ital", font=3) text(0.3, 0.5, expression(S.E.(b[row]-b[col])), font=2, pos=4) dev.off() 15
© Copyright 2026 Paperzz