Handout 3

Statistical Programming Camp: An Introduction to R
Handout 3: Data Manipulation and Summarizing Univariate Data
Fox Chapters 1-3, 7-8
In this handout, we cover the following new materials:
Using logical operators:
Subsetting data with
Using
<, <=, >, >=, ==, !=, &, |, and is.na()
[] and subset() using logical expressions
ifelse() for conditional statements
More functions for summary statistics:
weighted.mean(), quantile(X,
Applying functions by indexes using
Using
var() (variance), sd() (standard deviation),
P), and IQR() (Inter-quartile Range)
tapply()
function() to create user-defined functions.
Common arguments for graphs: main (main title), xlab and ylab (axis labels), xlim and
ylim (axis limits), pch (point symbol), lty (line type), lwd (line width), col (color), and
cex (sizing)
Adding features to graphs with lines() and abline() (lines), points() (points), text()
(text), and arrows() (arrows)
Using
identify() to identify points on graphs.
Using
\n to break lines.
Using par(mfrow = c(X, Y)) at the beginning of graphical commands to produce X by Y
figure in one graphical window.
Using
hist() to generate histograms.
Calculating a smooth density via
density()
Adding a legend to an existing graph by
legend()
Printing and saving graphs
We will cover the following Statistical Programming Camp Coding Rule:
Curly Brackets
1
1
Logical Operators and Values
Logical operators (<, <=, >, >=, == and !=) allow for data manipulation and subsetting by
determining whether a specified condition is TRUE or FALSE, both of which must be uppercased
and are special values in R just like NA. The operators correspond to standard use. For instance,
<= evaluates whether a number is greater than or equal to a specified value. The symbol !=
corresponds to “not equal”. The output of a logical statement is of the class logical.
> "Hello" == "hello"
[1] FALSE
> y <- 3 < 4
> y
[1] TRUE
> class(y)
[1] "logical"
Logical operators may be applied to individual data entries or entire vectors (or even a dataframe!).
When applied to a vector, logical operators evaluate each element of the vector.
> x <- c(3, 2, 1, -2, -1)
> x != 1
[1]
TRUE
TRUE FALSE
TRUE
TRUE
Combine logical statements and operations with & (and) and | (or).
> x > 2 | x <= -1
[1]
TRUE FALSE FALSE
TRUE
TRUE
> x > 0 & x <= 2
[1] FALSE
TRUE
TRUE FALSE FALSE
Combinging logical operators with other commands allows us to perform operations only on
elements that meet the logical condition. For instance, we can add up the number of TRUE
statements using sum().
> sum(x > 0 & x <= 2)
# Adds up the number of TRUE statements
[1] 2
The command is.na() is a logical operator that identifies missing data. We may use na.rm()
to remove missing data.
2
> x <- c(x, NA)
> is.na(x) # identifies missing data by returning a logical vector
[1] FALSE FALSE FALSE FALSE FALSE
TRUE
> mean(x) # cannot compute the mean due to missing data
[1] NA
> mean(x[!is.na(x)]) # calculates the mean for only non-missing data
[1] 0.6
2
Subsetting with Logical Expressions
For the remainder of this handout, we will use the following data, which is a collection of countylevel data used by D. Matthews and J. Prothro in Negroes and the New Southern Politics. We can
answer some interesting questions using this data set. Does the state-wide mean value of black voter
registration depend on the existence of polltax? What about the literacy requirement? Finally, what
do you find when considering the four combinations of these two? The variables of the data are:
Variable
state
county
polltax
litreq
blackPop
pBlackReg
fedex66
pIncreaseReg
Description
state name
county code
the existence of polltax (1 = Yes, 0 = No)
the existence of literacy requirement (1 = Yes, 0 = No)
1960 % black of state population (100s)
% black voting age population registered in 1964 (black registration rate)
federal examiner present in county in 1966
% increase in black registration rate from 1964 to 1968
Previously, we learned that vectors and data frames can be subsetted by using brackets ([ ]).
For example, a subset of a data frame can be obtained by specifying row numbers (or row names)
and column numbers (or column names) in brackets. Logical expressions can also be used within
brackets for subsetting.
> reg <- read.table("registration.txt", header=TRUE)
> ## black registration is lower where polltax is present
> mean(reg$pBlackReg[reg$polltax == 1])
[1] 17.95895
> mean(reg$pBlackReg[reg$polltax == 0])
[1] 38.24883
> ## black registration is lower where literacy requirement is imposed
> mean(reg$pBlackReg[reg$litreq == 1])
3
[1] 30.35514
> mean(reg$pBlackReg[reg$litreq == 0])
[1] 53.07164
> ## black registration is lowest where both requirements are present
> mean(reg$pBlackReg[(reg$polltax == 1) & (reg$litreq == 1)])
[1] 17.95895
> ## no observations returns NaN
> mean(reg$pBlackReg[(reg$polltax == 1) & (reg$litreq == 0)])
[1] NaN
> mean(reg$pBlackReg[(reg$polltax == 0) & (reg$litreq == 1)])
[1] 34.63745
> mean(reg$pBlackReg[(reg$polltax == 0) & (reg$litreq == 0)])
[1] 53.07164
In addition to [ ], subset() may be used to subset data, which takes vectors and data
frames as the first argument. Then, users can specify subset and/or select as arguments.
The former should be a logical vector indicating elements or rows to keep while the latter should
specify the variables to keep (either by a vector of variable names or by a numeric vector indicating
column numbers)
>
>
>
+
+
>
>
## counties with a higher than average black population but lower than
## average registration rate
lowreg <- subset(reg, subset = ((reg$blackPop >= mean(reg$blackPop))
& (reg$pBlackReg <= mean(reg$pBlackReg))),
select = c("blackPop", "pBlackReg", "polltax", "litreq"))
## How many impose both polltax and literacy requirement
nrow(lowreg[(lowreg$polltax == 1) & (lowreg$litreq == 1), ])
[1] 34
> ## Another way
> sum((lowreg$polltax == 1) & (lowreg$litreq == 1))
[1] 34
4
3
Using Conditional Statements via
ifelse()
Conditional Statements evaluate a logical statement, then perform different actions depending on
whether the statement is true or false. The function ifelse(X, Y, Z) performs an action Y and
returns the result of this action as the output if the statement X is true and performs Z and returns
the output if X is false.
>
>
>
+
>
>
+
+
>
>
## Creating a new variable indicating counties with higher than average
## black population and polltax
reg$highpoptax <- ifelse((reg$blackPop >= mean(reg$blackPop) & reg$litreq == 1),
"Yes", "No")
## a more complex example creating region variable
reg$region <- ifelse(reg$state=="Alabama" | reg$state=="Georgia" |
reg$state=="Louisiana" | reg$state=="Mississippi" |
reg$state=="South Carolina", "Deep South", "Peripheral South")
reg$region <- as.factor(reg$region)
table(reg$region)
Deep South Peripheral South
361
76
4
More Functions for Summarizing Data
In addition to the functions we learned last week (i.e., mean(), median(), min(), max(), and
range()), we have the following new functions that are useful for summarizing data.

var() (variance) and sd() (standard deviation) summarize numeric data.
> ## two ways of calculating standard deviation
> sd(reg$pBlackReg)
[1] 25.34743
> sqrt(var(reg$pBlackReg))
[1] 25.34743
Weighted mean can be computed using
mean of X weighted by Y.
weighted.mean(X, Y), where the output is the
> ## overall registration rate should be weighted by county population
> weighted.mean(reg$pBlackReg, reg$blackPop)
[1] 32.04816
The function quantile(X, P) provides the sample quantiles of a numeric vector
element of another numeric vector P.
> quantile(reg$pBlackReg) # the default is quartiles plus min and max
5
X for each
0% 25% 50% 75% 100%
0.0 12.3 29.3 52.9 99.9
> quantile(reg$pBlackReg, seq(from = 0.2, to = 0.8, by = 0.2))
# quintiles
20%
40%
60%
80%
8.70 23.10 36.00 58.18
The function
IQR() returns the interquartile range
> IQR(reg$pBlackReg)
[1] 40.6
5
Applying Functions by Indexes
In many situations, we want to apply the same function repeatedly for different parts of the data. For
example, in the black registration data, we may want to compute the registration rate within each
state. Doing this manually is a pain especially if the number of states is large; you have to subset the
data for one state and then use mean() to compute the registration for that state, and this has to be
repeated for each state.
The function tapply() (t is a short hand for table) enables you to do such computation in
one line. Specifically, tapply(X, INDEX, FUN) applies the function FUN to X for each of the
groups defined by a vector INDEX. Replace FUN with mean, median, sd, etc. to generate desired
quantity.
> ## Calculate the mean of % black registration rates by state
> tapply(reg$pBlackReg, reg$state, mean)
Alabama
Florida
24.142424
53.071642
North Carolina South Carolina
47.977777
37.436956
6
Georgia
31.707006
Louisiana
37.990476
Mississippi
3.886207
Writing Functions
One of the greatest benefits of R is the flexibility the software allows for users to write their own
functions. The syntax takes the form of name <- foo(bar1, bar2, ...), where name is the
function name, (bar1, bar2, ...) are the inputs, and the commands within the brackets { }
define the function. We begin with a simple example, creating a function to compute the mean from a
vector with missing data. Note that an opening curly brace should never go on its own line. A closing
curly brace should always go on its own line. Additionally, code within brackets should be aligned
according to the text editor’s automatic alignment.
> x <- c(10:22, NA, 1:7, NA, 5)
> mean(x) # cannot compute mean due to missing data
[1] NA
6
> my.mean <- function(x){
+
x <- x[!is.na(x)] # removes missing data
+
sum <- sum(x)
+
length <- length(x)
+
mean <- sum/length
+
out <- c(sum, length, mean) # define the output
+
names(out) <- c("sum", "length", "mean")
+
out # end function by calling output
+ }
> my.mean(x)
sum
241.00000
length
21.00000
mean
11.47619
Programming Camp Coding Rule: Curly Brackets
An opening curly brace should never go on its own line. A closing curly brace should always go
on its own line. Code within brackets should be properly aligned.
GOOD Code: name <- foo(bar1, bar2, ...){
command1 <- code1
command2 <- code2
}
BAD Code: name <- foo(bar1, bar2, ...)
{command1 <- code1
command2 <- code2}
7
Graphs for Univariate Data: Histograms
Graphs are critical tools for summarizing data in a straightforward and easy to understand manner.
Great graphics strengthen projects and report by illustrating central features of the data without much
additional explanation. Bad graphics are inefficient (leaving out critical information such as labels),
potentially misleading, or too complicated.
7
There are several common graphing arguments that specify basic features of the graph, including
the number of figures included on a graph, titles, axis labels, data range, etc. The following table
summarizes these arguments:
main
xlab, ylab
xlim, ylim
col
cex
cex.axis
cex.main
Main title of the graph.
Labels for the x-axis and y-axis.
Specifies the x-limits and y-limits, as in xlim = c(0, 10), for the interval [0, 10].
Specifies the color to use , e.g., "blue" or "red".
Specifies size of plotted text or symbols.
Specifies size of axis annotation.
Specifies size of plot title.
The second class of graphing commands adds additional features to an existing graph. These
functions include points() for adding points, lines() for lines, and text() for texts.
Adds a plot-line to figure
e.g. lines(x, y) where x and y define coordinates
abline() Adds a straight line
e.g. abline(h = x) to place a horizontal line at height x
e.g. abline(v = x) to place a vertical line at point x
e.g. abline(a = x, b = y) to place a line with intercept x and slope y
points() Add points
e.g. points(x, y) to place dots with x and y as the coordinates
e.g. points(x, y, line = TRUE) connects the dots as a line
text()
Adds additional text
e.g. text(x, y, z) to display z as a text centered at coordinates (x, y)
arrows() Adds arrows
e.g. arrows(x, y, length, angle, code) to display arrows beginning from coordinate
x, ending at coordinate y, for the length specified, at the angle specified, of the arrow
type specified by code = , and of the color specified by col.
lines()
The function identify() allows us to click on points in our graphs and R will return meaningful
data about those datapoints. When done, press Esc.
The command
\n will force a line break. This is convenient to use with long plot titles.
The command par(mfrow = c(X, Y)) will produce an X by Y figure in one graphical window. mfrow means the graphs will be filled by row whereas mfcol means they will be filled by
columns
The function
hist() will produce a histogram to summarize the distribution of data. Setting
freq = FALSE within hist() will produce a histogram rather than a frequency plot. If
you specify a single number as the argument breaks, you will be able to set the number of
equally spaced bins. If you give a numeric vector instead, it will specify the breakpoints between
histogram cells.
The function density() will calculate the smooth density of a numeric object as an output,
which then in turn can be an input to the plot() function to draw the smooth histogram (use
the lines() function to add it to the existing graph).
8
To add a legend to an existing graph, use legend(). The syntax legend(x, y, z) adds
legend with text z at coordinates (x, y), which can also be substituted with "topleft", "bottomright", etc.
>
>
>
>
>
>
+
+
>
+
+
## begin by subsetting the data
examiner <- reg[reg$fedex66 == 1, ]
noexaminer <- reg[reg$fedex66 == 0, ]
## side by side histograms of registration rates
par(mfrow = c(1, 2))
hist(examiner$pBlackReg, freq = FALSE, breaks = 10, xlim = c(0, 100),
main = "Federal Examiner Present",
xlab = "Registration Rates")
hist(noexaminer$pBlackReg, freq = FALSE, breaks = 10, xlim = c(0, 100),
main = "No Federal Examiner Present", cex.main = 0.995, ## smaller plot title
xlab = "Registration Rates")
Federal Examiner Present
0.00
0.000
0.02
0.005
0.010
Density
0.06
0.04
Density
0.08
0.10
0.015
0.12
No Federal Examiner Present
0
20 40 60 80
0
Registration Rates
>
>
>
>
+
+
>
>
>
>
>
>
>
+
20 40 60 80
Registration Rates
## return to single graph
par(mfrow = c(1,1))
## histogram for counties with examiner
hist(examiner$pBlackReg, freq = FALSE, breaks = 10, xlim = c(0, 100),
main = "Registration Rates \n Federal Examiner Present", cex.main = 1.5,
xlab = "Registration Rates", cex.axis = 1.5)
## add counties with no-examiner as smooth density
lines(density(noexaminer$pBlackReg))
## add lines to compare median of counties with/without examiner
abline(v = median(examiner$pBlackReg), col = "red", lty = 2)
abline(v = median(noexaminer$pBlackReg), col = "blue", lty = 2)
## add legend
legend("topright", c("Examiner Median", "No Examiner Median"),
lty = c(2, 2), col = c("red", "blue"))
9
0.12
Registration Rates
Federal Examiner Present
0.00
0.04
Density
0.08
Examiner Median
No Examiner Median
0
20
40
60
80
100
Registration Rates
8
Printing and Saving Graphs
There are a few ways to print and save the graphs you create in R.
In the window of your graph (if you are a Mac user, make sure your graphic window rather than
the R console is selected), you can click File: Save as: PDF... or File: Print....
You can also right-click on a figure in R and copy the image (if you are a Mac user, you need to
highlight the graph and type Apple+C to copy it). Then paste that image into Microsoft Word
or any other document.
You can also do it via a command by using
dev.off() afterwards.
pdf() before your plotting commands and then
> pdf(file = "myplot.pdf", height = 3, width = 5) # height and width are in inches
> dev.off() ## This creates a pdf file in the working directory
9
Practice Questions
9.1
Supreme Court Justice Ideal Points
In a 2002 article, Andrew Martin and Kevin Quinn explored the extent to which the ideal points (i.e.,
policy preferences) of Supreme Court Justices change throughout their tenure on the Court.The data
set contains the following:
term – Supreme Court Term
justice – Justice’s Last Name
idealpt – Justice’s Estimated Ideal Point, where negative values indicate liberal leanings and
positive values indicate conservative leanings
10
pparty – President’s Political Party
1. Using the tapply() function, create a variable for the median ideal point of court justices for
each term of the court.
2. Generate a new variable in the justices data set to indicate whether each justice falls on the
Conservative or Liberal end of the ideal point spectrum. Using ifelse(), generate a new
variable that takes a value of Liberal if the justice’s ideal point is less than 0 and a value of
Conservative if the justice’s ideal point is greater than 0. Using table(), determine how
many justices in the data set were Conservative and how many were liberal.
3. Create a histogram of justice’s ideal points. Using tapply(), calculate each justice’s median
ideal point. Generate a histogram of the justice’s ideal points. Be sure to add an informative
title and labels. Create a red, vertical dashed line indicating the median. Additionally, add the
density line to the plot. Save the graph you created as a pdf file using the file name xxx.pdf
where xxx is your netID. Submit it to Blackboard along with your R script file xxx.R (Do not
turn in your R console print out).
9.2
The Impact of Increases in the Minimum Wage
Many economists believe that increasing the minimum wage actually hurts the poor, the very part of
the population such a policy is supposed to help out. The reason is that if employers have to pay higher
wages then they would simply hire less people. This means that those who are earning the minimum
wage may lose their jobs as a result of increasing the minimum wage. Two researchers, David Card
and Alan Krueger, tested this argument using the data from fast food industry in New Jersey and
Pennsylvania. We analyze their original data in this precept. The njmin.txt data file, available at
Blackboard, contains the following variables
Variable
chain
location
wageBefore
wageAfter
fullBefore
fullAfter
partBefore
partAfter
Description
fast food chain
store location (southJ, centralJ, northJ, shoreJ & PA)
Starting wage measured before the increase
Starting wage measured after the increase
number of full-time employees before the increase
number of full-time employees before the increase
number of part-time employees before the increase
number of part-time employees before the increase
1. Load the data into R
2. Create a factor variable called state, which takes two values NJ and PA. How many stores in
NJ and PA does the study sample contain, respectively? Which chain has the largest number of
restaurants NJ and PA, respectively, in this study sample?
3. Create four histograms in one graph using the starting wage data; starting from the left upper
corner in a clockwise manner, NJ before the increase, NJ after the increase, PA after the increase,
and PA before the increase. Add informative labels to each graph. Are the starting wages similar
between NJ and PA before the increase? What about after the increase? Within each state,
does the histogram look similar before and after the increase?
11
4. Compute the average number of full-time employees in NJ separately before and after the increase. Do the same for PA. What do these numbers tell you about the impact of the increase
in minimum wage? Are these average differences large compared to the standard deviation of
full-time employees before the change in each state?
5. Calculate the difference in the number of full-time employees between before and after the increase within each state. Summarize the data using two smoothed histograms in one plot (red
solid line for NJ, and blue solid line for PA), with dashed lines for representing the mean difference of each state. Finally, calculate the difference in differences between the two states. (If
you are curious, go ahead and conduct the same calculation for part-time employment and see
if similar results are obtained.)
12

Download Report

Handout 3

Paperzz.com

Your Paperzz