Rice University Summer Institute of Statistics RUSIS Lab 11

Rice University Summer Institute of Statistics
RUSIS Lab 11
Statistical Computing Sessions
June 9, 2011
FOR THE FOLLOWING EXERCISES, DO
1
NOT USE FOR LOOPS
.Rdata
We already learn how to load external data into R, when it is “plain” text (.csv, .txt) files. R has its own file
extension (compilation) to save any R object, .Rdata, which preserves the structure of the object. However,
.Rdata can be only opened by R and in the case of data this file is “heavier” than plain saving (Recall
write.csv), so choose wisely when using it.
Download the file data available online (Lab 10 > data). To load .Rdata files use the function load(’name
of your file.Rdata’) if you saved it in your working directory. Otherwise you can use load(file.choose()).
Loading the data file into R:
• If you load it without storing into a variable, you will see that after loading nothing happens. This is
because R automatically uses the name of your file as variable, try now head(data)
• If you store your loading into a variable, you will see that after calling your variable it gives you back
"data" meaning the name of the structure in the .Rdata file is data. Try now head(data)
i. What is the structure of the data set?
ii. How many variables are present?
iii. Who are the players in the data set? (Recall: unique)
To save R objects as .Rdata files, use the command save.
2
2.1
subset and within
subset
We already learn how to subset from vectors, matrices, and data frames using square brackets and the $ sign.
As mentioned, data frames allows to do more, much more, do to its structure. We can use the command
subset to simplify our subsetting from data frames.
Recall our example using the diamonds data set:
zero_dim <- diamonds$x == 0 | diamonds$y == 0 | diamonds$z == 0
1
with the command subset we can simplify this as follows:
subset(diamonds, x == 0 | y == 0 | z == 0)
2.2
within
You may have notice that we can create an additional variable in a data frame using the $. What are the
dimensions of the mtcars data frame? What are the variables? Now try the following:
mtcars$origin <- NA
What are the dimensions now? What are the variables?
Once the column is there, we can edit it systematically using the within function
mtcars <- within(mtcars, {
origin[25 < mpg] <- ’Japanese’
origin[18 <= mpg & mpg <= 25] <- ’German’
origin[mpg < 18] <- ’American’
})
Now use table to know how many cars are American.
FOR THE FOLLOWING EXERCISES, DO
2.3
NOT USE FOR LOOPS
Practice
i. Historically, one measure of batting performance has been the batting average (AVG), defined to be the
proportion of hits (H) to at bats (AB). Calculate the batting average for each player per player year
(that is, each row), and add that average to the data frame as a new variable.
ii. One thing that the batting average doesn’t take into account is how good the hits are. For example, the
batting average treats single base hits and home runs the same. To get around this, in addition to the
batting average many people like to consider a weighted batting average called the slugging average.
In R it can be calculated
TB <- (H - X2B - X3B - HR) + 2 * X2B + 3 * X3B + 4 * HR
and
SLG <- TB / AB,
in the within statement. Create these variables as well.
iii. Using the subset function, form a new data frame containing just the data concerning Barry Bonds.
Use the qplot function to plot his batting average by year. Does anything seem to be strange about the
plot? Why or why not? Consider the data in light of the following wikipedia quote (“batting average”)
2
“In modern times, a season batting average higher than .300 is considered to be excellent,
and an average higher than .400 a nearly unachievable goal. The last player to do so, with
enough plate appearances to qualify for the batting championship, was Ted Williams of the
Boston Red Sox, who hit .406 in 1941, though the best modern players either threaten to or
actually do achieve it occasionally, if only for brief periods of time.”
Add a red horizontal line to the batting average plot you just created at .300. This provides a nice
frame of reference for the reader and the analyst alike.
iv. Do the same plot using careerYear and Age as the independent variables. Use the which.max functions
to determine how old Bonds was when he achieved his highest seasonal batting average. Notice that
the three plots with x-axes yearID, careerYear, and Age all convey very similar information. Do any of
them seem to make the argument for his doping seem stronger or weaker?
v. Make plots containing all of the players’ batting averages by their age in two ways :
first by faceting (by rows, see ?facet.grid. Also look at http://had.co.nz/ggplot2) and then by
grouping (add group = nameFull in the qplot function). Color the latter plot by player. Which plot
is more useful and why?
3
Confidence intervals
This exercise explores confidence intervals with regards to data analysis.
i. Read the data available online (Lab 11 > plain data). Note the structure of the data. What is separating
the values? See ?read.table to load the data into R.
ii. Each element of data in the file comes from the same distribution. Using exploratory techniques
(qplot, summary, etc), determine a reasonable family of distributions from which the data might have
been derived.
iii. Using the unbiased estimators X̄ and S 2 , estimate the parameters of your assumed distribution using
ALL your data. (Hint: use scan or unlist)
iv. Assuming that your estimate of the variance (from the entire data set) is the true variance σ 2 , consider
each column of the data matrix which you read in a different sample from this distribution. Under
these assumptions, the 95% confidence interval for the mean is given by the two-dimensional vector
(a lower bound, and an upper bound)
σ
σ
x̄ − 1.96 √ , x̄ + 1.96 √
n
n
.
Find a 95% confidence interval for the mean of each column of data. Make a data frame containing
the variables lower, xbar, and upper containing the sample√mean and upper and lower confidence
limits for the true mean based on each column assuming σ = s2 .
v. The true mean of the distribution is 1. Determine how many of the confidence intervals you generated
contain the true mean.
3

Download Report

Rice University Summer Institute of Statistics RUSIS Lab 11

Paperzz.com

Your Paperzz