handout

Getting Started with R and RStudio
1
1
What is R?
R is a system for statistical computation and graphics. It is the statistical system that is used in Mathematics 241,
Engineering Statistics, for the following reasons:
1. R is free and open-source.
2. R is user-extensible and user extensions can easily be made available to all users.
3. R is commercial quality. It is the package of choice for many engineers who use statistics frequently.
4. R is easy to use.
At first, you might wonder about point 4 above. Other data analysis tools (such as Excel, for example) appear easier
to use. But many things that an engineer might do are easier to do in R than Excel and some are impossible to do in
Excel (correctly). In Mathematics 241 we will focus on core statistical tools and so learn to use only a small fraction
of the capabilities of R. But since R is free, you will be able to keep R and add to your knowledge of it throughout
your career.
2
Using R on the “Cloud”
R can easily be downloaded and installed on your personal computer. You can download it from here: http:
//cran.r-project.org/. However we will use R over the internet by using a system called RStudio. There are
advantages and disadvantages to using R over the internet but the principle advantages for this course are that using
RStudio means that the installation and setup of R is taken care of and also data can easily be shared with the
instructor and other students. Additionally you can use R on different computers (as long as they have access to the
internet) and the status of your R session will always be just as you left it.
To use RStudio, go to http://beta.rstudio.org/ and use your Calvin email ID and password. The webpage that
comes up after logging in should look like this:
Getting Started with R and RStudio
2
Notice that there are four windows. R commands are entered one line at a time into the Console window (which is
the lower left window by default). In the standalone version of R that you can install, there is only a console window
at the start. The other three windows are used by RStudio to interact with the file system, to show graphics plots,
and to provide an editor that can compose input for the console. (These notes are being produced in RStudio using
the window at the top left as an editor.)
In the remainder of these notes we will work entirely within the console. Thus these notes can be used to get started
in any version of R. The symbol > is the prompt symbol that signifies that R is ready for input. In general, in
response to the prompt we enter a one-line command and get some output (or define some object). We will look at
some of the basic kinds of objects and commands in the rest of these notes.
3
Basic features of R
In the examples that follow, you can distiguish input from output by the input prompt symbol >. Try these commands
or variations of them yourself.
3.1
Using R as a Calculator
R can be used as a calculator.
> 5 + 3
[1] 8
> 15.3 * 23.4
[1] 358.02
> sqrt(16)
[1] 4
You can save values to named variables for later reuse
> product = 15.3 * 23.4
> product
[1] 358.02
> .5 * product
[1] 179.01
> log(product)
[1] 5.880589
> product <- 15.3 * 23.4
> 15.3 * 23.4 -> newproduct
> newproduct
[1] 358.02
# save result
# show the result
# half of the result
# log of the result
# <- is the assignment operator, same as =
# can assign to the right hand side
The semi-colon can be used to place multiple commands on one line. One frequent use of this is to save and print a
value all in one go:
> product <- 15.3 * 23.4; product
[1] 358.02
3.2
# save result and show it
Functions and Objects
Though R does arithmetic on numbers, the real power of R comes from the fact that R understands complex objects
and has a large library of functions that operate on those objects. So most of the R commands that we will enter
will look like f(x,y,...) where f is the name of an R function (like log above) and x,y,... is a list of objects and
options. In the next section we illustrate by introducing the vector object and give some examples of functions that
operate on vectors.
Getting Started with R and RStudio
3.3
3
Vectors
A vector has a length (a non-negative integer) and a mode (numeric, character, complex, or logical). All elements
of the vector must be of the same mode. Typically, we use a vector to store the values of a quantitative variable.
Usually vectors will be constructed by reading data from an R dataset or a file as we will soon see. But short vectors
can be constructed by entering the elements directly using the c() function.
> x = c(1,3,5,7,9,8,6,4,2)
> x
[1] 1 3 5 7 9 8 6 4 2
Note that the [1] that precedes the elements of the vectors is not one of the elements but rather an indication that
the first element of the vector follows. There are a couple of shortcuts that help construct vectors that are regular.
> y=1:10
> z=seq(0,5,.05)
> y;z
[1] 1 2 3 4 5
[1] 0.00 0.05 0.10
[16] 0.75 0.80 0.85
[31] 1.50 1.55 1.60
[46] 2.25 2.30 2.35
[61] 3.00 3.05 3.10
[76] 3.75 3.80 3.85
[91] 4.50 4.55 4.60
6 7
0.15
0.90
1.65
2.40
3.15
3.90
4.65
8 9 10
0.20 0.25
0.95 1.00
1.70 1.75
2.45 2.50
3.20 3.25
3.95 4.00
4.70 4.75
0.30
1.05
1.80
2.55
3.30
4.05
4.80
0.35
1.10
1.85
2.60
3.35
4.10
4.85
0.40
1.15
1.90
2.65
3.40
4.15
4.90
0.45
1.20
1.95
2.70
3.45
4.20
4.95
0.50
1.25
2.00
2.75
3.50
4.25
5.00
0.55
1.30
2.05
2.80
3.55
4.30
0.60
1.35
2.10
2.85
3.60
4.35
0.65
1.40
2.15
2.90
3.65
4.40
0.70
1.45
2.20
2.95
3.70
4.45
Many functions operate on vectors component-wise.
> x=1:5
> y=6:10
> x^2
[1] 1 4 9 16 25
> x+y
[1] 7 9 11 13 15
> log(x)
[1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379
3.4
Data Frames
Data sets are usually stored in a special structure called a data frame. Data frames have a 2-dimensional structure.
Rows correspond to the individuals (units, cases, subjects) of our data set and the columns correspond to variables
(measurements collected on each individual). Data frames in R are named as are the individual variables of the data
frame. The columns (variables) are either vectors or factors (think of a factor as a vector that stores a categorical
variable). We will usually get our data frames from external files that we have prepared in some other way – Excel
is a good way to prepare a data frame as a data frame looks like a spreadsheet. Some datasets are included with the
default R installation.
The iris data frame contains 5 variables measured for each of 150 iris plants. The iris data set is included with
the default R installation.
Getting Started with R and RStudio
4
> str(iris)
# summarizes the structure of the data frame
'data.frame':
150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species
: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
> summary(iris)
# gives summary information on each variable
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
Min.
:4.300
Min.
:2.000
Min.
:1.000
Min.
:0.100
1st Qu.:5.100
1st Qu.:2.800
1st Qu.:1.600
1st Qu.:0.300
Median :5.800
Median :3.000
Median :4.350
Median :1.300
Mean
:5.843
Mean
:3.057
Mean
:3.758
Mean
:1.199
3rd Qu.:6.400
3rd Qu.:3.300
3rd Qu.:5.100
3rd Qu.:1.800
Max.
:7.900
Max.
:4.400
Max.
:6.900
Max.
:2.500
Species
setosa
:50
versicolor:50
virginica :50
> head(iris)
# prints the first several cases of the data frame
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1
5.1
3.5
1.4
0.2 setosa
2
4.9
3.0
1.4
0.2 setosa
3
4.7
3.2
1.3
0.2 setosa
4
4.6
3.1
1.5
0.2 setosa
5
5.0
3.6
1.4
0.2 setosa
6
5.4
3.9
1.7
0.4 setosa
In interactive mode, you can also try
> View(iris)
to see the data or
> ?iris
to get the documentation about for the data set.
Access to an individual variable in a data frame uses the $ operator in the following syntax:
> dataframe$variable
For example,
> iris$Sepal.Length
[1] 5.1 4.9 4.7 4.6
[19] 5.7 5.1 5.4 5.1
[37] 5.5 4.9 4.4 5.1
[55] 6.5 5.7 6.3 4.9
[73] 6.3 6.1 6.4 6.6
[91] 5.5 6.1 5.8 5.0
[109] 6.7 7.2 6.5 6.4
[127] 6.2 6.1 6.4 7.2
[145] 6.7 6.7 6.3 6.5
5.0
4.6
5.0
6.6
6.8
5.6
6.8
7.4
6.2
5.4
5.1
4.5
5.2
6.7
5.7
5.7
7.9
5.9
4.6
4.8
4.4
5.0
6.0
5.7
5.8
6.4
5.0
5.0
5.0
5.9
5.7
6.2
6.4
6.3
4.4
5.0
5.1
6.0
5.5
5.1
6.5
6.1
4.9
5.2
4.8
6.1
5.5
5.7
7.7
7.7
5.4
5.2
5.1
5.6
5.8
6.3
7.7
6.3
4.8
4.7
4.6
6.7
6.0
5.8
6.0
6.4
4.8
4.8
5.3
5.6
5.4
7.1
6.9
6.0
4.3
5.4
5.0
5.8
6.0
6.3
5.6
6.9
5.8
5.2
7.0
6.2
6.7
6.5
7.7
6.7
5.7
5.5
6.4
5.6
6.3
7.6
6.3
6.9
5.4
4.9
6.9
5.9
5.6
4.9
6.7
5.8
5.1
5.0
5.5
6.1
5.5
7.3
7.2
6.8
shows the contents of the Sepal.Length variable (a vector). But this isn’t very useful for a large data set. We would
prefer to compute a numerical or graphical summary.
Getting Started with R and RStudio
4
5
Summaries of a single quantitative variable
Almost always, a quantitative variable is stored in a vector and that vector is one column of a data frame. Most
functions that give a numerical or graphical summary require a vector as argument. In this section we illustrate some
of the more important summary functions with the variable Sepal.Length of the data frame iris.
4.1
Numerical summaries
> mean(iris$Sepal.Length)
[1] 5.843333
> median(iris$Sepal.Length)
[1] 5.8
> sd(iris$Sepal.Length)
[1] 0.8280661
> quantile(iris$Sepal.Length)
0% 25% 50% 75% 100%
4.3 5.1 5.8 6.4 7.9
4.2
Graphical Summaries
There are several ways to make graphs in R. Many individuals have written R packages that give great control over
the way a graph is drawn. We will use the standard graphics functions that are built n to R. Here we illustrate the
two most important graphical representations of a single quantitative variable. In RStudio, graphics output appears
in the plot window (lower right). You must click on the Plots tab to see them.
A histogram is drawn using the function hist.
> hist(iris$Sepal.Length)
15
0
5
Frequency
25
Histogram of iris$Sepal.Length
4
5
6
7
8
iris$Sepal.Length
Many functions in R have optional arguments that change the way that the function acts. Often we can omit these
arguments since R chooses reasonable default values. Note that R produces frequency histograms. To produce a
density histogram, we need an optional argument freq. Note that we name the argument. Optional arguments
usually have to be named so that R knows which arguments are being included. Other optional arguments control
the title of the histogram and the axis labels.
Getting Started with R and RStudio
6
> hist(iris$Sepal.Length,freq=F,main="Sepal Length",xlab=" ")
# F is short for false
0.0 0.1 0.2 0.3 0.4
Density
Sepal Length
4
5
6
7
8
Another common plot is called a boxplot. A boxplot is a graphical representation of a five number summary of a
quantitative variable. The default boxplot uses a vertical scale. Here we draw a horizontal boxplot.
>
boxplot(iris$Sepal.Length, horizontal=T, main="Sepal Length")
# T is short for true
Sepal Length
4.5
5
5.5
6.5
7.5
Importing data
Each RStudio user has space to save files. You can see your personal directory using the files tab of the same window
in which you look at plots. Each user has a Public directory which is visible to other RStudio users. All datafiles
used in Mathematics 241 and in the 241 texbook are in the instructor’s public directory. To load a file used in class
use the Import Dataset tab under the Workspace tab of the top right window. Select From Text File and enter
as filename /shared/[email protected]/. You will see the following
Getting Started with R and RStudio
7
Class datasets are in the M241 directory and textbook datasets are in the DF directory. For example, to import
the dimes dataset, navigate to M241 and select dimes.csv. A window will popup that enables you to tell R which
format the data is in but in this case RStudio understands the CSV format of the dimes dataset. This procedure
defines the data frame dimes. To load a textbook dataset, navigate to the DF folder and selectr the appropriate file:
x10.4.23.csv is the data for exercise 23 in section 4 of chapter 10 and e3.7.csv is example 7 of chapter 3.
6
Useful features of RStudio
One of the most useful features of RStudio is that it will save the state of your session even if you close your browser.
This includes all variables, plots, and other settings. This is very useful for class work since you might get stuck on
homework after attempting a problem and can pick up again after you get help in class or from the instructor.
Another useful feature is the History tab of the upper right hand window. In that window you can find all the lines
that you have entered into the console. These lines can be copied into the console window, for example.
Two useful editing features are accessed by the tab key and the arrow keys. If you start to type the name of a
function (e.g., > hi) and enter the tab character, you get all the possible functions that begin with these characters
along with a short description of what they do. (Try entering the tab key after entering hi.) If you hit the up-arrow
key, the previous line that you entered now becomes the current line and you can edit it and enter it again. This is
very useful if you make a small typo on a very long line.
These notes were produced on January 31, 2011 using R version 2.12.1 (2010-12-16).