handout

Getting Started with R and RStudio
1
1
What is R?
R is a system for statistical computation and graphics. It is the statistical system that is used in Mathematics
241, Engineering Statistics, for the following reasons:
1. R is free and open-source.
2. R is user-extensible and user extensions can easily be made available to all users.
3. R is commercial quality. It is the package of choice for many engineers who use statistics frequently.
4. R is easy to use.
No doubt you will hear some disagreement about point 4 above. Other data analysis tools (such as Excel, for
example) appear easier to use at first. But many things that an engineer might do are easier to do in R than
Excel and some are impossible to do in Excel (correctly). In Mathematics 241 we will focus on core statistical
tools and so learn to use only a small fraction of the capabilities of R. But since R is free, you will be able to
keep R and add to your knowledge of it throughout your career.
2
Using R on the “Cloud”
R can easily be downloaded and installed on your personal computer. However we will use R over the internet
by using a system called RStudio. There are advantages and disadvantages to using R over the internet but
the principle advantages for this course are that using RStudio means that the installation and setup of R is
taken care of and also data can easily be shared with the instructor and other students. (Instructions on how
to download and setup R and RStudio for your own computer are available at the course webpage.)
To use RStudio on the web, go to http://dahl.calvin.edu:8787. Initially, your Calvin ID works as both your
ID and password. (You can change your password by going to http://dahl.calvin.edu:4200 and entering
the yppasswd command when you finally get a command prompt.) The webpage that comes up after logging
in should look like this:
Notice that there are four panes. R commands are entered one line at a time into the Console pane (which is
the lower left pane by default but that can be changed). In the standalone version of R that you can install,
there is only a console window at the start. The other three panes are used by RStudio to interact with the file
Getting Started with R and RStudio
2
system, to show graphics plots, and to provide an editor that can compose input for the console. (These notes
are being produced in RStudio using the pane at the top left as an editor.)
In the remainder of these notes we will work entirely within the console. Thus these notes can be used to get
started in any version of R.
The symbol > is the prompt symbol that signifies that R is ready for input. In general, in response to the
prompt we enter a one-line command and get some output (or define some object). We will look at some of the
basic kinds of objects and commands in the rest of these notes.
3
Basic features of R
In the examples that follow, you can distiguish input from output by the input prompt symbol >. Try these
commands or variations of them yourself.
3.1
Using R as a Calculator
R can be used as a calculator.
> 5 + 3
[1] 8
> 15.3 * 23.4
[1] 358.02
> sqrt(16)
[1] 4
You can save values to named variables for later reuse
> product = 15.3 * 23.4
> product
[1] 358.02
> .5 * product
[1] 179.01
> log(product)
[1] 5.880589
> product <- 15.3 * 23.4
> 15.3 * 23.4 -> newproduct
> newproduct
[1] 358.02
# save result
# show the result
# half of the result
# log of the result
# <- is the assignment operator, same as =
# can assign to the right hand side
The semi-colon can be used to place multiple commands on one line. One frequent use of this is to save and
print a value all in one go:
> product <- 15.3 * 23.4; product
[1] 358.02
3.2
# save result and show it
Functions and Objects
Though R does arithmetic on numbers, the real power of R comes from the fact that R understands complex
objects and has a large library of functions that operate on those objects. So most of the R commands that
we will enter will look like f(x,y,...) where f is the name of an R function (like log above) and x,y,... is
a list of objects. In the next section we illustrate by introducing the vector object and give some examples of
functions that operate on vectors.
Getting Started with R and RStudio
3.3
3
Vectors
A vector has a length (a non-negative integer) and a mode (numeric, character, complex, or logical). All elements
of the vector must be of the same mode. Typically, we use a vector to store the values of a quantitative variable.
Usually vectors will be constructed by reading data from an R dataset or a file as we will soon see. But short
vectors can be constructed by entering the elements directly.
> x = c(1,3,5,7,9,8,6,4,2)
> x
[1] 1 3 5 7 9 8 6 4 2
Note that the [1] that precedes the elements of the vectors is not one of the elements but rather an indication
that the first element of the vector follows. There are a couple of shortcuts that help construct vectors that are
regular.
> y=1:10
> z=seq(0,5,.05)
> y;z
[1] 1 2 3 4 5
[1] 0.00 0.05 0.10
[16] 0.75 0.80 0.85
[31] 1.50 1.55 1.60
[46] 2.25 2.30 2.35
[61] 3.00 3.05 3.10
[76] 3.75 3.80 3.85
[91] 4.50 4.55 4.60
6 7
0.15
0.90
1.65
2.40
3.15
3.90
4.65
8 9 10
0.20 0.25
0.95 1.00
1.70 1.75
2.45 2.50
3.20 3.25
3.95 4.00
4.70 4.75
0.30
1.05
1.80
2.55
3.30
4.05
4.80
0.35
1.10
1.85
2.60
3.35
4.10
4.85
0.40
1.15
1.90
2.65
3.40
4.15
4.90
0.45
1.20
1.95
2.70
3.45
4.20
4.95
Many functions operate on vectors component-wise.
> x=1:5
> y=6:10
> x^2
[1] 1 4 9 16 25
> x+y
[1] 7 9 11 13 15
> log(x)
[1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379
0.50
1.25
2.00
2.75
3.50
4.25
5.00
0.55
1.30
2.05
2.80
3.55
4.30
0.60
1.35
2.10
2.85
3.60
4.35
0.65
1.40
2.15
2.90
3.65
4.40
0.70
1.45
2.20
2.95
3.70
4.45
Getting Started with R and RStudio
3.4
4
Data Frames
Data sets are usually stored in a special structure called a data frame. Data frames have a 2-dimensional
structure. Rows correspond to the individuals (observational units, cases, subjects) of our data set and the
columns correspond to variables (measurements collected on each individual). Data frames in R are named as
are the individual variables of the data frame. The columns (variables) are either vectors or factors (think of
a factor as a vector that stores a categorical variable). We will usually get our data frames from external files
that we have prepared in some other way – Excel is a good way to prepare a data frame as a data frame looks
like a spreadsheet. Some datasets are included with the default R installation.
The iris data frame contains 5 variables measured for each of 150 iris plants. The iris data set is included
with the default R installation.
> str(iris)
# summarizes the structure of the data frame
’data.frame’:
150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species
: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
> summary(iris)
# gives summary information on each variable
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
Min.
:4.300
Min.
:2.000
Min.
:1.000
Min.
:0.100
1st Qu.:5.100
1st Qu.:2.800
1st Qu.:1.600
1st Qu.:0.300
Median :5.800
Median :3.000
Median :4.350
Median :1.300
Mean
:5.843
Mean
:3.057
Mean
:3.758
Mean
:1.199
3rd Qu.:6.400
3rd Qu.:3.300
3rd Qu.:5.100
3rd Qu.:1.800
Max.
:7.900
Max.
:4.400
Max.
:6.900
Max.
:2.500
Species
setosa
:50
versicolor:50
virginica :50
> head(iris)
# prints the first several cases of the data frame
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1
5.1
3.5
1.4
0.2 setosa
2
4.9
3.0
1.4
0.2 setosa
3
4.7
3.2
1.3
0.2 setosa
4
4.6
3.1
1.5
0.2 setosa
5
5.0
3.6
1.4
0.2 setosa
6
5.4
3.9
1.7
0.4 setosa
In interactive mode, you can also try
> View(iris)
to see the data or
> ?iris
to get the documentation about for the data set.
Getting Started with R and RStudio
5
Access to an individual variable in a data frame uses the $ operator in the following syntax:
> dataframe$variable
For example,
> iris$Sepal.Length
[1] 5.1 4.9 4.7 4.6
[19] 5.7 5.1 5.4 5.1
[37] 5.5 4.9 4.4 5.1
[55] 6.5 5.7 6.3 4.9
[73] 6.3 6.1 6.4 6.6
[91] 5.5 6.1 5.8 5.0
[109] 6.7 7.2 6.5 6.4
[127] 6.2 6.1 6.4 7.2
[145] 6.7 6.7 6.3 6.5
5.0
4.6
5.0
6.6
6.8
5.6
6.8
7.4
6.2
5.4
5.1
4.5
5.2
6.7
5.7
5.7
7.9
5.9
4.6
4.8
4.4
5.0
6.0
5.7
5.8
6.4
5.0
5.0
5.0
5.9
5.7
6.2
6.4
6.3
4.4
5.0
5.1
6.0
5.5
5.1
6.5
6.1
4.9
5.2
4.8
6.1
5.5
5.7
7.7
7.7
5.4
5.2
5.1
5.6
5.8
6.3
7.7
6.3
4.8
4.7
4.6
6.7
6.0
5.8
6.0
6.4
4.8
4.8
5.3
5.6
5.4
7.1
6.9
6.0
4.3
5.4
5.0
5.8
6.0
6.3
5.6
6.9
5.8
5.2
7.0
6.2
6.7
6.5
7.7
6.7
5.7
5.5
6.4
5.6
6.3
7.6
6.3
6.9
5.4
4.9
6.9
5.9
5.6
4.9
6.7
5.8
5.1
5.0
5.5
6.1
5.5
7.3
7.2
6.8
shows the contents of the Sepal.Length variable. But this isn’t very useful for a large data set. We would
prefer to compute a numerical or graphical summary.
4
Summaries of a single quantitative variable
Almost always, a quantitative variable is stored in a vector and that vector is one column of a data frame. Most
functions that give a numerical or graphical summary require a vector as argument. In this section we illustrate
some of the more important summary functions with the variable Sepal.Length of the data frame iris.
4.1
Numerical summaries
> mean(iris$Sepal.Length)
[1] 5.843333
> median(iris$Sepal.Length)
[1] 5.8
> sd(iris$Sepal.Length)
[1] 0.8280661
> quantile(iris$Sepal.Length)
0% 25% 50% 75% 100%
4.3 5.1 5.8 6.4 7.9
4.2
Graphical Summaries
There are several ways to make graphs in R. Many individuals have written R packages that give great control
over the way a graph is drawn. We will use the standard graphics functions that are built n to R. Here
we illustrate the two most important graphical representations of a single quantitative variable. In RStudio,
graphics output appears in the plot window (lower right). You must click on the Plots tab to see them.
A histogram is drawn using the function hist.
> hist(iris$Sepal.Length)
Getting Started with R and RStudio
6
15
0
5
Frequency
25
Histogram of iris$Sepal.Length
4
5
6
7
8
iris$Sepal.Length
Many functions in R have optional arguments that change the way that the function acts. Often we can omit
these arguments since R chooses reasonable default values. Note that R produces frequency histograms. To
produce a density histogram, we need an optional argument freq. Note that we name the argument. Optional
arguments usually have to be named so that R knows which arguments are being included. Other optional
arguments control the title of the histogram and the axis labels.
> hist(iris$Sepal.Length,freq=F,main="Sepal Length",xlab=" ")
0.0 0.1 0.2 0.3 0.4
Density
Sepal Length
4
5
6
7
8
# F is short for false
Getting Started with R and RStudio
7
Another common plot is called a boxplot. A boxplot is a graphical representation of a five number summary of
a quantitative variable. The default boxplot uses a vertical scale. Here we draw a horizontal boxplot.
>
boxplot(iris$Sepal.Length, horizontal=T, main="Sepal Length")
# T is short for true
Sepal Length
4.5
5
5.5
6.5
7.5
Importing data
In this class, we will use data from several different sources. R has many builtin datasets. (The iris dataset
used earlier in these notes in one of those.) There are also many packages available that provide additional
datasets and also extend R by defining useful functions. Packages are installed and loaded via the Package tab
of the files panes of RStudio.
We will also use datasets developed especially for this class. Each RStudio user has space to save files. You can
see your personal directory using the files tab of the same window in which you look at plots. Each user has
a Public directory which is visible to other RStudio users. There are two collections of data that are available
through the instructor’s public directory. The directory Navidi contains the datasets from the textbook. Other
datasets used in this course are also included there. To load such datasets, use the Import Dataset tab of the
Workspace pane, select From Text File and enter as filename /home/stob/Data. You will see the following
Class datasets are in this directory and textbook datasets are in the Navidi directory. For example, to import the
dimes dataset, simply select dimes.csv. A window will popup that enables you to tell R which format the data is
in but in this case RStudio understands the CSV format that the dimes dataset is in. This procedure defines the
Getting Started with R and RStudio
8
data frame dimes. To load a textbook dataset, navigate to the Navidi folder and select the appropriate chapter
and then file in that chapter: for example ex3-2-5.txt is the data for exercise 5 in section 2 of chapter 3. Note
that you will have to change the variable name (from ex-3-2-5) since dashes are not acceptable characters in
variable names. Choose a short, memorable variable name!
6
Useful features of RStudio
One of the most useful features of RStudio is that it will save the state of your session even if you close your
browser. This includes all variables, plots, and other settings. This is very useful for class work since you might
get stuck on homework after attempting a problem and can pick up again after you get help in class or from
the instructor.
Another useful feature is the History tab of the upper right hand window. In that window you can find all the
lines that you have entered into the console. These lines can be copied into the console window, for example.
The Source pane can be used to edit and save your work. You can run command lines entered into this pane
using the appropriate buttons. If you have an error, you can simply edit that line in the Source pane.
Two useful editing features are accessed by the tab key and the arrow keys. If you start to type the name of
a function (e.g., > hi) and enter the tab character, you get all the possible functions that begin with these
characters along with a short description of what they do. (Try entering the tab key after entering hi.) If you
hit the up-arrow key, the previous line that you entered now becomes the current line and you can edit it and
enter it again. This is very useful if you make a small typo on a very long line.