R for 243

R and RStudio for 243
1
1
What is R?
R is a system for statistical computation and graphics. It is the statistical system that is used in Mathematics
243 for the following reasons:
1. R is free and open-source.
2. R is user-extensible and user extensions can easily be made available to all users.
3. R is commercial quality. It is the package of choice for many professionals who use statistics frequently.
4. R is easy to use.
No doubt you will hear some disagreement about point 4 above. Other data analysis tools (such as Excel, for
example) appear easier to use at first. But many things that we will want to do are easier to do in R than Excel
and some are impossible to do in Excel (correctly). In Mathematics 243 we will focus on core statistical tools
and so learn to use only a small fraction of the capabilities of R. But since R is free, you will be able to keep R
and add to your knowledge of it throughout your career.
2
Using R on the “Cloud”
R can easily be downloaded and installed on your personal computer. However we will use R over the internet
by using a system called RStudio. There are advantages and disadvantages to using R over the internet but the
principle advantages for this course are that using RStudio means that the installation and setup of R is taken
care of and also data can easily be shared with the instructor and other students. (If you want to install R and
or RStudion on our own computer, nstructions on how to download and setup them are available at the course
webpage. It is not necessary to do this if ou have access to the internet.)
To use RStudio, go to http://dahl.calvin.edu:8787. Initially, your Calvin ID works as both your ID and password. (You can change your password by going to http://dahl.calvin.edu:4200 and entering the yppasswd
command when you finally get a command prompt.) The webpage that comes up after logging in should look
like this:
R and RStudio for 243
2
Notice that there are four windows. R commands are entered one line at a time into the Console window (which
is the lower left window by default). In the standalone version of R that you can install, there is only a console
window at the start. The other three windows are used by RStudio to interact with the file system, to show
graphics plots, and to provide an editor that can compose input for the console. (These notes are being produced
in RStudio using the window at the top left as an editor.)
In the remainder of these notes we will work entirely within the console. Thus these notes can be used to get
started in any version of R.
The symbol > is the prompt symbol that signifies that R is ready for input. In general, in response to the
prompt we enter a one-line command and get some output (or define some object). We will look at some of the
basic kinds of objects and commands in the rest of these notes.
3
Basic features of R
In the examples that follow, you can distiguish input from output by the input prompt symbol >. Try these
commands or variations of them yourself.
3.1
Using R as a Calculator
R can be used as a calculator.
> 5 + 3
[1] 8
> 15.3 * 23.4
[1] 358.02
> sqrt(16)
[1] 4
You can save values to named variables for later reuse
> product = 15.3 * 23.4
> product
[1] 358.02
> .5 * product
[1] 179.01
> log(product)
[1] 5.880589
> product <- 15.3 * 23.4
> 15.3 * 23.4 -> newproduct
> newproduct
[1] 358.02
# save result
# show the result
# half of the result
# log of the result
# <- is the assignment operator, same as =
# can assign to the right hand side
The semi-colon can be used to place multiple commands on one line. One frequent use of this is to save and
print a value all in one go:
> product <- 15.3 * 23.4; product
[1] 358.02
3.2
# save result and show it
Functions and Objects
Though R does arithmetic on numbers, the real power of R comes from the fact that R understands complex
objects and has a large library of functions that operate on those objects. So most of the R commands that
we will enter will look like f(x,y,...) where f is the name of an R function (like log above) and x,y,... is
a list of objects. In the next section we illustrate by introducing the vector object and give some examples of
functions that operate on vectors.
R and RStudio for 243
3.3
3
Vectors
A vector has a length (a non-negative integer) and a mode (numeric, character, complex, or logical). All elements
of the vector must be of the same mode. Typically, we use a vector to store the values of a quantitative variable.
Usually vectors will be constructed by reading data from an R dataset or a file as we will soon see. But short
vectors can be constructed by entering the elements directly.
> x = c(1,3,5,7,9,8,6,4,2)
> x
[1] 1 3 5 7 9 8 6 4 2
Note that the [1] that precedes the elements of the vectors is not one of the elements but rather an indication
that the first element of the vector follows. There are a couple of shortcuts that help construct vectors that are
regular.
> y=1:10
> z=seq(0,5,.05)
> y;z
[1] 1 2 3 4 5
[1] 0.00 0.05 0.10
[16] 0.75 0.80 0.85
[31] 1.50 1.55 1.60
[46] 2.25 2.30 2.35
[61] 3.00 3.05 3.10
[76] 3.75 3.80 3.85
[91] 4.50 4.55 4.60
6 7
0.15
0.90
1.65
2.40
3.15
3.90
4.65
8 9 10
0.20 0.25
0.95 1.00
1.70 1.75
2.45 2.50
3.20 3.25
3.95 4.00
4.70 4.75
0.30
1.05
1.80
2.55
3.30
4.05
4.80
0.35
1.10
1.85
2.60
3.35
4.10
4.85
0.40
1.15
1.90
2.65
3.40
4.15
4.90
0.45
1.20
1.95
2.70
3.45
4.20
4.95
0.50
1.25
2.00
2.75
3.50
4.25
5.00
0.55
1.30
2.05
2.80
3.55
4.30
0.60
1.35
2.10
2.85
3.60
4.35
0.65
1.40
2.15
2.90
3.65
4.40
0.70
1.45
2.20
2.95
3.70
4.45
Many functions operate on vectors component-wise.
> x=1:5
> y=6:10
> x^2
[1] 1 4 9 16 25
> x+y
[1] 7 9 11 13 15
> log(x)
[1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379
3.4
Functions
From calculus, we know that a function f is a rule that takes inputs and produces for each input a unique
output. So in the expression y = f (x), x is a real-number input, f is the name of the function and f (x) is the
output of the function (another real number). Most of the work that we do in R will be by applying functions
to data. It is important to understand a few key features of R functions.
1. Functions might take a variable number of arguments.
For example, the function c that we used to define a vector takes any number of arguments.
2. The arguments to a function can be more complicated objects than just numbers.
We saw above that the log function can take a vector as an argument rather than just a number as we
are used to from calculus.
3. Functions often have several optional arguments that determine the exact way in which they behave.
Optional arguments usually have default values that are used when they are not included. The sort
function sorts a vector into increasing or decreasing order. The default is increasing.
R and RStudio for 243
4
> z=c(1,3,5,6,4,2)
> sort(z)
[1] 1 2 3 4 5 6
> sort(z,decreasing=TRUE)
[1] 6 5 4 3 2 1
> sort(z,decreasing=FALSE)
[1] 1 2 3 4 5 6
4. Arguments to R functions have names and an order.
In the above example, the second argument is named decreasing. Notice that we explictly identified the
argument by using its name however we have not identified the first argument by name (the first argument
to sort is named x). However if we do use names, we can then change the order of the arguments:
> z=c(1,3,5,6,4,2)
> sort(decreasing=T,x=z)
[1] 6 5 4 3 2 1
R matches up the arguments by first matching the named arguments and then by matching the unnamed
arguments in the order that they appear in the function call. It is good practice to name optional
arguments.
5. Functions return objects and sometimes have other “side effects.”
A function returns a value. This value can be assigned to a variable or immediately printed as in the
following where m=mean(x) causes m to receive the value of the function but nothing to be printed whereas
mean(x) causes the immediate printing of the result. Some functions may cause other things to happen
beyond returning a value or instead of returning a value - for example write.csv will save a copy of an
R object to a file (in CSV format).
> x=1:100
> m=mean(x)
> m
[1] 50.5
> mean(x)
[1] 50.5
4
Coin Tossing
Investigation 1.1 of ISCAM2 relies on a coin-tossing applet. We will instead use R to do the simulations. In
so doing, we will meet some powerful R functions. In the following session, we define the variable coin as a
vector of two character strings. We then change coin to a “factor”. A factor is the usual way to represent a
categorical variable in R. Notice that the vector representation of coin and the factor representation of coin
print differently. The possible values of a factor are called “levels” and there are only a specific finite number
of levels for any particular factor variable. Note that when a factor is printed, so are all its possible levels.
It’s preferable to store categorical variables as factors rather than as character vectors. Character vectors are
reserved for things that really are character data (e.g., names of people).
> coin=c('Heads','Tails')
> coin
[1] "Heads" "Tails"
R and RStudio for 243
5
> coin=factor(coin)
> coin
[1] Heads Tails
Levels: Heads Tails
Now we flip a coin.
> sample(coin,1)
[1] Heads
Levels: Heads Tails
The sample function chooses elements from a vector or a factor “at random”, in this case such that each of the
two elements of the factor are equally likely to occur. To flip 16 coins, we do the following,
> sample(coin,16,replace=T)
[1] Heads Tails Heads Tails Tails Tails Heads Tails Heads Heads Heads Tails
[13] Tails Heads Tails Tails
Levels: Heads Tails
The optional argument replace says that for each of the 16 trials, the object selected on that trial is “replaced”
in coin before selecting another. Note also that T abbreviates TRUE and that R understands this abbreviation.
(If we were dealing 5 cards from a deck of cards, we would want replace=F.)
We are dealing with fair coins here, but if we wanted to load the coin, we could. In the following example, heads
has probability 3/4 of occuring on any particular toss.
> sample(coin,16,replace=T,prob=c(3/4,1/4))
[1] Tails Heads Heads Heads Heads Heads Tails Heads Tails Heads Heads Heads
[13] Heads Heads Heads Heads
Levels: Heads Tails
A table of results from our 16 coin tosses can be produced using the table functon.
> s=sample(coin,16,replace=T)
> table(s)
s
Heads Tails
7
9
Instead of producing a table, we might want to produce a count of just the number of heads.
> s=sample(coin,16,T)
> s=='Heads'
[1] TRUE TRUE TRUE TRUE
[13] FALSE TRUE FALSE FALSE
> sum(s=='Heads')
[1] 11
TRUE FALSE
TRUE FALSE
TRUE
TRUE
TRUE
TRUE
Notice that == is the equality comparison operator – not the assignment operator. The result of s==’Heads’ is a
logical vector of the same length as s whose elements are the result of that comparison. Thus TRUE corresponds
to ’Heads’. The sum of such a vector is simply the number of TRUEs, since TRUE is coded as 1 and FALSE
as 0.
We would like to repeat this experiment of tossing 16 coins many times. To do this, we will use the replicate
function.
> replicate(20, sum(sample(coin,16,replace=T)=='Heads'))
[1] 4 12 7 9 9 7 9 6 5 8 10 9 9 8 7 4 8 5
8 11
R and RStudio for 243
6
The second argument to replicate is an R function that has a numerical value. In this case, we see that the
second argument is a function that simulates throwing 16 coins and counting the number of heads. The first
argument, in this case 20, instructs R to repeat the second argument 20 times and put all 20 of the numbers
returned in a vector. So the resulting vector simulates throwing 16 coins 20 different times and counting the
number of heads each time.
> r=replicate(1000,sum(sample(coin,16,replace=T)=='Heads'))
> table(r)
r
2
3
4
5
6
7
8
9 10 11 12 13 14
2
8 30 78 127 179 193 166 129 51 21 12
4
In 1,000 trials, 8 heads occured most often and we always had at least some heads and some tails. Interesting.
5
Useful features of RStudio
One of the most useful features of RStudio is that it will save the state of your session even if you close your
browser. This includes all variables, plots, and other settings. This is very useful for class work since you might
get stuck on homework after attempting a problem and can pick up again after you get help in class or from
the instructor.
Another useful feature is the History tab of the upper right hand window. In that window you can find all the
lines that you have entered into the console. These lines can be copied into the console window, for example.
The Source pane can be used to edit and save your work. You can run command lines entered into this pane
using the appropriate buttons. If you have an error, you can simply edit that line in the Source pane.
Two useful editing features are accessed by the tab key and the arrow keys. If you start to type the name of
a function (e.g., > hi) and enter the tab character, you get all the possible functions that begin with these
characters along with a short description of what they do. (Try entering the tab key after entering hi.) If you
hit the up-arrow key, the previous line that you entered now becomes the current line and you can edit it and
enter it again. This is very useful if you make a small typo on a very long line.
6
Binomial Random Variables
A binomial process is a random process characterized by the following conditions:
1. The process consists of a sequence of finitely many (n) trials of some simpler random process.
2. Each trial results in one of two possible outcomes, usually called success (S) and failure (F ).
3. The probability of success on each trial is a constant denoted by π.
4. The trials are independent one from another.
Thus a binomial process is characterized by two parameters, n and π.
Given a binomial process, we are usually interested in counting the number of successes. If X is the number of
successes, we call X a binomial random variable. If X is a binomial random variable with parameters n and π,
we write X ∼ Bin(n, π). The symbol ∼ can be read as “has the distribution” or something to that effect.
R can compute various probabilities associated with a binomial process. R can also be used to simulate a
binomial random process. Note that R uses the variable name size to name what we have called n above.
function (& parameters)
explanation
dbinom(x,size,prob)
returns P(X = x) (the pmf).
pbinom(q,size,prob)
returns P(X ≤ q) (the cdf).
rbinom(n,size,prob)
makes n random draws of the random variable X and
returns them in a vector.
R and RStudio for 243
7
The following R output computes some probabilities related to tossing 8 fair coins.
> dbinom(3,8,.5)
[1] 0.21875
> pbinom(3,8,.5)
[1] 0.3632813
> 1-pbinom(3,8,.5)
[1] 0.6367187
7
Tests of Hypotheses for a Proportion
The function binom.test() performs hypothesis tests for a single proportion π. The syntax is
binom.test(x, n, p = 0.5,
alternative = c("two.sided", "less", "greater"),
conf.level = 0.95)
where x is the observed number of successes, n is the number of trials, and p is the hypothesized value of π.
The optional argument alternative is used to specify the alternative. Note that the first letter is enough. Also
note that optional arguments can be abbreviated by their first few letters (enough to make them unambiguous).
While a confidence inteval is produced, this is not the preferred confidence interval for a proportion.
> binom.test(16,18,.5,alt='g')
Exact binomial test
data: 16 and 18
number of successes = 16, number of trials = 18, p-value = 0.0006561
alternative hypothesis: true probability of success is greater than 0.5
95 percent confidence interval:
0.6897373 1.0000000
sample estimates:
probability of success
0.8888889
8
Simple x − y plots
If x and y are vectors of the same length, the plot function plots the points (xi , yi ) where xi and yi are the ith
elements of the two vectors. This gives us an easy way to plot a function. The function plot has lots of optional
arguments to control the look of the plot. Note that the par function sets graphical options – in this case the
graphs are printed in 1 row and 2 columns (mfcol=c(1,2)) and with the axis labels horizontal (las=1).
>
>
>
>
>
x=0:16
y=dbinom(x,16,.5)
par(mfcol=c(1,2),las=1)
plot(x,y)
plot(x,y,type='h')
R and RStudio for 243
8
0.20
0.20
●
●
●
0.15
0.15
●
0.10
y
y
●
●
0.05
0.00
●
●
0
0.05
●
●●●●
●●●●
5
10
x
0.10
15
0.00
0
5
10
x
15

Download Report

R for 243

Paperzz.com

Your Paperzz