Introduction to R

Introduction to R
Arthur White
24th October, 2016
1
Getting Started
The aim of this lab is to familiarise you with R, a statistical software platform.
While R is used much more in later courses, you may find it useful for your
forthcoming assignments.
To open R, click on the start menu → All Programs → Rstudio. Usually,
you interact with R through a command-line interface - you type in a command
and R responds. Note that R is case sensitive, so that (in general) X 6= x.
Basic Arithmetic
> 2 + 2
[1] 4
> 3 * 4 - 7 / 8 + sqrt(16) + 2^5
[1] 47.125
> 1:10
[1]
1
2
3
4
5
6
7
8
9 10
> sum(1:10)
[1] 55
> x = c(1, -2, 1/7, sqrt(3))
> x
[1]
1.0000000 -2.0000000
0.1428571
> sum(x)
[1] 0.874908
1
1.7320508
R can be used as a calculater to perform simple tasks. Here +, -,*, / and ˆ
denote the addition, subtraction, multiplication, division, and exponentiation
operations respectively. The colon operator : creates a vector of numbers, in
this case from 1, . . . , 10.
R has many in built functions. Here the meaning of the sum function should
be obvious. In particular, the command sum(1:10) calculates
1 + 2 + . . . + 10.
To better understand what a function does, type a question mark directly in
front of it, e.g., ?sqrt. This brings up a help file for the function.
We can assign values to a variable using the = operator. (The <- operator can
also be used.) In the above example, we assign a vector of values to the variable
x. This vector can then be stored for later use. The individual elements of the
vector are joined together using the function c, which stands for concatenate,
or, in simpler English, combine.
Other useful functions in R include factorial, choose and exp. For example
> factorial(1:5)
[1]
1
2
6
24 120
> choose(7, 3)
[1] 35
> exp(-1.5)
[1] 0.2231302
Note that the choose function takes two arguments.
We can use these functions to calculate probabilities. For example, an urn
contains 5 balls, 3 of which are red and 2 of which are black. If 3 balls are
selected, what is the probability that 2 will be red? This may be calculated as
follows:
> (choose(3, 2) * choose(2, 1)) / choose(5, 3)
[1] 0.6
Exercise 1
Use the choose function to calculate the probability that 3 heads are obtained
after a fair coin has been tossed 5 times. Calculate the associated probabilities
for each element of the sample space, i.e, the probability that 0, . . . , 5 heads
are obtained.
2
2
Probability Distributions
R has a number of in built functions to calculate probabilities associated with
many different probability distributions. These include functions for the hypergeometric, binomial and Poisson distributions. In particular, we will be using
the functions dhyper, dbinom and dpois to calculate probabilities associated
with these respective distributions.
Hypergeometric Distribution
Recall the question from class: An athlete conceals 2 performance enhancing
drugs (PEDs) in a bottle, containing 8 vitamin pills that are similar in appearance. If 3 tablets are selected at random, what is the probability that the
cheating will be detected?
The following code gives us the answer
> dhyper(1, 2, 8, 3) + dhyper(2, 2, 8, 3)
[1] 0.5333333
Note that the dhyper function takes 4 arguments. These relate to 1) the number
of successes in the sample, 2) the number of successes in the population, 3) the
number of failures in the population, and 4) the size of the sample. Another
way to obtain this answer would be to enter
> sum(dhyper(1:2, 2, 8, 3))
[1] 0.5333333
Exercise 2
Use the dhyperfunction to check the answer you obtained for the question relating to the red and black balls in Section 1.
Binomial Distribution
Now recall the question from class: A random sample of 5 components is selected
from a production line that produces, on average, 5% non-conforming components. What is the probability that the sample will contain 2 non-conforming
components?
> dbinom(x = 2, size = 5, prob = 0.05)
[1] 0.02143438
Can you explain what the three arguments for this function mean? Note that
we can also easily inspect the probabilities associated with each element of the
sample space, and confirm that this value sums to 1:
3
> dbinom(x = 0:5, size = 5, prob = 0.05)
[1] 0.7737809375 0.2036265625 0.0214343750 0.0011281250 0.0000296875
[6] 0.0000003125
> sum(dbinom(x = 0:5, size = 5, prob = 0.05))
[1] 1
Exercise 3
Use the dbinom function to check your answer for Excercise 1 in Section 1.
Poisson Distribution
We can also use R to visualise data. For example, we can compare observed data
to the behaviour which would be expected from a suitable distribution. Here we
examine data recording flying bomb hits on wartime London, with mean number
of hits µ = 0.9323. This data is shown in Table 1. The following code compares
the observed data to expected number of hits, were X ∼ Poisson(0.9323).
Table 1: Flying bomb hits on wartime London.
Number of hits
0
1
2
3
4
Expected number 226.74 211.39 98.54 30.62 7.14
Actual number
229
211
93
35
7
>
>
>
+
>
≥5
1.57
1
actual = c(229, 211, 93, 35, 7, 1)
expected = 573 * c( dpois(0:4, 0.9323), 1 - sum(dpois(0:4, 0.9323)))
plot(0:5, actual, type = "h", main = "Flying Bomb Hits",
xlab = "Number of times hit", ylab="Frequency", )
points(0:5, expected, pch = 2, cex = 2, col = "red")
You should get a plot resembling that shown in Figure 1.
3
Monty Hall Problem
We have barely scraped the surface of what R can be used for. As a simple
demonstration, we will create our own function and simulate a simple random
(also called stochastic) process, based on the infamous Monty Hall problem.
Briefly, the problem is named after the host of a game show, in which contestants were asked to select one of three doors. The main prize, a sports car, lay
behind one door, whereas goats waited behind the other two. (Contestants usually preferred to win the sports car.) After making their choice, the host would
reveal a goat behind one of the doors which had not been chosen. The contestant was then allowed to swap their pick. Is it advantageous for the contestant
to do so?
4
100
0
50
Frequency
150
200
Flying Bomb Hits
0
1
2
3
4
5
Number of times hit
Figure 1: Comparison of actual vs. expected flying bomb hits in wartime London. The black lines represent the actual data, the red triangles the expected
number of hits.
To start, we will use the sample function to randomly assign the car and
goats behind doors 1 to 3. For simplicity, we will assume that a contestant
always chooses door number 1, i.e., the first value in the sample.
> samp1<- sample(c("Car", "Goat", "Goat"))
> samp1[1]
[1] "Goat"
Now we create a function called monty.hall that simulates the game, but
takes into account whether we choose to swap our pick or not. Inspect the code
below and try to see if you can follow what is being done.
> monty.hall <- function(swap = TRUE){
+
+
samp1 <- sample(c("Car", "Goat", "Goat"))
+
+
if(swap == FALSE){
+
5
+
pick1 <- samp1[1]
+
+
}
+
if(swap == TRUE){
+
if(samp1[2] == "Goat"){
+
+
pick1 <- samp1[3]
+
+
}
+
if(samp1[3] == "Goat"){
+
+
pick1 <- samp1[2]
+
+
}
+
}
+
return(pick1)
+ }
> monty.hall(swap = TRUE)
[1] "Goat"
> monty.hall(swap = FALSE)
[1] "Goat"
To inspect the long term behaviour of our strategy, we can use the replicate
function. This simulates the game e.g., n = 100 times, for a chosen strategy. We
can then see how often this strategy is successful. The table function simply
tabulates the number of times the the contestant wins or loses over the course
of the simulation.
> table(replicate(100, monty.hall(swap = TRUE)))
Car Goat
68
32
Note that your answer may well be different from the one shown here. Does
the answer roughly correspond to what you would expect? What happens if
you increase the number of simulations?
6