1 Matrices and Vectors and Lists - niasra

University of Wollongong
School of Mathematics and Applied Statistics
STAT231 Probability and Random Variables 2014
Second Lab - Week 4
If you can’t finish the log-book questions in lab, proceed at home. If you have questions ask
your tutor or come to the consultation hours. Be reminded that logbook questions are an
assessment task and that future assignments will rely on lab knowledge.
1
Matrices and Vectors and Lists
The basic data object in R is a vector. Even scalars are vectors of length 1. There are several
ways to create vectors. To create a vector you might use c() operator. The : operator creates
sequences incrementing/decrementing by 1. If the step-size should be different, then seq()
function can be used. To repeat numbers in a certain fashion you can use rep() by specifying
arguments times and each.
>
>
>
>
>
>
>
b1<-c(0,0,0,1)
#1:n stands for seq(from=1,to=n,by=1)
b2<-1:10
b3<-10:1
b4<-seq(from=0,to=10,by=2)
b5<-rep(1:4,times=5)
b6<-rep(1:4,each=2)
Matrices are created with the command
> matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE,
dimnames = NULL)
where the first optional argument is a vector containing the data that is used to fill the
matrix, nrow specifies the number of rows and ncol the number of columns. The default
option byrow = FALSE is to fill column by column. The matrix multiplication operator in R
is %*%.
>
>
>
>
>
#matrices
A1<-matrix(c(1,2,3,4,5,6,7,8),nrow=2,ncol=4,byrow=TRUE)
A2<-matrix(c(1,2,3,4,5,6,7,8),nrow=2,ncol=4) #byrow=FALSE
A1>2
#dimensions of A1 and A2
1
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
dim(A1)
dim(A2)
#accessing the first row of matrix A1
A1[1,]
#accessing the second column of matrix A1
A1[,2]
#multiply matrices
A1%*%A2 #doesn't work because dimensions don't fit
A1%*%t(A2) #so only A1 times transpose of A2 works
#matrix times vector
A1%*%b1
#or alternatively
A2%*%as.matrix(b1)
#element-wise multiplication
A1*A2
#row-sum and column-sum
apply(A1,1,sum)
apply(A1,2,sum)
A list is a convenient structure to store multiple objects simultaneously. These obsjetcs
can be of different type and different length and structure:
>
>
>
>
n
s
b
x
<<<<-
c(2, 3,
c("aa",
c(TRUE,
list(n,
5) # object 1
"bb", "cc", "dd", "ee") # object 2
FALSE, TRUE, FALSE, FALSE) # object 3
s, b, 3)
# list x contains n, s, b
To access the, say, first element and second element of a list you use double brackets
“[[]]”:
> x[[1]]
[1] 2 3 5
> x[[2]]
[1] "aa" "bb" "cc" "dd" "ee"
2
You can also name elements and then refer to the name
>
>
>
>
# list x contains n, s, b, now named "n", "s" and "b.vector"
x <- list(n=n,s=s,b.vector=b,number=3)
# option 1: using the entry number
x[[1]]
[1] 2 3 5
> x[[3]]
[1]
TRUE FALSE
TRUE FALSE FALSE
> #option 2: using element's name
> x$n
[1] 2 3 5
> x$b.vector
[1]
TRUE FALSE
TRUE FALSE FALSE
Do you remember the data-frame object from the lab of week 1? A dataframe has
properties of a matrix and of a list. You can access the elements of the dataframe in the
same way as with lists by calling its column name, but also by using the matrix notation
that is used to access, say, the columns of a matrix. There is one fundamental difference
between lists and dataframes. Each column/object of a dataframe needs to have the same
length. A list doesn’t have this requirement. A dataframe can also be easily converted into
a matrix (if dataframe consists of numbers for example) using the command as.matrix().
2
2.1
Some Control Structures in R
Loops
R has a several types of loops, for example the for and while-loops. Examples are given
below when computing the product of the numbers 1 to 10.
> #loops
> #for loop
> f<-1
3
>
>
>
>
+
+
+
+
index<-1:10
#"for" loop executed as many times as length of index and
# i takes values of the index-set "index"
for(i in index){
f<-f*i
#cat is here to produce output of current values of f
cat("current value of f at ",i,"th iteration: ",f,"\n")
}
current
current
current
current
current
current
current
current
current
current
value
value
value
value
value
value
value
value
value
value
of
of
of
of
of
of
of
of
of
of
f
f
f
f
f
f
f
f
f
f
at
at
at
at
at
at
at
at
at
at
1 th iteration:
2 th iteration:
3 th iteration:
4 th iteration:
5 th iteration:
6 th iteration:
7 th iteration:
8 th iteration:
9 th iteration:
10 th iteration:
1
2
6
24
120
720
5040
40320
362880
3628800
> f
[1] 3628800
>
>
>
>
+
+
+
+
+
+
+
#while loop
f<-1
i<-1
while(i<=10){
#executes the statements that follow until
# condition not met, here as long as i<=10
f<-f*i
cat("current value of f at ",i,"th iteration: ",f,"\n")
i<-i+1
}
current
current
current
current
current
current
current
current
value
value
value
value
value
value
value
value
of
of
of
of
of
of
of
of
f
f
f
f
f
f
f
f
at
at
at
at
at
at
at
at
1
2
3
4
5
6
7
8
th
th
th
th
th
th
th
th
iteration:
iteration:
iteration:
iteration:
iteration:
iteration:
iteration:
iteration:
4
1
2
6
24
120
720
5040
40320
current value of f at
current value of f at
9 th iteration: 362880
10 th iteration: 3628800
> f
[1] 3628800
> #easier
> prod(1:10)
[1] 3628800
2.2
If/Then Statement
The if/then statement in R has the following form.
> if(cond){
+
#evaluate expression if cond is true
+ }else{
+
#evaluate expression if cond is false
+ }
Alternatively you can use ifthen(cond,expr.true,expr.false).
> x<-5
> if(x<0){cat("Number <0\n")}else{cat("Number >=0\n")}
Number >=0
> x<--3
> if(x<0){cat("Number <0\n")}else{cat("Number >=0\n")}
Number <0
>
>
+
+
x <- c(6:-4)
for(i in x){
ifelse(i >= 0, print(sqrt(i)), print(NA))
}
[1] 2.44949
[1] 2.236068
[1] 2
5
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
3
1.732051
1.414214
1
0
NA
NA
NA
NA
Functions
Sometimes we want to calculate or apply an operation to many numbers or repeat an analysis
which consist of many lines of code. In this case it is often easier to define functions that
contain this complex code. When a function is defined we need to allocate a name. Once
it is defined, the name can be applied to several data sets by only calling the name of the
function and its arguments without needing to repeat numerous times the possibly lengthy
code.
>
+
+
+
myfunction <- function(arg1, arg2, ... ){
statements
return(object)
}
Here myfunction is an arbitrary chosen name, and arg1, arg2, ... are arguments of
the function. The definition needs to finish with returning of an object, here achieved by
return(object).
In Section 2.1 we calculated the product of the first 10 number, in fact that’s the factorial
function, in r implemented by factorial().
To illustrate functions in R we define the factorial functions in several ways.
>
>
>
>
>
>
>
>
+
factorial1<-function(n){return(prod(1:n))}
factorial2<-function(n){prod(1:n)} #without using return
# since prod(1:n) shows an output when entered in R console, it
# returns this output automatically without using return()
# to be "cleaner", using "return" is recommended
#factorial3 calls itself, called recursion
factorial3<-function(n){
cat("n=",n,"\n");
6
+
if(n==1){
+
return(n)
+
}else{
+
return(n*factorial3(n-1))
+
}#end if
+ }#end function factorial 3
Now test by yourself and try to understand why and how these functions work!
>
>
>
>
>
>
>
>
>
>
#tests
factorial1(10)
factorial2(10)
factorial3(10)
factorial1(15)
factorial2(15)
factorial3(15)
#compare with default R-function
factorial(10)
factorial(15)
In R you can only return one object not several. For example if you have a data set and
you want to return ȳ and s2 , you need to set-up a list which then contains these two values
as elements.
>
>
+
+
+
+
+
>
>
y<-seq(1,100,by=2)
bary.s2<-function(data){
meany<-mean(data);
vary<-var(data)
list1<-list(bary=meany,s2=vary) #first element is called bary, 2nd s2
return(list1)
}
#now apply function to data "y"
bary.s2(data=y)
$bary
[1] 50
$s2
[1] 850
7
>
>
>
>
#or storing
test<-bary.s2(y)
#show bary=sample mean
test$bary
[1] 50
> #show s2=sample var
> test$s2
[1] 850
> #directly
> mean(y)
[1] 50
> var(y)
[1] 850
4
Hypergeometric Distribution
Create the probability function (p.f.) f (y) and the cumulative distribution function (c.d.f.)
F (y) for the hypergeometric distribution with parameters N = 42, K = 7,n = 21 with
(notations from lecture). Recall (in case this has already been covered) that N is the
population size, n the size of sample taken (sample size), and K is the number of objects
with certain characteristic (e.g. white balls).
First create a sequence of numbers from 0 to 7, the domain of Y .
The R-function dhyper (p.f. of hypergeometric) uses different notations for parameters:
>
>
>
>
>
>
>
y<-0:7 #y<-sequence(0,7,by=1)
# m is the number of while balls (what we call K)
# n the number of black balls, therefore N=m+n and m=N-n
# k=21 (what we call usually n)
# The probability function in R gives
# the probability of having x while balls!
fY<- dhyper(x=y, m=7, n=42-7, k=21)
8
The Vector fY contains the 8 probabilities when referring to the events that the sample
contains 0, 1, . . . , 7 white balls.
Calculate the c.d.f. using cumsum(), a function for the cumulative sum, see help(cumsum):
>
>
>
>
>
FY<-cumsum(fY)
#alternatively use
FY1<-phyper(q=y, m=7, n=42-7, k=21)
#compare FY with FY1
FY-FY1
[1] 0.000000e+00 0.000000e+00 2.775558e-17 -5.551115e-17 -2.220446e-16
[6] -2.220446e-16 -2.220446e-16 -2.220446e-16
Some plotting:
#first plain plots
plot(0:7,fY)
points(0:7,FY)
#doesn't look good
0.15
0.00
0.05
0.10
fY
0.20
0.25
0.30
>
>
>
>
0
1
2
3
4
5
6
7
0:7
This looks rather non-informative, instead we draw lines using a histogram style, a stepfunction style and some coloring.
>
>
>
>
plot(0:7,fY,type="h",col="red",ylim=c(0,1))
#histogram type
points(0:7,FY,type="s",lty=1,col="blue")
#step type
# needs legend, where you need to define also colors and line-type (lty)
legend("topleft",c("f(y)","F(y)"),col=c("red","blue"),lty=c(1,1))
9
1.0
0.0
0.2
0.4
fY
0.6
0.8
f(y)
F(y)
0
1
2
3
4
5
6
7
0:7
There are many graphical parameters for functions like plot(), points(), lines(),
etc. to enable changing almost every aspect of a plot, which makes R very powerful, but
sometimes this is complicated finding the correct parameters for the desired outcome.
5
Simulating Random Numbers
Before we continue enter
> set.seed(studentID)
where studentID is your actual UOW student ID. This command will ensure that every
student creates unique random numbers, which guarantees that all of you have different
results of your log-book questions.
We first generate 150 random numbers from the uniform distribution in [0, 1]. Then
we transform the data and obtain a vector Y1, which contains random numbers 0, 1, . . . , 7,
in fact numbers from the hypergeometric distribution described above, achieved by the
transformation inside the for-loop.
>
>
>
>
+
+
>
size<-150
U<-runif(size)
Y1<-rep(0,size)
for(i in 1:size){
Y1[i]<- min((0:7)[U[i]<=FY])
}#end for loop
#to check what happens fix i, e.g. i<-10
10
>
>
>
>
>
# enter 1) U[i], 2) FY, 3) U[i]<=FY
# then 4) (0:7)[U[i]<=FY]
#simpler is to use R command
Y2<-rhyper(nn=size, m=7, n=42-7,
k=21)
> # Plot the histograms
> hist(Y1,breaks=seq(-0.5,7.5,by=1))
> hist(Y2,breaks=seq(-0.5,7.5,by=1))
Histogram of Y2
40
30
0
0
10
20
Frequency
30
20
10
Frequency
40
50
Histogram of Y1
0
2
4
6
0
Y1
2
4
6
Y2
Next we use the simulated data to estimate the mean, the variance, P (Y = 3), P (2 ≤
Y ≤ 4) and P (1 ≤ Y ≤ 6) with the R-functions mean(), var().
>
>
>
>
>
>
>
>
mean(Y1)
mean(Y2)
var(Y1)
var(Y2)
# Find
#P(Y=3)
Y1==3 #look at output and compare with Y1
mean(Y1==3) # sum(Y1==3)/150
11
>
>
>
>
>
>
>
>
>
mean(Y2==3) # sum(Y2==3)/150
#P(2<=Y<=4)
Y1>=2 & Y1<=4 #look at output and compare with Y1
mean(Y1>=2 & Y1<=4)
mean(Y2>=2 & Y2<=4)
#P(1<=Y<=6)
Y1>=1 & Y1<=6 #look at output and compare with Y1
mean(Y1>=1 & Y1<=6)
mean(Y2>=1 & Y2<=6)
To calculate an estimate for P (Y = 3), we first simulate the random numbers stored in
Y1, then we check whether each number equals 3 by Y1==3, which results in a vector of size
150 of Boolean-type (TRUE/FALSE). Then the mean()-function is applied to this vector,
however TRUE is converted to 1 and FALSE is converted to 0, hence mean() is applied to a
vector of 1’s and 0’s. It counts the average proportion of observing a 3, which is an estimate
for the probability of observing a 3.
12