Practical 1 Introductory R

BIO8052 Quantitative Methods: Practical 1 Introduction to R and RStudio
Aims of practical
 Introduce you to the R statistics package and RStudio environment
 Learn how to import and export data from Excel to R
 Summarise and reshape data
 Simple plots of continuous and categorical data
What is R ?
R is a free (open source) statistical package, developed in 1997 by Ross Ihaka and Robert
Gentleman at the University of Aukland, New Zealand, with its first stable release in 2000. It is
named ‘R’ partly as it is very similar to a commercial package ‘S-plus’, and also the initial letters of
first names of its authors. It has rapidly become the most widely used statistics package both in
academia, and also amongst environmental scientists and ecologists. One reason for its popularity
is that it is both a statistics package and a computer programming environment. This allows anyone
to write an add-on ‘package’ to provide new functions for R; indeed there are many packages that
are now available specifically aimed for use by ecologists. One disadvantage of R is that compared
to ‘point-and-click’ menu-driven programs (such as Minitab, SPSS or SAS), it has a steep initial
learning curve. Once you have learnt the basics, the structure of many analyses are relatively
simple, and it is also much easier to document analyses compared to menu-driven software.
Getting Started with R
R is freely available to download from the main website if you wish to install it on your own PC (it
runs on Windows, Macs and Linux):
https://www.r-project.org/
Starting RStudio
Login to the PC as usual, and select R, click on the Start button then go to the option:
Start → All Programs → Statistical Software → R → RStudio
This will start up the RStudio program. If you get a pop-up box asking you whether to start the 64-
bit or 32-bit version, use the 32-bit version. You should see the following screen:
Environment
\ history
Console
Files \
Plots \
Packages
\ Help \
Viewer
There are 3 parts to the screen:
 Console (left)
o this is where you can type R commands to read in data, and display results of
analyses.
 Environment \ History tabs (top-right):
o Environment: displays a list of data objects in temporary memory in your RStudio
session
o History: stores a record of all the R commands you have issued in the current
session
 Files \ Plots \ Packages \ Help \ Viewer tabs (bottom-right):
o Files: Lists files and folders that RStudio uses in its current ‘Working Directory’
o Plots: Any graphs you produce are stored here and can be exported to Word etc.
o Packages: List of packages you have installed etc.
o Help: The help for R commands
o Viewer: Create Powerpoint-like presentations (we will not use this feature).
The most important area, and the one you will use primarily in this first practical, is the Console,
but you will use some features from other tabs today also.
Working Directory or Folder
RStudio needs to know where to look to find data, or where to export data. It is best to create a
folder for any particular analysis, and store R-scripts, data-files, results etc in that folder. So before
you begin your analyses, create a folder BIO8052 in your Documents Filespace and within that
folder create a subfolder called Practical 1 :
You now have to tell RStudio that “BIO8052 – Practical 1” is the folder to use for the current
session. On the RStudio main menu, click on
Session -> Set Working Directory -> Choose Directory…
Navigate to the Practical 1 folder you have just created, and click on the “Select folder” button.
Data import from Excel
R does not have a built-in spreadsheet, so you will usually find it easiest to enter your data in Excel,
export it from Excel in CSV or other text format, then import it into R. From the Blackboard
website, right-click on the file orgmatt.xlsx, select “Save link as…” and save it in your Practical 1
folder.
Open the file using Excel. Note data format: it is from a simple experiment with one response
variable and one explanatory listed in two columns. The data are from a study of soil organic
matter content at two different sites. The explanatory variable ‘site’ contains two levels
representing your two different sites, A and B.


Now use the File-Save As menu in Excel to save the data in your filespace, saving it as
filetype CSV (Comma delimited)(*.csv). This is a plain text, with the columns separated
by commas. Excel will warn you that it is a non-standard format. Exit Excel - you will
receive another warning asking you to save the file, but it is safe to click on “Don’t Save”.
In Windows Explorer navigate to your Practical 1 folder and look at your two files:
If you have the standard Windows settings which hides the filename “extension” both files will
appear to have the same name. If your Windows Explorer is set to show all file titles, the clue that
they are different is in the “Type” column, which distinguishes between the original and commaseparated files. If your Windows Explorer only shows the icons and filenames, the icon beside the
names differs. If you want Windows to display the full filename please ask myself or a
demonstrator.
Now import the data into RStudio using the read.csv command, typing the command at the >
prompt symbol in the Console window. This needs to know the name of your file in quotes, and hit
the large “Return” key to enter the command at the end:
orgmatt.dat <- read.csv(“orgmatt.csv”)
The <- sign is an “assignment” sign, to assign the data you have read into an R table of data called
orgmatt.dat
R stores data, results of analyses etc. in a ‘workspace’. You can store your data inside R under any
name but I find like to indicate that this is the raw data by putting a short code at the end of the
name to indicate what it contains, in this case “.dat” to show this is data. If you click on the
Environment tab at the top-right of your RStudio screen you will see orgmatt.dat listed in your
workspace, with an indication of the number of rows and columns:
If you double-click on the name you will see the data displayed in a spreadsheet-like viewer, but
you cannot edit it unless you are running RStudio version 0.99 or above.
Summary statistics about your data
Use the summary command to find mean (average), median, min, max etc. of your data:
summary(orgmatt.dat)
orgmatt
Min.
:4.240
1st Qu.:4.485
Median :4.820
Mean
:4.863
3rd Qu.:5.140
Max.
:5.790
site
A:12
B:12
Notice how your explanatory variable is summarised differently to the response, as it is categorical
in this example. You have just put your data in a dataframe which is an R ‘object’ that can store
data of multiple types, as in this first example. Dataframes and CSV input will probably be the
main way you enter data into R initially. To display your whole dataframe in the Console window
simply enter its name or use the print command:
orgmatt.dat
print(orgmatt.dat)
If you are working interactively in the Console, you only need to type the name of the dataframe
orgmatt.dat. If you have stored your commands in an R script (see later) then the print is
also needed. The organic matter dataset is only 24 rows long, but this can be difficult if you have
large sets of data. It is often simpler to check the first few rows of a dataframe using the head
command:
head(orgmatt.dat)
1
2
3
4
5
6
orgmatt site
5.79
A
5.58
A
5.18
A
5.62
A
4.50
A
4.61
A
To type selected rows or columns, add square brackets and enter the row or column number you
want to display: the number before the comma refers to rows, the number after to columns. To
display only column number 2 of your dataframe, try:
orgmatt.dat[,2]
[1] A A A A A A A A A A A A B B B B B B B B B B B B
Levels: A B
which will display the second column of data (in a row across the screen). To display rows 5 to 10,
enter your numbers before the comma inside the square brackets:
orgmatt.dat[5:10,]
orgmatt site
5
4.50
A
6
4.61
A
7
5.17
A
8
4.53
A
9
4.75
A
10
4.97
A
You may find the tapply command useful to provide summaries within categories. E.g. the mean
organic matter content (column 1), summarised by survey site (column 2) is:
tapply(orgmatt.dat[,1], orgmatt.dat[,2], mean)
A
B
5.053333 4.673333
You can produce ‘tabular’ summaries of variance, standard deviation by substituting mean with
var or sd.
Plotting your data
To do a simple boxplot, type:
5.0
4.5
orgmatt
5.5
boxplot(orgmatt ~ site, data=orgmatt.dat)
A
B
site
The ~ symbol inside the brackets is to indicate which column is explanatory on the right (site)
and which is response on the left (orgmatt). Think of it as analogous to an equals sign. This is
the standard box and whiskers plot, indicating median, inter-quartile range etc. You can improve
the labelling on your plots with some options; type the following all on one line:
boxplot(orgmatt ~ site, data=orgmatt.dat, main="Summary of soil
organic matter survey", ylab="Organic matter content (g)",
xlab="Survey site")
5.0
4.5
Organic matter content (g)
5.5
Summary of soil organic matter survey
A
B
Survey site
Useful tip: click on the “Export” button on a graph and you can copy it to your clipboard, save as a
pdf file etc. to insert into Word reports or Powerpoint presentations.
Reshaping data
Sometimes you will be sent data that are not in the best format to analyse using R and you will need
to restructure the data. The following data are from an experiment to determine the impacts of
different cutting regimes (Low, Medium or High) on insect diversity. The data show the diversity of
insects caught in each of the five plots per treatment:
cuttingL
60
57
67
58
61
cuttingM
64
67
74
63
70
cuttingH
87
84
69
85
85
Start Excel and enter the data as shown above, then use Save As in Excel to save into a CSV format
file; I’ll assume your file is called grassland.csv but you can give it any name. As before, make
sure you have saved it in the correct Folder so that R knows where to find the file, before using the
read.csv command:
insect.dat <- read.csv(“grassland.csv”)
print(insect.dat)
1
2
3
4
5
cuttingL cuttingM cuttingH
60
64
87
57
67
84
67
74
69
58
63
85
61
70
85
Most univariate analyses in R (and many other statistical packages) expect the response and
explanatory variables to be in different columns; here we need two columns, one to contain the
response (insect diversity), and a second to contain the explanatory (cutting regime). We will use
the stack command to do this as it is relatively simple:
insect.stk <- stack(insect.dat)
print(insect.stk)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
values
60
57
67
58
61
64
67
74
63
70
87
84
69
85
85
ind
cuttingL
cuttingL
cuttingL
cuttingL
cuttingL
cuttingM
cuttingM
cuttingM
cuttingM
cuttingM
cuttingH
cuttingH
cuttingH
cuttingH
cuttingH
Your response and explanatory data are now in two columns, but you will want to rename the
columns to something clearer:
diversity
60
65
70
75
80
85
colnames(insect.stk) <- c(“diversity”, “cutting”)
head(insect.stk)
summary(insect.stk)
boxplot(diversity ~ cutting, data=insect.stk)
cuttingH
cuttingL
cuttingM
cutting
Notice that the diversity seems to be higher where the cutting is higher, but there are outliers in the
data. By default R uses the level of a factor whose name is first in the alphabet, here cuttingH, as
the reference level. If you want to change your plot, so that the bars go from Low, Medium to High,
then:
insect.stk$cutting <- factor(insect.stk$cutting,
levels=c("cuttingL", "cuttingM", "cuttingH"))
boxplot(diversity ~ cutting, data=insect.stk)
Continuous data
In this section we look at an example where your explanatory data are continuous, and summarise \
plot the data. Download the file bluetit.csv from Blackboard; it shows the breeding condition
of blue tits in early spring relative to the amount of bird food made available in suburban gardens
over winter.
bluetit.dat <- read.csv(“bluetit.csv”)
summary(bluetit.dat)
condition
Min.
:0.000
1st Qu.:2.350
Median :5.200
Mean
:5.136
3rd Qu.:8.000
Max.
:9.600
gardenfood
Min.
: 1.000
1st Qu.: 4.000
Median : 8.000
Mean
: 6.909
3rd Qu.:10.000
Max.
:12.000
Notice how R displays min, mean, max values etc. for your explanatory variable (gardenfood)
as this is a continuous explanatory variable, unlike your organic matter and insect diversity studies
which had categorical explanatory variables.
You now want a scatter-plot to display the data:
plot(condition ~ gardenfood, data=bluetit.dat)
8
6
condition
4
2
0
2
4
6
8
10
12
gardenfood
You can see that there is a general trend of increased bird condition with the provision of garden
food, although the data are noiser at the upper end of the scale. We will fit a statistical model to
these data in a later practical. To join the points with straight lines, use the type=”b” option in
your plot command, but your output may look a bit strange:
0
2
4
condition
6
8
plot(condition ~ gardenfood, data=bluetit.dat, type=”b”)
2
4
6
8
10
12
gardenfood
This isn’t what you want! The problem has arisen because the original data were not in any
particular order, and R has simply joined the lines based on the order in which they were found in
the dataset. We will re-sort the bluetit data based on the values of the explanatory variable
(gardenfood); in the following command note that:
 the text order(bluetit.dat$gardenfood) is to the left of the comma indicating
that we are sorting the data by rows
 the use of the $ symbol to allow us to specify which column is used to sort the rows
 there is a comma before the closing square bracket to indicate we want all the columns
selected
bluetit.dat2 <- bluetit.dat[order(bluetit.dat$gardenfood),]
print(bluetit.dat)
print(bluetit.dat2)
You can see that the data are unchanged, but in the second dataset they are ordered according to the
explanatory variable. Now replot, using bluetit.dat2:
0
2
4
condition
6
8
plot(condition ~ gardenfood, data=bluetit.dat2, type=”b”)
2
4
6
8
10
12
gardenfood
You can improve your plot using xlab, ylab and main as earlier.
Simple summary statistics
It is easy to obtain most of the standard summary statistics that you might need. Returning to your
insect.stk dataset, the overall mean diversity is:
mean(insect.stk$diversity)
[1] 70.06667
or for each of the 3 cutting regimes:
tapply(insect.stk$diversity, insect.stk$cutting, mean)
cuttingL cuttingM cuttingH
60.6
67.6
82.0
Try similar commands for median, var (variance) or sd (standard deviation). Note that R does
not have a built-in command for se (standard error), so we will create our own. Recall that standard
error is simply the square root of the variance divided by the number of values. It gives a measure
of any bias in your estimate of the mean:
𝑠. 𝑒. = √
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑥)
𝑙𝑒𝑛𝑔𝑡ℎ(𝑥)
You can write your own command or “function” in R to do this:
se <- function(x) sqrt(var(x) / length(x) )
and put your own function into tapply:
tapply(insect.stk$diversity, insect.stk$cutting, se)
cuttingL cuttingM cuttingH
1.749286 2.014944 3.286335
95% confidence intervals
These are not automatically calculated by R, but I have added a simple script ci95.R onto the
Blackboard website. Download and save this script in your Practical 1 folder. The script creates a
function, ci95, that you can use to display Confidence Intervals. It is a little more complicated
than the one you have just created for standard errors, so ask if you do not understand any of it.
Open the script in RStudio by clicking on File  Open File… from its main menu. Next, push the
commands into R by clicking on the Source button:
This then creates a function to calculate 95% CI which you can use in the usual way to find the
relevant values for the three cutting regimes:
tapply(insect.stk$diversity, insect.stk$cutting, ci95)
$cuttingH
[1] 72.87567 91.12433
$cuttingL
[1] 55.7432 65.4568
$cuttingM
[1] 62.00562 73.19438
If you were to repeat the experiment, we can be 95% certain that the mean insect diversity under the
High cutting regime would be between 72.88 and 91.12 etc.
Workspace, exporting data, getting help etc.
Workspace and history
After working in R on a set of analyses for a while, you will have a set of R ‘objects’; dataframes,
model results etc in your working area. It is easy to lose track of what you have in this ‘workspace’,
so use the ls command to list everything:
ls()
[1] "bluetit.dat"
"orgmatt.dat"
"bluetit.dat2" "insect.dat"
"insect.stk"
This command has round brackets after it, and whilst it is rare that you will use an options in the
command, you always need the brackets. The listing will include results of analyses etc. The same
information is displayed in a more “user-friendly” form under the “Environment” tab at the top-
right of the RStudio screen.
R keeps track of what you have done, which can be useful in complicated analyses; to display a
record of your earlier commands:
history()
Again, like the listing command the round brackets are always needed in case of options. The same
information if available on the “History” tab at the top-right of the RStudio screen.
Data export
It is best to export from R in CSV file format, to use in Excel, via the write.csv command:
write.csv(insect.stk, “insectstacked.csv”)
If, in Windows Explorer (not R), you double-click on the resulting file insectstacked.csv it
will automatically open in Excel.
Help
To obtain help for a command, preface the command with a ? e.g.
?stack
The help will open in a window at the bottom-right of RStudio:
The help can be difficult to understand initially, but the help format is similar for all commands:
 Description – brief description of the function
 Usage – what arguments are expected in the brackets after the function. For stack there is at
least one argument, x, typically the name of a data frame. You used the data frame insect.dat
when you used stack. Other arguments (such as select and form) are optional, indicated




by … Notice also that there is a related command, unstack, listed too.
Arguments – a more detailed description of the arguments that can be used.
Details – a more detailed description of the function
Return – describes the output of the function. For stack, it is another dataframe, with two
columns. You may recall that you re-named your columns.
Example – shows how the function can be used, often with the aid of a built-in dataset. The
example section can be particularly useful to help understand functions. You can even run
an example:
example(stack, local=TRUE)
stack> require(stats)
stack> formula(PlantGrowth)
weight ~ group
<environment: 0x0000000008af8cf8>
# check the default formula
stack> pg <- unstack(PlantGrowth)
# unstack according to this formula
stack> pg
ctrl trt1
1 4.17 4.81
2 5.58 4.17
3 5.18 4.41
4 6.11 3.59
5 4.50 5.87
6 4.61 3.83
7 5.17 6.03
8 4.53 4.89
9 5.33 4.32
10 5.14 4.69
trt2
6.31
5.12
5.54
5.50
5.37
5.29
4.92
6.15
5.80
5.26
stack> stack(pg)
values ind
1
4.17 ctrl
2
5.58 ctrl
3
5.18 ctrl
4
6.11 ctrl
5
4.50 ctrl
6
4.61 ctrl
7
5.17 ctrl
8
4.53 ctrl
9
5.33 ctrl
10
5.14 ctrl
11
4.81 trt1
12
4.17 trt1
13
4.41 trt1
14
3.59 trt1
15
5.87 trt1
16
3.83 trt1
17
6.03 trt1
18
4.89 trt1
19
4.32 trt1
20
4.69 trt1
21
6.31 trt2
22
5.12 trt2
23
5.54 trt2
24
5.50 trt2
25
5.37 trt2
26
5.29 trt2
27
4.92 trt2
# now put it back together
28
29
30
6.15 trt2
5.80 trt2
5.26 trt2
stack> stack(pg, select = -ctrl)
values ind
1
4.81 trt1
2
4.17 trt1
3
4.41 trt1
4
3.59 trt1
5
5.87 trt1
6
3.83 trt1
7
6.03 trt1
8
4.89 trt1
9
4.32 trt1
10
4.69 trt1
11
6.31 trt2
12
5.12 trt2
13
5.54 trt2
14
5.50 trt2
15
5.37 trt2
16
5.29 trt2
17
4.92 trt2
18
6.15 trt2
19
5.80 trt2
20
5.26 trt2
# omitting one vector
In the above output, the commands listed in the example at the end of the manual page for the
stack function are run. Note that I added the option local=TRUE to ensure that none of the
variables created in the example interfered with your own workspace. If you run the example
without this option (the default) you will notice that the stack example datasets pg and
PlantGrowth will appear in your workspace. There are also a lot of excellent tutorials and
introductory videos on R usage on the internet which are worth exploring.
R as a calculator
Remember that R can be used any time as a calculator which is very useful, e.g.:
734 + 297
Use + - / * ^ and sqrt for addition, subtraction, division, multiplication, power, square root.
Similarly you can do arithmetic on whole sets of numbers in a dataframe, e.g. the ratio of bluetit
condition to feeding is:
bluetit.dat$condition / bluetit.dat$gardenfood
You can save your workspace (R data frames, results of analyses) by clicking on the disk icon in the
Environment tab, but personally I prefer to create R scripts to repeat analyses.
R Scripts
An R Script is a plain text file containing a list of R commands, such as that shown in your History
window, with the advantage that you can annotate it with comments to remind yourself of what you
did. This is very useful if you want to repeat an analysis from a few weeks earlier, and gives you a
reliable record of what you did. It allows your analyses to become more reproducible.
From the Blackboard website, right-click on the file Pract1_commands.R and save it in your
Practical 1 folder. Open the file within RStudio by clicking on the File -> Open File… menu,
and you will see that it contains a full listing of all the commands you have run in today’s practical.
I’ve added comments, prefaced with the # symbol. I’ve also added print and readline
statements to pause the output and make it easier to understand.
To run the script, click on the Source button. It will run all of today’s analyses in a few seconds!
Practice using scripts to store your analyses, and you will find them invaluable.
When I work with RStudio, I typically write commands interactively into the Console, and regularly
copy them from the History or Console window into an R Script, with comments to remind me of
why I did various analyses. You can copy and paste manually from the Console, or alternatively
click on the “To Source” button in the History window:
Summary
 Use read.csv to import comma-separated values from Excel
 summary, head, tapply commands for simple statistics
 stack and colnames to reshape structure of data
 boxplot and scatter plots using plot command
 xlab, ylab, main options to plot command to improve readability
 Simple user-defined command with function for standard error, 95% CI etc.
 Put R commands into commented scripts ( .R files) to keep a record of your analyses

Download Report

Practical 1 Introductory R

Paperzz.com

Your Paperzz