apply, lapply, sapply tapply

Basics2
1
Outline
Review of data structures
Data frames
•
•
•
•
•
•
•
•
•
Review of data frame structure
Importing/Exporting
Data frame example
Browsing
Selecting/Subsetting
Manipulating
Table and unique functions
Merging
Summarizing with apply, by, aggregate
Basic Programming
•
•
•
•
if/else
for/while loops
functions
Summarizing with functions
Graphics
•
•
•
•
Scatterplots
Histograms
Barplots
Pie Charts
2
Initialize R Environment
## Initialize R Environment
# Set working directory..
setwd("C:/Rworkshop")
getwd()
# Create a new folder named 'Outfolder' in working directory
if(!file.exists("Outfolder")){
dir.create("Outfolder") }
# Set options
options(scipen=6)
# bias against scientific notation
# Load libraries
library(RODBC)
# ODBC database interface
3
Get Sample Data
## Get Sample data
# Load sample data files from website (SampleData_PartI.zip)
http://www.fs.fed.us/rm/ogden/research/Rworkshop/SampleData.shtml
# Extract zip file to working directory
unzip("SampleData_PartI.zip")
# List files in folder
list.files("PlotData")
4
R Data Structures
# vector
group of single objects or elements
(all elements must be the same mode)
# factors
vector of values that are limited to a fixed set of values (categories)
# list
group of objects called components
(can be different modes and lengths)
# array
group of vectors with more than 1 dimension
(all elements must be the same mode)
format: array(data=NA, dim=c(dim1, dim2, ..)
# matrix
2-dimensional array (group of vectors with rows and columns)
(all elements must be the same mode).
format: matrix(data=NA, nrow=1, ncol=1, byrow=FALSE)
# data frame 2-dimensional list (group of vectors of the same length)
(can be different modes)
format: data.frame(row, col)
Images from: https://www.stat.auckland.ac.nz/~paul/ItDT/HTML/node64.html
5
Data Frames Review
## Data frames are the primary data structure in R.
## Data frames are used for storing data tables, such as forest inventory data
tables.
## A data frame is similar structure to a matrix, except a data frame may contain
categorical data, such as species names.
## Each column of a data frame is treated as a vector component of a list and
therefore, can be different modes. The difference is that each vector
component must have same length.
6
Data Frames Review
# data frame
#
2-dimensional list (group of vectors of the same length; can be different modes)
data.frame(row, column)
## Start with a matrix
mat <- matrix(data=c(seq(1,5), rep(2,5)), nrow=5, ncol=2)
df <- data.frame(mat)
is.data.frame(df)
dim(df)
str(df)
matv <- c(mat)
is.vector(matv)
dflst <- c(df)
is.list(dflst)
#
#
#
#
#
Convert matrix to data frame
Check if object is a data frame
Get dimensions of data frame
Get structure of data frame
Create a vector from matrix
# Create a list from data frame
## Accessing elements of a data frame
## Data frames can be accessed exactly like matrices, but they can also be accessed by column names.
df[1,2]
df[3,]
df[,2]
df[2:4,1]
#
#
#
#
Identify
Identify
Identify
Identify
value in row 1, column 2
values in row 3 (vector)
values in column 2 (vector)
values in rows 2 thru 4 and column 1 (vector)
df[3,"X2"]
df[,"X2"]
df[2:4,"X1"]
# Identify value in row 3, column 2
# Identify values in column 2('X2')
# Identify values in column 2 thru 4 and column 1('X1')
7
Data Frames Review
## Building your own data frame..
dat <- edit(data.frame())
## Build data frame from vectors of rows and columns..
# Column vectors
numbervect <- c(1, 2, 3)
stringvect <- c("a", "b", "c")
logicalvect <- c(TRUE, FALSE, TRUE)
# Combine column vectors into a data frame and name columns
df <- data.frame(cbind(numbervect, stringvect, logicalvect))
df <- data.frame(cbind(COLUMN1=numbervect, COLUMN2=stringvect,
COLUMN3=logicalvect))
# Row vectors
row1vect <- c(1, "a", TRUE)
row2vect <- c(2, "b", FALSE)
row3vect <- c(3, "c", TRUE)
# Combine row vectors into a data frame
df <- data.frame(row1vect, row2vect, row2vect)
# Name data frame columns
names(df) <- c("COLUMN1", "COLUMN2", "COLUMN3")
8
Importing Data Frames
## Import data to R as data frame..
read.table()
scan()
read.csv()
read.fwf()
read.dbf()
# reads text file into a dataframe (file, header=TRUE, sep="", na.strings="NA", skip = 0)
# reads text file into a vector or a list
# reads comma-delimited text file (*.csv) into a dataframe
# reads text file with fixed width format into a dataframe
# reads DBF file into a dataframe (library foreign)
# dat <- read.table("c:/data/mydata.txt", header=TRUE)
# dat <- read.table("c:\\data\\mydata.txt", header=TRUE)
# dat <- read.table ("c:\data\mydata.txt", header=TRUE)
# valid
# valid
# invalid
# Notes..
help(read.table)
# read.table() will return an error if not all lines have the full set of values for variables.
# This is common with data from spreadsheets when the last (or right) columns have missing values.
# read.table() – the default delimiter is a space, tab, or newline.
# For comma-delimited files
sep=", "
# For tab-delimited files
sep="\t"
..
# File can be a URL (ex. ftp:// urls)
# dat <- read.csv("ftp://ftp2.fs.fed.us/incoming/rmrs/testdat.csv",
header=TRUE)
9
Importing Data Frames
## Excel files..
# It is recommended to export the data from Excel to a tab-delimited or comma-separated
file before importing to R.
# Convert long numbers represented with scientific notation to a number or a text before
exporting (using Format Cells..).
# Check imported dataframe in R for entire rows or columns of NA values… from
extraneous empty cells.
## Missing Values..
# By default, only blanks are read as missing values. If you have some other indication for a
missing value, you need to let R know what that missing value character is by specifying
na.strings (ex. na.strings="" or na.strings="." ).
10
Importing Data Frames
## Package: RODBC
#
#
#
#
#
#
#
channel
channel
channel
channel
channel
channel
channel
<<<<<<<-
odbcConnect(dsn)
# Data Source Name (dsn)
odbcConnectAccess("filenm.mdb")
# ACCESS database
odbcConnectAccess2007("filenm.accdb") # ACCESS database (2007, 2010)
odbcConnectExcel("filenm.xls")
# Excel workbook (*.xlsx??)
odbcConnectExcel("filenm.xlsx")
# Excel workbook (2007, 2010)
odbcConnect(dsn="idbfia.mci")
# Oracle database (need Oracle driver)
odbcConnect("mydb1", case="postgresql")
# PostgreSQL database (need PostgreSQL driver)
# Note: If channel = -1 , something is wrong
# Get data..
# Set channel to path of Excel spreadsheet
channel <- odbcConnectExcel("PlotData/testdat.xls")
# Get list of tables (sheets) in spreadsheet
sqlTables(channel)
# Fetch data from Excel spreadsheet (using sqlFetch or sqlQuery)
testdat <- sqlFetch(channel, "testdat")
# Specify sheet name
dat <- sqlQuery(channel, "select PLT_CN, TREE from [testdat$]")
odbcClose(channel)
odbcDataSources()
# Close connection (frees resources)
# Displays available data source names (dsn)
http://cran.r-project.org/web/packages/RODBC/vignettes/RODBC.pdf
11
Exporting Data Frames
## Exporting data from R..
# write.table()
# write.csv()
# write.dbf()
# writes a dataframe to a text file (append = FALSE, sep=" ")
# writes a dataframe to a comma-delimited text file (*.csv)
# writes a dataframe to a DBF file (library foreign)
write.table(testdat, "Outfolder/testdat2.txt")
write.table(testdat, "Outfolder/testdat2.txt",
row.names=FALSE, sep=",")
write.csv(testdat, "Outfolder/testdat2.csv", row.names=FALSE)
# Notes..
help(write.table)
# The default settings include row names and column names and has a space delimiter.
# write.table() – the default delimiter is a space, tab, or newline.
#
For comma-delimited files:
sep=", "
#
For tab-delimited files:
sep="\t"
12
Data Frame Example
## Read in sample data frame..
setwd("C:/Rworkshop") # use path to your Rworkshop
dat <- read.csv("PlotData/testdat.csv", header=TRUE, stringsAsFactors=FALSE)
dat
# Display dat. Check if data frame has extra rows of NA values.
PLT_CN
TREE
STATUSCD
SPCD
DIA
HT
BA
unique plot number
unique tree #
live/dead
species code
diameter - inches
height - feet
basal area sqft
(numero parcela unico)
(numero arbol unico)
(codigo de estado (vivir/muertos))
(estado arbol)
(dap)
(altura)
(area basal)
13
Data Frames - Browsing
## Browsing data frames
dat
# Print dataframe to screen
View(dat)
# Display dataframe in spreadsheet format
dat <- edit(dat) # Edit dataframe in spreadsheet format
head(dat)
head(dat, n=10)
tail(dat)
names(dat)
row.names(dat)
summary(dat)
str(dat)
#
#
#
#
#
#
#
Display
Display
Display
Display
Display
Display
Display
first rows of dataframe (default = 6 rows)
first 10 rows of dataframe
last rows of dataframe (default = 6 rows)
column names of dat
row names of dat
summary statistics for columns of dat
internal structure of dat
## Dimensions of data frames
dim(dat)
nrow(dat)
ncol(dat)
14
Data Frames – Selecting/Subsetting
## Selecting and subsetting columns and rows in data frames - df[rows, cols)
# To define or select data frame columns: df$colnm or df[,colnm]
# To define or select data frame rows: df[rowid, ]
dat[, c("TREE", "DIA")]
dat[3:5,]
dat[c(4,6), c(2:4)]
dat[dat$SPCD == 746,]
dat[dat$DIA < 10,]
subset(dat, DIA < 10)
dfsub <- subset(dat, DIA<10)
dfsub
#
#
#
#
#
#
#
#
Select TREE and DIA columns for all rows
Select all columns for rows 3 thru 5
Select rows 4 and 6 and columns 2 thru 4
Select all columns where SPCD = 746
Subset rows
Subset rows
Assign new data frame object to subset of dat
Display dfsub
# Select the TREE and DIA columns for dead trees only
dat[dat$STATUSCD == 2, c("TREE", "DIA")]
# Select columns where SPCD=746 and STATUSCD=2
dfsub[dfsub$SPCD == 746 & dfsub$STATUSCD == 2, ]
15
Data Frames - Manipulating
## Manipulating data in data frames..
# To remove column(s) from data frame
dfsub$HT <- NULL
# Removes one column
dfsub[,!(names(dfsub) %in% c("TREE", "DIA"))] # Removes one or more columns
# To change name of a column in a data frame
names(dat)[names(dat)== "SPCD"] <- "SPECIES"
# To order a dataframe
dat <- dat[order(dat$SPECIES, dat$DIA),]
# Order by 2 columns
dat <- dat[order(dat$DIA, decreasing=TRUE),] # Descending order
dat <- dat[order(dat$PLT_CN),]
# Order by 1 column
16
Data Frames - Manipulating
## Manipulating data in data frames cont..
# To add a column of concatenated columns
dat$NEW <- paste(dat$TREE, ":", dat$SPECIES)
# With default spaces
dat$NEW <- paste(dat$TREE, ":", dat$SPECIES, sep="") # With no spaces
# .. with leading 0 values
dat$NEW <- paste(dat$TREE, ":", formatC(dat$SPECIES,width=3,flag=0),sep="")
# To add a column of percent of total basal area for each tree in dat (rounded).
dat$BApertot <- round(dat$BA / sum(dat$BA) * 100, 2)
17
Data Frames - Exploring
## Exploring data frames with the unique and table functions..
# Display the unique values of species in dataset and sort
unique(dat$SPECIES)
sort(unique(dat$SPECIES))
# Display the number of trees by species
help(table)
tab <- table(dat$SPECIES)
class(tab)
# Display the number of trees by species and plot
table(dat$PLT_CN, dat$SPECIES)
table(dat[,c("PLT_CN", "SPECIES")])
data.frame(table(dat$PLT_CN, dat$SPECIES))
# Create an object with the number of trees by status, species, and plot
# .. and add column names
tabdf <- data.frame(table(dat$PLT_CN, dat$SPECIES, dat$STATUSCD))
names(tabdf) <- c("PLT_CN", "SPECIES", "STATUSCD", "FREQ")
# Get a count of dead tree species by plot
table(dat[dat$STATUSCD == 2, "SPECIES"], dat[dat$STATUSCD==2, "PLT_CN"])
table(dat[dat$STATUSCD ==2, "PLT_CN"], dat[dat$STATUSCD == 2, "SPECIES"])
18
Data Frames - Exploring
## More ways to explore data frames..
# Get the maximum height (HT) for Douglas-fir (SPECIES=202)
max(dat[dat$SPECIES == 202, "HT"])
# Get the average height (HT) for both Douglas-fir (SPECIES=202) and lodgepole pine
(SPECIES=106)
mean(dat[dat$SPECIES %in% c(202, 108), "HT"])
# Get the average diameter (DIA) for dead (STATUSCD=2) aspen trees (SPECIES=746)
mean(dat[(dat$SPECIES == 746 & dat$STATUSCD == 2), "DIA"])
# Get the average diameter (DIA) for live (STATUSCD=1) aspen trees (SPECIES=746)
mean(dat[(dat$SPECIES == 746 & dat$STATUSCD == 1), "DIA"])
19
Data Frames - Merging
# Example: merge variable code descriptions for species to data frame
1. Get data frame of code descriptions
2. Merge to table
20
Data Frames - Merging
# Example: merge variable code descriptions for species to data frame
## First, get data frame of code descriptions
# Build species look-up table (or import from file)
# Gets unique values of species, sorted:
SPECIES <- sort(unique(dat$SPECIES))
SPECIESNM <- c("subalpine fir", "Engelmann spruce", "lodgepole pine",
"Douglas-fir", "aspen")
sptab <- data.frame(cbind(SPECIES, SPECIESNM))
## Then, merge to table
help(merge)
dat <- merge(dat, sptab, by="SPECIES")
dat
dat[order(dat$PLT_CN, dat$TREE), ]
21
Exercise 7
## 7.1. Load testdat.csv again, assigning it to object ex1df.
## 7.2. Display the dimensions, the column names, and the structure of ex1df.
## 7.3. Display the first 6 rows of ex1df.
## 7.4. Display this table ordered by heights with the largest trees first and display the
maximum height.
## 7.5. Change name of column STATUS to 'ESTADO', DIA to 'DAP', and column HT to
'ALTURA'.
## 7.6. Merge the sptab table we made on slide 18 to the ex1df table, using the SPCD column,
and assign to ex1df2.
Hint: use by.x and by.y for merging to columns with different names.
## 7.7. Display only rows with subalpine fir and DAP greater than 10.0.
## 7.8. Display the number of trees by ESTADO and SPECIESNM.
## 7.9. Display the total basal area (BA) for lodgepole pine.
22
Exercise 7 cont.
## 7.10. Create a new object named 'aspen' with just the rows in ex1df2 that are aspen.
## 7.11. The 2 columns, SPCD and SPECIESNM, are not important in the aspen table.
Remove them and overwrite object 'aspen'.
## 7.12. Display the mean ALTURA of live aspen trees.
## 7.13. Create a look up table for ESTADO and merge this table to ex1df2. Assign the
merged table to ex1df3. Then order ex1df3 by PLT_CN.
## Hint:
##
1. Get vector of unique values of ESTADO
##
2. Define new vector called ESTADONM where 1 = Live and 2 = Dead
##
3. Create dataframe of vectors with names ESTADO and ESTADONM
##
4. Merge new dataframe to ex1df2
##
5. Order table by PLT_CN and TREE
## 7.14. Display the number of trees again, this time by ESTADONM and SPECIESNM.
23
Summarizing Data Frames
with apply, by, aggregate
24
Summarizing Data Frames
apply – applies a function to rows or columns of an array, matrix, or data frame (returns a
vector or array).
sapply – applies a function to each element of a list or vector (returns a vector or array).
lapply – similar to sapply, applies a function to each element of a list or vector (returns a list).
tapply – applies a function to a vector, separated into groups defined by a second vector or
list of vectors (returns an array).
by – similar to tapply, applies a function to vector or matrix, separated in to groups by a
second vector or list vectors (returns an array or list).
aggregate – applies a function to one or more vectors, separated into groups defined by a
second vector or list of vectors (returns a data frame).
Function
Input
Output
apply
sapply
lapply
tapply
by
aggregate
array/matrix/data frame
vector/list
vector/list
vector
data frame
data frame
vector/array
vector/array
list
array/matrix
array/list
data frame
25
apply, lapply, sapply
# apply – applies a function to the rows or columns of an array, matrix, or data frame
#
(1-rows; 2-columns).
# Apply a mean across 3 columns of dat
meandat <- apply(dat[,c("DIA", "HT", "BA")], 2, mean)
meandat
is.vector(meandat)
# Returns a vector
# sapply/lapply – applies a function to a each element of a list or vector.
# Check if columns of dat are numeric
is.list(dat)
sapply(dat, is.numeric)
lapply(dat, is.numeric)
# Returns a vector
# Returns a list
# Display the number of characters for each column name of dat.
sapply(names(dat), nchar)
26
tapply
tapply - applies a function to a vector, separated into groups defined by a second
vector or list of vectors (returns an array/matrix).
## Get max DIA for each SPECIESNM.
maxdia <- tapply(dat$DIA, dat$SPECIESNM, max)
maxdia
class(maxdia)
is.vector(maxdia)
# Get summed basal area by species and plot.
spba.tapply <- tapply(dat$BA, dat[,c("PLT_CN", "SPECIESNM")], sum)
spba.tapply
dim(spba.tapply)
class(spba.tapply)
is.array(spba.tapply)
# Change NA values to 0 (0 basal area) and round all values to 2 decimal places.
spba.tapply[is.na(spba.tapply)] <- 0
spba.tapply <- round(spba.tapply, 2)
spba.tapply
## Accessing output of tapply
spba.tapply[1,]
spba.tapply["4836418010690",]
spba.tapply["4836418010690", "lodgepole pine"]
27
tapply cont..
## Now, let's add one more variable.. and summarize by STATUSCD as well.
stspba.tapply <- tapply(dat$BA, dat[,c("PLT_CN", "SPECIESNM", "STATUSCD")], sum)
stspba.tapply[is.na(stspba.tapply)] <- 0
stspba.tapply
class(stspba.tapply)
## Accessing output of tapply
dim(stspba.tapply)
[1] 2 5 2
# An array with 3 dimensions
#(PLT_CN, SPECIESNM, STATUSCD)
# Display all elements of dimensions 2 and 3, for the first element of dimension 1.
stspba.tapply[1,,]
stspba.tapply["4836418010690",,]
# ..or the name of the element
# Display all elements of dimensions 1 and 3, for the third element of dimension 2.
stspba.tapply[,3,]
stspba.tapply[,"Engelmann spruce",]
# ..or the name of the element
# Display all elements of dimensions 1 and 2, for the second element of dimension 3.
stspba.tapply[,,2]
stspba.tapply[,,"2"]
# ..or the name of the element
# Display one element of array.
stspba.tapply[2,4,2]
stspba.tapply["40374256010690","lodgepole pine","2"]
28
by
by – applies a function to vector or matrix, separated in to groups by a second vector
or list of vectors (returns a by class(array/list(if simplify=FALSE)).
## Get max DIA for each SPECIESNM
## The default by statement returns an object of class 'by'
maxdia <- by(dat$DIA, dat$SPECIESNM, mean)
maxdia
class(maxdia)
is.list(maxdia)
## Accessing output from by (default)
maxdia[1]
maxdia[[1]]
maxdia[["aspen"]]
# Display element 1
# Display value of element 1
# Display value of element named 'aspen'
## Adding simplify=FALSE, returns on object of class list or array
maxdia2 <- by(dat$DIA, dat$SPECIESNM, mean, simplify=FALSE)
maxdia2
class(maxdia2)
is.list(maxdia2)
## Accessing is the same as above, except you can also use '$', because it is a list.
maxdia2[1]
maxdia2[[1]]
maxdia2[["aspen"]]
maxdia2$aspen
# Display element 1
# Display value of element 1
# Display value of element named 'aspen'
29
by cont..
# Get summed basal area by species and plot
spba.by <- by(dat$BA, dat[,c("PLT_CN", "SPECIESNM")], sum, simplify=FALSE)
spba.by[is.na(spba.by)] <- 0
spba.by
class(spba.by)
is.list(spba.by)
dim(spba.by)
## Accessing output from by statement (same as tapply, except can use list option)
dim(stspba.by)
[1] 2 5
# An array(list) with 2 dimensions
#(PLT_CN, SPECIESNM)
# Display all elements of dimensions 2, for the first element of dimension 1.
spba.by[1,]
spba.by["4836418010690",]
# ..or the name of the element
# Display all elements of dimension 1, for the forth element of dimension 2.
spba.by[,4]
spba.by[,"lodgepole pine"]
# ..or the name of the element
# Display one element and element value from array/list.
spba.by[2,4]
spba.by["40374256010690","lodgepole pine"]
spba.by[,"lodgepole pine"]$"40374256010690"
spba.by["40374256010690",]$"lodgepole pine"
# Element
# Element
# Element value
# Element value
30
by cont..
by – applies a function to vector or matrix.
tapply – applies a function to a vector only.
# Use by to apply a function to more than one variable. How correlated are DIA, BA, and
HT for each species?
by(dat[,c("DIA", "BA", "HT")], dat$SPECIESNM, cor)
31
aggregate
aggregate – applies a function to one or more vectors, separated into groups defined by a
second vector or list of vectors (returns a data frame).
# Get sum of basal area by plot
sumbaspp <- aggregate(dat$BA, list(dat$SPECIESNM), sum)
names(sumbaspp) <- c("SPECIESNM", "SUMBA")
# Get sum of basal area by plot and STATUSCD
spba.ag <- aggregate(dat$BA, list(dat$PLT_CN, dat$SPECIESNM), sum)
class(spba.ag)
# Add names to data frame
names(spba.ag) <- c("PLT_CN", "SPECIESNM", "SUMBA")
# Get average diameter and average height by SPECIESNM
avgdiahtspp <- aggregate(dat[,c("DIA", "HT")], list(dat$SPECIESNM), mean)
names(avgdiahtspp) <- c("SPECIESNM", "AVGDIA", "AVGHT")
# Merge summed basal area by species with average dia/height by species
datsum <- merge(sumbaspp, avgdiahtspp, by="SPECIESNM")
32
Comparison of Summary Functions
## Differences in output
# Example: Get the number of trees by plot
# Using tapply
treecnt <- tapply(dat$PLT_CN, dat$PLT_CN, length)
treecnt
class(treecnt)
# Using by
treecnt.by <- by(dat$PLT_CN, dat$PLT_CN, length)
treecnt.by
class(treecnt.by)
is.array(treecnt.by)
# Using aggregate
treecnt.ag <- aggregate(dat$PLT_CN, list(dat$PLT_CN), length)
names(treecnt.ag) <- c("PLT_CN", "NUM_TREES")
class(treecnt.ag)
treecnt.ag
# Using table function
treecnt.tab <- table(dat$PLT_CN)
treecnt.tab
class(treecnt.tab)
is.array(treecnt.tab)
33
Exercise 8
## 8.1 Use sapply to change the names of dat to all upper case letters.
## 8.2 Use apply to get the range of values for DIA and HT in dat. Set the results to an object
named 'ranges'. Is this object an array? What is the class of 'ranges'?
## 8.3 Use tapply to get the minimum 'HT' by SPECIESNM and STATUSNM and set to an
object named minht. Change any NA values to 0.
## 8.4 Use aggregate to get a sum of BA by SPECIESNM. Set this to an object named
'ba.sum'. Add names to ba.sum, SPECIESNM, SUMBA. What is the class of 'ba.sum'?
34
Basics Programming in R
35
if/else statements
## Format
# if(logical statement1){
#
then statement1
# }else if(logical statement2){
#
then statement2
# }else{
#
then statement3
#}
# Example: Print to screen if the value x is less than, equal to, or greater than 10.
x <- 5
if(x < 10){
print("less than 10")
}else if(x == 10){
print("equals 10")
}else{
print("greater than 10")
}
# Assign x to 10 and run the if statement again.
x <- 10
36
ifelse statement
## Format
# ifelse(logical statement1, yes, no)
# Example: Print to screen if the value x is less than 10 or greater than or equal to 10.
x <- 5
answer <- ifelse(x < 10, "less than 10", "greater than or equal to 10")
answer
x <- 10
ifelse(x < 10, "less than 10", "greater than or equal to 10")
# Example: Append a new column named STATUSNM to dat with 2 values, "live", where
STATUSCD = 1, and "dead", where STATUSCD = 2.
dat$STATUSNM <- ifelse(dat$STATUSCD == 1, "live", "dead")
37
for/while loops
## Format
# for(i in 1:n){
#
execute statement
# }
# Example: for loop
for(i in 1:10){
print(i)
}
# Example: while loop
i <- 1
while(i < 10){
print(i)
i <- i + 1
}
# Example: for loop
weekdays <- c("Mon", "Tues", "Wednes", "Thurs", "Fri", "Satur", "Sun")
for(day in weekdays){
print(paste(day, "day", sep=""))
}
38
if/else and for loops
## IFELSE statements and FOR loops
# Example: Create a loop using the previous example for if/else statements.
for(x in 1:20){
if(x < 10){
print("less than 10")
}else if(x == 10){
print("equals 10")
}else{
print("greater than 10")
}
}
# Example: Create a loop that compiles a vector with column names of dat that are less
than 5 characters wide.
datnames <- vector()
for(name in names(dat)){
name.length <- nchar(name)
if(name.length < 5){
datnames <- c(datnames, name)
}
}
datnames
39
Functions
# Format
# f <- function(<arguments>) {
#
Do something interesting
#}
# Example: Create a function using the previous example, with x as a user-defined
parameter.
getx <- function(x){
if(x < 10){
print("less than 10")
}else if(x == 10){
print("equals 10")
}else{
print("greater than 10")
}
}
getx(5)
getx(10)
# Example: Function to perform 2 calculations and return the result.
getx2 <- function(x, y){
x2 <- x * y
x2 <- x2 + 5
return(x2) }
x2 <- getx2(5, 100)
getx2(2, 20)
40
Functions cont.
# Example: Function to print the mean diameter of a specified species.
meanDIA <- function(SPECIES){
# Gets mean DIA for SPECIES in dat
dia.mean <- mean(dat[dat$SPECIES == SPECIES, "DIA"])
# Gets name for SPECIES in dat
spnm <- unique(dat[dat$SPECIES == SPECIES, "SPECIESNM"])
paste("Mean diameter for", spnm, "is:", round(dia.mean))
}
meanDIA(108)
meanDIA(746)
# Example: Function to print the mean of a specified variable for a specified species.
meanval <- function(SPECIES, var){
# Gets mean var for SPECIES in dat
var.mean <- mean(dat[dat$SPECIES == SPECIES, var])
# Gets name for SPECIES in dat
spnm <- unique(dat[dat$SPECIES == SPECIES, "SPECIESNM"])
paste("Mean", var, "for", spnm, "is:", round(var.mean))
}
meanval(746, "HT")
meanval(108, "DIA")
41
Functions cont.
# Example: Function to print the specified function of a specified variable for a specified
species. The default function is mean.
funval <- function(SPECIES, var, funnm=mean){
# Summarizes var by funnm for SPECIES in dat
var.mean <- funnm(dat[dat$SPECIES == SPECIES, var])
# Gets name for SPECIES in dat
spnm <- unique(dat[dat$SPECIES == SPECIES, "SPECIESNM"])
# Gets the string name for the function
funnm <- deparse(substitute(funnm))
paste(funnm, var, "for", spnm, "is:", round(var.mean))
}
funval(746, "HT", mean)
funval(746, "HT")
funval(108, "BA", sum)
42
Functions cont.
# Example: Function to return a table of pfun applied to pvar by xvar and yvar.
datPivot <- function(dat, pvar, xvar, yvar, pfun=sum){
ptab <- tapply(dat[,pvar], dat[,c(xvar, yvar)], pfun)
ptab[is.na(ptab)] <- 0
return(ptab)
}
datPivot(dat, "BA", "PLT_CN", "SPECIESNM")
datPivot(dat, "HT", "PLT_CN", "SPECIESNM")
datPivot(dat, "HT", "PLT_CN", "SPECIESNM",mean)
# Help on functions
https://ems-team.usda.gov/sites/fs-fia-iw-tt/R%20Help%20docs/Rfunctions.pdf
43
Summarizing with Functions
# Create a function to use with an apply statement to calculate a percent of the total basal
area for multiple variables by species.
# First, create a table summarizing multiple variables by species.
sptab <- aggregate(dat[,c("DIA", "BA")], list(dat$SPECIESNM), sum)
names(sptab) <- c("SPECIESNM", "SUMDIA", "SUMBA")
sptab
# Then, create a function to calculate percentages by species.
perfun <- function(x){ round(x / sum(x) * 100, 2) }
# Apply this function on all columns of sptab
apply(sptab[,-1], 2, perfun)
44
Summarizing with Functions
# Use the perfun function to calculate a percent of total basal area per plot.
# First, create a table summarizing BA by species and plot
sppltba <- aggregate(dat$BA, list(dat$SPECIESNM, dat$PLT_CN), sum)
sppltba
names(sppltba) <- c("SPECIESNM", "PLT_CN", "SUMBA")
sppltba
# Then use a by statement to calculate percent basal area per species
sppltba.by <- by(sppltba$SUMBA, sppltba$PLT_CN, perfun)
sppltba.by
# Append the percent calculation to sppltba.
sppltba$BApercplot <- unlist(sppltba.by)
sppltba
# Check the column totals by plot
by(sppltba$BApercplot, sppltba$PLT_CN, sum)
45
Exercise 9
## 9.1 Use the sapply function to determine if any columns of dat are factors.
## 9.2 Create a loop to check if any of the columns of dat is a factor. Print to screen the name
of the columns that are factors.
## 9.3 Append a column named HTCLASS to dat with 2 categories: "SHORT "or "TALL",
use the mean of HT as the break point.
So: for trees with HT less than the mean HT, set HTCLASS = "SHORT", and
for trees with HT greater than or equal to the mean HT, set HTCLASS = "TALL".
Check if this works using a table on HT and HTCLASS.
## 9.4 Create a function named 'getrows', that will print out all the records (or trees) of dat
for a specified SPECIES. Include one parameter named 'sp'. How many records have
lodgepole trees (108)?
## 9.5. Using the function you created in 9.4, how many records have aspen trees? What is
the average HT for these records (trees)?
## 9.6 Create a function to use with sapply to add a "_1" to all the column names of dat.
46
A Touch on Graphics in R
47
Scatterplots
# Plot a scatterplot of basal area by height
plot(dat$BA, dat$HT)
help(plot)
help(par)
par()
# Help on graphical parameters
# Display current settings of par
## Add labels
plot(dat$BA, dat$HT, main="Basal Area by Height", xlab="Basal Area",
ylab="Height")
## Change symbol (see below)
plot(dat$BA, dat$HT, main="Basal Area by Height", xlab="Basal Area",
ylab="Height", pch=19)
## Change color and size of points
plot(dat$BA, dat$HT, main="Basal Area by Height", xlab="Basal Area",
ylab="Height", pch=19, col="red", cex=0.5)
pch
colors()
cex <- 0.5
# List of color choices
# Make the symbol half the size
48
Scatterplots
# Add multiple plots to graphics device (display window).
par(mfrow=c(1,3))
# 1*3 = 3 plots (1 row, 3 columns)
## Add plots
plot(dat$BA, dat$HT, main="Basal Area by Height", xlab="Basal Area",
ylab="Height")
plot(dat$BA, dat$HT, main="Basal Area by Height", xlab="Basal Area",
ylab="Height", pch=19)
plot(dat$BA, dat$HT, main="Basal Area by Height", xlab="Basal Area",
ylab="Height", pch=19, col="red", cex=0.5)
# Change width of space surrounding plots in graphics device (display window).
# mar
#
Number of lines of margin to be specified on all 4 sides of plot.
Default is c(5, 4, 4, 2) + 0.1 (bottom, left, top, right).
par(mfrow=c(1,3), mar=c(5, 6, 4, 2))
## Add plots
plot(dat$BA, dat$HT, main="Basal Area by Height", xlab="Basal Area",
ylab="Height")
plot(dat$BA, dat$HT, main="Basal Area by Height", xlab="Basal Area",
ylab="Height", pch=19)
plot(dat$BA, dat$HT, main="Basal Area by Height", xlab="Basal Area",
ylab="Height", pch=19, col="red", cex=0.5)
49
Scatterplots cont.
par(mfrow=c(1,1))
plot(dat$BA, dat$HT, main="Basal Area by Height", xlab="Basal Area",
ylab="Height", pch=19, col="red", cex=0.5)
# Add fit lines to scatterplot to look at trends
abline(lm(dat$HT ~ dat$BA))
# Regression line (y~x)
lines(lowess(dat$BA, dat$HT), col="green") # lowess line (x,y)
# Scatterplot matrix
pairs(dat[,c("BA", "HT", "DIA")])
help(pairs)
## Other graphics packages
# car
# scatterplot3d
50
Histograms
help(hist)
# Frequency Histogram
# Displays the frequency of observations (DIA).
par(mfrow=c(1,3))
hist(dat$DIA)
hist(dat$DIA, breaks=2)
hist(dat$DIA, breaks=2, ylim=c(0,10), main="Histogram of DIA", xlab="DIA")
# Probability Histogram
# Displays the probabilities (density) associated with all observations (DIA).
par(mfrow=c(1,2))
hist(dat$DIA, main="Frequency Histogram")
hist(dat$DIA, prob=TRUE, main="Probability Histogram")
# Density curve of observations
par(mfrow=c(1,1))
plot(density(dat$DIA))
# Overlay density curve on probability histogram
par(mfrow=c(1,1))
hist(dat$DIA, prob=TRUE)
lines(density(dat$DIA))
51
Barplots
help(barplot)
## Create a barplot of total basal area by species
par(mfrow=c(1,1))
# Get sum of basal area by species (using aggregate).
spbasum.ag <- aggregate(dat$BA, list(dat$SPECIESNM), sum)
names(spbasum.ag) <- c("SPECIESNM", "SUMBA")
## Create barplot
barplot(spbasum.ag$SUMBA)
barplot(spbasum.ag$SUMBA, names.arg= spbasum$SPECIESNM, main="Sum of Basal
Area - aggregate")
# Get sum of basal area by species (using tapply) and create barplot.
spbasum.t <- tapply(dat$BA, dat$SPECIESNM, sum)
barplot(spbasum.t, main="Sum of Basal Area - tapply")
## Add colors
barplot(spbasum.t, col=c("yellow", "dark green", "blue", "orange", "red"),
main="Sum of Basal Area - tapply")
barplot(spbasum.t, col=rainbow(5), main="Sum of Basal Area - tapply")
help(rainbow)
## Make x and y axes labels smaller
barplot(spbasum.t, col=c("yellow", "dark green", "blue", "orange", "red"),
cex.axis=0.75, cex.names=0.75, space=0.1, ylim=c(0,3.0))
52
Pie Charts
## Create a pie chart of the proportion of basal area by species.
help(pie)
# Use the table we created previously
sptab <- aggregate(dat[,c("DIA", "BA")], list(dat$SPECIESNM), sum)
names(sptab) <- c("SPECIESNM", "SUMDIA", "SUMBA")
sptab
# Calculate the proportion of SUMBA by species
propba <- round(sptab$SUMBA / sum(sptab$SUMBA), 2)
## Create pie chart
par(mfrow=c(1,1))
pie(propba, labels=sptab$SPECIESNM)
## Now lets create a pie chart for each plot we measured.
# Again, use the table we created previously, with BApercplot
sppltba <- aggregate(dat$BA, list(dat$SPECIESNM, dat$PLT_CN), sum)
names(sppltba) <- c("SPECIESNM", "PLT_CN", "SUMBA")
sppltba$BApercplot <- unlist(by(sppltba$SUMBA, sppltba$PLT_CN, perfun))
par(mfrow=c(1,2))
for(cn in unique(sppltba$PLT_CN)){
plt <- sppltba[sppltba$PLT_CN == cn,]
pie(plt$BApercplot, labels=plt$SPECIESNM, cex=0.75, main=cn)
}
53
Exercise 10
## 10.1 Create a scatterplot of BA vs. DIA with proper labels. Are these variables
correlated?
## 10.2 Add a regression line to the scatterplot in 10.1 and color it red.
## 10.3 Create a histogram of HT with 5 breaks and labels. What is the range of heights
with the highest frequency?
## 10.4 Create a barplot of maximum heights by species with labels. Which species has
the maximum height?
54
R Help Links
# List of available packages in R
http://cran.r-project.org/web/packages/available_packages_by_name.html
# Link to help on RODBC package
http://cran.r-project.org/web/packages/RODBC/vignettes/RODBC.pdf
# Data manipulation
https://science.nature.nps.gov/im/datamgmt/statistics/r/fundamentals/manipulation.cfm
# apply/sapply/aggregate/..
http://lamages.blogspot.com/2012/01/say-it-in-r-with-by-apply-and-friends.html
# R Programming
http://en.wikibooks.org/wiki/R_Programming/
# Help on functions
https://ems-team.usda.gov/sites/fs-fia-iw-tt/R%20Help%20docs/Rfunctions.pdf
# R Reference Card
http://scicomp.evergreen.edu/docs/workshops/RReferenceCard.pdf
55

Download Report

apply, lapply, sapply tapply

Paperzz.com

Your Paperzz