EART125/225: Central tendency in R Jan 6 As you work through

EART125/225: Central tendency in R
Jan 6
As you work through today’s exercises, I would recommend typing the code yourself rather than
cutting and pasting from this document or from your previous work.
Part 1: Loading data files, assigning variables, working with data frames (review)
Remember that the R function read.csv() is used for reading comma-separated text files,
either from your hard drive or directly from an internet URL. In this class, you can just read the
files directly from the website. Even though you may have the file loaded from last time, it’s
good practice (both scientifically and for your coding skills) to load it each time you start R.
On its own, the function reads the csv file and converts it to an R object, but won’t actually store
it anywhere. To do that, you need to assign it to a variable (choose your own concise but
descriptive name) using the <- symbol. For example:
georoc <read.csv("http://people.ucsc.edu/~mclapham/earth125/data/georoc.csv")
This file is a data frame object (containing a mixture of numeric and character columns). The
most common way to specify a particular column within a data frame is using the $ symbol. For
example: georoc$SIO2 or georoc$rock.type. Remember that R is case-sensitive.
Part 2: Working with data frames – the subset function
The data files for this class are organized with columns containing the actual data and columns
containing relevant metadata (supplementary information about each sample’s context). In a data
frame, character columns will often be treated as an R object type called a factor. Use
levels(df$col) to list the values, called the levels, in that column. Rows are the individual
observations. You should always set up data files in this format.
Here’s a snippet of the georoc dataset to illustrate that layout:
tectonic.setting rock.type SIO2 TIO2 AL2O3 CAO MGO
1 Convergent margin Andesite 67.46 0.52 15.57 3.01 1.37
2
Intraplate Rhyolite 69.86 0.41 14.75 0.84 0.20
3 Convergent margin Rhyolite 73.27 0.33 13.15 1.84 0.78
4
Seamount
Dacite 71.67 0.46 13.83 2.84 1.02
5 Convergent margin Rhyolite 64.83 0.87 13.79 1.43 2.91
MNO
0.08
0.06
0.08
0.06
0.04
However, you often want to know details about (or perform statistical tests on) data from one of
your categories. For example, what is the SiO2 of andesite rocks or what is the TiO2 of rocks
from convergent margins? The R function subset() is designed for this purpose.
The subset function requires two arguments (an argument is some parameter – a value or a
variable – given to the function when you run it).
The first argument is the name of a variable, typically a data frame object or column(s) from a
data frame. The second argument is a logical statement; it will be used to select the rows of the
data frame (or column) where that statement is true.
A logical statement is one where the outcome is either true or false. The most common logical
statements are:
==
!=
>
<
“is equal to” – used for both numeric and character data
“is not equal to” – used for both numeric and character data
“greater than” – used only for numeric data, can also use >=
“less than” – used only for numeric data, can also use <=
Other less common statements include:
%in% evaluates if one object (numeric or character) is found in a vector of multiple objects
is.na() evaluates if the value is NA (missing data)
Here are three examples:
Returns all columns of a data frame variable called georoc where the value in the rock.type
column is Andesite (remember – case-sensitive!)
subset(georoc, georoc$rock.type == “Andesite”)
Returns just the SIO2 values from a data frame variable called georoc where the value in the
rock.type column is not Andesite
subset(georoc$SIO2, georoc$rock.type != “Andesite”)
Returns just the TIO2 values from a data frame variable called georoc where the value in the
SIO2 column is greater than or equal to 65
subset(georoc$TIO2, georoc$SIO2 >= 65)
Note that you always write character data inside quotation marks, but numeric data is never
written with quotation marks.
Part 3: Central tendency functions
The R functions to calculate central tendency measures are fairly straightforward. They are:
mean(x) #calculates the mean
median(x) #calculates the median
Each function requires a single numeric vector variable as the input. I have denoted that variable
as “x” in the examples above, so you will have to replace it with a named numeric variable from
your R environment. That vector can (and will often) be a column from a data frame or the
output from the subset function.
One complication: sometimes your data frame may contain missing values, which R treats as NA
(a logical constant so the variable is still considered to be numeric). If you try to calculate the
mean or median of a vector that contains one or more NA values, the result will be NA.
Conveniently, functions like mean and median contain an additional option to remove NA values
before calculating the output. The option is specified after listing the vector variable:
mean(x, na.rm = TRUE) #calculates the mean after removing NA
You can also abbreviate TRUE as T (note the single equals sign, not the logical ==).
Part 4: plotting histograms
You have learned that the shape of the data distribution (i.e., symmetrical or skewed) is an
important piece of information, because it helps you decide whether to use mean or median as a
measure of central tendency. The R function to make a histogram is simply:
hist(x) #where x is again a numeric vector variable
R allows you to tweak many parameters that control the appearance of your plot (see ?plot and
?par for the complete list), but here are two:
To set the x-axis label, use:
hist(x, xlab = “Some text”)
To set the color (in this case of a histogram, the color of the bar fill), use:
hist(x, col = “tomato3”)
R has 657 named colors (type colors() at the text prompt for a list or search online for images
showing the colors) and can accept rgb values (with the rgb() function) or hexadecimal colors,
if you know or care about such things.
Exercise
First, read the following data files and store each as a variable (make sure to give each variable
an appropriately descriptive name).
http://people.ucsc.edu/~mclapham/earth125/data/chesapeake.csv
http://people.ucsc.edu/~mclapham/earth125/data/georoc.csv
http://people.ucsc.edu/~mclapham/earth125/data/venuscrater.csv
After you have done that, go to the tests & quizzes section of eCommons and work through the
practice questions for in-class 2: central tendency.

Download Report

EART125/225: Central tendency in R Jan 6 As you work through

Paperzz.com

Your Paperzz