R Manual

An R Companion to Mathematical Methods in the
Natural Sciences (Vorwerk and Vorwerk)
Diane P. Genereux∗
Stephen V. O’Brien, editor
Westfield State University
Westfield, MA
∗
†
[email protected]
[email protected]
1
†
Contents
Unit 0 R Companion: Algebra Review
1 Unit 1 R Companion: Using Technology
1.1 What is R? . . . . . . . . . . . . . . . .
1.2 Some thoughts on learning R . . . . . .
1.3 How to credit R in published work . . .
1.4 Plots in R . . . . . . . . . . . . . . . . .
1.5 Installing R . . . . . . . . . . . . . . . .
1.6 Graphing Data in R . . . . . . . . . . .
1.7 Troubleshooting Data Entered in R . . .
1.8 Entering functions in R . . . . . . . . . .
1.9 Plotting functions in R . . . . . . . . . .
1.10 Troubleshooting functions in R . . . . .
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 Unit 2 R Companion: Scientific notation
1
1
1
1
2
2
3
4
6
7
8
9
3 Unit 3 R Companion: Solving Linear Functions
3.1 Using R to Solve Linear Equations Graphically . . . . . . . . . . . . . .
3.2 Using R to identify the point of intersection for two linear equations. . .
11
11
12
4 Unit 4 R Companion: Linear Regression
4.1 Plotting an inferred function . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Interpolation and Extrapolation . . . . . . . . . . . . . . . . . . . . . . .
15
17
17
5 Unit 5 R Companion: Quadratic Functions
5.1 Solving quadratic equations graphically in R . . . . . . . . . . . . . . . .
19
19
6 Unit 6 R Companion: Behavior of a Function
6.1 Plotting some special functions . . . . . . . . . . . . . . . . . . . . . . .
6.2 Scaling your plot window to examine the behavior of a function . . . . .
6.3 Some reminders about using graphs to assess the behavior of functions .
23
23
23
25
7 Unit 7 R Companion: Function Library and Non-Linear
7.1 Nonlinear Regression . . . . . . . . . . . . . . . . . . . . .
7.2 Homework for R Users . . . . . . . . . . . . . . . . . . . .
7.3 Answers to Homework for R Users . . . . . . . . . . . . . .
27
27
31
32
Regression
. . . . . . . .
. . . . . . . .
. . . . . . . .
8 Unit 8 R Companion: Descriptive Statistics
33
9 Unit 9 R Companion: Hypothesis Testing
2
35
9.1
Question: You have taken a sample and calculated a proportion. Is the true
proportion in the population from which your sample is drawn different
from 50%? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2 Question: Do the proportions calculated for samples from two populations
indicate that the two populations differ in their true proportions? . . . .
9.3 Question: Given an estimated mean of a sample, what can we say about
the mean for the population from which it was drawn? . . . . . . . . . .
9.4 Question: Do these samples come from populations with different means?
9.5 Question: Is there evidence of change a difference between ”before” and
”after” samples taken for a given set of individuals? . . . . . . . . . . . .
9.6 Implementing the χ2 Test in R . . . . . . . . . . . . . . . . . . . . . . . .
9.7 A sample data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.8 Doing the χ2 test by hand . . . . . . . . . . . . . . . . . . . . . . . . . .
9.9 To Perform a χ2 test in Excel . . . . . . . . . . . . . . . . . . . . . . . .
9.10 To Perform a χ2 test in R . . . . . . . . . . . . . . . . . . . . . . . . . .
35
37
38
39
40
41
41
42
42
43
10 Unit 10 R Companion: Confidence Intervals
10.1 Proportions: computing the confidence interval and the margin of error .
10.2 Means: computing the confidence interval and the margin of error . . . .
45
45
47
11 Unit 11 R Companion: Experimental Design
51
A Appendix 1: Other hints that may be useful as you
A.1 Using R as a Calculator . . . . . . . . . . . . . . . .
A.2 File Types in R . . . . . . . . . . . . . . . . . . . . .
A.3 Saving Files in R . . . . . . . . . . . . . . . . . . . .
A.4 Typing Shortcuts in R . . . . . . . . . . . . . . . . .
A.5 Importing data into R, and exporting plots from R .
learn R
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
B Appendix B: Exponential Regression in R
B.1 Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.2 Enter your data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.3 Find the natural log of your output variable . . . . . . . . . . . . . . . .
B.4 Use linear regression to find the line that best relates your input variable,
x to the log of your output variable — that is, loge (y). . . . . . . . . . .
B.5 To find ”m” for the exponential function, use the slope inferred for the
best-fit line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.6 To find ”b” for the exponential function, raise e to the intercept inferred
for the best-fit line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.7 Write your exponential function! . . . . . . . . . . . . . . . . . . . . . . .
B.8 How well does your inferred function fit your data? . . . . . . . . . . . .
C Appendix 3: Implementing the
C.1 A sample data set . . . . . . .
C.2 Doing the χ2 test by hand . .
C.3 To Perform a χ2 test in Excel
C.4 To Perform a χ2 test in R . .
χ2 Test
. . . . .
. . . . .
. . . . .
. . . . .
3
by
. .
. .
. .
. .
Hand, in
. . . . . .
. . . . . .
. . . . . .
. . . . . .
Excel, and
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
in
. .
. .
. .
. .
R
. .
. .
. .
. .
53
53
54
54
54
55
57
57
58
58
58
59
59
59
60
61
61
62
62
62
Preface
Welcome to the first edition of our R Companion to the Biology 123 Coursebook by Vorwerk and Vorwerk. We hope that this Companion will be useful to you in your studies
of mathematical methods, and of the R statistical computing language.
We are grateful for your input on revisions that would make this Companion more
useful for future readers.
Diane P. Genereux and Stephen V. O’Brien, editor
Westfield, MA
January 2013
i
Unit 0 R Companion: Algebra
Review
Your Coursebook contains all of the material required for this section.
iii
Chapter 1
Unit 1 R Companion: Using
Technology
1.1
What is R?
R is a statistical computing language. It works both as a scientific calculator, and as a
computing environment with the capacity to to perform a wide range of analyses, and to
make complex plots and other graphics. R is currently the programming environment of
choice for many biologists and statisticians.
1.2
Some thoughts on learning R
Learning R is a little challenging and (I think) a lot of fun.
With the basic programming skills you’ll learn and apply in our class, you will be
prepared to use R in your other coursework. In past semesters, for example, students who
worked with R in Biology 0123 applied their R skills to complete projects in Introductory
Biology, and Genetics.
Strategies for coding in R are widely documented in online discussions. Should you
want to read about how to approach a new problem in R, a quick Google search is likely
to yield several useful strategies.
Working with R in our course will be an excellent way to become comfortable with
computer programming in general, and will be good preparation for learning a wide
variety of programming languages in the future.
The goal of this R Companion is to introduce you to some basic programming approaches that will be useful in completing the problems in your Coursebook by Vorwerk
and Vorwerk.
1.3
How to credit R in published work
R was written by Ross Ihaka and Robert Gentleman. Since their initial development of
this language, thousands of people have contributed software packages that extend R’s
capabilities to specific, new forms of data analysis.
Should you decide to publish a paper in which you use R, it will be important to give
credit to the team of people who developed it.
1
The preferred citation is:
R: A Language and Environment for Statistical Computing
R Development Core Team, The R Foundation for Statistical Computing
Vienna, Austria 2012
1.4
Plots in R
Using R, you can plot data sets large and small, and then tune those plots in almost
infinitely many ways.
Here is a plot I made using R (D.P. Genereux 2009, PLoSGenetics:e1000509). The plot
examines how Methylation Density (y-axis), a modification that helps to determine
whether individual genes are on or off in individual cells, depends on cell Division
Number (x-axis).
Epigenetic Costs of Genetic Fidelity?
Figure 2. Trajectories of methylation densities under asymmetric or symmetric strand segregation, with high initial methylation
As you
canThesee,
it isstrand
possible
to plot methylation
data points
and
in R,
well
as to add
density.
oldest-parent
and the population-mean
densities (filled
and functions
open circles, respectively),
are as
shown
for simulations
run under asymmetric and symmetric modes of strand segregation (circles and squares, respectively). For the simulations shown here, I used a
numerousstarting
informative
labels.
If the
you
want
to examine
wide densities
variety
graphs
made
methylation density
of m~0:8. For
scenarios
with parent
strand de novo a
methylation,
were of
calculated
with m~0:975,
and using
d ~d ~0:05. Under asymmetric strand segregation, these parameter values lead to monotonic increases in population-mean and oldest-parent
strandof
methylation
(upper
curves). Under symmetric strand
segregation,
these parameter values
lead to population-mean
and oldest-parent
R — some
themdensities
both
information-rich
and
aesthetically
impressive
— take
a look at
strand DNA methylation densities that were dynamic about the starting value (middle curves). With no parent strand de novo methylation
(d ~0:1, d ~0), densities were unchanged under both symmetric and asymmetric strand segregation (dashed line).
http://gallery.r-enthusiasts.com
doi:10.1371/journal.pgen.1000509.g002
p
d
1.5
d
p
achieved through asymmetric strand segregation, with implications for human disease.
The rate of increase will also depend on the initial DNA
methylation density (compare, for instance, Figures 2 and 3).
Lorincz et al. [16] found that progression to dense methylation is
especially likely for genomic regions that have already attained
intermediate methylation densities. In light of this finding, it seems
plausible that even slow or transient increases in DNA methylation
could raise methylation densities to a threshold sufficient to trigger
more substantial increases.
What might be the functional implications of the increased DNA
methylation densities predicted under asymmetric strand segregation? The accumulation of methyl groups on a long-lived DNA strand
could serve as a signal to guide asymmetric strand segregation itself
[17], or to distinguish stem cells from differentiated cells [13]. My
findings could also help to explain the positive correlation observed
between age and methylation density in endometrial [18] and
intestinal [19] tissues. Both of these are rapidly-dividing tissues of the
sort initially predicted by Cairns [1], and reported by some groups
[4], to undergo asymmetric strand segregation. In contrast, slowlydividing cells, such as those in the hematopoetic lineage, have
constant methylation densities [20–23] and have been reported not to
undergo asymmetric strand segregation [6]. Thus, the systematic
increases in DNA methylation densities predicted here may be
specific to the rapidly-dividing lineages Cairns initially discussed [1].
My results may also have implications for the etiology of cancer
in humans. Several epithelial cancers are associated with
reductions in epigenetic fidelity, including the accumulation of
aberrant methylation and abnormal gene silencing [24,25].
Barrett’s esophagus illustrates the potential relevance of these
findings. The esophageal epithelium in Barrett’s esophagus
contains abnormal intestinal crypt-like structures, and is characterized by abrupt increases in DNA methylation densities and
consequent silencing of loci critical to cell-cycle regulation [26].
Thus, it is possible that directional change in epigenetic
information may be a cost of the increased genetic fidelity
Installing R
Models
Modelling an Epithelial Crypt
Because R is an open-source program, you can download
it and install it, without charge.
I developed a simplified model of an epithelial crypt with which to
track methylation dynamics (Figure 1). Each crypt consists of one
It work equally well on Macs and PCs.
somatic stem cell, and four differentiated cells. At each round of
stem-cell division, one terminally differentiated cell is produced, and
one stem cell is produced. The top-most of the terminally
differentiated cells is sloughed off at the epithelial surface.
Segregation of the oldest DNA strand always to the stem cell
characterizes asymmetric strand segregation (Figure 1a); segregation
of the oldest DNA strand at random to the stem and terminally
differentiated cells characterizes symmetric segregation (Figure 1B).
Modelling DNA Methylation Events in an Epithelial Crypt
Installing R on a Mac
Maintenance and de novo methylation. I modelled replication and methylation dynamics for a single, methylated locus such
as one of those on the hemizygous X chromosome in a human male,
or on the inactive X chromosome in a human female. Prior to cell
division, the locus undergoes semi-conservative replication,
producing two double-stranded DNA molecules. Each molecule is
composed of a parent strand from the original double-stranded molecule, and a newly-synthesized daughter strand. The model to be described in detail below compares methylation dynamics under asymmetric (Figure 1A) and symmetric (Figure 1B) strand segregation.
Methyl groups are added to CpG cytosines in DNA by two
different processes: maintenance methylation and de novo
methylation. Maintenance methylation is performed by maintenance methyltranferases, which exhibit a preference for hemimethylated CpG/CpG dyads, are thought to localize to the
1. Check to see whether R is installed on your computer already. On a Mac, you
can use the Spotlight function (click on the magnifying glass in the upper-right
corner). Just type R in the search box. If the list that appears contains the symbol
shown below, click on it. R is already installed!
2
PLoS Genetics | www.plosgenetics.org
3
June 2009 | Volume 5 | Issue 6 | e1000509
2. If the symbol above does not appear on the list, you’ll need to download and install
R. To do so:
a) Go to http://cran.r-project.org/bin/macosx/
b) Click on R-2.15.1-signed.pkg
c) Install the downloaded file.
d) The R program should now be available in your Applications folder.
e) Need help? Please ask!
Installing R on a PC
Check to see whether R is installed on your computer already. Use the “Search” option
available at the bottom left of your screen to look for R. If it’s already installed, you can
skip the rest of this section.
If R is not installed, you’ll need to download it and install it. To do so:
1. Go to http://cran.r-project.org/bin/windows/base/
2. Click on ”Download R 2.15.2 for Window”
3. This will download a .exe file (“R-2.15-win.exe”)
4. Follow the prompts asking for information about where to install R. In most cases,
the default settings should be fine.
5. When the installation is complete, a shortcut-to-R icon should appear on your
desktop.
6. Need help? Please ask!
1.6
Graphing Data in R
Entering data in R
Many of the data sets you will collect in the natural sciences will consist of lists of
paired numbers. For example, you might gather information on the height of a plant at
different time points following its germination. For such a data set, “time” would be the
independent (input) variable, and “height” is the dependent (output) variable. Data of
this form could be displayed as follows:
Time (days)
1
2
3
4
5
Height (cm)
2
3
4.6
5
5.1
3
To enter such data in R, you would create two separate lists: one for “time”, the other
for “height”.
To create these two separate lists in R, you would type,
timevalues<-c(1,2,3,4,5)
and
heightvalues<-c(2,3,4.6,5,5.1)
These lists introduce some important conventions of R code. 10pt
A list must:
• Be given a name. Here, the names of the two lists are timevalues and heightvalues.
In R, you came name a list whatever you like, so long as the name contains no spaces.
It is conventional that list names not contain any capital letters or punctuation.
• Be indicated using the symbol < − This can be typed as a less-than sign (<),
followed by a hyphen(-). Together, these symbols create a left-pointing arrow, and
tell R that you want to assign the variable name you’ve chosen to refer to a list
containing the data points you’re about to enter.
• Have data entered in the form c(, , , ...) Here, the “c” stands for “concatenate”
(which means “link together”). Parentheses then are used to surround a list of data
points that are separated from one another by commas.
After you have established lists according to these conventions, you may choose to
display your data in a table. Below, for example, I’ve chosen to display these data as a
matrix called “mydata”. Here, too, you’re welcome to give your data any name you like,
provided that the name does not contain spaces.
This code:
mydata<-data.frame(timevalues,heightvalues)
would display your data as
timevalues
1
2
3
4
5
1.7
heightvalues
1
2.0
2
3.0
3
4.6
4
5.0
5
5.1
Troubleshooting Data Entered in R
Often, you’ll enter a set of (x, y) coordinates, where each x value corresponds to a specific
y value — as in our time/plant height example above.
It’s crucial to proofread your typed lists in R to ensure that no values have been omitted! R will not be able to make an (x, y) plot using two lists of different length.
4
If you do try to create a table or plot using lists of different length, R will return this
error message:
Error in xy.coords(x, y, xlabel, ylabel, log) :
’x’ and ’y’ lengths differ
Missing data are, of course, a reality of experimental science. Weather and illness are
just two of the many factors that can impact your ability to collect data at the times
originally planned. Fortunately, there is a way to enter and display a data set for which
one more or more data points are missing!
Say, for example, you had been collecting height data for the plant from the example
above, but weren’t able to collect data on day three. Your incomplete data set would
look like this:
Time (days)
1
2
3
4
5
Height (cm)
2
3
5
5.1
You could enter these data in R as follows:
timevalues<-c(1,2,3,4,5)
heightvalues<-c(2,3,NA,5,5.1)
Here “NA” (which stands for “Not Available”) is used to indicate that a data point
is missing.
You’d then update your data frame to use the revised data
mydata<-data.frame(timevalues,heightvalues)
which would produce the following:
1
2
3
4
5
timevalues heightvalues
1
2.0
2
3.0
3
NA
4
5.0
5
5.1
Plotting data that you have entered in R
Here are two different ways to plot (x, y) data in R:
1. Use your x and y lists separately: plot(timevalues,heightvalues)
2. Use your data frame directly: plot(mydata)
5
3.5
2.0
2.5
3.0
heightvalues
4.0
4.5
5.0
Each of these options produces the following plot:
1
2
3
4
5
timevalues
Note the “gap” left in the plot at time 3, indicating the time point for which no data
were collected.
1.8
Entering functions in R
To code the function,
z(p) = p + 6
(1.1)
in R, you would write the following:
z<-function(p) {p+6}
A few things to note:
• z is the name of the function
• the < − symbol once again tells R that we want the name (here, z) to take a
particular value
• function tells R that you’re coding a function (rather than, say, just doing a
calculation)
• (p) — the parentheses are required — tells R that the function z will have p as its
input variable.
• p+6 gives the function itself. For this part, you need to use special “curly” parentheses, which look like this: {}. You can enter curly parentheses by holding shift,
and using the keys just to the right of “p” on your computer keyboard.
6
1.9
Plotting functions in R
Once you have coded a function in R, you can use “plot” to graph the function over a
given range of input variables.
For examine, if we wanted to plot our function z, we would type,
plot(z)
Yielding this plot:
6.0
6.2
6.4
z (x)
6.6
6.8
7.0
Plot of z(p) over the default range of input values from 0 to 1
0.0
0.2
0.4
0.6
0.8
1.0
x
You’ll note that this only plots using input values from 0 to 1! If we want to explore
the function over a different range of input values — say, from -10 to 10, we would type:
plot(z, xlim=c(-10,10))
In this code, “xlim” is short for “the limits of the range of x values”. The “c(10,10)” refers to the range of inputs values over which we wish to examine our function.
7
0
5
z (x)
10
15
Plot of z(p) over a broader range of input values from -10 to 10
-10
-5
0
5
10
x
Note below that the graph for this broader range looks much the same as the default
— that’s because this is a linear function, with a rate of change that is constant for all
values of the input variable!
As we’ll see later in the course, however, some non-linear functions look very different
when examined over narrow as compared to broad ranges of parameter values. Thus, it’s
always a good idea to examine functions over a broad range of input values.
1.10
Troubleshooting functions in R
On occasion, R will return an error message alerting you that a function cannot be
evaluated because it contains typos or other errors. Here are some problems to check for
should you find that your function will not “work”:
• Error in Parentheses. Be sure that conventional parentheses surround your input
variable, and curly parentheses surround your functional expression.
• Missing Arrow. Be sure that your <- is typed properly.
• Problems with Input variable. R may issue an error message if your input
variable has previously been stored as a list of data. If you are using as an input
variable a name — for examplex — that you’ve previously used for a data list, it’s a
good idea to clear the name of your variable. To so, type rm(p), where “rm” stands
for “remove”, and the name of your function (here ”p”) is enclosed in parentheses.
8
Chapter 2
Unit 2 R Companion: Scientific
notation
R is able to convert from scientific notation to standard notation, and vice versa.
For example, type the following number
345698543985743985734
and then press Enter.
R will return
[1] 3.453985e+20
There are several things to note here:
• You can disregard the “[1]” at the beginning of the output line. This is, in essence,
a reminder that this is a value returned by R, rather than one that you entered.
• the “e” can be read as “times10” — that is, the “e” indicates that the output
provided is given in scientific notation; the number that follows gives the power to
which 10 should be raised.
• By default, R provides scientific notation with six digits following the decimal point.
If you would like to have only three digits total — one before the decimal point,
and two after– you can enter
– options(digits=3)
– Note that this will perform the appropriate rounding, and will return 3.45e+20
– Each time you want to shift the level of precision, simply reenter options(digits=2)
, being sure to indicate the appropriate number of digits.
Be sure you can use scientific notation both with and without R! While it
is often useful to ask R to do these conversions for you, you also be sure that you are
able to move from standard to scientific notation, and back.
You should aim to be able to convert quickly between standard and scientific notation, even in the absence of a computer. Please see your main Coursebook for more on
converting to and from scientific notation without R or other technologies.
9
Chapter 3
Unit 3 R Companion: Solving Linear
Functions
In Unit 1, you learned how to code and plot functions in R. You may find it useful to
review that section before proceeding.
3.1
Using R to Solve Linear Equations Graphically
The goal of solving a function graphically is to identify the value(s) of the input variable
for which the left and the right sides of the equation are equal. This is the same as saying
that our goal is to identify values of the input variable for which there is no difference
between the values of the expressions on left and rights side of the equations.
More specifically, in R, our strategy for solving linear equations is to identify values of
the input variable for which the difference between the left and right sides of the equation
is zero. Your solution will take the form of one or more values for the input variable, x.
Solving an equation with one solution
To find the solution for this equation:
4(x − 2) = 2x − 5
(3.1)
You would use the following steps:
1. Code each side of the function as a separate function in R. For this equation,
that would be:
y1<-function(x) {4*(x-2)}
y2<-function(x) {2*x-5}
2. Plot both functions on the same set of axes. For clarity, we will graph y1
in blue, and y2 in red.
R knows the names of lots of colors. Try altering your code to make one of these lines
green, yellow, or purple!
plot(y1, -5, 5, col=’blue’)
plot(y2, -5, 5, col=’red’, add=TRUE)
11
0
10
N.B. In this code, add=TRUE tells R that you want to code the two functions together
using one set of axes, rather than on separate plots.
-10
-20
y1 (x)
intersection!
-4
-2
0
2
4
x
From the plot, we see that these two lines cross! As noted above, the x value at which
the two lines cross is called the solution —the x value for which the two expressions are
equal. Using your plot, estimate the value of x at the point where the two lines cross.
This is an estimate of the solution to your equations.
Solving an equation with no solution
When two functions yield parallel lines, the lines never intersect, so we would say that
there is no solution. In this case, your “solution” would be “no solution”.
Solving an equation with infinitely many solutions
When two functions yield a single line, the values of the two functions are identical for
all values of x, so we would say that there are infinitely many solutions. In this case,
your “solution” would be “infinitely many solutions”.
3.2
Using R to identify the point of intersection for
two linear equations.
Sometimes, just looking at a graph is not good enough — you need more precise information on the x value for which two plots intersect! R can help with this, too.
12
As noted above, the point(s) at which the two plots intersect is an x value or set of x
values for which the two functions yield the same answer. Another way of saying this: the
solution represented on a graph marks the input-variable value for which the difference
between the two functions is zero! Therefore, we approach this problem by asking R to
tell us where the difference between these two function is zero.
First, we define a new function that will calculate the difference between y1 and y2.
To make things easier for ourselves, lets call that function difference
difference<-function(x) {y1(x)-y2(x)}
Next, we tell R that we want to find the place where the function difference is exactly
equal to zero. To do so, we type
uniroot(difference, c(-100, 100))[1]
The resulting number is the x value that is the solution to these equations! In the
code above, the -100 and 100 give the lower and upper bounds, respectively, of the range
over which we are asking R to look for solutions. You can change these values to examine
a smaller or larger range of values.
The meaning of “uniroot” Wondering about the meaning of the uniroot function
above? Basically, it identifies the value of the input variable (here, x) for which the difference equation we’ve entered has value of zero. In essence, we’re using this code to ask
for the place where the difference between the two functions is zero — that is, where they
have the same value! The returned value for x is the solution for our system of equations.
13
Chapter 4
Unit 4 R Companion: Linear
Regression
Finding the best-fit line and its equation
To infer a best-fit line in R, first enter lists of data corresponding to your x and y variables.
timevalues<-c(1,2,3,4,5)
heightvalues<-c(2,3,4.6,5,5.1)
It’s good practice to make sure that the two lists are of equal length — if they’re not,
R will not be able to infer a best-fit line. If there are values missing from your data —
that is, you have an x value without a corresponding y value or vice versa, you may wish
to use NA (see Unit 1 for more on how to do this).
For short lists, manual proofreading is often sufficient. To confirm that longer lists
are of appropriate length, the length function is helpful:
length(timevalues)
length(heightvalues)
If the returned lengths are equal, you’re ready to proceed to linear regression! If not,
you’ll want to examine your data for errors. You can always display your data as entered
by typing a variable name, and then pressing “Enter”.
Recall that you can make an (x, y) plot by typing
plot(timevalues, heightvalues)
where “timevalues”, the input/independent variable, is given before “heightvalues” the
output/dependent variable. Often, an (x, y) plot will provide an initial sense of whether
there is a linear relationship between your two variables.
To gather quantitative information on whether or not there is evidence of a linear
relationship of “height” to “time”, you would use the following code:
summary(lm(formula = heightvalues ~ timevalues))
Here, ”summary” tells R that you want it to summarize its findings (a summary
should be sufficient for our purposes here).
15
The most important part of the code is this:
lm(formula = heightvalues ~ timevalues)
Here, “lm” stands for “linear model” — which make sense because we are doing linear
regression! Technically, this means that we are asking R to find the ”linear model” that
will best fit the present data set.
This part:
formula = heightvalues ~ timevalues
tells R that we want to find a formula for height as a function of time.
Two things are especially important to note:
1. In linear regression, the y variable (here, “heightvalues”) is given before the x
variable (here, “timevalues”). It may be useful to think of this form as asking,
“what function would allow us to estimate height as a function of time?”
2. Here, the symbol between the x and y variables is a a tilde, and can be typed by
pressing the key just above the tab key, while holding down the shift key.
After you enter the above code into R, you will see the following output:
Call:
lm(formula = heightvalues ~ timevalues)
Residuals:
1
2
-0.30 -0.12
3
0.66
4
5
0.24 -0.48
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
1.4800
0.5510
2.686
0.0747 .
timevalues
0.8200
0.1661
4.936
0.0159 *
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 0.5254 on 3 degrees of freedom
Multiple R-squared: 0.8904,
Adjusted R-squared: 0.8538
F-statistic: 24.36 on 1 and 3 DF, p-value: 0.01595
R returns a great deal of information here! If you take a statistics course later in
your studies, you’ll learn much more about linear regression, and the meanings of these
various terms.
16
For the moment, only two components are essential:
1. The formula you entered!
Call:
lm(formula = heightvalues ~ timevalues)
This should look familiar. It’s just the relationship for which we asked R to infer a line.
Once again, be sure that you’ve entered the dependent (y) variable before the independent
(x) variable.
2. The intercept and slope for the line that R inferred to fit these data.
Coefficients:
Estimate Std.
(Intercept)
1.4800
timevalues
0.8200
Here, the y intercept is estimated to be 1.4800 (top line). The slope is estimated to
be 0.8200 (bottom line). Now that we have these two values, we can write the equation
for the line that R has inferred to be the best fit to these data.
A very important note on writing functions: You’ll see that the function above
uses “h” and “t” (not “heightvalues” and “timevalues”) to refer to the two variables of
interest. That’s because we want R to use “heighvalues” and “timevalues” to refer to our
lists of data, and “h” and “t” to refer to the variables in our function. I always use names
that end with the word “values” to name data sets, and single letters to name parameters
in functions. I recommend that you follow this convention when writing functions of your
own, to avoid confusion between data lists and variables!
The formula for our line of interest is:
h = 0.8200 ∗ t + 1.4800
4.1
Plotting an inferred function
Once you have inferred the slope and intercept for your best-fit line, you can code the
corresponding function in y = mx + b form! For the line inferred above, that function
would be:
h<-function(t) {0.8200 * t + 1.4800}
4.2
Interpolation and Extrapolation
Once you have written a function to calculate h as a function of t, interpolation and
extrapolation are quite easy!
For any given value of the input variable t, you can use the function you’ve written
to calculate the corresponding h value. For example, say we wanted to interpolate to
estimate the plant’s height as t = 2.5 we would type,
17
h(2.5)
and the press enter, which would yield
3.53
Similarly, to extrapolate to predict the height of the plant at t = 200, you would enter
h(200)
which would yield
165.48
Finally, to display your best-fit line on a plot with your raw data, you would type
plot(timevalues, heightvalues); plot(h, xlim=c(1,5),add=TRUE)
Note that this code is really just a fusion of two things you’ve already learned to do:
plot a data set, and infer a best-fit line! This code will yield the following plot:
3.5
3.0
2.5
2.0
height
4.0
4.5
5.0
Plant height data and best-fit line
1
2
3
4
time
18
5
Chapter 5
Unit 5 R Companion: Quadratic
Functions
5.1
Solving quadratic equations graphically in R
An overview
As noted in our earlier work on solving linear equations (Unit 3), the solution set for a
system of functions includes all values of x for which the functions are equal. In graphical
terms, this means that your goal is find the value of x at the one or more places where
the plots for these functions intersect. The solution(s) to a set of equation is the x value
or values that yield equal values of y for the two equstions.
Steps for Solving Functions Graphically in R
The strategy for solving a system of quadratic equations graphically in R has three steps
that closely parallel those for solving linear equations:
1. Enter your two functions as two separate functions.
2. Define a third function – which, here, we’ll term ifference\enverb — that calculates the difference between the first two functions you’ve entered.
3. Solve to determine the value of x at these points of intersection.
For example:
a) Define two functions (using the left- and right-hand sides of the initial function).
y1<-function(x) {4*x^2+3*x+1}
y2<-function(x) {(x-1)*(2*x+4)}
b) Graph both functions on the same graph. To do this in R:
plot(y1,xlim=c(-4,4), ylim=c(-5,100), lty=2)
plot(y2,xlim=c(-4,4), add=T)
19
100
80
0
20
40
y1 (x)
60
This is y1(x)
This is y2(x)
-4
-2
0
2
4
x
4. Here, you can see that the two curves do not intersect so there is no solution.
Some Notes about the Code Above
The term
lty=2
Also, the code for y1 contains the term
ylim=c(-5,100)
As you will recall from our earlier work on plotting functions in R, this command
tells R to set the y axis to go from -5 on up to 100. Try changing these numbers,
and see how your graph changes!
NOTE: Depending on the range you plot, you may not be able to tell whether your
two curves intersect or not! Try adjusting the range to get a clearer sense of this. For
example, the graph above examines the interval from x = −4 to x = 4. If you wanted to
examine this interval to range from x = −100 to x = 100, you would use this code:
plot(y1, xlim=c(-100,100), lty=2)
plot(y2,xlim=c(-100,100), add=T)
Take a look at the resulting graph. While the two functions it plots are the same as
those in the earlier panel, the range of the y axis is much broader in the second plot than
in the first. As a result, we can no longer tell whether or not these two curves actually
20
y1 (x)
0
10000
20000
30000
40000
intersect in the area where they draw close to one another. To avoid uncertainties and
possible misinterpretations of this sort, be sure to examine your functions on several
different scales!
-100
-50
0
50
100
x
Example: Solve x2 − 3x = (1 − x)2x.
1. Define two functions: y1 = x2 − 3x (left side) and y2 = (1 − x)2x (right side).
y1<-function(x) {x^2-3*x}
y2<-function(x) {(1-x)*2*x}
2. Graph both functions on the same graph:
0
20
y1(x)
40
60
plot(y1,ylim=c(-10,70),xlim=c(-4,4))
plot(y2, xlim=c(-4,4),add=T,lty=2);abline(v=0, lty=3);abline(h=0, lty=3)
-4
-2
0
2
4
x
You’ll note that there are two different intersections here; our job in finding solution
is to identify both of these points. As previously, when our goal was solve systems
21
of linear equations, our strategy here will be to make a new function that calculates
the difference between the two functions, and then identify the x value that makes
the difference equal to zero. So...
a) Define your difference function as one function minus the other:
difference<-function(x) {y1(x)-y2(x)}
And then, also as before, use the
uniroot
Recall that the
uniroot
Here, I look at the graph, and see that the graphs for the two functions cross
somewhere near x = 0. For this, I’ll start by examining on the interval (-1,1):
uniroot(difference, c(-1,1))
This gives several lines of output. For the moment, concentrate on the first line.
This gives the first root, which here is:
$root
[1] 8.733861e-07
This a very small positive number that is, it’s very, very close to zero.
For my second root, I look at the interval that, based on the graph, would seem to
contain my second solution. Once agin, concentrate on just the first line of output:
uniroot(difference, c(1,2))
This yields a solution of ∼1.666.
Therefore, we say that the equation has two roots. They are ∼0, and ∼1.66.
Note: the ∼ symbol is called tilde, and, in this context, is used to mean “approximately”. We state these roots as approximations, because they are not integers, and
to list the inferred root precisely would very likely require lots and lots of significant
figures.
22
Chapter 6
Unit 6 R Companion: Behavior of
a Function
6.1
Plotting some special functions
Careful attention to notation is essential in all coding, and especially so when
working with complex functions! Here are a few examples of R code for functions
you will encounter in Unit 6:
Function
y = x3
y=
sin(x)
x
y=
1
x
+5
y=y=
1
x2
R Code
y<-function(x){x^3}
y<-function(x){sin(x)/x}
y<-function(x){1/x+5}
y<-function(x){1/(x^2)}
6.2 Scaling your plot window to examine the
behavior of a function
For many functions, behavior for x → ∞ and x → −∞ will be visible only when
you examine the function over a wide range of values for the input parameter.
Consider, for example, this code in R
y<-function(x) {sin(x)/x}
If you were to use this code to make a plot:
plot(y)
You would initially see a plot of the function on the range x = 0 to x = 1, as is the
default convention in R. The graph would look like this:
23
1.00
0.95
y (x)
0.90
0.85
0.0
0.2
0.4
0.6
0.8
1.0
x
While this would be a good start, it would not permit you to examine the behavior
of the function for negative values, or for x values greater than 1!
To examine the behavior of functions, it will be very useful to adjust the x and y
axes to permit viewing of the function over a broader range of input and output
values. To view this function for −50 < x < 50 and −0.06 < y < 0.06, we would
enter:
y (x)
-0.06
-0.04
-0.02
0.00
0.02
0.04
0.06
plot(y, xlim=c(-500,500), ylim=c(-.06,.06))
-400
-200
0
x
200
400
.
Note that this narrowing the range of the y values plotted yields information that
was not at all apparent from our first plotting attempt. In particular:
• The function is undefined at x = 0 (that’s because the denominator would
have value of 0!)
• As x → ∞, y → 0; as x → −∞, y → 0.
24
6.3 Some reminders about using graphs to
assess the behavior of functions
When examining functions using a graph in R, be sure to:
• Adjust the scale of your axes so as to display the function over a wide range
of input and output values.
• Be on the lookout for x values for which the function is undefined. This should
be apparent from your graph in R. It should also be possible to identify these
points by examining the function itself to see whether there are any x values
for which the denominator will be 0.
25
Chapter 7
BRIEF ARTICLE
THE AUTHOR
Unit 7 R Companion: Function
Library and Non-Linear
Regression
Here is some R code for several of the functions you’ll encounter in this chapter:
Type of Function
Exponential Growth
Function
y = a · bx
Exponential Decay with negative exponent y = a · b−x
c
1+a·e−b·x
R Code
y<-function(x){a*b^x}
y<-function(x){a*b^-x}
Logistic Growth
y=
Sine
y = sin(x)
y<-function(x){sin(x)}
Cosine
y = cos(x)
y<-function(x){cos(x)}
Surge Function
y = A · x · e−b·x + C y<-function(x){A*exp(-b*x)+C
Square Root Function
y=
x
y<-function(x){x^(1/2)}
Absolute Value Function
y = |x|
y<-function(x){x^(1/2)}
Logarithmic Function, base 10
y = log10 (x)
y<-function(x){log(x,10)}
Logarithmic Function, base e
y = ln(x)
y<-function(x){log(x,exp(1))}
√
y<-function(x){c/(1+a*exp(-b*x)}
1
A Note on value e, used in the logistic growth, surge, and natural logarithmic functions: The symbol e is used to indicate a value of approximately
2.718. In your work with mathematical expressions, you’ll often need to raise e to
various powers. R has a special way of indicating the number e.
Say, for example, you want to raise the number e to power 2. You would type
exp(2). Here,exp indicates the number e, and the parenthetical number indicates
the power to which e should be raised.
7.1
Nonlinear Regression
The procedure for non-linear regression in R is very similar to the linear case:
a) Enter your data as two separate lists. For example,
27
xvalues<-c(1,2,3,16,22)
yvalues<-c(4,5,6,7,8)
b) Plot your data
6
4
5
yvalues
7
8
plot(xvalues,yvalues)
5
10
15
20
xvalues
c) Have a look at your graph. These data sure look nonlinear!
You already know how to do linear regression in R:
summary(lm(yvalues~xvalues))
This infers the best-fit line to be y = 4.6565 + 0.1527 ∗ x
But if we add this function to the plot
y1<-function(x) {4.6565+0.1527*x}
plot(xvalues,yvalues); plot(y1,xlim=c(1,22),add=T)
28
8
7
6
yvalues
5
4
5
10
15
20
xvalues
it doesn’t look like a good fit at all.
A poor fit to a linear function means that it’s time for non-linear regression,
which basically means that rather than finding the best-fit line, you are finding
some other function that is a good fit to your data. Such a function could,
for example, contain terms like x2 , x3 ...xn !
Here is how to ask R to find a second-order non-linear function (that is, a
function that contains x2 that is a good fit to your data:
summary(lm( yvalues ~ xvalues + I(xvalues^2)))
Some parts of this code should already be familiar from your work with linear
regression. If you compare to the line above, you’ll see that the only new part
is the
+ I(xvalues^2)
Basically, that part means that we are allowing for the
appear in the function that is the best fit to our data. Note:
reasons, you’ll always need to surround each new term,
here, with I(...), where that first letter is a capital I, as in
Biol123”.
Using this notation, we get the long output:
Call:
lm(formula = y ~ x + I(x^2))
Residuals:
1
2
-0.69733 0.05821
3
4
0.82331 -0.36056
29
5
0.17636
term xvalues2 to
just for notational
like the xvalues2
”I am studying in
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.443307
0.721808
6.156
0.0254 *
x
0.258799
0.251491
1.029
0.4116
I(x^2)
-0.004779
0.011163 -0.428
0.7102
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 0.815 on 2 degrees of freedom
Multiple R-squared: 0.8671,
Adjusted R-squared: 0.7343
F-statistic: 6.527 on 2 and 2 DF, p-value: 0.1329
As with linear regression, we can start by considering only a small part of
this. Focus, for the moment, on just this part:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.443307
0.721808
6.156
0.0254 *
x
0.258799
0.251491
1.029
0.4116
I(x^2)
-0.004779
0.011163 -0.428
0.7102
---
In this case, we can get our inferred function just by looking at the first
column. The intercept is inferred to be 4.44, the x coefficient is 0.258799, and
the x2 coefficient is -0.004779.
So, the best-fit function containing x2 is inferred to be
y = −0.00478 ∗ x2 + −0.2588 ∗ x + 4.44
Just one more thing to mention here: if you want to fit your data to a function
that has x3 as a term, you apply this same exact logic to write your code:
summary(lm( yvalues ~ xvalues+ I(xvalues^2)+I(xvalues^3)))
You can keep doing this to add more and more powers of xvalues!
The curve inferred in that final step would give the graph below, which is a
much better fit to your data!
30
8
7
4
5
6
yvalues
5
10
15
20
xvalues
Note: For this chapter the homework on the next page of this R Companion
(not the homework in the Coursebook, which, for this Unit, is specific to the
TI-83 calculator).
7.2
x
y
Homework for R Users
1
1
2
16
3
85
4
240
i. Plot these data, and find the best-fit regression line going up to the power
4 (that is, you want a function that contains the term x4
ii. Based on your previous work with R and linear regression, look at your
output and find the R2 value.
5
10
15
y
20
25
30
iii. Write the function inferred to be the best fit for these data.
1
2
3
4
5
x
31
7.3
Answers to Homework for R Users
i. My plot is above.
ii. My R2 value appears to be 1.0. (Added challenge: can you figure out
why?)
iii. The function would then be y = 20 − 40.5 · x + 30 · x2 − 8.5 · x3 + 0.83 · x4
32
Chapter 8
Unit 8 R Companion:
Descriptive Statistics
For our exploration of descriptive statistics in R, let ’s assume you start with
this list of data indicating the number of squirrels that you observed outside
your dorm. You collected data every day for two weeks:
squirrels<-c(1,0,0,0,0,4,5,6,6,7,7,8,10,2)
Here are some basic R functions that will allow you to perform descriptive
statistics for this data set. The code for a great many of these calculations
will be intuitive. Moreover, many of these functions use the same code as the
analogous function in Excel.
Here’s the code for some of the simple statistical tests you may want to per2
THE AUTHOR
form:
Function
mean
sum of values
maximum value
minimum value
median value
standard deviation for sample
variance
quartiles (to get 0% 25%, 50%, 75%, 100%)
number of data points
first quartile
second quartile
third quartile
the 89th percentile
total of all squared values
R Code
mean(squirrels)
sum(squirrels)
max(squirrels)
min(squirrels)
median(squirrels)
sd(squirrels) see notes
var(squirrels)
quantile(squirrels)
see notes
length(squirrels)
quantile(squirrels, probs=c(0.25))
quantile(squirrels, probs=c(0.5))
quantile(squirrels, probs=c(0.75))
quantile(squirrels, probs=c(0.89))see no
sum(squirrels^2)
i. Standard deviation for a population: Unless you tell it to do otherwise, R will calculate the sample standard deviation. To get the standard
deviation for the population, you can take advantage of the fact that the
33
calculations for population and sample standard deviations differ only
slightly (see your Coursebook for more on this!). The population standard deviation can be calculated as:
((sd(squirrels)^2*(length(squirrels)-1))/length(squirrels))^(1/2)
ii. The ”quartiles” function: Instead of ”quartiles”, the term in Excel,
R uses the term ”quantiles” (note the N rather than R!), which is a more
general way of talking about dividing a data set into parts based on the
fraction of data points that fall into each of those parts.
iii. There are a variety of different conventions for calculating quantiles. The
differences between these methods have mostly to do with where to place
data points that lie exactly on the boundary calculate for two segments
of the data set. The default method in R calculates these the same way
that Excel does.
iv. To find the mode, you can take advantage of R’s ability to make a table
to sort your data, and then display the abundance of each datum in the
overall set.
table(squirrels)
Which yields this list:
squirrels
0 1 2 4
4 1 1 1
5
1
6
2
7
2
8 10
1 1
Here, the top line indicates 0, 1, 2, 3... squirrels, and the bottom line indicates
how many times each of those numbers of squirrels occurred in the original
data set. Here, for example, you can see that it was the finding of 0 squirrels
that occurred the largest number of times in the data set (it occurred four
times). Therefore, the mode of this data set is 0. By contrast, there was only
one day on which 10 squirrels were observed.
34
Chapter 9
Unit 9 R Companion:
Hypothesis Testing
9.1 Question: You have taken a sample and
calculated a proportion. Is the true
proportion in the population from which your
sample is drawn different from 50%?
To address questions of this sort, R uses the proportion test. Consider the
following example:
Say you toss a coin 100 times.
If the coin is fair — that is, if it has equal probabilities of yielding heads and
yielding tails — we might expect that 50 tosses would yield heads, and 50
tosses would yield tails. However, we’re not tossing the coin infinitely many
times, so we expect some random variation around a ”perfect” outcome of
50-50. So, we need to figure how how different the outcome can be from 50-50
before we should begin to suspect that the coin is not fair.
In your experiment, 100 tosses yield 61 heads and 39 tails.
Here, you would use the proportion test to investigate whether or not this
result is consistent with the null hypothesis, H0 , that the coin is fair.
To implement this test, enter:
heads<-c(61)
tosses<-c(100)
prop.test(heads, tosses, p=.5, alternative="greater")
where
35
heads
tosses
p
”greater”
# of ”successes” observed
# of trials
null hypothesis, H0 , in decimal form.
alternative hypothesis, H1 . Here, that’s the possibility that p >0.5
Using the code above yields the following R output:
1-sample proportions test with continuity
correction
data: females out of people, null probability 0.5
X-squared = 4.41, df = 1, p-value = 0.01786
alternative hypothesis: true p is greater than 0.5
95 percent confidence interval:
0.5228432 1.0000000
sample estimates:
p
0.61
Some of the above lines will be familiar based on the code you entered.
For example:
data
and
number of successes = 61, number of trials
= 100
simply give the numbers of heads, and total tosses, as you originally entered
them.
The next line is the p-value which, as you know from your reading, tells you
the probability of gathering is particular data set if your null hypothesis is
true.
p-value = 0.0176
Here, the p-value is rather small, which can be interpreted to to mean that
only 1.8% the time would you observe 61 or more heads of 100 tosses if the
truth were that heads and tails were equally likely outcomes. In this case, we
would be justified in rejecting the null hypothesis, and accepting the alternate
interpretation that the probability of tossing heads using this particular coin
is greater than 0.5.
36
If you want R to summarize by giving only information about the p-value,
you can write,
prop.test(x=61, n=100, p=.5, alternative="greater")[3]
This is just like the code above, except for the ”[3]” on the end. That little
bit of additional code tells R that you want it to return only the 3rd item in
the answer — that is, just the part where it tells you the p-value. (You’ll find
that if you change the 3 to some other number, R will return a different part
of the answer!)
9.2 Question: Do the proportions calculated
for samples from two populations indicate
that the two populations differ in their true
proportions?
Let’s say that you now have samples for two different populations, and want
to ask whether the true proportions for those populations are significantly
different from one another. For example, imagine that you find that there are
10 males out of 24 students in one section of Biology 123, and 17 males out
of 25 students in another section. You want to ask whether the proportion of
males differ between the two sections.
We use the same logic as before:
males<-c(10,17)
students<-c(24,25)
prop.test(males, students)
Note that here, the lists give the value of each variable for the two populations.
There are 10 males in the first sample, and 17 in the second. Together, they
comprise the list of samples for males. Please ask if you have questions about
this!
The output is then:
2-sample test for equality of proportions
with continuity correction
data: males out of students
X-squared = 2.4503, df = 1, p-value =
0.1175
alternative hypothesis: two.sided
95 percent confidence interval:
-0.57312711 0.04646044
sample estimates:
prop 1
prop 2
0.4166667 0.6800000
37
Note that the p-value here is 0.1175, indicating that there is a 12% chance
that the proportions of males in the two populations differ this much by chance
alone, without any true difference between the populations.
Typically, a p-value this high will be interpreted to mean that there is no
justification for rejecting the null hypothesis that the proportions are equal
for the two populations.
9.3 Question: Given an estimated mean of a
sample, what can we say about the mean for
the population from which it was drawn?
Say you ask each student taking Biology 0123 this semester how many credits
he or she is taking, total. You want to test the hypothesis that the true mean
number of credits for Biology 0123 students is 15.
Here’s your data set:
creditstaken123<-c(12,13,14,15,15,18,12,13,14,15,18,12,12,12,12,13,
14,15,15,13,13,13,13,12,13,14,15,15,18,12,
13,14,15,18,12,12,12,12,13,14,15,15,13,13,13,13)
Implementing the t-test using raw data
The t-test will be useful here to ask whether the true mean for this population
is 15. To implement the t-test for our data set and question, type:
t.test(creditstaken123, mu=15)
There are just two terms to keep in mind here: the first is the name of your
data set (here, creditstaken)
The second is mu=15 which, here, tells R that you want to test the hypothesis
that the true mean of your data set is 15 (you can think of ”mu” as the code
for ”mean”).
Entering the code above yields a very, very tiny p-value (only 9.94·10−6 , which
can be written in standard notation as 0.00000994 - a very small number! )
Therefore, in this case, it seems safe to conclude that the true mean number
of credits is significantly different from the null hypothesis of 15.
38
9.4 Question: Do these samples come from
populations with different means?
On occasion, you’ll want to ask whether there is evidence that two different
samples have different means. So that we can experiment with this, let’s
define a second data set: the numbers of credits taken by a group of students
who are not currently enrolled in Biology0123:
creditstakenNOT123<-c(15,15,15,16,13,13,14,15,15,16,17,12,
13,14,15,15,16,17,16,17,12,
13,14,15,15,16,17,16,17,12,13,14,15,15,16,17)
Here, we use a t-test, which is a common approach for comparing the mean
and distribution for two different sets of data.
t.test(creditstaken123, creditstakenNOT123)
which yields this output:
Welch Two Sample t-test
data: creditstaken123 and creditstakenNOT123
t = -3.2022, df = 78.666, p-value =0.001969
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.8644710 -0.4350459
sample estimates:
mean of x mean of y
13.73913 14.88889
Here, too, we can tell R that we want to look directly at the p-value:
t.test(creditstaken123, creditstakenNOT123)[3]
Here, the very low p-value of 0.00197 tells us that we can reject the null
hypothesis that the means of the two populations are the same. Thus, we
conclude that there is evidence that the mean numbers of credits differ between these two groups of students (though this particular approach does not
permit us to make any claims about which mean is greater than the other).
39
9.5 Question: Is there evidence of change a
difference between ”before” and ”after”
samples taken for a given set of individuals?
Many experiments have a “before” and “after” design, measuring pulse or
body temperature both before and after exercise, for example, and then asking
whether there is evidence that the before and after measures differ significantly
for individuals. In these cases, it’s most useful to ask about the statistical
significance of changes observed for individual patients, rather than for the
population as a whole.
Here, for example, we record students’ test scores before and the after a 3-hour
study session:
before<-c(90,80,80,80,80,83,83,71,90,90,87,77,74,72,90,80,80,
80,80,83,83,71,90,90,
87,77,74,72,90,80,80,80,80,83,83,71,90,90,87,77,74,72)
after<-c(98,98,98,98,60,87,87,85,82,81,85,86,89,
90,100,100,100,98,98,98,98,
60,87,87,85,82,81,85,86,89,90,100,100,100,89,89,82,82,90,90,92,100)
In this case, we must inform R prior to implementing the t-test that the two
sets of the samples are matched. Happily, we can do this by adding only a
tiny bit of code!
We write,
t.test(after, before, paired=TRUE)
Note that the ”paired=TRUE” component is telling R that the samples are
paired — that is, that the ”before” and ”after” data sets both contain data
for the same group of patients. Which yields this output:
Paired t-test
data: after and before
t = 4.9273, df = 41, p-value = 1.417e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
4.931844 11.782441
sample estimates:
mean of the differences
8.357143
40
As always, we can get just the p-value on its own by writing,
t.test(after, before, paired=TRUE)[3]
Which yields:
$p.value
[1] 1.416579e-05
That’s a pretty low p-value! Note that the full output also gives the ”mean
of the differences”, showing that individual students’ test scores increased by
an average of 8.36. Looks like most students benefits greatly from this study
session!
9.6
Implementing the χ2 Test in R
The goal of the χ2 test is to assess whether the distribution of counts across
two or more groups is significantly different from the null expectation.
To do this, we compare each observed count to the expected count, and ask
about the probability of observing total discrepancy of this great or greater, by
chance alone. Larger total differences between observed and expected counts
lead to lower p-values, indicating lower probabilities that the discrepancies
arose by chance alone, suggesting that there is a significant difference between
the two groups.
9.7
A sample data set
Suppose we knew that 1/3 of Westfield students were from Hampden County,
1/3 were from Suffolk County, and 1/3 were from elsewhere. We then surveyed
198 students on a weekend afternoon to ask about the county where they live.
Our expected proportions are:
Hampden
33%
Suffolk
33%
Other
33%
And our expected counts for those 198 students surveyed would then be:
Hampden
66
Suffolk
66
Other
66
Let’s say our observed counts were as follows:
Hampden
48
Suffolk
70
Other
80
41
9.8
Doing the χ2 test by hand
We could the χ2 test to ask whether the distribution of students in to “Hampden”, ”Suffolk”, and ”Other” was significantly different from our expectation.
To do so, we would calculate the following for each category:
(O−E)2
E
and then take the sum of the equation for each of the categories.
For the example above, that would be
County
Observed
Expected
(O−E)2
E
Hampden
48
66
4.91
Suffolk
80
66
2.97
Other
70
66
0.242
Summing the bottom line of this table, we would get 8.12! The test is said to
have two degrees of freedom (”d.f.”) because knowing the total number of
students surveyed (198) plus any two other values (number from Suffolk and
number from elsewhere, or number from Hamden and number from elsewhere,
or number from Suffolk and number from Hampden) would be sufficient to
complete the whole table.
To figure out the associated p-value, you have several options:
• do the entire test in Excel, as described below
• do the entire test in R, as described below
• use the sum calculated above (8.12) in conjunction with an online calculator, called ”P from chi2 ” that is available at
http://graphpad.com/quickcalcs/PValue1.cfm
9.9
To Perform a χ2 test in Excel
The “chisq.test” function in Excel takes two arguments: first, the range of
cells that contain the observed values, and second, the range of cells that
contain the expected values.
To enter data in Excel, enter your observed and expected values into separate
blocks of cells.
Then, type, ”=chisq.test(”
Highlight you observed values, press comma, enter your expected values, press
enter.
The value returned will be your p-value, indicating how probable your observed values are, given your expected values.
42
9.10
To Perform a χ2 test in R
There are several way to do a χ2 test in R. For me, the easiest to remember
is ”by hand”, meaning that you enter the raw data, and then calculate the
(O−E)2
value described above, find the sum, and then ask how surprised you
E
should be about this result. Here’s how:
First, establish two lists: one with your expected values, and the other with
your observed values. For the data given above, that would be:
expected<-c(66,66,66)
observed<-c(48,80,70)
Then, find the sum of each of the values you calculated using
(O−E)2
:
E
chivalue=sum(((observed-expected)^2)/expected)
Finally, we want to ask how surprised we should be to observe this χ2 value.
The command below tells R to compare your results to the chi-square distribution. The ”chivalue” refers to the part you calculated above. The ”2” gives
the degrees of freedom — that is, the number of things we need to know (here,
counts for two of three counties) in order to be able to provide the complete
data set, assuming we already know the total number of students surveyed.
1-pchisq(chivalue,2)
And returns a p-value —
0.01723857
We interpret this to mean the only 1.7% of the time would the observed values
differ this much or more from the expected values by chance alone!
If we are using the conventional cut-off of 5%, we would note that our percent
is even lower the cutoff, and would conclude that these observations differ
significantly from the expectations.
43
Chapter 10
Unit 10 R Companion:
Confidence Intervals
10.1 Proportions: computing the confidence
interval and the margin of error
Computing a confidence interval
The goal of computing a confidence interval on a proportion, using a sample,
is to determine the range of values that, with a given degree of certainty,
contains the true value of the proportion for that population.
For a data set that is binary — that is, it consists of only two possible answers,
like ”yes” or ”no”, ”left” or ”right”, etc. — you can compute a confidence
interval using the ”proportion test” or,
prop.test
(You’ll recall from earlier reading in our course that we used this very same
test to ask whether a sampled proportion was significantly different from an
expected proportion!)
Say, for example, you are interested in the proportion of students who take
Biology 0123 who are majoring in biology. There are several different sections
of Biology 123, so you decide to use our section as a sample with which to
estimate the proportion for all sections.
You interview the students in our section, and find that 10 of 22 students
are majoring in biology. Given this sampled proportion, what can we say
about the true fraction of all Biology 0123 students at Westfield State who
are majoring in biology?
Lets assume, for the moment, that you want to compute the 95% confidence
interval on our estimate — that is, the range of values that, with 95% certainty, contains the true proportion for Biology 123 overall.
To compute this confidence interval, you would write:
biomajors<-c(10)
45
studentssurveyed<-c(22)
prop.test(biomajors,studentssurveyed)
where
R Code
biomajors<-c(10)
studentssurveyed<-c(22)
meaning
number of biology majors in our section
number of students in our section
The proportion of biology majors in our sample is then
10
22
which is 0.455.
As is often the case, R produces several lines of output — the confidence
interval is on one of these lines.
1-sample proportions test with continuity correction
data: biomajors out of studentssurveyed, null probability 0.5
X-squared = 0.0455, df = 1, p-value = 0.8312
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.2507068 0.6732606
sample estimates:
p
0.4545455
You’ll note that proportion of biology majors among the students sampled
is given by the last line — reassuringly, it’s equal to the value we calculated
above:
p
0.4545455
The line of interest for computing the confidence interval is this one:
95 percent confidence interval:
0.2507068 0.6732606
This tells us that our sample of 10 biology majors out of 22 Biol0123 students
is, with 95% confidence, taken from a population where the true proportion
is between 0.25 and 0.67.
As you’ll recall from your reading, a 95% confidence interval is appropriate
when you want to be quite certain that the interval compute contains the true
answer. If you don’t specify the confidence interval you want to compute, R
will assume that you are looking for the 95% interval.
However, there some cases in which you might be content to calculate an
interval that has a somewhat lower probability of containing the true answer
— computing 80% intervals, for example, is common in some fields.
To calculate the 80% confidence interval in R, you would write,
46
prop.test(biomajors,studentssurveyed, conf.level=.8)
Note that this yields an interval that is smaller than the 95% value we calculated above. Do you see why? Please let me known if you’d like to discuss
this further!
Computing the margin of error
Often, scientific papers contain sentences like this:
Of all of the students taking Biology0123, 45.5% ± 21.1% are biology majors.”
In this case, the ± statement indicates the margin of error. The margin of
error is half the confidence interval, and can be computed by finding half of
the range spanned by the confidence interval
0.2507068-0.6732606
which is
-0.4225538
And then dividing by two, which is:
-0.2112769
So, the margin of error on our estimate of the proportion is ±21.1%.
Note: Several times in this section, we have skipped back and forth between
thinking about proportions as decimals, and thinking about them as percentages. Either of these options is fine for thinking about proportions, so long
as you are clear about which you are using!
10.2 Means: computing the confidence
interval and the margin of error
...using a complete data set
Say you have this list of squirrel counts taken on various plots in Westfield:
squirrels.per.acre<-c(50,10,16,6,12,13,14,15,16,19,20,21,25,25,70,91,30,
54,16,6,12,13,14,15, 16,19,20,21,25,
25,70,91,20,21,25,25,70,91,6)
Recall from before that you can use R to compute the mean of this list:
mean(squirrels.per.acre)
47
To compute the confidence interval on this estimate of the mean, we can use
the one-sample t-test, as we did in a previous unit. Here, we would enter,
t.test(squirrels.per.acre)
The 95% confidence interval is listed in the output:
95 percent confidence interval:
20.96036 36.88580
As above, we can use this interval to compute the margin of error:
(20.96036−36.88580)
2
which is equal to -7.96, so we write the interval as
28.92 ± 7.96.
...using summary statistics
On rare occasion, you will have summary information for a study, but no
access to the raw data. In such a case, you’ll want to use the known mean,
standard deviation and sample size to calculate the confidence interval.
To do this, you’ll have to install the BSDA” package in R. Because R is an
open-source language, anyone who wants to can contribute new packages to
perform computations that are not part of the standard distribution. The
“BSDA” package is one such package. R as installed on the laptop you are
using likely does not include this package — but installing the package should
be quite easy! The two lines of code given below will tell R to download the
BSDA package from the internet, and install it on the computer that you are
using:
install.packages("BSDA", dependencies=T)
library(BSDA)
Because R doesn’t retain access to packages from one session to the next, you
should be sure to enter this code each time you want to calculate a confidence
interval from summary statistics.
After the BSDA package is installed, we can use its tsum.test function to
perform calculations using summary statistics. For example, imagine that
for the squirrels.per.acre data set given above, we already know the following
(but don’t have the original data set!):
mean
standard deviation
sample size
28.92308
24.56397
39
Imagine we want to calculate the 80% confidence interval. To do so, we input
the values above into the ”tsum.test” function. The name comes from the
ability of this function to calculate a t interval using summary statistics:
48
tsum.test(mean.x=28.92308, s.x=24.56397, n.x=39, conf.level=.80)
This yields a confidence interval of
23.79304 34.05312
And an associated margin of error of
23.79304−34.04312
2
= −5.13.
So, we can say, with 80% confidence, that the true of squirrels across one acre
plots in Westfield is
28.92 ± 5.13.
49
Chapter 11
Unit 11 R Companion:
Experimental Design
Your Coursebook contains all of the material required for this section.
51
Appendix A
Appendix 1: Other hints that
may be useful as you learn R
Evaluating functions at specific input-variable values
Once you have coded a function in R, it’s quite easy to calculate the value of
the function given a specific value for the input variable. Say, for example,
you wanted to evaluate your function z(p) for p = 10. If you were to type
z(10)
R would return
[1] 16
A.1
Using R as a Calculator
R can perform all of the arithmetic you might otherwise perform on a typical
calculator.
If you type
(3+23)/4
and then press enter on you computer keyboard, R will return
[1] 6.5
(For the moment, you can ignore the “ [1]” that appears at the beginning of
each line of R output.)
If you want to do multiple calculations on a single line, you can use a semi
colon to separate the various commands. For example,
(3+23)/4; 2+5
53
returns two separate answers on two separate lines:
[1] 6.5
[1] 7
If you want to calculate the product of two numbers — say, for example, 17
and 20 — you’d type
17*20
In R, the asterix (*) is used as a multiplication sign.
A.2
File Types in R
• There are three main types of windows in R:
i. Console where calculations are performed
ii. Text file where code can be written and stored
iii. Quartz window where graphical output is displayed.
• Sometimes, after a new plot is called, the Quartz file containing the plot
doesn’t immediately appear as the front window on your screen. To
bring that window to the front, click on the “Window” menu and select
the “Quartz” option. If the x-axis is partially obscured on the graph
displayed, try maximizing the Quartz window.
A.3
Saving Files in R
• There are two primary types of files that you’ll want to save in R:
i. Text files containing code: be sure to save these so that you can
adapt and use your code later!
ii. Quartz files that contain plots: you can save these as .pdf files for
later retrieval, printing, incorporation into papers, etc.
A.4
Typing Shortcuts in R
As you become more accustomed to R, you’ll develop your own methods for
saving time and accomplishing tasks efficiently. Here are a few to start with:
• Reenter code from earlier in your console: press the up-arrow key.
This will enable you to cycle through earlier lines until you reach the
code of interest.
• Use text recognition to save typing time: If you want to type a
new line that contains a term you’ve used at least one already, type the
beginning of the word, and then press the tab key. This will yield a
drop-down list; scroll through it to find the term you’d begun to type!
54
A.5 Importing data into R, and exporting
plots from R
At times, you may record raw data in one application (Excel, for example),
and want later to manipulate and analyze those data R. You could reenter
your data directly into R. However, each time that you transcribe data, there
are new opportunities for typos – minimizing such opportunities is a useful
goal. Fortunately, R can read and important files of various different formats.
Getting R to read data from an Excel file.
Say you have the following Excel file
and would like to plot and analyze these data in Excel.
To import this Excel file into R, use the following steps:
i. Save your Excel file as a “.csv” file, where “csv” stands for ”commaseparated values”. Be sure to save your file to a folder where you can
find it by name! For the purpose of this example, let’s assume that we’ve
named our file “runningtimes.csv”, and that we’ve saved it to the folder
R Addenda.
ii. Proceed to R, and change your “Working Directory” to be the folder
where you saved this file. You can change the working directory by
going to Misc − >Change Working Directory, and then clicking on the
folder that contains your .csv
iii. Next, from within R, read in the contents of your file. To do so, type:
p<-read.table("runningtimes.csv",sep=’,’,
header = TRUE)
iv. A couple things to note in the above code: the“sep” command in the
code above makes it possible to tell R that commas are used to separate
our data points, and the header setting tells R that we have included
names for the various columns in our data file (”Age”, etc.)
v. Your data should now be ready for manipulation in R! To test this, could
type the name of our data set (here, we’ve set it to be “p”), and examine
the data — they should have the same essential structure as in the Excel
file.
55
Appendix B
Appendix B: Exponential
Regression in R
B.1
Strategy
The goal of exponential regression is to the find a function of form
y = b · em·x
that is a good fit to a particular data set.
In R, exponential regression can be achieved through “exponential transformation”. We find the natural log of the output (y) variable, and fit a line to
the transformed data. The value for m in the exponential function is the slope
of the inferred line. The value for b in the exponential function is calculated
as eintercept of the interfered line.
Here are the specific steps:
i. Find the natural log of the output variable, y.
ii. Use linear regression to find the line that best relates your input variable,
x to the log of your output variable — that is, loge (y).
iii. To find ”m” for the exponential function, use the slope inferred for the
best-fit line.
iv. To find ”b” for the exponential function, raise e to the intercept inferred
for the best-fit line.
v. Write your exponential function!
Here’s an example:
57
B.2
Enter your data
Imagine you record hours of data on the number of bacterial cells present in
200mL of a culture. The data can be written as follows:
hour<-c(1,2,3,4,5,6)
cells<-c( 1, 20, 30, 50, 80,200)
We can plot the data:
plot(hour,cells)
100
0
50
Cells
150
200
Yielding a graph:
1
2
3
4
5
6
Hour
B.3 Find the natural log of your output
variable
Here, our output variable is ”cells”, so we write
natlogcells<-log(cells)
For the natural logs of these y values, I get
[1] 0.000000 2.995732 3.401197 3.912023 4.382027
[6] 5.29831
B.4 Use linear regression to find the line
that best relates your input variable, x to the
log of your output variable — that is, loge(y).
We’ve done this lots of times before! Enter
58
summary(lm(natlogcells~hour))
I get
Call:
lm(formula = natlogcells ~ hour)
Residuals:
1
2
-1.1057 0.9997
3
0.5148
4
5
6
0.1353 -0.2850 -0.2590
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
0.2154
0.7583
0.284
0.7904
hour
0.8903
0.1947
4.573
0.0102
(Intercept)
hour
*
--Residual standard error: 0.8145 on 4 degrees of freedom
Multiple R-squared: 0.8394,
Adjusted R-squared: 0.7993
F-statistic: 20.91 on 1 and 4 DF, p-value: 0.01024
B.5 To find ”m” for the exponential
function, use the slope inferred for the best-fit
line.
For the example above, I get
m = 0.8903.
B.6 To find ”b” for the exponential function,
raise e to the intercept inferred for the best-fit
line.
Here, the intercept for the line is inferred to be 0.2154, so I calculate b as
b = exp(0.2154) = 1.24.
B.7
Write your exponential function!
Remember, your function should have the form: y = b · em·x
So, we write:
cells = 1.24 · e0.8903·hour
59
B.8 How well does your inferred function fit
your data?
Try plotting your raw data and inferred function on the same plot:
inferredfit<-function(x) {1.24*exp(.8903*x)}
plot(hour,cells, main="A pretty good fit!");plot(inferredfit,1,6,add=T)
100
0
50
cells
150
200
A pretty good fit!
1
2
3
4
hour
It’s a pretty good fit!
60
5
6
Appendix C
Appendix 3: Implementing the
χ2 Test by Hand, in Excel, and
in R
The goal of the χ2 test is to assess whether the distribution of counts across
two or more groups is significantly different from the null expectation.
To do this, we compare each observed count to the expected count, and ask
about the probability of observing total discrepancy of this great or greater, by
chance alone. Larger total differences between observed and expected counts
lead to lower p-values, indicating lower probabilities that the discrepancies
arose by chance alone, suggesting that there is a significant difference between
the two groups.
C.1
A sample data set
Suppose we knew that 1/3 of Westfield students were from Hampden County,
1/3 were from Suffolk County, and 1/3 were from elsewhere. We then surveyed
198 students on a weekend afternoon to ask about the county where they live.
Our expected proportions are:
Hampden
33%
Suffolk
33%
Other
33%
And our expected counts for those 198 students surveyed would then be:
Hampden
66
Suffolk
66
Other
66
Let’s say our observed counts were as follows:
Hampden
48
Suffolk
70
Other
80
61
C.2
Doing the χ2 test by hand
We could the χ2 test to ask whether the distribution of students in to “Hampden”, ”Suffolk”, and ”Other” was significantly different from our expectation.
To do so, we would calculate the following for each category:
(O−E)2
E
and then take the sum of the equation for each of the categories.
For the example above, that would be
County
Observed
Expected
(O−E)2
E
Hampden
48
66
4.91
Suffolk
80
66
2.97
Other
70
66
0.242
Summing the bottom line of this table, we would get 8.12! The test is said to
have two degrees of freedom (”d.f.”) because knowing the total number of
students surveyed (198) plus any two other values (number from Suffolk and
number from elsewhere, or number from Hamden and number from elsewhere,
or number from Suffolk and number from Hampden) would be sufficient to
complete the whole table.
To figure out the associated p-value, you have several options:
• do the entire test in Excel, as described below
• do the entire test in R, as described below
• use the sum calculated above (8.12) in conjunction with an online calculator, called ”P from chi2 ” that is available at http://graphpad.com/quickcalcs/PValue1.c
C.3
To Perform a χ2 test in Excel
The “chisq.test” function in Excel takes two arguments: first, the range of
cells that contain the observed values, and second, the range of cells that
contain the expected values.
To enter data in Excel, enter your observed and expected values into separate
blocks of cells.
Then, type, ”=chisq.test(”
Highlight you observed values, press comma, enter your expected values, press
enter.
The value returned will be your p-value, indicating how probable your observed values are, given your expected values.
C.4
To Perform a χ2 test in R
There are several way to do a χ2 test in R. For me, the easiest to remember
is ”by hand”, meaning that you enter the raw data, and then calculate the
62
(O−E)2
E
value described above, find the sum, and then ask how surprised you
should be about this result. Here’s how:
First, establish two lists: one with your expected values, and the other with
your observed values. For the data given above, that would be:
expected<-c(66,66,66)
observed<-c(48,80,70)
Then, find the sum of each of the values you calculated using
(O−E)2
:
E
chivalue=sum(((observed-expected)^2)/expected)
Finally, we want to ask how surprised we should be to observe this χ2 value.
The command below tells R to compare your results to the chi-square distribution. The ”chivalue” refers to the part you calculated above. The ”2” gives
the degrees of freedom — that is, the number of things we need to know (here,
counts for two of three counties) in order to be able to provide the complete
data set, assuming we already know the total number of students surveyed.
1-pchisq(chivalue,2)
And returns a p-value —
0.01723857
We interpret this to mean the only 1.7% of the time would the observed values
differ this much or more from the expected values!
If we are using the conventional cut-off of 5%, we would note that our percent
is even lower the cutoff, and would conclude that these observations differ
significantly from the expectations.
63