An R Companion to Mathematical Methods in the Natural Sciences (Vorwerk and Vorwerk) Diane P. Genereux∗ Stephen V. O’Brien, editor Westfield State University Westfield, MA ∗ † [email protected] [email protected] 1 † Contents Unit 0 R Companion: Algebra Review 1 Unit 1 R Companion: Using Technology 1.1 What is R? . . . . . . . . . . . . . . . . 1.2 Some thoughts on learning R . . . . . . 1.3 How to credit R in published work . . . 1.4 Plots in R . . . . . . . . . . . . . . . . . 1.5 Installing R . . . . . . . . . . . . . . . . 1.6 Graphing Data in R . . . . . . . . . . . 1.7 Troubleshooting Data Entered in R . . . 1.8 Entering functions in R . . . . . . . . . . 1.9 Plotting functions in R . . . . . . . . . . 1.10 Troubleshooting functions in R . . . . . iii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Unit 2 R Companion: Scientific notation 1 1 1 1 2 2 3 4 6 7 8 9 3 Unit 3 R Companion: Solving Linear Functions 3.1 Using R to Solve Linear Equations Graphically . . . . . . . . . . . . . . 3.2 Using R to identify the point of intersection for two linear equations. . . 11 11 12 4 Unit 4 R Companion: Linear Regression 4.1 Plotting an inferred function . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Interpolation and Extrapolation . . . . . . . . . . . . . . . . . . . . . . . 15 17 17 5 Unit 5 R Companion: Quadratic Functions 5.1 Solving quadratic equations graphically in R . . . . . . . . . . . . . . . . 19 19 6 Unit 6 R Companion: Behavior of a Function 6.1 Plotting some special functions . . . . . . . . . . . . . . . . . . . . . . . 6.2 Scaling your plot window to examine the behavior of a function . . . . . 6.3 Some reminders about using graphs to assess the behavior of functions . 23 23 23 25 7 Unit 7 R Companion: Function Library and Non-Linear 7.1 Nonlinear Regression . . . . . . . . . . . . . . . . . . . . . 7.2 Homework for R Users . . . . . . . . . . . . . . . . . . . . 7.3 Answers to Homework for R Users . . . . . . . . . . . . . . 27 27 31 32 Regression . . . . . . . . . . . . . . . . . . . . . . . . 8 Unit 8 R Companion: Descriptive Statistics 33 9 Unit 9 R Companion: Hypothesis Testing 2 35 9.1 Question: You have taken a sample and calculated a proportion. Is the true proportion in the population from which your sample is drawn different from 50%? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Question: Do the proportions calculated for samples from two populations indicate that the two populations differ in their true proportions? . . . . 9.3 Question: Given an estimated mean of a sample, what can we say about the mean for the population from which it was drawn? . . . . . . . . . . 9.4 Question: Do these samples come from populations with different means? 9.5 Question: Is there evidence of change a difference between ”before” and ”after” samples taken for a given set of individuals? . . . . . . . . . . . . 9.6 Implementing the χ2 Test in R . . . . . . . . . . . . . . . . . . . . . . . . 9.7 A sample data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.8 Doing the χ2 test by hand . . . . . . . . . . . . . . . . . . . . . . . . . . 9.9 To Perform a χ2 test in Excel . . . . . . . . . . . . . . . . . . . . . . . . 9.10 To Perform a χ2 test in R . . . . . . . . . . . . . . . . . . . . . . . . . . 35 37 38 39 40 41 41 42 42 43 10 Unit 10 R Companion: Confidence Intervals 10.1 Proportions: computing the confidence interval and the margin of error . 10.2 Means: computing the confidence interval and the margin of error . . . . 45 45 47 11 Unit 11 R Companion: Experimental Design 51 A Appendix 1: Other hints that may be useful as you A.1 Using R as a Calculator . . . . . . . . . . . . . . . . A.2 File Types in R . . . . . . . . . . . . . . . . . . . . . A.3 Saving Files in R . . . . . . . . . . . . . . . . . . . . A.4 Typing Shortcuts in R . . . . . . . . . . . . . . . . . A.5 Importing data into R, and exporting plots from R . learn R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B Appendix B: Exponential Regression in R B.1 Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Enter your data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3 Find the natural log of your output variable . . . . . . . . . . . . . . . . B.4 Use linear regression to find the line that best relates your input variable, x to the log of your output variable — that is, loge (y). . . . . . . . . . . B.5 To find ”m” for the exponential function, use the slope inferred for the best-fit line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.6 To find ”b” for the exponential function, raise e to the intercept inferred for the best-fit line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.7 Write your exponential function! . . . . . . . . . . . . . . . . . . . . . . . B.8 How well does your inferred function fit your data? . . . . . . . . . . . . C Appendix 3: Implementing the C.1 A sample data set . . . . . . . C.2 Doing the χ2 test by hand . . C.3 To Perform a χ2 test in Excel C.4 To Perform a χ2 test in R . . χ2 Test . . . . . . . . . . . . . . . . . . . . 3 by . . . . . . . . Hand, in . . . . . . . . . . . . . . . . . . . . . . . . Excel, and . . . . . . . . . . . . . . . . . . . . . . . . . . . . in . . . . . . . . R . . . . . . . . 53 53 54 54 54 55 57 57 58 58 58 59 59 59 60 61 61 62 62 62 Preface Welcome to the first edition of our R Companion to the Biology 123 Coursebook by Vorwerk and Vorwerk. We hope that this Companion will be useful to you in your studies of mathematical methods, and of the R statistical computing language. We are grateful for your input on revisions that would make this Companion more useful for future readers. Diane P. Genereux and Stephen V. O’Brien, editor Westfield, MA January 2013 i Unit 0 R Companion: Algebra Review Your Coursebook contains all of the material required for this section. iii Chapter 1 Unit 1 R Companion: Using Technology 1.1 What is R? R is a statistical computing language. It works both as a scientific calculator, and as a computing environment with the capacity to to perform a wide range of analyses, and to make complex plots and other graphics. R is currently the programming environment of choice for many biologists and statisticians. 1.2 Some thoughts on learning R Learning R is a little challenging and (I think) a lot of fun. With the basic programming skills you’ll learn and apply in our class, you will be prepared to use R in your other coursework. In past semesters, for example, students who worked with R in Biology 0123 applied their R skills to complete projects in Introductory Biology, and Genetics. Strategies for coding in R are widely documented in online discussions. Should you want to read about how to approach a new problem in R, a quick Google search is likely to yield several useful strategies. Working with R in our course will be an excellent way to become comfortable with computer programming in general, and will be good preparation for learning a wide variety of programming languages in the future. The goal of this R Companion is to introduce you to some basic programming approaches that will be useful in completing the problems in your Coursebook by Vorwerk and Vorwerk. 1.3 How to credit R in published work R was written by Ross Ihaka and Robert Gentleman. Since their initial development of this language, thousands of people have contributed software packages that extend R’s capabilities to specific, new forms of data analysis. Should you decide to publish a paper in which you use R, it will be important to give credit to the team of people who developed it. 1 The preferred citation is: R: A Language and Environment for Statistical Computing R Development Core Team, The R Foundation for Statistical Computing Vienna, Austria 2012 1.4 Plots in R Using R, you can plot data sets large and small, and then tune those plots in almost infinitely many ways. Here is a plot I made using R (D.P. Genereux 2009, PLoSGenetics:e1000509). The plot examines how Methylation Density (y-axis), a modification that helps to determine whether individual genes are on or off in individual cells, depends on cell Division Number (x-axis). Epigenetic Costs of Genetic Fidelity? Figure 2. Trajectories of methylation densities under asymmetric or symmetric strand segregation, with high initial methylation As you canThesee, it isstrand possible to plot methylation data points and in R, well as to add density. oldest-parent and the population-mean densities (filled and functions open circles, respectively), are as shown for simulations run under asymmetric and symmetric modes of strand segregation (circles and squares, respectively). For the simulations shown here, I used a numerousstarting informative labels. If the you want to examine wide densities variety graphs made methylation density of m~0:8. For scenarios with parent strand de novo a methylation, were of calculated with m~0:975, and using d ~d ~0:05. Under asymmetric strand segregation, these parameter values lead to monotonic increases in population-mean and oldest-parent strandof methylation (upper curves). Under symmetric strand segregation, these parameter values lead to population-mean and oldest-parent R — some themdensities both information-rich and aesthetically impressive — take a look at strand DNA methylation densities that were dynamic about the starting value (middle curves). With no parent strand de novo methylation (d ~0:1, d ~0), densities were unchanged under both symmetric and asymmetric strand segregation (dashed line). http://gallery.r-enthusiasts.com doi:10.1371/journal.pgen.1000509.g002 p d 1.5 d p achieved through asymmetric strand segregation, with implications for human disease. The rate of increase will also depend on the initial DNA methylation density (compare, for instance, Figures 2 and 3). Lorincz et al. [16] found that progression to dense methylation is especially likely for genomic regions that have already attained intermediate methylation densities. In light of this finding, it seems plausible that even slow or transient increases in DNA methylation could raise methylation densities to a threshold sufficient to trigger more substantial increases. What might be the functional implications of the increased DNA methylation densities predicted under asymmetric strand segregation? The accumulation of methyl groups on a long-lived DNA strand could serve as a signal to guide asymmetric strand segregation itself [17], or to distinguish stem cells from differentiated cells [13]. My findings could also help to explain the positive correlation observed between age and methylation density in endometrial [18] and intestinal [19] tissues. Both of these are rapidly-dividing tissues of the sort initially predicted by Cairns [1], and reported by some groups [4], to undergo asymmetric strand segregation. In contrast, slowlydividing cells, such as those in the hematopoetic lineage, have constant methylation densities [20–23] and have been reported not to undergo asymmetric strand segregation [6]. Thus, the systematic increases in DNA methylation densities predicted here may be specific to the rapidly-dividing lineages Cairns initially discussed [1]. My results may also have implications for the etiology of cancer in humans. Several epithelial cancers are associated with reductions in epigenetic fidelity, including the accumulation of aberrant methylation and abnormal gene silencing [24,25]. Barrett’s esophagus illustrates the potential relevance of these findings. The esophageal epithelium in Barrett’s esophagus contains abnormal intestinal crypt-like structures, and is characterized by abrupt increases in DNA methylation densities and consequent silencing of loci critical to cell-cycle regulation [26]. Thus, it is possible that directional change in epigenetic information may be a cost of the increased genetic fidelity Installing R Models Modelling an Epithelial Crypt Because R is an open-source program, you can download it and install it, without charge. I developed a simplified model of an epithelial crypt with which to track methylation dynamics (Figure 1). Each crypt consists of one It work equally well on Macs and PCs. somatic stem cell, and four differentiated cells. At each round of stem-cell division, one terminally differentiated cell is produced, and one stem cell is produced. The top-most of the terminally differentiated cells is sloughed off at the epithelial surface. Segregation of the oldest DNA strand always to the stem cell characterizes asymmetric strand segregation (Figure 1a); segregation of the oldest DNA strand at random to the stem and terminally differentiated cells characterizes symmetric segregation (Figure 1B). Modelling DNA Methylation Events in an Epithelial Crypt Installing R on a Mac Maintenance and de novo methylation. I modelled replication and methylation dynamics for a single, methylated locus such as one of those on the hemizygous X chromosome in a human male, or on the inactive X chromosome in a human female. Prior to cell division, the locus undergoes semi-conservative replication, producing two double-stranded DNA molecules. Each molecule is composed of a parent strand from the original double-stranded molecule, and a newly-synthesized daughter strand. The model to be described in detail below compares methylation dynamics under asymmetric (Figure 1A) and symmetric (Figure 1B) strand segregation. Methyl groups are added to CpG cytosines in DNA by two different processes: maintenance methylation and de novo methylation. Maintenance methylation is performed by maintenance methyltranferases, which exhibit a preference for hemimethylated CpG/CpG dyads, are thought to localize to the 1. Check to see whether R is installed on your computer already. On a Mac, you can use the Spotlight function (click on the magnifying glass in the upper-right corner). Just type R in the search box. If the list that appears contains the symbol shown below, click on it. R is already installed! 2 PLoS Genetics | www.plosgenetics.org 3 June 2009 | Volume 5 | Issue 6 | e1000509 2. If the symbol above does not appear on the list, you’ll need to download and install R. To do so: a) Go to http://cran.r-project.org/bin/macosx/ b) Click on R-2.15.1-signed.pkg c) Install the downloaded file. d) The R program should now be available in your Applications folder. e) Need help? Please ask! Installing R on a PC Check to see whether R is installed on your computer already. Use the “Search” option available at the bottom left of your screen to look for R. If it’s already installed, you can skip the rest of this section. If R is not installed, you’ll need to download it and install it. To do so: 1. Go to http://cran.r-project.org/bin/windows/base/ 2. Click on ”Download R 2.15.2 for Window” 3. This will download a .exe file (“R-2.15-win.exe”) 4. Follow the prompts asking for information about where to install R. In most cases, the default settings should be fine. 5. When the installation is complete, a shortcut-to-R icon should appear on your desktop. 6. Need help? Please ask! 1.6 Graphing Data in R Entering data in R Many of the data sets you will collect in the natural sciences will consist of lists of paired numbers. For example, you might gather information on the height of a plant at different time points following its germination. For such a data set, “time” would be the independent (input) variable, and “height” is the dependent (output) variable. Data of this form could be displayed as follows: Time (days) 1 2 3 4 5 Height (cm) 2 3 4.6 5 5.1 3 To enter such data in R, you would create two separate lists: one for “time”, the other for “height”. To create these two separate lists in R, you would type, timevalues<-c(1,2,3,4,5) and heightvalues<-c(2,3,4.6,5,5.1) These lists introduce some important conventions of R code. 10pt A list must: • Be given a name. Here, the names of the two lists are timevalues and heightvalues. In R, you came name a list whatever you like, so long as the name contains no spaces. It is conventional that list names not contain any capital letters or punctuation. • Be indicated using the symbol < − This can be typed as a less-than sign (<), followed by a hyphen(-). Together, these symbols create a left-pointing arrow, and tell R that you want to assign the variable name you’ve chosen to refer to a list containing the data points you’re about to enter. • Have data entered in the form c(, , , ...) Here, the “c” stands for “concatenate” (which means “link together”). Parentheses then are used to surround a list of data points that are separated from one another by commas. After you have established lists according to these conventions, you may choose to display your data in a table. Below, for example, I’ve chosen to display these data as a matrix called “mydata”. Here, too, you’re welcome to give your data any name you like, provided that the name does not contain spaces. This code: mydata<-data.frame(timevalues,heightvalues) would display your data as timevalues 1 2 3 4 5 1.7 heightvalues 1 2.0 2 3.0 3 4.6 4 5.0 5 5.1 Troubleshooting Data Entered in R Often, you’ll enter a set of (x, y) coordinates, where each x value corresponds to a specific y value — as in our time/plant height example above. It’s crucial to proofread your typed lists in R to ensure that no values have been omitted! R will not be able to make an (x, y) plot using two lists of different length. 4 If you do try to create a table or plot using lists of different length, R will return this error message: Error in xy.coords(x, y, xlabel, ylabel, log) : ’x’ and ’y’ lengths differ Missing data are, of course, a reality of experimental science. Weather and illness are just two of the many factors that can impact your ability to collect data at the times originally planned. Fortunately, there is a way to enter and display a data set for which one more or more data points are missing! Say, for example, you had been collecting height data for the plant from the example above, but weren’t able to collect data on day three. Your incomplete data set would look like this: Time (days) 1 2 3 4 5 Height (cm) 2 3 5 5.1 You could enter these data in R as follows: timevalues<-c(1,2,3,4,5) heightvalues<-c(2,3,NA,5,5.1) Here “NA” (which stands for “Not Available”) is used to indicate that a data point is missing. You’d then update your data frame to use the revised data mydata<-data.frame(timevalues,heightvalues) which would produce the following: 1 2 3 4 5 timevalues heightvalues 1 2.0 2 3.0 3 NA 4 5.0 5 5.1 Plotting data that you have entered in R Here are two different ways to plot (x, y) data in R: 1. Use your x and y lists separately: plot(timevalues,heightvalues) 2. Use your data frame directly: plot(mydata) 5 3.5 2.0 2.5 3.0 heightvalues 4.0 4.5 5.0 Each of these options produces the following plot: 1 2 3 4 5 timevalues Note the “gap” left in the plot at time 3, indicating the time point for which no data were collected. 1.8 Entering functions in R To code the function, z(p) = p + 6 (1.1) in R, you would write the following: z<-function(p) {p+6} A few things to note: • z is the name of the function • the < − symbol once again tells R that we want the name (here, z) to take a particular value • function tells R that you’re coding a function (rather than, say, just doing a calculation) • (p) — the parentheses are required — tells R that the function z will have p as its input variable. • p+6 gives the function itself. For this part, you need to use special “curly” parentheses, which look like this: {}. You can enter curly parentheses by holding shift, and using the keys just to the right of “p” on your computer keyboard. 6 1.9 Plotting functions in R Once you have coded a function in R, you can use “plot” to graph the function over a given range of input variables. For examine, if we wanted to plot our function z, we would type, plot(z) Yielding this plot: 6.0 6.2 6.4 z (x) 6.6 6.8 7.0 Plot of z(p) over the default range of input values from 0 to 1 0.0 0.2 0.4 0.6 0.8 1.0 x You’ll note that this only plots using input values from 0 to 1! If we want to explore the function over a different range of input values — say, from -10 to 10, we would type: plot(z, xlim=c(-10,10)) In this code, “xlim” is short for “the limits of the range of x values”. The “c(10,10)” refers to the range of inputs values over which we wish to examine our function. 7 0 5 z (x) 10 15 Plot of z(p) over a broader range of input values from -10 to 10 -10 -5 0 5 10 x Note below that the graph for this broader range looks much the same as the default — that’s because this is a linear function, with a rate of change that is constant for all values of the input variable! As we’ll see later in the course, however, some non-linear functions look very different when examined over narrow as compared to broad ranges of parameter values. Thus, it’s always a good idea to examine functions over a broad range of input values. 1.10 Troubleshooting functions in R On occasion, R will return an error message alerting you that a function cannot be evaluated because it contains typos or other errors. Here are some problems to check for should you find that your function will not “work”: • Error in Parentheses. Be sure that conventional parentheses surround your input variable, and curly parentheses surround your functional expression. • Missing Arrow. Be sure that your <- is typed properly. • Problems with Input variable. R may issue an error message if your input variable has previously been stored as a list of data. If you are using as an input variable a name — for examplex — that you’ve previously used for a data list, it’s a good idea to clear the name of your variable. To so, type rm(p), where “rm” stands for “remove”, and the name of your function (here ”p”) is enclosed in parentheses. 8 Chapter 2 Unit 2 R Companion: Scientific notation R is able to convert from scientific notation to standard notation, and vice versa. For example, type the following number 345698543985743985734 and then press Enter. R will return [1] 3.453985e+20 There are several things to note here: • You can disregard the “[1]” at the beginning of the output line. This is, in essence, a reminder that this is a value returned by R, rather than one that you entered. • the “e” can be read as “times10” — that is, the “e” indicates that the output provided is given in scientific notation; the number that follows gives the power to which 10 should be raised. • By default, R provides scientific notation with six digits following the decimal point. If you would like to have only three digits total — one before the decimal point, and two after– you can enter – options(digits=3) – Note that this will perform the appropriate rounding, and will return 3.45e+20 – Each time you want to shift the level of precision, simply reenter options(digits=2) , being sure to indicate the appropriate number of digits. Be sure you can use scientific notation both with and without R! While it is often useful to ask R to do these conversions for you, you also be sure that you are able to move from standard to scientific notation, and back. You should aim to be able to convert quickly between standard and scientific notation, even in the absence of a computer. Please see your main Coursebook for more on converting to and from scientific notation without R or other technologies. 9 Chapter 3 Unit 3 R Companion: Solving Linear Functions In Unit 1, you learned how to code and plot functions in R. You may find it useful to review that section before proceeding. 3.1 Using R to Solve Linear Equations Graphically The goal of solving a function graphically is to identify the value(s) of the input variable for which the left and the right sides of the equation are equal. This is the same as saying that our goal is to identify values of the input variable for which there is no difference between the values of the expressions on left and rights side of the equations. More specifically, in R, our strategy for solving linear equations is to identify values of the input variable for which the difference between the left and right sides of the equation is zero. Your solution will take the form of one or more values for the input variable, x. Solving an equation with one solution To find the solution for this equation: 4(x − 2) = 2x − 5 (3.1) You would use the following steps: 1. Code each side of the function as a separate function in R. For this equation, that would be: y1<-function(x) {4*(x-2)} y2<-function(x) {2*x-5} 2. Plot both functions on the same set of axes. For clarity, we will graph y1 in blue, and y2 in red. R knows the names of lots of colors. Try altering your code to make one of these lines green, yellow, or purple! plot(y1, -5, 5, col=’blue’) plot(y2, -5, 5, col=’red’, add=TRUE) 11 0 10 N.B. In this code, add=TRUE tells R that you want to code the two functions together using one set of axes, rather than on separate plots. -10 -20 y1 (x) intersection! -4 -2 0 2 4 x From the plot, we see that these two lines cross! As noted above, the x value at which the two lines cross is called the solution —the x value for which the two expressions are equal. Using your plot, estimate the value of x at the point where the two lines cross. This is an estimate of the solution to your equations. Solving an equation with no solution When two functions yield parallel lines, the lines never intersect, so we would say that there is no solution. In this case, your “solution” would be “no solution”. Solving an equation with infinitely many solutions When two functions yield a single line, the values of the two functions are identical for all values of x, so we would say that there are infinitely many solutions. In this case, your “solution” would be “infinitely many solutions”. 3.2 Using R to identify the point of intersection for two linear equations. Sometimes, just looking at a graph is not good enough — you need more precise information on the x value for which two plots intersect! R can help with this, too. 12 As noted above, the point(s) at which the two plots intersect is an x value or set of x values for which the two functions yield the same answer. Another way of saying this: the solution represented on a graph marks the input-variable value for which the difference between the two functions is zero! Therefore, we approach this problem by asking R to tell us where the difference between these two function is zero. First, we define a new function that will calculate the difference between y1 and y2. To make things easier for ourselves, lets call that function difference difference<-function(x) {y1(x)-y2(x)} Next, we tell R that we want to find the place where the function difference is exactly equal to zero. To do so, we type uniroot(difference, c(-100, 100))[1] The resulting number is the x value that is the solution to these equations! In the code above, the -100 and 100 give the lower and upper bounds, respectively, of the range over which we are asking R to look for solutions. You can change these values to examine a smaller or larger range of values. The meaning of “uniroot” Wondering about the meaning of the uniroot function above? Basically, it identifies the value of the input variable (here, x) for which the difference equation we’ve entered has value of zero. In essence, we’re using this code to ask for the place where the difference between the two functions is zero — that is, where they have the same value! The returned value for x is the solution for our system of equations. 13 Chapter 4 Unit 4 R Companion: Linear Regression Finding the best-fit line and its equation To infer a best-fit line in R, first enter lists of data corresponding to your x and y variables. timevalues<-c(1,2,3,4,5) heightvalues<-c(2,3,4.6,5,5.1) It’s good practice to make sure that the two lists are of equal length — if they’re not, R will not be able to infer a best-fit line. If there are values missing from your data — that is, you have an x value without a corresponding y value or vice versa, you may wish to use NA (see Unit 1 for more on how to do this). For short lists, manual proofreading is often sufficient. To confirm that longer lists are of appropriate length, the length function is helpful: length(timevalues) length(heightvalues) If the returned lengths are equal, you’re ready to proceed to linear regression! If not, you’ll want to examine your data for errors. You can always display your data as entered by typing a variable name, and then pressing “Enter”. Recall that you can make an (x, y) plot by typing plot(timevalues, heightvalues) where “timevalues”, the input/independent variable, is given before “heightvalues” the output/dependent variable. Often, an (x, y) plot will provide an initial sense of whether there is a linear relationship between your two variables. To gather quantitative information on whether or not there is evidence of a linear relationship of “height” to “time”, you would use the following code: summary(lm(formula = heightvalues ~ timevalues)) Here, ”summary” tells R that you want it to summarize its findings (a summary should be sufficient for our purposes here). 15 The most important part of the code is this: lm(formula = heightvalues ~ timevalues) Here, “lm” stands for “linear model” — which make sense because we are doing linear regression! Technically, this means that we are asking R to find the ”linear model” that will best fit the present data set. This part: formula = heightvalues ~ timevalues tells R that we want to find a formula for height as a function of time. Two things are especially important to note: 1. In linear regression, the y variable (here, “heightvalues”) is given before the x variable (here, “timevalues”). It may be useful to think of this form as asking, “what function would allow us to estimate height as a function of time?” 2. Here, the symbol between the x and y variables is a a tilde, and can be typed by pressing the key just above the tab key, while holding down the shift key. After you enter the above code into R, you will see the following output: Call: lm(formula = heightvalues ~ timevalues) Residuals: 1 2 -0.30 -0.12 3 0.66 4 5 0.24 -0.48 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.4800 0.5510 2.686 0.0747 . timevalues 0.8200 0.1661 4.936 0.0159 * --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.5254 on 3 degrees of freedom Multiple R-squared: 0.8904, Adjusted R-squared: 0.8538 F-statistic: 24.36 on 1 and 3 DF, p-value: 0.01595 R returns a great deal of information here! If you take a statistics course later in your studies, you’ll learn much more about linear regression, and the meanings of these various terms. 16 For the moment, only two components are essential: 1. The formula you entered! Call: lm(formula = heightvalues ~ timevalues) This should look familiar. It’s just the relationship for which we asked R to infer a line. Once again, be sure that you’ve entered the dependent (y) variable before the independent (x) variable. 2. The intercept and slope for the line that R inferred to fit these data. Coefficients: Estimate Std. (Intercept) 1.4800 timevalues 0.8200 Here, the y intercept is estimated to be 1.4800 (top line). The slope is estimated to be 0.8200 (bottom line). Now that we have these two values, we can write the equation for the line that R has inferred to be the best fit to these data. A very important note on writing functions: You’ll see that the function above uses “h” and “t” (not “heightvalues” and “timevalues”) to refer to the two variables of interest. That’s because we want R to use “heighvalues” and “timevalues” to refer to our lists of data, and “h” and “t” to refer to the variables in our function. I always use names that end with the word “values” to name data sets, and single letters to name parameters in functions. I recommend that you follow this convention when writing functions of your own, to avoid confusion between data lists and variables! The formula for our line of interest is: h = 0.8200 ∗ t + 1.4800 4.1 Plotting an inferred function Once you have inferred the slope and intercept for your best-fit line, you can code the corresponding function in y = mx + b form! For the line inferred above, that function would be: h<-function(t) {0.8200 * t + 1.4800} 4.2 Interpolation and Extrapolation Once you have written a function to calculate h as a function of t, interpolation and extrapolation are quite easy! For any given value of the input variable t, you can use the function you’ve written to calculate the corresponding h value. For example, say we wanted to interpolate to estimate the plant’s height as t = 2.5 we would type, 17 h(2.5) and the press enter, which would yield 3.53 Similarly, to extrapolate to predict the height of the plant at t = 200, you would enter h(200) which would yield 165.48 Finally, to display your best-fit line on a plot with your raw data, you would type plot(timevalues, heightvalues); plot(h, xlim=c(1,5),add=TRUE) Note that this code is really just a fusion of two things you’ve already learned to do: plot a data set, and infer a best-fit line! This code will yield the following plot: 3.5 3.0 2.5 2.0 height 4.0 4.5 5.0 Plant height data and best-fit line 1 2 3 4 time 18 5 Chapter 5 Unit 5 R Companion: Quadratic Functions 5.1 Solving quadratic equations graphically in R An overview As noted in our earlier work on solving linear equations (Unit 3), the solution set for a system of functions includes all values of x for which the functions are equal. In graphical terms, this means that your goal is find the value of x at the one or more places where the plots for these functions intersect. The solution(s) to a set of equation is the x value or values that yield equal values of y for the two equstions. Steps for Solving Functions Graphically in R The strategy for solving a system of quadratic equations graphically in R has three steps that closely parallel those for solving linear equations: 1. Enter your two functions as two separate functions. 2. Define a third function – which, here, we’ll term ifference\enverb — that calculates the difference between the first two functions you’ve entered. 3. Solve to determine the value of x at these points of intersection. For example: a) Define two functions (using the left- and right-hand sides of the initial function). y1<-function(x) {4*x^2+3*x+1} y2<-function(x) {(x-1)*(2*x+4)} b) Graph both functions on the same graph. To do this in R: plot(y1,xlim=c(-4,4), ylim=c(-5,100), lty=2) plot(y2,xlim=c(-4,4), add=T) 19 100 80 0 20 40 y1 (x) 60 This is y1(x) This is y2(x) -4 -2 0 2 4 x 4. Here, you can see that the two curves do not intersect so there is no solution. Some Notes about the Code Above The term lty=2 Also, the code for y1 contains the term ylim=c(-5,100) As you will recall from our earlier work on plotting functions in R, this command tells R to set the y axis to go from -5 on up to 100. Try changing these numbers, and see how your graph changes! NOTE: Depending on the range you plot, you may not be able to tell whether your two curves intersect or not! Try adjusting the range to get a clearer sense of this. For example, the graph above examines the interval from x = −4 to x = 4. If you wanted to examine this interval to range from x = −100 to x = 100, you would use this code: plot(y1, xlim=c(-100,100), lty=2) plot(y2,xlim=c(-100,100), add=T) Take a look at the resulting graph. While the two functions it plots are the same as those in the earlier panel, the range of the y axis is much broader in the second plot than in the first. As a result, we can no longer tell whether or not these two curves actually 20 y1 (x) 0 10000 20000 30000 40000 intersect in the area where they draw close to one another. To avoid uncertainties and possible misinterpretations of this sort, be sure to examine your functions on several different scales! -100 -50 0 50 100 x Example: Solve x2 − 3x = (1 − x)2x. 1. Define two functions: y1 = x2 − 3x (left side) and y2 = (1 − x)2x (right side). y1<-function(x) {x^2-3*x} y2<-function(x) {(1-x)*2*x} 2. Graph both functions on the same graph: 0 20 y1(x) 40 60 plot(y1,ylim=c(-10,70),xlim=c(-4,4)) plot(y2, xlim=c(-4,4),add=T,lty=2);abline(v=0, lty=3);abline(h=0, lty=3) -4 -2 0 2 4 x You’ll note that there are two different intersections here; our job in finding solution is to identify both of these points. As previously, when our goal was solve systems 21 of linear equations, our strategy here will be to make a new function that calculates the difference between the two functions, and then identify the x value that makes the difference equal to zero. So... a) Define your difference function as one function minus the other: difference<-function(x) {y1(x)-y2(x)} And then, also as before, use the uniroot Recall that the uniroot Here, I look at the graph, and see that the graphs for the two functions cross somewhere near x = 0. For this, I’ll start by examining on the interval (-1,1): uniroot(difference, c(-1,1)) This gives several lines of output. For the moment, concentrate on the first line. This gives the first root, which here is: $root [1] 8.733861e-07 This a very small positive number that is, it’s very, very close to zero. For my second root, I look at the interval that, based on the graph, would seem to contain my second solution. Once agin, concentrate on just the first line of output: uniroot(difference, c(1,2)) This yields a solution of ∼1.666. Therefore, we say that the equation has two roots. They are ∼0, and ∼1.66. Note: the ∼ symbol is called tilde, and, in this context, is used to mean “approximately”. We state these roots as approximations, because they are not integers, and to list the inferred root precisely would very likely require lots and lots of significant figures. 22 Chapter 6 Unit 6 R Companion: Behavior of a Function 6.1 Plotting some special functions Careful attention to notation is essential in all coding, and especially so when working with complex functions! Here are a few examples of R code for functions you will encounter in Unit 6: Function y = x3 y= sin(x) x y= 1 x +5 y=y= 1 x2 R Code y<-function(x){x^3} y<-function(x){sin(x)/x} y<-function(x){1/x+5} y<-function(x){1/(x^2)} 6.2 Scaling your plot window to examine the behavior of a function For many functions, behavior for x → ∞ and x → −∞ will be visible only when you examine the function over a wide range of values for the input parameter. Consider, for example, this code in R y<-function(x) {sin(x)/x} If you were to use this code to make a plot: plot(y) You would initially see a plot of the function on the range x = 0 to x = 1, as is the default convention in R. The graph would look like this: 23 1.00 0.95 y (x) 0.90 0.85 0.0 0.2 0.4 0.6 0.8 1.0 x While this would be a good start, it would not permit you to examine the behavior of the function for negative values, or for x values greater than 1! To examine the behavior of functions, it will be very useful to adjust the x and y axes to permit viewing of the function over a broader range of input and output values. To view this function for −50 < x < 50 and −0.06 < y < 0.06, we would enter: y (x) -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 plot(y, xlim=c(-500,500), ylim=c(-.06,.06)) -400 -200 0 x 200 400 . Note that this narrowing the range of the y values plotted yields information that was not at all apparent from our first plotting attempt. In particular: • The function is undefined at x = 0 (that’s because the denominator would have value of 0!) • As x → ∞, y → 0; as x → −∞, y → 0. 24 6.3 Some reminders about using graphs to assess the behavior of functions When examining functions using a graph in R, be sure to: • Adjust the scale of your axes so as to display the function over a wide range of input and output values. • Be on the lookout for x values for which the function is undefined. This should be apparent from your graph in R. It should also be possible to identify these points by examining the function itself to see whether there are any x values for which the denominator will be 0. 25 Chapter 7 BRIEF ARTICLE THE AUTHOR Unit 7 R Companion: Function Library and Non-Linear Regression Here is some R code for several of the functions you’ll encounter in this chapter: Type of Function Exponential Growth Function y = a · bx Exponential Decay with negative exponent y = a · b−x c 1+a·e−b·x R Code y<-function(x){a*b^x} y<-function(x){a*b^-x} Logistic Growth y= Sine y = sin(x) y<-function(x){sin(x)} Cosine y = cos(x) y<-function(x){cos(x)} Surge Function y = A · x · e−b·x + C y<-function(x){A*exp(-b*x)+C Square Root Function y= x y<-function(x){x^(1/2)} Absolute Value Function y = |x| y<-function(x){x^(1/2)} Logarithmic Function, base 10 y = log10 (x) y<-function(x){log(x,10)} Logarithmic Function, base e y = ln(x) y<-function(x){log(x,exp(1))} √ y<-function(x){c/(1+a*exp(-b*x)} 1 A Note on value e, used in the logistic growth, surge, and natural logarithmic functions: The symbol e is used to indicate a value of approximately 2.718. In your work with mathematical expressions, you’ll often need to raise e to various powers. R has a special way of indicating the number e. Say, for example, you want to raise the number e to power 2. You would type exp(2). Here,exp indicates the number e, and the parenthetical number indicates the power to which e should be raised. 7.1 Nonlinear Regression The procedure for non-linear regression in R is very similar to the linear case: a) Enter your data as two separate lists. For example, 27 xvalues<-c(1,2,3,16,22) yvalues<-c(4,5,6,7,8) b) Plot your data 6 4 5 yvalues 7 8 plot(xvalues,yvalues) 5 10 15 20 xvalues c) Have a look at your graph. These data sure look nonlinear! You already know how to do linear regression in R: summary(lm(yvalues~xvalues)) This infers the best-fit line to be y = 4.6565 + 0.1527 ∗ x But if we add this function to the plot y1<-function(x) {4.6565+0.1527*x} plot(xvalues,yvalues); plot(y1,xlim=c(1,22),add=T) 28 8 7 6 yvalues 5 4 5 10 15 20 xvalues it doesn’t look like a good fit at all. A poor fit to a linear function means that it’s time for non-linear regression, which basically means that rather than finding the best-fit line, you are finding some other function that is a good fit to your data. Such a function could, for example, contain terms like x2 , x3 ...xn ! Here is how to ask R to find a second-order non-linear function (that is, a function that contains x2 that is a good fit to your data: summary(lm( yvalues ~ xvalues + I(xvalues^2))) Some parts of this code should already be familiar from your work with linear regression. If you compare to the line above, you’ll see that the only new part is the + I(xvalues^2) Basically, that part means that we are allowing for the appear in the function that is the best fit to our data. Note: reasons, you’ll always need to surround each new term, here, with I(...), where that first letter is a capital I, as in Biol123”. Using this notation, we get the long output: Call: lm(formula = y ~ x + I(x^2)) Residuals: 1 2 -0.69733 0.05821 3 4 0.82331 -0.36056 29 5 0.17636 term xvalues2 to just for notational like the xvalues2 ”I am studying in Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.443307 0.721808 6.156 0.0254 * x 0.258799 0.251491 1.029 0.4116 I(x^2) -0.004779 0.011163 -0.428 0.7102 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.815 on 2 degrees of freedom Multiple R-squared: 0.8671, Adjusted R-squared: 0.7343 F-statistic: 6.527 on 2 and 2 DF, p-value: 0.1329 As with linear regression, we can start by considering only a small part of this. Focus, for the moment, on just this part: Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.443307 0.721808 6.156 0.0254 * x 0.258799 0.251491 1.029 0.4116 I(x^2) -0.004779 0.011163 -0.428 0.7102 --- In this case, we can get our inferred function just by looking at the first column. The intercept is inferred to be 4.44, the x coefficient is 0.258799, and the x2 coefficient is -0.004779. So, the best-fit function containing x2 is inferred to be y = −0.00478 ∗ x2 + −0.2588 ∗ x + 4.44 Just one more thing to mention here: if you want to fit your data to a function that has x3 as a term, you apply this same exact logic to write your code: summary(lm( yvalues ~ xvalues+ I(xvalues^2)+I(xvalues^3))) You can keep doing this to add more and more powers of xvalues! The curve inferred in that final step would give the graph below, which is a much better fit to your data! 30 8 7 4 5 6 yvalues 5 10 15 20 xvalues Note: For this chapter the homework on the next page of this R Companion (not the homework in the Coursebook, which, for this Unit, is specific to the TI-83 calculator). 7.2 x y Homework for R Users 1 1 2 16 3 85 4 240 i. Plot these data, and find the best-fit regression line going up to the power 4 (that is, you want a function that contains the term x4 ii. Based on your previous work with R and linear regression, look at your output and find the R2 value. 5 10 15 y 20 25 30 iii. Write the function inferred to be the best fit for these data. 1 2 3 4 5 x 31 7.3 Answers to Homework for R Users i. My plot is above. ii. My R2 value appears to be 1.0. (Added challenge: can you figure out why?) iii. The function would then be y = 20 − 40.5 · x + 30 · x2 − 8.5 · x3 + 0.83 · x4 32 Chapter 8 Unit 8 R Companion: Descriptive Statistics For our exploration of descriptive statistics in R, let ’s assume you start with this list of data indicating the number of squirrels that you observed outside your dorm. You collected data every day for two weeks: squirrels<-c(1,0,0,0,0,4,5,6,6,7,7,8,10,2) Here are some basic R functions that will allow you to perform descriptive statistics for this data set. The code for a great many of these calculations will be intuitive. Moreover, many of these functions use the same code as the analogous function in Excel. Here’s the code for some of the simple statistical tests you may want to per2 THE AUTHOR form: Function mean sum of values maximum value minimum value median value standard deviation for sample variance quartiles (to get 0% 25%, 50%, 75%, 100%) number of data points first quartile second quartile third quartile the 89th percentile total of all squared values R Code mean(squirrels) sum(squirrels) max(squirrels) min(squirrels) median(squirrels) sd(squirrels) see notes var(squirrels) quantile(squirrels) see notes length(squirrels) quantile(squirrels, probs=c(0.25)) quantile(squirrels, probs=c(0.5)) quantile(squirrels, probs=c(0.75)) quantile(squirrels, probs=c(0.89))see no sum(squirrels^2) i. Standard deviation for a population: Unless you tell it to do otherwise, R will calculate the sample standard deviation. To get the standard deviation for the population, you can take advantage of the fact that the 33 calculations for population and sample standard deviations differ only slightly (see your Coursebook for more on this!). The population standard deviation can be calculated as: ((sd(squirrels)^2*(length(squirrels)-1))/length(squirrels))^(1/2) ii. The ”quartiles” function: Instead of ”quartiles”, the term in Excel, R uses the term ”quantiles” (note the N rather than R!), which is a more general way of talking about dividing a data set into parts based on the fraction of data points that fall into each of those parts. iii. There are a variety of different conventions for calculating quantiles. The differences between these methods have mostly to do with where to place data points that lie exactly on the boundary calculate for two segments of the data set. The default method in R calculates these the same way that Excel does. iv. To find the mode, you can take advantage of R’s ability to make a table to sort your data, and then display the abundance of each datum in the overall set. table(squirrels) Which yields this list: squirrels 0 1 2 4 4 1 1 1 5 1 6 2 7 2 8 10 1 1 Here, the top line indicates 0, 1, 2, 3... squirrels, and the bottom line indicates how many times each of those numbers of squirrels occurred in the original data set. Here, for example, you can see that it was the finding of 0 squirrels that occurred the largest number of times in the data set (it occurred four times). Therefore, the mode of this data set is 0. By contrast, there was only one day on which 10 squirrels were observed. 34 Chapter 9 Unit 9 R Companion: Hypothesis Testing 9.1 Question: You have taken a sample and calculated a proportion. Is the true proportion in the population from which your sample is drawn different from 50%? To address questions of this sort, R uses the proportion test. Consider the following example: Say you toss a coin 100 times. If the coin is fair — that is, if it has equal probabilities of yielding heads and yielding tails — we might expect that 50 tosses would yield heads, and 50 tosses would yield tails. However, we’re not tossing the coin infinitely many times, so we expect some random variation around a ”perfect” outcome of 50-50. So, we need to figure how how different the outcome can be from 50-50 before we should begin to suspect that the coin is not fair. In your experiment, 100 tosses yield 61 heads and 39 tails. Here, you would use the proportion test to investigate whether or not this result is consistent with the null hypothesis, H0 , that the coin is fair. To implement this test, enter: heads<-c(61) tosses<-c(100) prop.test(heads, tosses, p=.5, alternative="greater") where 35 heads tosses p ”greater” # of ”successes” observed # of trials null hypothesis, H0 , in decimal form. alternative hypothesis, H1 . Here, that’s the possibility that p >0.5 Using the code above yields the following R output: 1-sample proportions test with continuity correction data: females out of people, null probability 0.5 X-squared = 4.41, df = 1, p-value = 0.01786 alternative hypothesis: true p is greater than 0.5 95 percent confidence interval: 0.5228432 1.0000000 sample estimates: p 0.61 Some of the above lines will be familiar based on the code you entered. For example: data and number of successes = 61, number of trials = 100 simply give the numbers of heads, and total tosses, as you originally entered them. The next line is the p-value which, as you know from your reading, tells you the probability of gathering is particular data set if your null hypothesis is true. p-value = 0.0176 Here, the p-value is rather small, which can be interpreted to to mean that only 1.8% the time would you observe 61 or more heads of 100 tosses if the truth were that heads and tails were equally likely outcomes. In this case, we would be justified in rejecting the null hypothesis, and accepting the alternate interpretation that the probability of tossing heads using this particular coin is greater than 0.5. 36 If you want R to summarize by giving only information about the p-value, you can write, prop.test(x=61, n=100, p=.5, alternative="greater")[3] This is just like the code above, except for the ”[3]” on the end. That little bit of additional code tells R that you want it to return only the 3rd item in the answer — that is, just the part where it tells you the p-value. (You’ll find that if you change the 3 to some other number, R will return a different part of the answer!) 9.2 Question: Do the proportions calculated for samples from two populations indicate that the two populations differ in their true proportions? Let’s say that you now have samples for two different populations, and want to ask whether the true proportions for those populations are significantly different from one another. For example, imagine that you find that there are 10 males out of 24 students in one section of Biology 123, and 17 males out of 25 students in another section. You want to ask whether the proportion of males differ between the two sections. We use the same logic as before: males<-c(10,17) students<-c(24,25) prop.test(males, students) Note that here, the lists give the value of each variable for the two populations. There are 10 males in the first sample, and 17 in the second. Together, they comprise the list of samples for males. Please ask if you have questions about this! The output is then: 2-sample test for equality of proportions with continuity correction data: males out of students X-squared = 2.4503, df = 1, p-value = 0.1175 alternative hypothesis: two.sided 95 percent confidence interval: -0.57312711 0.04646044 sample estimates: prop 1 prop 2 0.4166667 0.6800000 37 Note that the p-value here is 0.1175, indicating that there is a 12% chance that the proportions of males in the two populations differ this much by chance alone, without any true difference between the populations. Typically, a p-value this high will be interpreted to mean that there is no justification for rejecting the null hypothesis that the proportions are equal for the two populations. 9.3 Question: Given an estimated mean of a sample, what can we say about the mean for the population from which it was drawn? Say you ask each student taking Biology 0123 this semester how many credits he or she is taking, total. You want to test the hypothesis that the true mean number of credits for Biology 0123 students is 15. Here’s your data set: creditstaken123<-c(12,13,14,15,15,18,12,13,14,15,18,12,12,12,12,13, 14,15,15,13,13,13,13,12,13,14,15,15,18,12, 13,14,15,18,12,12,12,12,13,14,15,15,13,13,13,13) Implementing the t-test using raw data The t-test will be useful here to ask whether the true mean for this population is 15. To implement the t-test for our data set and question, type: t.test(creditstaken123, mu=15) There are just two terms to keep in mind here: the first is the name of your data set (here, creditstaken) The second is mu=15 which, here, tells R that you want to test the hypothesis that the true mean of your data set is 15 (you can think of ”mu” as the code for ”mean”). Entering the code above yields a very, very tiny p-value (only 9.94·10−6 , which can be written in standard notation as 0.00000994 - a very small number! ) Therefore, in this case, it seems safe to conclude that the true mean number of credits is significantly different from the null hypothesis of 15. 38 9.4 Question: Do these samples come from populations with different means? On occasion, you’ll want to ask whether there is evidence that two different samples have different means. So that we can experiment with this, let’s define a second data set: the numbers of credits taken by a group of students who are not currently enrolled in Biology0123: creditstakenNOT123<-c(15,15,15,16,13,13,14,15,15,16,17,12, 13,14,15,15,16,17,16,17,12, 13,14,15,15,16,17,16,17,12,13,14,15,15,16,17) Here, we use a t-test, which is a common approach for comparing the mean and distribution for two different sets of data. t.test(creditstaken123, creditstakenNOT123) which yields this output: Welch Two Sample t-test data: creditstaken123 and creditstakenNOT123 t = -3.2022, df = 78.666, p-value =0.001969 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.8644710 -0.4350459 sample estimates: mean of x mean of y 13.73913 14.88889 Here, too, we can tell R that we want to look directly at the p-value: t.test(creditstaken123, creditstakenNOT123)[3] Here, the very low p-value of 0.00197 tells us that we can reject the null hypothesis that the means of the two populations are the same. Thus, we conclude that there is evidence that the mean numbers of credits differ between these two groups of students (though this particular approach does not permit us to make any claims about which mean is greater than the other). 39 9.5 Question: Is there evidence of change a difference between ”before” and ”after” samples taken for a given set of individuals? Many experiments have a “before” and “after” design, measuring pulse or body temperature both before and after exercise, for example, and then asking whether there is evidence that the before and after measures differ significantly for individuals. In these cases, it’s most useful to ask about the statistical significance of changes observed for individual patients, rather than for the population as a whole. Here, for example, we record students’ test scores before and the after a 3-hour study session: before<-c(90,80,80,80,80,83,83,71,90,90,87,77,74,72,90,80,80, 80,80,83,83,71,90,90, 87,77,74,72,90,80,80,80,80,83,83,71,90,90,87,77,74,72) after<-c(98,98,98,98,60,87,87,85,82,81,85,86,89, 90,100,100,100,98,98,98,98, 60,87,87,85,82,81,85,86,89,90,100,100,100,89,89,82,82,90,90,92,100) In this case, we must inform R prior to implementing the t-test that the two sets of the samples are matched. Happily, we can do this by adding only a tiny bit of code! We write, t.test(after, before, paired=TRUE) Note that the ”paired=TRUE” component is telling R that the samples are paired — that is, that the ”before” and ”after” data sets both contain data for the same group of patients. Which yields this output: Paired t-test data: after and before t = 4.9273, df = 41, p-value = 1.417e-05 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 4.931844 11.782441 sample estimates: mean of the differences 8.357143 40 As always, we can get just the p-value on its own by writing, t.test(after, before, paired=TRUE)[3] Which yields: $p.value [1] 1.416579e-05 That’s a pretty low p-value! Note that the full output also gives the ”mean of the differences”, showing that individual students’ test scores increased by an average of 8.36. Looks like most students benefits greatly from this study session! 9.6 Implementing the χ2 Test in R The goal of the χ2 test is to assess whether the distribution of counts across two or more groups is significantly different from the null expectation. To do this, we compare each observed count to the expected count, and ask about the probability of observing total discrepancy of this great or greater, by chance alone. Larger total differences between observed and expected counts lead to lower p-values, indicating lower probabilities that the discrepancies arose by chance alone, suggesting that there is a significant difference between the two groups. 9.7 A sample data set Suppose we knew that 1/3 of Westfield students were from Hampden County, 1/3 were from Suffolk County, and 1/3 were from elsewhere. We then surveyed 198 students on a weekend afternoon to ask about the county where they live. Our expected proportions are: Hampden 33% Suffolk 33% Other 33% And our expected counts for those 198 students surveyed would then be: Hampden 66 Suffolk 66 Other 66 Let’s say our observed counts were as follows: Hampden 48 Suffolk 70 Other 80 41 9.8 Doing the χ2 test by hand We could the χ2 test to ask whether the distribution of students in to “Hampden”, ”Suffolk”, and ”Other” was significantly different from our expectation. To do so, we would calculate the following for each category: (O−E)2 E and then take the sum of the equation for each of the categories. For the example above, that would be County Observed Expected (O−E)2 E Hampden 48 66 4.91 Suffolk 80 66 2.97 Other 70 66 0.242 Summing the bottom line of this table, we would get 8.12! The test is said to have two degrees of freedom (”d.f.”) because knowing the total number of students surveyed (198) plus any two other values (number from Suffolk and number from elsewhere, or number from Hamden and number from elsewhere, or number from Suffolk and number from Hampden) would be sufficient to complete the whole table. To figure out the associated p-value, you have several options: • do the entire test in Excel, as described below • do the entire test in R, as described below • use the sum calculated above (8.12) in conjunction with an online calculator, called ”P from chi2 ” that is available at http://graphpad.com/quickcalcs/PValue1.cfm 9.9 To Perform a χ2 test in Excel The “chisq.test” function in Excel takes two arguments: first, the range of cells that contain the observed values, and second, the range of cells that contain the expected values. To enter data in Excel, enter your observed and expected values into separate blocks of cells. Then, type, ”=chisq.test(” Highlight you observed values, press comma, enter your expected values, press enter. The value returned will be your p-value, indicating how probable your observed values are, given your expected values. 42 9.10 To Perform a χ2 test in R There are several way to do a χ2 test in R. For me, the easiest to remember is ”by hand”, meaning that you enter the raw data, and then calculate the (O−E)2 value described above, find the sum, and then ask how surprised you E should be about this result. Here’s how: First, establish two lists: one with your expected values, and the other with your observed values. For the data given above, that would be: expected<-c(66,66,66) observed<-c(48,80,70) Then, find the sum of each of the values you calculated using (O−E)2 : E chivalue=sum(((observed-expected)^2)/expected) Finally, we want to ask how surprised we should be to observe this χ2 value. The command below tells R to compare your results to the chi-square distribution. The ”chivalue” refers to the part you calculated above. The ”2” gives the degrees of freedom — that is, the number of things we need to know (here, counts for two of three counties) in order to be able to provide the complete data set, assuming we already know the total number of students surveyed. 1-pchisq(chivalue,2) And returns a p-value — 0.01723857 We interpret this to mean the only 1.7% of the time would the observed values differ this much or more from the expected values by chance alone! If we are using the conventional cut-off of 5%, we would note that our percent is even lower the cutoff, and would conclude that these observations differ significantly from the expectations. 43 Chapter 10 Unit 10 R Companion: Confidence Intervals 10.1 Proportions: computing the confidence interval and the margin of error Computing a confidence interval The goal of computing a confidence interval on a proportion, using a sample, is to determine the range of values that, with a given degree of certainty, contains the true value of the proportion for that population. For a data set that is binary — that is, it consists of only two possible answers, like ”yes” or ”no”, ”left” or ”right”, etc. — you can compute a confidence interval using the ”proportion test” or, prop.test (You’ll recall from earlier reading in our course that we used this very same test to ask whether a sampled proportion was significantly different from an expected proportion!) Say, for example, you are interested in the proportion of students who take Biology 0123 who are majoring in biology. There are several different sections of Biology 123, so you decide to use our section as a sample with which to estimate the proportion for all sections. You interview the students in our section, and find that 10 of 22 students are majoring in biology. Given this sampled proportion, what can we say about the true fraction of all Biology 0123 students at Westfield State who are majoring in biology? Lets assume, for the moment, that you want to compute the 95% confidence interval on our estimate — that is, the range of values that, with 95% certainty, contains the true proportion for Biology 123 overall. To compute this confidence interval, you would write: biomajors<-c(10) 45 studentssurveyed<-c(22) prop.test(biomajors,studentssurveyed) where R Code biomajors<-c(10) studentssurveyed<-c(22) meaning number of biology majors in our section number of students in our section The proportion of biology majors in our sample is then 10 22 which is 0.455. As is often the case, R produces several lines of output — the confidence interval is on one of these lines. 1-sample proportions test with continuity correction data: biomajors out of studentssurveyed, null probability 0.5 X-squared = 0.0455, df = 1, p-value = 0.8312 alternative hypothesis: true p is not equal to 0.5 95 percent confidence interval: 0.2507068 0.6732606 sample estimates: p 0.4545455 You’ll note that proportion of biology majors among the students sampled is given by the last line — reassuringly, it’s equal to the value we calculated above: p 0.4545455 The line of interest for computing the confidence interval is this one: 95 percent confidence interval: 0.2507068 0.6732606 This tells us that our sample of 10 biology majors out of 22 Biol0123 students is, with 95% confidence, taken from a population where the true proportion is between 0.25 and 0.67. As you’ll recall from your reading, a 95% confidence interval is appropriate when you want to be quite certain that the interval compute contains the true answer. If you don’t specify the confidence interval you want to compute, R will assume that you are looking for the 95% interval. However, there some cases in which you might be content to calculate an interval that has a somewhat lower probability of containing the true answer — computing 80% intervals, for example, is common in some fields. To calculate the 80% confidence interval in R, you would write, 46 prop.test(biomajors,studentssurveyed, conf.level=.8) Note that this yields an interval that is smaller than the 95% value we calculated above. Do you see why? Please let me known if you’d like to discuss this further! Computing the margin of error Often, scientific papers contain sentences like this: Of all of the students taking Biology0123, 45.5% ± 21.1% are biology majors.” In this case, the ± statement indicates the margin of error. The margin of error is half the confidence interval, and can be computed by finding half of the range spanned by the confidence interval 0.2507068-0.6732606 which is -0.4225538 And then dividing by two, which is: -0.2112769 So, the margin of error on our estimate of the proportion is ±21.1%. Note: Several times in this section, we have skipped back and forth between thinking about proportions as decimals, and thinking about them as percentages. Either of these options is fine for thinking about proportions, so long as you are clear about which you are using! 10.2 Means: computing the confidence interval and the margin of error ...using a complete data set Say you have this list of squirrel counts taken on various plots in Westfield: squirrels.per.acre<-c(50,10,16,6,12,13,14,15,16,19,20,21,25,25,70,91,30, 54,16,6,12,13,14,15, 16,19,20,21,25, 25,70,91,20,21,25,25,70,91,6) Recall from before that you can use R to compute the mean of this list: mean(squirrels.per.acre) 47 To compute the confidence interval on this estimate of the mean, we can use the one-sample t-test, as we did in a previous unit. Here, we would enter, t.test(squirrels.per.acre) The 95% confidence interval is listed in the output: 95 percent confidence interval: 20.96036 36.88580 As above, we can use this interval to compute the margin of error: (20.96036−36.88580) 2 which is equal to -7.96, so we write the interval as 28.92 ± 7.96. ...using summary statistics On rare occasion, you will have summary information for a study, but no access to the raw data. In such a case, you’ll want to use the known mean, standard deviation and sample size to calculate the confidence interval. To do this, you’ll have to install the BSDA” package in R. Because R is an open-source language, anyone who wants to can contribute new packages to perform computations that are not part of the standard distribution. The “BSDA” package is one such package. R as installed on the laptop you are using likely does not include this package — but installing the package should be quite easy! The two lines of code given below will tell R to download the BSDA package from the internet, and install it on the computer that you are using: install.packages("BSDA", dependencies=T) library(BSDA) Because R doesn’t retain access to packages from one session to the next, you should be sure to enter this code each time you want to calculate a confidence interval from summary statistics. After the BSDA package is installed, we can use its tsum.test function to perform calculations using summary statistics. For example, imagine that for the squirrels.per.acre data set given above, we already know the following (but don’t have the original data set!): mean standard deviation sample size 28.92308 24.56397 39 Imagine we want to calculate the 80% confidence interval. To do so, we input the values above into the ”tsum.test” function. The name comes from the ability of this function to calculate a t interval using summary statistics: 48 tsum.test(mean.x=28.92308, s.x=24.56397, n.x=39, conf.level=.80) This yields a confidence interval of 23.79304 34.05312 And an associated margin of error of 23.79304−34.04312 2 = −5.13. So, we can say, with 80% confidence, that the true of squirrels across one acre plots in Westfield is 28.92 ± 5.13. 49 Chapter 11 Unit 11 R Companion: Experimental Design Your Coursebook contains all of the material required for this section. 51 Appendix A Appendix 1: Other hints that may be useful as you learn R Evaluating functions at specific input-variable values Once you have coded a function in R, it’s quite easy to calculate the value of the function given a specific value for the input variable. Say, for example, you wanted to evaluate your function z(p) for p = 10. If you were to type z(10) R would return [1] 16 A.1 Using R as a Calculator R can perform all of the arithmetic you might otherwise perform on a typical calculator. If you type (3+23)/4 and then press enter on you computer keyboard, R will return [1] 6.5 (For the moment, you can ignore the “ [1]” that appears at the beginning of each line of R output.) If you want to do multiple calculations on a single line, you can use a semi colon to separate the various commands. For example, (3+23)/4; 2+5 53 returns two separate answers on two separate lines: [1] 6.5 [1] 7 If you want to calculate the product of two numbers — say, for example, 17 and 20 — you’d type 17*20 In R, the asterix (*) is used as a multiplication sign. A.2 File Types in R • There are three main types of windows in R: i. Console where calculations are performed ii. Text file where code can be written and stored iii. Quartz window where graphical output is displayed. • Sometimes, after a new plot is called, the Quartz file containing the plot doesn’t immediately appear as the front window on your screen. To bring that window to the front, click on the “Window” menu and select the “Quartz” option. If the x-axis is partially obscured on the graph displayed, try maximizing the Quartz window. A.3 Saving Files in R • There are two primary types of files that you’ll want to save in R: i. Text files containing code: be sure to save these so that you can adapt and use your code later! ii. Quartz files that contain plots: you can save these as .pdf files for later retrieval, printing, incorporation into papers, etc. A.4 Typing Shortcuts in R As you become more accustomed to R, you’ll develop your own methods for saving time and accomplishing tasks efficiently. Here are a few to start with: • Reenter code from earlier in your console: press the up-arrow key. This will enable you to cycle through earlier lines until you reach the code of interest. • Use text recognition to save typing time: If you want to type a new line that contains a term you’ve used at least one already, type the beginning of the word, and then press the tab key. This will yield a drop-down list; scroll through it to find the term you’d begun to type! 54 A.5 Importing data into R, and exporting plots from R At times, you may record raw data in one application (Excel, for example), and want later to manipulate and analyze those data R. You could reenter your data directly into R. However, each time that you transcribe data, there are new opportunities for typos – minimizing such opportunities is a useful goal. Fortunately, R can read and important files of various different formats. Getting R to read data from an Excel file. Say you have the following Excel file and would like to plot and analyze these data in Excel. To import this Excel file into R, use the following steps: i. Save your Excel file as a “.csv” file, where “csv” stands for ”commaseparated values”. Be sure to save your file to a folder where you can find it by name! For the purpose of this example, let’s assume that we’ve named our file “runningtimes.csv”, and that we’ve saved it to the folder R Addenda. ii. Proceed to R, and change your “Working Directory” to be the folder where you saved this file. You can change the working directory by going to Misc − >Change Working Directory, and then clicking on the folder that contains your .csv iii. Next, from within R, read in the contents of your file. To do so, type: p<-read.table("runningtimes.csv",sep=’,’, header = TRUE) iv. A couple things to note in the above code: the“sep” command in the code above makes it possible to tell R that commas are used to separate our data points, and the header setting tells R that we have included names for the various columns in our data file (”Age”, etc.) v. Your data should now be ready for manipulation in R! To test this, could type the name of our data set (here, we’ve set it to be “p”), and examine the data — they should have the same essential structure as in the Excel file. 55 Appendix B Appendix B: Exponential Regression in R B.1 Strategy The goal of exponential regression is to the find a function of form y = b · em·x that is a good fit to a particular data set. In R, exponential regression can be achieved through “exponential transformation”. We find the natural log of the output (y) variable, and fit a line to the transformed data. The value for m in the exponential function is the slope of the inferred line. The value for b in the exponential function is calculated as eintercept of the interfered line. Here are the specific steps: i. Find the natural log of the output variable, y. ii. Use linear regression to find the line that best relates your input variable, x to the log of your output variable — that is, loge (y). iii. To find ”m” for the exponential function, use the slope inferred for the best-fit line. iv. To find ”b” for the exponential function, raise e to the intercept inferred for the best-fit line. v. Write your exponential function! Here’s an example: 57 B.2 Enter your data Imagine you record hours of data on the number of bacterial cells present in 200mL of a culture. The data can be written as follows: hour<-c(1,2,3,4,5,6) cells<-c( 1, 20, 30, 50, 80,200) We can plot the data: plot(hour,cells) 100 0 50 Cells 150 200 Yielding a graph: 1 2 3 4 5 6 Hour B.3 Find the natural log of your output variable Here, our output variable is ”cells”, so we write natlogcells<-log(cells) For the natural logs of these y values, I get [1] 0.000000 2.995732 3.401197 3.912023 4.382027 [6] 5.29831 B.4 Use linear regression to find the line that best relates your input variable, x to the log of your output variable — that is, loge(y). We’ve done this lots of times before! Enter 58 summary(lm(natlogcells~hour)) I get Call: lm(formula = natlogcells ~ hour) Residuals: 1 2 -1.1057 0.9997 3 0.5148 4 5 6 0.1353 -0.2850 -0.2590 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.2154 0.7583 0.284 0.7904 hour 0.8903 0.1947 4.573 0.0102 (Intercept) hour * --Residual standard error: 0.8145 on 4 degrees of freedom Multiple R-squared: 0.8394, Adjusted R-squared: 0.7993 F-statistic: 20.91 on 1 and 4 DF, p-value: 0.01024 B.5 To find ”m” for the exponential function, use the slope inferred for the best-fit line. For the example above, I get m = 0.8903. B.6 To find ”b” for the exponential function, raise e to the intercept inferred for the best-fit line. Here, the intercept for the line is inferred to be 0.2154, so I calculate b as b = exp(0.2154) = 1.24. B.7 Write your exponential function! Remember, your function should have the form: y = b · em·x So, we write: cells = 1.24 · e0.8903·hour 59 B.8 How well does your inferred function fit your data? Try plotting your raw data and inferred function on the same plot: inferredfit<-function(x) {1.24*exp(.8903*x)} plot(hour,cells, main="A pretty good fit!");plot(inferredfit,1,6,add=T) 100 0 50 cells 150 200 A pretty good fit! 1 2 3 4 hour It’s a pretty good fit! 60 5 6 Appendix C Appendix 3: Implementing the χ2 Test by Hand, in Excel, and in R The goal of the χ2 test is to assess whether the distribution of counts across two or more groups is significantly different from the null expectation. To do this, we compare each observed count to the expected count, and ask about the probability of observing total discrepancy of this great or greater, by chance alone. Larger total differences between observed and expected counts lead to lower p-values, indicating lower probabilities that the discrepancies arose by chance alone, suggesting that there is a significant difference between the two groups. C.1 A sample data set Suppose we knew that 1/3 of Westfield students were from Hampden County, 1/3 were from Suffolk County, and 1/3 were from elsewhere. We then surveyed 198 students on a weekend afternoon to ask about the county where they live. Our expected proportions are: Hampden 33% Suffolk 33% Other 33% And our expected counts for those 198 students surveyed would then be: Hampden 66 Suffolk 66 Other 66 Let’s say our observed counts were as follows: Hampden 48 Suffolk 70 Other 80 61 C.2 Doing the χ2 test by hand We could the χ2 test to ask whether the distribution of students in to “Hampden”, ”Suffolk”, and ”Other” was significantly different from our expectation. To do so, we would calculate the following for each category: (O−E)2 E and then take the sum of the equation for each of the categories. For the example above, that would be County Observed Expected (O−E)2 E Hampden 48 66 4.91 Suffolk 80 66 2.97 Other 70 66 0.242 Summing the bottom line of this table, we would get 8.12! The test is said to have two degrees of freedom (”d.f.”) because knowing the total number of students surveyed (198) plus any two other values (number from Suffolk and number from elsewhere, or number from Hamden and number from elsewhere, or number from Suffolk and number from Hampden) would be sufficient to complete the whole table. To figure out the associated p-value, you have several options: • do the entire test in Excel, as described below • do the entire test in R, as described below • use the sum calculated above (8.12) in conjunction with an online calculator, called ”P from chi2 ” that is available at http://graphpad.com/quickcalcs/PValue1.c C.3 To Perform a χ2 test in Excel The “chisq.test” function in Excel takes two arguments: first, the range of cells that contain the observed values, and second, the range of cells that contain the expected values. To enter data in Excel, enter your observed and expected values into separate blocks of cells. Then, type, ”=chisq.test(” Highlight you observed values, press comma, enter your expected values, press enter. The value returned will be your p-value, indicating how probable your observed values are, given your expected values. C.4 To Perform a χ2 test in R There are several way to do a χ2 test in R. For me, the easiest to remember is ”by hand”, meaning that you enter the raw data, and then calculate the 62 (O−E)2 E value described above, find the sum, and then ask how surprised you should be about this result. Here’s how: First, establish two lists: one with your expected values, and the other with your observed values. For the data given above, that would be: expected<-c(66,66,66) observed<-c(48,80,70) Then, find the sum of each of the values you calculated using (O−E)2 : E chivalue=sum(((observed-expected)^2)/expected) Finally, we want to ask how surprised we should be to observe this χ2 value. The command below tells R to compare your results to the chi-square distribution. The ”chivalue” refers to the part you calculated above. The ”2” gives the degrees of freedom — that is, the number of things we need to know (here, counts for two of three counties) in order to be able to provide the complete data set, assuming we already know the total number of students surveyed. 1-pchisq(chivalue,2) And returns a p-value — 0.01723857 We interpret this to mean the only 1.7% of the time would the observed values differ this much or more from the expected values! If we are using the conventional cut-off of 5%, we would note that our percent is even lower the cutoff, and would conclude that these observations differ significantly from the expectations. 63
© Copyright 2026 Paperzz