Getting Started with R and RStudio 1 1 What is R? R is a system for statistical computation and graphics. It is the statistical system that is used in Mathematics 241, Engineering Statistics, for the following reasons: 1. R is free and open-source. 2. R is user-extensible and user extensions can easily be made available to all users. 3. R is commercial quality. It is the package of choice for many engineers who use statistics frequently. 4. R is easy to use. At first, you might wonder about point 4 above. Other data analysis tools (such as Excel, for example) appear easier to use. But many things that an engineer might do are easier to do in R than Excel and some are impossible to do in Excel (correctly). In Mathematics 241 we will focus on core statistical tools and so learn to use only a small fraction of the capabilities of R. But since R is free, you will be able to keep R and add to your knowledge of it throughout your career. 2 Using R on the “Cloud” R can easily be downloaded and installed on your personal computer. You can download it from here: http: //cran.r-project.org/. However we will use R over the internet by using a system called RStudio. There are advantages and disadvantages to using R over the internet but the principle advantages for this course are that using RStudio means that the installation and setup of R is taken care of and also data can easily be shared with the instructor and other students. Additionally you can use R on different computers (as long as they have access to the internet) and the status of your R session will always be just as you left it. To use RStudio, go to http://beta.rstudio.org/ and use your Calvin email ID and password. The webpage that comes up after logging in should look like this: Getting Started with R and RStudio 2 Notice that there are four windows. R commands are entered one line at a time into the Console window (which is the lower left window by default). In the standalone version of R that you can install, there is only a console window at the start. The other three windows are used by RStudio to interact with the file system, to show graphics plots, and to provide an editor that can compose input for the console. (These notes are being produced in RStudio using the window at the top left as an editor.) In the remainder of these notes we will work entirely within the console. Thus these notes can be used to get started in any version of R. The symbol > is the prompt symbol that signifies that R is ready for input. In general, in response to the prompt we enter a one-line command and get some output (or define some object). We will look at some of the basic kinds of objects and commands in the rest of these notes. 3 Basic features of R In the examples that follow, you can distiguish input from output by the input prompt symbol >. Try these commands or variations of them yourself. 3.1 Using R as a Calculator R can be used as a calculator. > 5 + 3 [1] 8 > 15.3 * 23.4 [1] 358.02 > sqrt(16) [1] 4 You can save values to named variables for later reuse > product = 15.3 * 23.4 > product [1] 358.02 > .5 * product [1] 179.01 > log(product) [1] 5.880589 > product <- 15.3 * 23.4 > 15.3 * 23.4 -> newproduct > newproduct [1] 358.02 # save result # show the result # half of the result # log of the result # <- is the assignment operator, same as = # can assign to the right hand side The semi-colon can be used to place multiple commands on one line. One frequent use of this is to save and print a value all in one go: > product <- 15.3 * 23.4; product [1] 358.02 3.2 # save result and show it Functions and Objects Though R does arithmetic on numbers, the real power of R comes from the fact that R understands complex objects and has a large library of functions that operate on those objects. So most of the R commands that we will enter will look like f(x,y,...) where f is the name of an R function (like log above) and x,y,... is a list of objects and options. In the next section we illustrate by introducing the vector object and give some examples of functions that operate on vectors. Getting Started with R and RStudio 3.3 3 Vectors A vector has a length (a non-negative integer) and a mode (numeric, character, complex, or logical). All elements of the vector must be of the same mode. Typically, we use a vector to store the values of a quantitative variable. Usually vectors will be constructed by reading data from an R dataset or a file as we will soon see. But short vectors can be constructed by entering the elements directly using the c() function. > x = c(1,3,5,7,9,8,6,4,2) > x [1] 1 3 5 7 9 8 6 4 2 Note that the [1] that precedes the elements of the vectors is not one of the elements but rather an indication that the first element of the vector follows. There are a couple of shortcuts that help construct vectors that are regular. > y=1:10 > z=seq(0,5,.05) > y;z [1] 1 2 3 4 5 [1] 0.00 0.05 0.10 [16] 0.75 0.80 0.85 [31] 1.50 1.55 1.60 [46] 2.25 2.30 2.35 [61] 3.00 3.05 3.10 [76] 3.75 3.80 3.85 [91] 4.50 4.55 4.60 6 7 0.15 0.90 1.65 2.40 3.15 3.90 4.65 8 9 10 0.20 0.25 0.95 1.00 1.70 1.75 2.45 2.50 3.20 3.25 3.95 4.00 4.70 4.75 0.30 1.05 1.80 2.55 3.30 4.05 4.80 0.35 1.10 1.85 2.60 3.35 4.10 4.85 0.40 1.15 1.90 2.65 3.40 4.15 4.90 0.45 1.20 1.95 2.70 3.45 4.20 4.95 0.50 1.25 2.00 2.75 3.50 4.25 5.00 0.55 1.30 2.05 2.80 3.55 4.30 0.60 1.35 2.10 2.85 3.60 4.35 0.65 1.40 2.15 2.90 3.65 4.40 0.70 1.45 2.20 2.95 3.70 4.45 Many functions operate on vectors component-wise. > x=1:5 > y=6:10 > x^2 [1] 1 4 9 16 25 > x+y [1] 7 9 11 13 15 > log(x) [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 3.4 Data Frames Data sets are usually stored in a special structure called a data frame. Data frames have a 2-dimensional structure. Rows correspond to the individuals (units, cases, subjects) of our data set and the columns correspond to variables (measurements collected on each individual). Data frames in R are named as are the individual variables of the data frame. The columns (variables) are either vectors or factors (think of a factor as a vector that stores a categorical variable). We will usually get our data frames from external files that we have prepared in some other way – Excel is a good way to prepare a data frame as a data frame looks like a spreadsheet. Some datasets are included with the default R installation. The iris data frame contains 5 variables measured for each of 150 iris plants. The iris data set is included with the default R installation. Getting Started with R and RStudio 4 > str(iris) # summarizes the structure of the data frame 'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... > summary(iris) # gives summary information on each variable Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Median :5.800 Median :3.000 Median :4.350 Median :1.300 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 Species setosa :50 versicolor:50 virginica :50 > head(iris) # prints the first several cases of the data frame Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa In interactive mode, you can also try > View(iris) to see the data or > ?iris to get the documentation about for the data set. Access to an individual variable in a data frame uses the $ operator in the following syntax: > dataframe$variable For example, > iris$Sepal.Length [1] 5.1 4.9 4.7 4.6 [19] 5.7 5.1 5.4 5.1 [37] 5.5 4.9 4.4 5.1 [55] 6.5 5.7 6.3 4.9 [73] 6.3 6.1 6.4 6.6 [91] 5.5 6.1 5.8 5.0 [109] 6.7 7.2 6.5 6.4 [127] 6.2 6.1 6.4 7.2 [145] 6.7 6.7 6.3 6.5 5.0 4.6 5.0 6.6 6.8 5.6 6.8 7.4 6.2 5.4 5.1 4.5 5.2 6.7 5.7 5.7 7.9 5.9 4.6 4.8 4.4 5.0 6.0 5.7 5.8 6.4 5.0 5.0 5.0 5.9 5.7 6.2 6.4 6.3 4.4 5.0 5.1 6.0 5.5 5.1 6.5 6.1 4.9 5.2 4.8 6.1 5.5 5.7 7.7 7.7 5.4 5.2 5.1 5.6 5.8 6.3 7.7 6.3 4.8 4.7 4.6 6.7 6.0 5.8 6.0 6.4 4.8 4.8 5.3 5.6 5.4 7.1 6.9 6.0 4.3 5.4 5.0 5.8 6.0 6.3 5.6 6.9 5.8 5.2 7.0 6.2 6.7 6.5 7.7 6.7 5.7 5.5 6.4 5.6 6.3 7.6 6.3 6.9 5.4 4.9 6.9 5.9 5.6 4.9 6.7 5.8 5.1 5.0 5.5 6.1 5.5 7.3 7.2 6.8 shows the contents of the Sepal.Length variable (a vector). But this isn’t very useful for a large data set. We would prefer to compute a numerical or graphical summary. Getting Started with R and RStudio 4 5 Summaries of a single quantitative variable Almost always, a quantitative variable is stored in a vector and that vector is one column of a data frame. Most functions that give a numerical or graphical summary require a vector as argument. In this section we illustrate some of the more important summary functions with the variable Sepal.Length of the data frame iris. 4.1 Numerical summaries > mean(iris$Sepal.Length) [1] 5.843333 > median(iris$Sepal.Length) [1] 5.8 > sd(iris$Sepal.Length) [1] 0.8280661 > quantile(iris$Sepal.Length) 0% 25% 50% 75% 100% 4.3 5.1 5.8 6.4 7.9 4.2 Graphical Summaries There are several ways to make graphs in R. Many individuals have written R packages that give great control over the way a graph is drawn. We will use the standard graphics functions that are built n to R. Here we illustrate the two most important graphical representations of a single quantitative variable. In RStudio, graphics output appears in the plot window (lower right). You must click on the Plots tab to see them. A histogram is drawn using the function hist. > hist(iris$Sepal.Length) 15 0 5 Frequency 25 Histogram of iris$Sepal.Length 4 5 6 7 8 iris$Sepal.Length Many functions in R have optional arguments that change the way that the function acts. Often we can omit these arguments since R chooses reasonable default values. Note that R produces frequency histograms. To produce a density histogram, we need an optional argument freq. Note that we name the argument. Optional arguments usually have to be named so that R knows which arguments are being included. Other optional arguments control the title of the histogram and the axis labels. Getting Started with R and RStudio 6 > hist(iris$Sepal.Length,freq=F,main="Sepal Length",xlab=" ") # F is short for false 0.0 0.1 0.2 0.3 0.4 Density Sepal Length 4 5 6 7 8 Another common plot is called a boxplot. A boxplot is a graphical representation of a five number summary of a quantitative variable. The default boxplot uses a vertical scale. Here we draw a horizontal boxplot. > boxplot(iris$Sepal.Length, horizontal=T, main="Sepal Length") # T is short for true Sepal Length 4.5 5 5.5 6.5 7.5 Importing data Each RStudio user has space to save files. You can see your personal directory using the files tab of the same window in which you look at plots. Each user has a Public directory which is visible to other RStudio users. All datafiles used in Mathematics 241 and in the 241 texbook are in the instructor’s public directory. To load a file used in class use the Import Dataset tab under the Workspace tab of the top right window. Select From Text File and enter as filename /shared/[email protected]/. You will see the following Getting Started with R and RStudio 7 Class datasets are in the M241 directory and textbook datasets are in the DF directory. For example, to import the dimes dataset, navigate to M241 and select dimes.csv. A window will popup that enables you to tell R which format the data is in but in this case RStudio understands the CSV format of the dimes dataset. This procedure defines the data frame dimes. To load a textbook dataset, navigate to the DF folder and selectr the appropriate file: x10.4.23.csv is the data for exercise 23 in section 4 of chapter 10 and e3.7.csv is example 7 of chapter 3. 6 Useful features of RStudio One of the most useful features of RStudio is that it will save the state of your session even if you close your browser. This includes all variables, plots, and other settings. This is very useful for class work since you might get stuck on homework after attempting a problem and can pick up again after you get help in class or from the instructor. Another useful feature is the History tab of the upper right hand window. In that window you can find all the lines that you have entered into the console. These lines can be copied into the console window, for example. Two useful editing features are accessed by the tab key and the arrow keys. If you start to type the name of a function (e.g., > hi) and enter the tab character, you get all the possible functions that begin with these characters along with a short description of what they do. (Try entering the tab key after entering hi.) If you hit the up-arrow key, the previous line that you entered now becomes the current line and you can edit it and enter it again. This is very useful if you make a small typo on a very long line. These notes were produced on January 31, 2011 using R version 2.12.1 (2010-12-16).
© Copyright 2026 Paperzz