R for Introductory Statistics

R for Introductory Statistics
Step 1: Is your data in the right place?
R uses “C:/Windows/system32” as its default location for data, (what R calls its “working directory”). It is
not a good idea to place files and change folders inside this directory, so we want to change the working
directory. For convenience, let’s set the working directory to be the desktop.
First, check your working directory.
> getwd()
[1] "C:/Windows/system32"
The “>” shape is your command prompt. Also,
you must include the parenthesis, even though
you are not putting anything inside them.
To set your working directory to your desktop, use the command setwd().
> setwd(C:/Users/computername/Desktop)
R is case-sensitive, so if the folder in your computer is capitalized,
you must capitalize it in your command. In some cases, your
computer might also display directory names with a “\” instead
of the “/” shown here (i.e. C:\Users\computername\Desktop).
Whatever your computer shows, use “/”. If R tells you it could
not find the directory, re-check you spelling, capitalization, and
your slashes. After you set your working directory, confirm it by
re-using the getwd() command.
If you are using a Purdue computer,
your “computername” will be the ID
you used to login. For a personal
computer, it will be the computer
name. To find your computer name,
right-click on a file on your desktop
and select “properties”. It will be
listed under “location”.
Now is a good time to give you the most important time-saving tip. If you enter any command wrong,
and R returns an error, you do not have to retype the whole command. On the empty command
prompt, press the “up” arrow key. This will cycle up through the previous commands you have entered.
Step 2: Loading Data
After you’ve set your desktop as the working directory, place your data file on the desktop. It must be in
a .txt or .csv file. If you have an excel file, use the “save as” function, and re-save the .xls or .xlsx file as a
.csv. (In doing this, Excel will say your file will lose some formatting, this is fine and you can click “yes”.)
#
1
2
3
4
5
6
7
8
9
10
glass
1
1
1
2
2
2
3
3
3
1
temp
100
100
100
100
100
100
100
100
100
125
light
580
568
570
550
530
579
546
575
599
1090
To read this file, use the command “read.table”.
>dataset=read.table("wolf.csv",header=T,sep=',’)
dataset is the title we wish to give our dataset in R. wolf.csv is the
data file you have on the desktop. header=T means that the cell at
the top of the column (for example, “glass”) is a label, not a data
point. sep=’,’ tells R to look for commas as separators for data
points. If you’re using a .csv file, this will be true. If you’re using a
.txt file, make sure that the data is separated by a comma.
You can also use the command read.csv(). Everything is the same
as read.table(), but you can leave out the sep=’,’ command. Of
course, make sure you are using a .csv file.
> dataset1=read.csv(“wolf.csv”,header=T)
Remember, R is always case-sensitive. If your file has a capital letter in the filename, that letter must be
capitalized in the command. If the column title is capitalized, you will have to capitalize it every time you
use it.
Step 3: Formatting Data
In the previous step, we provided a partial dataset. In this dataset, “glass” and “temp” are factors, or
numerical categorical variables. But R will always assume the data you provide is continuous. To make
R read a variable as a factor, use the as.factor() command.
This is the name of the variable you
want R to read as a factor.
> dataset$glass = as.factor(dataset$glass)
This is the name of the variable you
are creating. If you want to keep the
original “glass” variable the same, you
can change the name here to Fglass,
as in dataset$Fglass.
Step 4: Attaching Data
In the last step, we showed you how to format data with the as.factor() command. Every time we
referenced a variable, or created a new one, we had to specify the dataset. This can become annoying,
so to save time, you can attach data using the attach command.
> attach(dataset)
If we had done this before Step 3, we could have written the command:
> glass = as.factor(glass)
This will be time saving when your fitting models, and also simplifies your commands, making it easier to
avoid mistakes. You can only attach one dataset at a time.
Step 5: Means and T-tests
If you forget, at any time, the name of the variables in your dataset, you can use the names() command.
> names(variable)
or
> names(dataset$variable)
If you haven’t attached your data set,
you must specify the dataset.
To find the mean and standard deviation, use the mean() and sd() commands.
>mean(variable)
>sd(variable)
T-tests can be performed in a few ways, but all use the t.test() command.
> t.test(var1, mu=X)
> t.test(var1, var2, var.equal=T)
> t.test(var1, var2, paired=T)
Step 6: Fitting Models
This will test your variable (var1)
against a mean you decide (X)
This will test two
variables as a paired test
This is a two-variable T-test. When
var.equal=T, the test assumes equal
variance. When var.equal=F, the tes
will assume unequal variances)
In R, we must fit a linear model before we can run ANOVA or run analysis on our residuals. We do this
with the lm() command.
> model = lm(light~glass+temp+glass:temper, data=dataset)
This is the model name.
Response Variable
Interaction Term
Dataset
Factors
There are other ways to include interaction terms. With lm(light~glass*temp, data=dataset),
the asterisk between the factors tells R to include both main effects and the interaction in the model. If
we had three factors, lm(light~glass*temp*third, data=dataset), then the model would
include all three main effects, all three 2nd order interaction terms, and the higher level interaction
term.
If we had three main factors, but we only wanted the main effects and the 2nd order interaction terms,
we could write the model as lm(light~(glass+temp+third)^2, data=dataset).
To offer another example, the following three models are identical.
> model = lm(light~glass+temp+glass:temp, data=dataset)
> model = lm(light~glass*temp, data=dataset)
> model = lm(light~(glass+temp)^2, data=dataset)
It might also be useful for you to make a graph of your interaction terms, which we can use the
interaction.plot() command for.
> interaction.plot(temp,glass,light, xlab= “Coating”, ylab= “Response
Mean”, main= “Interaction Plot”, trace.label= “Temperature”)
The trace.label() provides a label for
the lines which connect the points in
the interaction plot, (i.e. the
difference temperatures used).
Step 7: ANOVA
Using the above model, we can perform ANOVA with a rather simple command: anova(model). You
must have previously constructed a model using the lm() command.
> anova(model)
Your output will look like:
Analysis of Variance Table
Response: light
glass
temper
glass:temper
Residuals
Df
2
2
4
18
Sum Sq
150865
1970335
290552
6579
Mean Sq
75432
985167
72638
366
F value
206.37
2695.26
198.73
Pr(>F)
3.886e-13
<2.2e-16
1.254e-14
***
***
***
Step 8: Type I and Type III Sum of Squares
The output from Step 7 will show you the Type I Sum of Squares. As with other software, you must
adjust the model to so that whichever term you wish to have unadjusted, (or on top) is the first term.
Unlike with other software, getting Type III Sum of Squares is somewhat onerous. As you may know, the
last term in the Type 1 SS table is equivalent to the Type III SS for that term. With R, you must re-do the
model with the term for which you wish to get the Type III SS last.
To save you a few steps, you can also use the aov() command to get ANOVA outputs. As you can see, the
aov() command fits your linear model within the command line. This way, you do not have to refit a
linear model with lm() to get all of your Type III SS.
> summary(aov(y~var1+var2+var1:var2+var3))
The aov() command returns a truncated ANOVA
table, so here we use the summary() command
to force a more complete table.
Step 9: Testing Assumptions
The names() command is also useful after you’ve fit a model.
Here, you’re fitting a linear model,
and the syntax rules we showed in
Step 6 apply here.
This is the output from the names()
command on your model. These are
all variables created when you fit a
model with lm().
> names(model)
# [1]
“coefficients” “residuals”
“effects” “rank”
# [5]
“fitted values” “assign”
“gr”
“dr.residual”
# [9]
“contrasts”
“xlevels” “call”
“terms”
#[13] “model”
We can use some of these variables to test assumptions about our model, like constant variance and
normality of residuals. Here, the linear model we created with the lm() command acts as a data set, so
you must reference the model when using a variable. You can also detach the dataset with detach()
command, and then attach the model.
This command partitions the window
into quarters. You don’t have to
> detach(dataset)
> attach(model)
partition it 2X2, but we’re going to
show you four graphs.
> par(mfrow=c(2,2))
> qqnorm(model$residuals)
The command qqnorm() draws a
> qqline(model$residuals)
normal QQ-plot of whatever variable
you specify, while qqline() draws a
If you wish to plot the residuals against the fitted values (also called the predicted values), we have to
line using the first and third quartile
use the plot() command.
> plot(model$fitted.values, model$residuals, xlab= “Fitted Values”,
ylab= “Residuals”, main= “Residual Plot vs. Fitted Values”)
Using the plot command, which graphs scatter plots, you must specify the variable for the x-axis first,
then the y-axis variable. xlab and ylab will label the x and y axes and main will supply a label for the
entire graph. Leaving any of these portions out will not affect the graph, except that the label will be
missing.
We can also use plot() to graph the residuals against factors. To simplify these instructions, we’re going
to assume you have not attached either the dataset or the model.
> plot(dataset$factor1, model$residuals, xlab= “Factor: 1”, ylab
“Residuals”, main “Residuals Plot vs. Factor 1”)
> plot(dataset$factor2, model$residuals, xlab= “Factor: 2”, ylab
“Residuals”, main “Residuals Plot vs. Factor 2”)
After you’ve got these plots, repartition the screen so nothing weird happens in the display when you do
further analysis.
> par(mfrow=c(1,1))
Step 10: Bartlett’s Test for Constant Variances and Tukey’s HSD
To do Bartlett’s Test, we can use the command bartlett.test(). Remember, at this point in this tutorial,
we’re assuming you’ve already loaded your dataset. We will assume you have not attached anything.
If your variables are factors, you must have also specified them as such with the as.factor() command.
> bartlett.test(var1~var2, data=dataset)
This will test variable 1 against variable 2. The output will be:
# Bartlett test of homogeneity of variances
#
#data: var1 by var2
#Bartlett's K-squared = 1.1508, df = 3, p-value = 0.7648
We can do Tukey’s HSD comparison using the command tukeyhsd(aov()).
> tukeyHSD(aov(y~var1*var2, dataset))$var1:var2
As before, you’re fitting a linear model,
and the syntax we showed in Step 6
applies here. If you want to test an
interaction term with Tukey’s HSD, make
sure to include it here.
This specifies that you want to test the
interaction term between var1 and
var2. To test a main effect, use $var1,
and repeat the entire command line to
test a different main effect.
Step 11: Randomized Complete and Balanced Incomplete Block Designs
R currently has no specialized functions for working with blocking designs. To work with RCBDs and
BIBDs, you should construct your .csv or .txt table so that you have treatment and block as factors. You
can then construct a linear model with the lm() command.
> blockmodel = lm(y~treatment+block, data=dataset)
> anova(blockmodel)
Again, you must decide on the proper
model to use here. R cannot decide
This will return an output like the one below.
for you what terms should be
included in the model.
#Analysis of Variance Table
#
#Response: y
#
Df Sum Sq Mean Sq F value
Pr(>F)
#trt
3 11.667 3.8889 5.9829 0.0414634 *
#block
3 66.083 22.0278 33.8889 0.0009528 ***
#Residuals 5 3.250 0.6500
Step 12: Split Plot and other Nested Data Designs
Using R to analyze split plot designs is a more complicated than the previous blocking designs. Let’s
assume we have the following experimental model that we
Coating
want to analyze.
Temp.
Sec. Sec. Sec. Sec.
Bar (⁰C)
1
2
3
4
6 metals bars are heated to one of three temperatures,
1
350
1
3
2
4
with two bars for each temperature. After cooling, the bars
2
360
2
3
1
4
are divided into four sections and each section is coated
3
370
2
4
3
1
with one of four salt solutions. The electrical resistance of
each section is then measured, with one measurement per
4
370
4
1
2
3
section.
5
360
2
3
1
4
6
350
3
1
4
2
Using this layout, each Bar is a whole plot, with the four
sections each being a split-plot. Our model for this analysis is:
𝑅𝑒𝑠𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑖𝑗𝑘 = 𝑊ℎ𝑜𝑙𝑒(𝑇𝑖 + 𝑃(𝑖)𝑗 ) + 𝑆𝑝𝑙𝑖𝑡(𝐶𝑘 + 𝐶𝑇𝑖𝑘 + 𝐶𝑃(𝑖)𝑗𝑘 + 𝑒𝑟𝑟𝑜𝑟, where T=temp, C=coating,
and P=plot.)
To test this, we use the aov() command.
>summary(aov(resistance~(temp+coating)^2+Error(plot/temp),
data=dataset))
Remember, this means the model
includes the main effects and all 2nd
order interaction terms. Refer to step
6 if you need to review the syntax for
linear models.
This tells the model that the error
term is the plot (bar) nested within
the Temperature factor.
This will provide you with the following output.
summary(aov(resistance))
#Error: plot
#
Df Sum Sq Mean Sq F value Pr(>F)
#temp
2 26519
13260
2.755 0.209
#Residuals 3 14440
4813
#
#Error: Within
#
Df Sum Sq Mean Sq F value Pr(>F)
#coating
3
4289 1429.7 11.480 0.00198 **
#temp:coating 6
3270
545.0
4.376 0.02407 *
#Residuals
9
1121
124.5
This is commonly labeled the
“Between Plots” or the “Whole Plot”
ANOVA. It shows the error term used
for F tests is the model term “Plot”.
This is the “Within Plots” or “Split
Plot” ANOVA. Here the error term is
the interaction effect of Plot*Coating.
If you try to fit the model using a the lm() command followed by anova(), as below,
>model=lm(resistance~(temp+coating)^2, data=dataset)
>anova(model)
You would receive the following output:
#Response: res
#
Df Sum Sq Mean Sq F value
Pr(>F)
#temp
2 26519.2 13259.6 10.2256 0.002557 **
#coating
3 4289.1 1429.7 1.1026 0.386016
#temp:coating 6 3269.7
545.0 0.4203 0.851799
#Residuals
12 15560.5 1296.7
As you can see, when we used a standard linear model, R assumed we were providing data for a
balanced complete block design, or a complete factorial design.
Addendum: There is a way to use the lm() command to do ANOVA with split-plot design, (and in the next
step, nested models). It is more complicated than using the aov() command, however it returns a
different ANOVA table which may be useful for educational purposes. We are using the same example
as above.
> model=lm(resistance~temp*coating + plot$in$temp +
coating:plot$in$temp, dataset)
> anova(model)
Your output will look like:
#Response: resistance
#
Df
#temp
2
#coating
3
#temp:coating
6
#temp:plot
3
#temp:coating:plot
9
#Residuals
0
Sum Sq Mean Sq F value Pr(>F)
26519 13260
4289
1429.7
3270
545
14440 4813
1121
124.5
0.000
The disadvantages of this method for analysis are obvious. Because it attempts to use the errors, (which
are irretrievable) for the F-tests, none can be completed. We would argue that the syntax is rather more
confusing as well. However, when compared with the output from using the aov() command, it is easy to
see where different terms come from, (i.e. that “residuals” in the Between Section is actually the plot
term nested within the temp term). The fact that nested terms, (plot within temp), are portrayed in R as
interaction terms might also supply insight into the confounded nature of nested terms.
Finally, other models which rely on nested data, (such as repeated measures experiments) use this same
command line. The important addition for nested data is the inclusion of Error term in the aov()
command line, which you will use to specify the between error term.
Step 13: Fractional Factorials
Conducting analysis of fractional factorial models with R uses syntax we’ve already learned, by fitting a
linear model with lm() and evaluating that model with anova(). As with other steps, the first part of
analyzing fractional factorials with R is making sure that your model is well constructed. With fractional
factorials, that also means knowing your generator.
The first step is labeling your factors. For example, let’s say we have a fractional factorial of 2(6-2) with
resolution=IV, and we make our generators 123 and 234. If we used the labels in our dataset, which are
complete words like “temperature”, the output for the ANOVA we’re about to do would be filled with
text and confusing, so first let’s rename our factors.
>
>
>
>
>
>
>
A=factor1
B=factor2
C=factor3
D=factor4
E=A*B*C
F=B*C*D
rep=block
Remember to replace “factor1” with
whatever your factor is named. Also,
we’re assuming that the dataset has
been attached.
Then we fit our linear model with lm().
> model= lm(y~(A+B+C+D+E+F)^6 + rep, dataset)
Then we conduct our ANOVA analysis with the anova() command.
> anova(model)
And the output will look something like the following.
#Analysis of Variance Table
#
#Response: data2$Hangtime
#
Df Sum Sq Mean Sq
F value
Pr(>F)
#A
1 4.5942 4.5942 196.8658 1.020e-14
#B
1 0.2201 0.2201
9.4294 0.0045078
#C
1 27.2255 27.2255 1166.6322 < 2.2e-16
#D
1 6.3438 6.3438 271.8368 < 2.2e-16
#E
1 0.0276 0.0276
1.1806 0.2858819
#F
1 4.4105 4.4105 188.9920 1.738e-14
#rep
2 0.0732 0.0366
1.5690 0.2248739
#A:B
1 0.3942 0.3942
16.8926 0.0002819
#A:C
1 1.0063 1.0063
43.1208 2.895e-07
#A:D
1 1.1563 1.1563
49.5484 7.980e-08
#A:E
1 1.7442 1.7442
74.7411 1.210e-09
#A:F
1 0.1938 0.1938
8.3046 0.0072398
#B:D
1 0.0063 0.0063
0.2700 0.6071123
#B:F
1 0.2626 0.2626
11.2506 0.0021694
#A:B:D
1 0.6888 0.6888
29.5157 6.863e-06
#A:B:F
1 1.1876 1.1876
50.8875 6.185e-08
#Residuals 30 0.7001 0.0233
***
**
***
***
***
***
***
***
***
**
**
***
***