ANOVA Calculations in R

ANOVA Calculations in R
STAT 461
01/27/2015
Class Example
Here is the ANVOA table from class:
Source
DF
Sum of Squares
Mean Square
F
p-value
Treatment
Error
Total
3
15
18
588.2
158.2
746.4
196.1
10.5
18.6
2.59e-05
There is enough evidence to suggest a difference among the sales of ceral packaging.
How to find the p-value:
1-pf(18.59,3,15)
## [1] 2.585817e-05
Lab Example
Is there a difference among GPAs among freshman, sophmores, juniors and seniors? We want to look at at a
random sample of from a college and test with α = 0.05.
Data Entry
When analyzing data for ANOVA, the best way to enter is in a vertical format. That is, there are two
columns. One has the variable of interest, the other has a variable describing the grouping variable (called
the factor). Here is how would enter the data by hand into R.
gpa <- c(2.34,2.38,3.31,2.39,3.40,2.70,2.34,
3.26,2.22,3.26,3.29,2.95,3.01,3.13,3.59,2.84,3,
2.80,2.6,2.49,2.83,2.34,3.23,3.49,3.03,2.87,
3.31,2.35,3.27,2.86,2.78,2.75,3.05,3.31)
class <- rep(c("Fresh","Soph","Junior","Senior"),c(7,10,9,8))
grades <- data.frame(class,gpa)
head(grades)
##
##
##
##
##
##
##
1
2
3
4
5
6
class
Fresh
Fresh
Fresh
Fresh
Fresh
Fresh
gpa
2.34
2.38
3.31
2.39
3.40
2.70
The other format is called the wide format. This is where you have a column for each group. Most statistical
packages don’t handle this type of data well, and is impossible to use with more complicated designs. We do
1
have some tools in R to help, however. Note, each column has to be the same length (I hate data in the wide
format)
Fresh <- c(2.34,2.38,3.31,2.39,3.40,2.70,2.34,NA,NA,NA)
Soph <- c(3.26,2.22,3.26,3.29,2.95,3.01,3.13,3.59,2.84,3)
Junior <- c(2.80,2.6,2.49,2.83,2.34,3.23,3.49,3.03,2.87,NA)
Senior <- c(3.31,2.35,3.27,2.86,2.78,2.75,3.05,3.31,NA,NA)
grades2 <- data.frame(Fresh,Soph,Junior,Senior)
grades2.1 <- stack(grades2)
head(grades2.1)
colnames(grades2.1) <- c("gpa","class")
Note that the variable names make more sense for sales, since R doesn’t really know what to call them.
You can change them if you want, but I would recommend entering data in the vertical format originally.
Plotting data
The first step to analysis is to plot the data. There are two main choices in this case, box plots and dot
plots/strip chart.
boxplot(gpa~class,data=grades,main="Student GPA")
2.2
2.6
3.0
3.4
Student GPA
Fresh
Junior
Senior
Soph
stripchart(gpa~class,data=grades,main="Student GPA",vertical=TRUE)
2
3.0
2.2
2.6
gpa
3.4
Student GPA
Fresh
Junior
Senior
Soph
Does it look like there are differences in the means?
Data Hint
Sometimes, the alphabetical order of factors is not what we want. Here is how to fix that in R.
grades$class <- ordered(grades$class,levels=c("Fresh","Soph","Junior","Senior"))
boxplot(gpa~class,data=grades,main="Student GPA",horizontal=TRUE,cex.axis=.6)
Fresh
Soph
Junior
Senior
Student GPA
2.2
2.4
2.6
2.8
3.0
3.2
3.4
3.6
Sums of Squares
Calculating the grand mean is easy. So is finding SStotal .
Y.bar <- mean(grades$gpa)
Y.bar
## [1] 2.905
SS.total <- sum((grades$gpa-Y.bar)^2)
SS.total
## [1] 4.94385
3
Recall that SStotal = SStrt + SSerror . SStrt is the between group sum of squares, while SSerror is the within
group sums of squares.
In R, we have a command call tapply that can find group means in a very efficient manor. tapply applies a
function across a data set, grouping by some other variable. We can also get the counts, in case the design is
unbalanced.
Y.bar.j <- tapply(grades$gpa,grades$class,mean)
Y.bar.j
##
Fresh
Soph
Junior
Senior
## 2.694286 3.055000 2.853333 2.960000
n.j <- tapply(grades$gpa,grades$class,length)
n.j
##
##
Fresh
7
Soph Junior Senior
10
9
8
Given these calculations, it is easier to find SStrt than it is SSerror . I may have had that backwards in class,
but you can see below, the calculation is simple in R.
SS.trt <- sum(n.j*(Y.bar.j-Y.bar)^2)
SS.err <- SS.total - SS.trt
SS.trt
## [1] 0.5840286
SS.err
## [1] 4.359821
Mean Squares
Mean squares are the sums of squares dived by the degrees of freedom. We have formulas for degrees of
freedom.
dftotal = N − 1
dftrt = r − 1
dferror = dftotal − dftrt = N − r
Then we can find the mean squares. We don’t find mean square total, as that is not used in our calculations.
MS.trt <- SS.trt/3
MS.err <- SS.err/(34-4)
Test Statistic
The test statistic for analysis of variance is the ratio of mean square between over mean square within. Or
F =
M Strt
=
M Serror
4
SStrt
r−1
SSerror
N −r
f.stat <- MS.trt/MS.err
f.stat
## [1] 1.33957
Is this large?
F-distribution
It can be shown that the ratio of these two mean squares follows what is called a F-distribution. Image
courtesy of Wikipedia and copyright IkamusumeFan, 2014 (http://en.wikipedia.org/wiki/File:F_pdf.svg).
Figure 1: F-Distributions
As the means of the groups get further apart, the value of the F-statistic increases. “How large is large”
depends on the number of samples (denominator df) and the number of groups (numerator df). You can find
the p-value for the hypothesis test by finding the area to the right of the test statistic under the appropriate
F-distribution.
pf(f.stat,3,30)
# Wait, that is to the left
## [1] 0.719899
5
1-pf(f.stat,3,30)
## [1] 0.280101
Putting it all together
To answer the questions for the homework.
H0 : µF = µSo = µJ = µSr
HA : At least one pair of means is not equal
boxplot(gpa~class,data=grades,main="Student GPA")
2.2
2.6
3.0
3.4
Student GPA
Fresh
Soph
Junior
Senior
It appears that freshman may have lower GPAs than the other classes.
Source
DF
Sum of Squares
Mean Square
F
p-value
Treatment
Error
Total
3
30
33
0.5840286
4.3598214
4.94385
0.1946762
0.1453274
1.3395699
0.2801
There is not enough evidence (F = 1.340, dfn = 3, dfd = 30, $p=0.2801) to suggest that at least on pair of
GPAs is different.
6

Download Report

ANOVA Calculations in R

Paperzz.com

Your Paperzz