printable version

F73DA2
INTRODUCTORY DATA ANALYSIS
ANALYSIS OF VARIANCE
50
45
35
30
25
20
y
40
regression:
x is a
quantitative
explanatory
variable
1.0
1.5
2.0
x
2.5
3.0
50
45
40
35
30
25
20
y
type is a qualitative
variable (a factor)
1.0
1.5
2.0
type
2.5
3.0
Illustration
Company 1:
36 28 32 43 30 21 33 37 26 34
Company 2:
26 21 31 29 27 35 23 33
Company 3:
39 28 45 37 21 49 34 38 44
50
45
40
sum
35
30
25
20
1.0
1.5
2.0
company
2.5
3.0
Explanatory variable qualitative
i.e. categorical - a factor
Analysis of variance
 linear models for
comparative experiments
1.0
1.5
2.0
type
2.5
3.0
20
25
30
35
company
40
45
50
Using Factor Commands
► The
display is different if “type” is declared
as a factor.
1
2
type
3
20
25
30
35
company
40
45
50
► We
could check for significant differences
between two companies using t tests.
► t.test(company1,company2)
► This calculates a 95% Confidence Interval
for difference between means
Includes 0 so no significant
difference
Instead use an analysis of variance
technique
Taking all the results together
Taking all the results together
We calculate the total variation for the system
which is the sum of squares of
individual values – 32.59259
► We
can also work out the sum of squares
within each company
This sums to 1114.431
► The
total sum of squares of the situation
must be made up of a contribution from
variation WITHIN the companies and
variation BETWEEN the companies.
► This means that the variation between the
companies equals 356.0884
► This
can all be shown in an analysis of
variance table which has the format:
Source of
variation
Degrees of
freedom
Sum
of squares
Mean squares
Between
treatments
k1
SSB
SSB /(k  1)
Residual
(within
treatments)
nk
SSRES
SSRES /(n k)
Total
n1
SST
F
Source of
variation
Degrees of
freedom
Sum
of squares
Mean squares
Between
treatments
k1
356.088
SSB /(k  1)
Residual
(within
treatments)
nk
1114.431
SSRES /(n k)
Total
n1
1470.519
F
► Using
the R package, the command is
similar to that for linear regression
Theory
Data: yij is the jth observation using
treatment i
Model:
Yij  i   ij
   i   ij , i  1,2,..., k; j  1,2,..., ni
where the errors ij are i.i.d. N(0,s2)
The response variables Yij are independent
Yij ~ N(µ + τi , σ2)
Constraint:
 n
i i
i
0
Derivation of least-squares estimators
ni
S    yij     i 
k
2
i 1 j 1
ni
ni
k
S
S

 2  yij     i  ,
 2  yij     i 

 i
i 1 j 1
j 1
Normal equations for the estimators:
y  nˆ and yi  ni ˆ  niˆi
with solution
ˆ  y , ˆi  yi  y
The fitted values are the treatment means
Ŷij  Yi
Partitioning the observed total variation
  y
ij
i
 y 
j
2

  y
ij
i
SST
 yi 
2

n  y
j
SSRES
SST = SSB + SSRES
i
i
 y 
i
SSB
2
The following results hold
SST

2
~
SS RES

2
2
n 1
~
2
nk
SS B

2
~
2
k 1
SS B /  k  1
~ Fk 1,n k
SS RES /  n  k 
Back to the example
ˆ  880/ 27  32.593
ˆ1  320/10  880/ 27  0.593,
ˆ2  225/8  880/ 27  4.468
ˆ3  335/ 9  880/ 27  4.630
Fitted values:
Company 1: 320/10 = 32
Company 2: 225/8 = 28.125
Company 3: 335/9 = 37.222
Residuals:
Company 1: 1j = y1j - 32
Company 2: 2j = y2j - 28.125
Company 3: 3j = y3j - 37.222
SST = 30152 – 8802/27 = 1470.52
SSB = (3202/10 + 2252/8 + 3352/9) – 8802/27
= 356.09
 SSRES = 1470.52 – 356.09 = 1114.43
ANOVA table
Source of
variation
Degrees of
freedom
Sum
Mean
of squares squares
Between
treatments
Residual
2
356.09
178.04
24
1114.43
46.44
Total
26
1470.52
F
3.83
Testing H0 : τi = 0 , i = 1,2,3 v H1 : not H0
(i.e. τi  0 for at least one i)
Under H0 , F = 3.83 on 2,24 df.
P-value = P(F2,24 > 3.83) = 0.036
so we can reject H0 at levels of testing
down to 3.6%.
Conclusion
Results differ among the
three companies (P-value 3.6%)
The fit of the model can be investigated by
examining the residuals:
the residual for response yij is
eij  ˆij  y ij  y i
this is just the difference between the response
and its fitted value (the appropriate sample mean)
Plotting the residuals in various ways may reveal
●
a pattern (e.g. lack of randomness, suggesting
that an additional, uncontrolled factor is present)
●
non-normality (a transformation may help)
●
heteroscedasticity (error variance differs among
treatments – for example it may increase with treatment
mean: again a transformation – perhaps log - may be
required)
1.0
1.5
2.0
type
2.5
3.0
20
25
30
35
company
40
45
50
1
2
type
3
20
25
30
35
company
40
45
50
► In
this example, samples are small, but one
might question the validity of the
assumptions of normality (Company 2) and
homoscedasticity (equality of variances,
Company 2 v Companies 1/3).
0
5
10
15
Index
20
25
-15
-10
-5
0
5
residuals(lm(company ~ type))
10
►
plot(residuals(lm(company~type))~
fitted.values(lm(company~type)),pch=8)
10
5
0
-5
residuals(lm(company ~ type))
-10
-15
28
30
32
34
fitted.values(lm(company ~ type))
36
plot(residuals(lm(company~type))~
fitted.values(lm(company~type)),pch=8)
► abline(h=0,lty=2)
►
10
5
0
-5
residuals(lm(company ~ type))
-10
-15
28
30
32
34
fitted.values(lm(company ~ type))
36
► It
is also possible to compare with an
analysis using “type” as a qualitative
explanatory variable
► type=c(rep(1,10),rep(2,8),rep(3,9))
► No “factor” command
1.0
1.5
2.0
type
2.5
3.0
20
25
30
35
company
40
45
50
Note low R2
The equation is company = 27.666+2.510 x type
A
B
C
D
spray
E
F
0
5
10
15
count
20
25
10
5
0
residuals(lm(count ~ spray))
-5
5
10
fitted.values(lm(count ~ spray))
15
Example
A school is trying to grade 300 different
scholarship applications. As the job is too
much work for one grader, 6 are used.
Example
A school is trying to grade 300 different
scholarship applications. As the job is too
much work for one grader, 6 are used. The
scholarship committee would like to ensure
that each grader is using the same grading
scale, as otherwise the students aren't being
treated equally. One approach to checking if
the graders are using the same scale is to
randomly assign each grader 50 exams and
have them grade.
To illustrate, suppose we have just 27 tests
and 3 graders (not 300 and 6 to simplify
data entry). Furthermore, suppose the
grading scale is on the range 1-5 with 5
being the best and the scores are reported
as:
grader 1
4
3
4
5
2
3
4
5
grader 2
4
4
5
5
4
5
4
4
grader 3
3
4
2
4
5
5
4
4
The 5% cut off for F distribution with 2,21 df is
3.467. The null hypothesis cannot be rejected. No
difference between markers.
Another Example
Class
I
151
II
152
III
175
IV
149
V
123
VI
145
168
141
155
148
132
131
128
129
162
137
142
155
167
120
186
138
161
172
134
115
148
169
152
141
Source of
variation
Between
treatments
df
Residual
24
5766.8
Total
29
8813.5
5
Sum
Mean
of squares squares
3046.7
609.3
240.3
F
2.54
6
5
4
companynumber
3
2
1
120
130
140
150
160
170
180
Normality and homoscedasticity
(equality of variance) assumptions
both seem reasonable
We now wish to Calculate a 95% confidence
interval for the underlying common standard
deviation , using
SSRES/
2
as a pivotal quantity with a 2 distribution.
It can easily be shown that the class III has
the largest value of 165.20 and that
Class II has the smallest value of 131.40
Consider performing a t test to compare
these two classes
There is no contradiction between this and
the ANOVA results.
It is wrong to pick out the largest and the
smallest of a set of treatment means, test for
significance, and then draw conclusions
about the set.
Even if H0 : " all equal" is true, the sample
means would differ and the largest and
smallest sample means perhaps differ
noticeably.