Analysis of Variance

Analysis of
Variance
Dr. Mohammed Alahmed
Ph.D. in BioStatistics
[email protected]
(011) 4674108
1
Introduction
• Analysis of variance (ANOVA), as the name
implies, is a statistical technique that is intended to
analyze variability in data in order to infer the
inequality among population means
• The purpose of ANOVA is much the same as the ttests presented in the preceding sections.
• The goal is to determine whether the mean
differences that are obtained for sample data are
sufficiently large to justify a conclusion that there
are mean differences between the populations from
which the samples were obtained.
2
• The difference between ANOVA and the t-tests is
that ANOVA can be used in situations where
there are more than two means being
compared, whereas the t-tests are limited to
situations where only two means are involved.
• If more than two means are compared, repeated
use of the independent-samples t-test will lead
to a higher Type I error rate  than the  level
set for each t-test. (multiple comparison
problem)
3
The basic ANOVA situation
• Variables in ANOVA:
– Dependent variable is metric.
– Independent variable(s) is nominal with two or more
levels – also called treatment, manipulation, or factor.
• One-Way ANOVA:
– Two variables: 1 Categorical, 1 Quantitative
– Main Question: Do the (means of) the quantitative
variables depend on which group (given by
categorical variable) the individual is in?
– If categorical variable has only 2 values:
• 2-sample t-test
– ANOVA allows for 3 or more groups.
4
AVOVA Hypotheses
•
The null hypothesis:
–
The means for all groups are the same
(equal).
H0: 1 = 2 = ………. = k
•
The alternative hypothesis:
–
The means are different for at least one pair
of groups.
H1: 1  2  ……….  k
•
If we reject H0 How do you determine
which means are significantly different?
5
• Before we begin, we must consider
the assumptions required to use
ANOVA
– The underlying distributions of the
populations are normal.
– The variance of each group is equal
(This is critical for ANOVA).
6
• If all of the groups had
the same means, the
distributions for all of the
populations would look
exactly the same
(overlaid graphs)
• Now, if the means of the
populations were
different, the picture
would look like this.
Notice that the variability
between the groups is
much greater than within
a group
7
Sources of variance
• When we take samples from each group,
there will be two sources of variability:
– Within group variability - when we sample
from a group there will be variability from
person to person in the same group
– Between group variability – the difference
from group to group
• If the between group variability is large, the means
of the groups are likely not to bethe same
• We can use the two types of variability to
determine if the means are likely different
8
• Blue arrow: within group, red arrow: between
group
• Notice that when the distribution are separate,
the between group variability is much greater
than the within group
9
Notation for ANOVA
All groups:
• n = number of individuals all together
• K = number of groups
• x= mean for entire data set is
Group i has:
• ni = # of individuals in group k
• xij = value for individual j in group k
• x = mean for group k
i
• si = standard deviation for group k
10
Sources of variability
• ANOVA measures two sources of variation in the
data and compares their relative sizes:
1. variation BETWEEN groups:
• for each data value look at the
difference between its group
mean and the overall mean.
2. variation WITHIN groups:
• for each data value we look at
the difference between that
value and the mean of its group.
x  x 
2
i
x
ij
 xi 
2
11
F-statistic
• The F-statistic assesses whether you can conclude that
statistical differences are present somewhere between the
group means.
• The F-statistic is a ratio of the Between Group Variation
divided by the Within Group Variation:
Between
MSB
F 

Within
MSW
• This test statistic is compared to an F-table with k-1 and n-k
degrees of freedom
• A large F is evidence against H0, since it indicates that there
is more difference between groups than within groups.
12
ANOVA Table
Source of
variation
SS
df
MS
F
Between
SSB
k-1
MSB
MSB/MSW
Within
SSW
n-k
MSW
Total
SST
n-1
 x
k
nj
j 1 i 1
ij

2
k
nj


k
p-value
nj
 
 x   xij  x j   x j  x
Total sum of squares
(SST)
2
j 1 i 1
j 1 i 1
2
group sum of
Between group sum of
= Within
+ squares (SS )
squares (SS )
W
B
13
Example
• A researcher wishes to try three different
techniques to lower the blood pressure of
individuals diagnosed with high blood
pressure.
• The subjects are randomly assigned to
three groups; the first group takes
medication, the second group exercises,
and the third group follows a special diet.
• After four weeks, the reduction in each
person’s blood pressure is recorded.
14
• The data are:
Medication
Exercise
Diet
10
6
5
12
8
9
9
3
12
15
0
8
13
2
4
• At α = 0.05, test the claim that there is no difference
among the means
• The hypotheses to be tested are:
H0: μ1 = μ2 = μ3
H1 : At least one mean is different from the others
15
Analyze → Compare
Means → One-Way Anova
In the One-Way ANOVA
menu window, place “Bp” in
the Dependent List box and
“Treatment” in the Factor
box,
16
• To complete the process described in the text, select OK in this
window without doing anything else. The resulting output is the
ANOVA table shown below.
since p <α, the null hypothesis is rejected.
• As mentioned in the text, this result allows us only to conclude
that at least one (true) treatment mean differs from the others;
we can say nothing about the relative sizes of the (true)
treatment means
• Further tests can be performed to determine which treatment
mean(s) differ and, consequently, determine which (true)
treatment mean(s) might have the highest (or lowest) values
17
• Verifying the Assumptions for the OneWay ANOVA F-test
• The assumptions for the one-way ANOVA
F-test, as expressed in in the text, are:
1. The populations from which the samples
were obtained must be normally or
approximately normally distributed.
2. The samples must be independent of one
another
3. The variances of the populations must be
equal
18
Assessing the normality and constant
variance assumptions
Analyze → Descriptive Statistics → Explore
19
This table provides results of the test of the following hypotheses:
H0 : The population random variable is normally distributed
H1: The population random variable is not normally distributed
An ideal Normal QQ Plot will have
plotted points that appear to
approximately fit a
linear trend;
If the error bars are close to each
other in length, as appears to be the
case here, one might expect the
constant variance assumption to be
approximately valid.
20
The second table of use is that of the “Test of Homogeneity of Variances”
shown below
The test to use here is the one that is “Based on Median”. This table
provides results of the test of the following hypotheses:
H0 : The population variances are equal
H1: The population variances are not equal.
The p-value given in the last column is sufficiently large to conclude
that the assumption of constant variances should not be rejected –
The constant variance assumption may be assumed valid.
Verifying the validity of the independence assumption:
The validity of the independence assumption can be difficult to assess. The
best approach is to ensure that the independence of the samples is
ensured by proper sampling and data collection practices.
21
Pair-wise Comparisons of Treatment
Means
• Where’s the Difference?
– Once ANOVA indicates that the groups do
not all appear to have the same means,
what do we do?
– We can do pair wise comparisons to
determine which specific means are
different, but we must still take into
account the problem with multiple
comparisons!
22
click on the
Post Hoc
23
We conclude that there is a significant difference between the
medication group and exercise group, but no difference between
the medication and diet and exercise and diet.
24