The Chi-squared test (x2)

The Chi-squared test (x )
2
The Chi-squared test is used to test whether there is a significant difference between data. For example,
we can use it to test whether there is any difference between altitude and the type and amount of
vegetation. A common test is to see whether there are significant differences in levels of environmental
quality between areas.
The Chi-squared test can only be used on data which has the following characteristics:
1.
2.
3.
4.
The data must be in the form of frequencies counted in a number of groups.
Data must be on the interval or ratio scale (i.e. it has a precise numerical value) and can be grouped
into categories.
The total number of observations must be greater than twenty.
The expected frequency in any one category must be greater than five.
Method
1.
State the hypothesis being tested – there is a significant difference between two or more sample
groups. It is convention to give a null hypothesis, (a negative test) that is, that there is no
significant difference between the samples.
2.
Tabulate the data as shown in the example below. The data being tested for significance is known
as the ‘observed’ frequency, and the column is headed ‘O’.
3.
Calculate the ‘expected’ number of frequencies that you would expect to find. These go in column
‘E’.
4.
Calculate the Chi-squared statistic using the formula: x 2 = Σ
(O − E )2
E
2
where x is the Chi-squared statistic,
Σ is the sum of
O refers to the observed frequencies, and
E are the expected frequencies.
5.
Calculate the degrees of freedom. This is quite simply one less than the total number of
observations (N), i.e. N – 1.
6.
Compare the calculated figure with the critical values in the significance tables using the
appropriate degrees of freedom. Read off the probability that the data frequencies you are testing
could have occurred by chance.
© Pearson Education Ltd 2012. For more information about the Pearson Baccalaureate series please visit
www.pearsonbacconline.com.
Example
A survey asked students what they thought was the world’s most serious problem. Fifty
students in each of five year groups were asked. The number from each group that stated
‘global warming’ is shown below.
Number stating ‘global warming’ as the most serious problem the world is facing
Year 13
Year 12
Year 11
Year 10
Year 9
Total
16
24
40
26
30
145
1.
State the hypothesis being tested – there is a significant difference in the number of students in
each year group stating that global warming is the most serious problem. It is convention to give a
null hypothesis, (a negative test) that is that there is no significant difference in the number of
students in each year group stating that global warming is the most serious problem.
2.
To work out the expected, find the average number of students stating ‘global warming’ for the
five year groups. If there is no significant difference between them they should all have around the
same. In this case, the total of all observations comes to 145 and the mean is therefore 29. This
becomes the expected value.
3.
Tabulate the data, thus
Year
Obs.
Exp.
(O–E)
(O–E)2
(O–E)2/E
Year 13
Year 12
Year 11
Year 10
Year 9
16
24
40
26
39
29
29
29
29
29
-13
-5
11
-3
10
169
25
121
9
100
5.83
0.86
4.17
0.31
3.45
∑ = 14.62
4.
Degrees of freedom (df) = (N – 1) = (5 – 1) = 4
5.
The critical values for 4 df are:
0.05 0.01
9.49 13.28
Clearly the computed value of 14.62 is higher than the critical values even at the 0.01 (99%) level of
significance. This means that our computed value is statistically significant. Therefore we reject the
null hypothesis and we accept the alternative hypothesis. This means that there is a significant
difference in the number of students in different age groups who thought global warming was the
world’s most serious problem.
NB The next stage is to offer explanations for the results. Remember the statistic is only used as a
means of clarification: it is not an end in itself but a means to help you to explain.
© Pearson Education Ltd 2012. For more information about the Pearson Baccalaureate series please visit
www.pearsonbacconline.com.
The critical values show the probability that the calculated value of x2 is the result of a chance
distribution. The larger the value of x2 the smaller is the probability that the null hypothesis is correct.
df
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
95%
3.84
5.99
7.82
9.49
11.07
12.59
14.07
15.51
16.92
18.31
19.68
21.03
22.36
23.68
25.00
26.30
27.59
28.87
30.14
31.41
99%
6.64
9.21
11.34
13.28
15.09
16.81
18.48
20.09
21.67
23.21
24.72
26.22
27.69
29.14
30.58
32.00
33.41
34.80
36.19
37.57
Exercise
Some students completed an environmental quality index for eight parts of an urban area. Their
results are shown below.
Site
EQI (Obs)
Wolvercote
42
Cornmarket
36
St Ebbe's
25
Summertown
40
Cowley
28
Botley
36
St Clements
21
Osney
36
1.
2.
3.
State the hypothesis being tested
Work out the x2 statistic.
Assess the level of statistical significance from the data.
© Pearson Education Ltd 2012. For more information about the Pearson Baccalaureate series please visit
www.pearsonbacconline.com.
Answers
1.
There is a significant difference in the environmental quality index in the selected areas. It is
convention to give a null hypothesis, (a negative test) that is that there is no significant
difference in the environmental quality index in the selected areas.
2.
To work out the expected, find the average environmental quality index for the eight areas. If there
is no significant difference between them they should all have around the same. In this case, the
total of all observations comes to 264 and the mean is therefore 33. This becomes the expected
value.
3.
Tabulate the data, thus
Site
EQI (Obs)
Exp.
(O–E)
(O–E)2
(O–E)2/E
Wolvercote
42
33
9
81
2.45
Cornmarket
36
33
3
9
0.27
St Ebbe's
25
33
-8
64
1.94
Summertown
40
33
7
49
1.48
Cowley
28
33
-5
25
0.76
Botley
36
33
3
9
0.27
St Clements
21
33
-12
144
4.36
Osney
36
33
3
9
0.27
∑ = 11.82
4.
Degrees of freedom (df) = (N – 1) = (8 – 1) = 7
5.
The critical values for 7 df are:
0.05 0.01
14.07 18.48
Clearly the computed value of 11.82 is lower than the critical values even at the 0.05 (95%) level of
significance. This means that our computed value is not statistically significant even though there are
some variations in environmental quality index between the eight locations. Therefore we cannot reject
the null hypothesis nor can we accept the alternative hypothesis. This means that there is a not
significant difference in the environmental quality index between the eight locations.
© Pearson Education Ltd 2012. For more information about the Pearson Baccalaureate series please visit
www.pearsonbacconline.com.