Developed Mathematical Ability and the ITDAM Test

08/11/01
The Influence of Test Title on Performance
in a Test of Developed Abilities in Mathematics
William Hodgson
Paper presented to the British Educational Research Association Annual
Conference, University of Leeds, 13-15 September 2001
Abstract
The effect of different test titles (mathematics versus problem solving, and hard versus
easy) on performance, and gender differences in performance, was investigated in a
randomised field experiment. The sample consisted of 4600 students, between 16-18
years of age, studying for the GCE A level examination in the United Kingdom. Test
title did appear to effect performance. Students generally achieved higher scores when
the test was described as hard rather than easy. The effect of test title differed with
gender and student ability. More able female students scored higher when the test was
described as hard and problem solving, but not when the test was described as hard
and mathematics. These more able females actually achieved the same score on the
test described as hard and problem solving as that achieved by males on the test
described as easy and problem solving, a gender difference of zero.
A
Design of the Experiment
The instrument used to measure developed ability in mathematics was the
‘International Test of Developed Ability in Mathematics’ (ITDAM). The ITDAM test
consisted of 35 questions and 25 minutes was allowed for completion of the test. Each
of the questions was of multiple choice type and many of the questions were similar in
presentation, style, and content to some GCE A-level mathematics questions. The
ITDAM test is considered in more detail in Section F.
The design of the experiment involved the modification of the title and description of
the ITDAM test to give four different versions. The administration and content of the
test remained unchanged. The front page of the test was modified to give the
impression that the test was respectively:- mathematical and hard (MH),
mathematical and easy (ME), problem solving and hard (PH) and finally problem
solving and easy (PE). For example, the rubric was amended to that shown below for
the test described as mathematical and hard:
1
W. Hodgson
08/11/01
Test Description
TEST OF DEVELOPED ABILITY
Mathematical Test
The questions are meant to be quite HARD, but you are not expected to finish the test.
Please read the questions carefully.
The words mathematical and problem solving were exchanged and the words hard and
easy were exchanged to give the four different versions of the test.
It was anticipated that describing the tests as mathematical and hard would depress
achievement of both sexes, but more so for females, and hence that the gender
difference would increase. It was also expected that this description might suppress the
use of iterative/estimation approaches to solving the problems. This was based on the
assumption that students might assume that such processes would be inappropriate to a
‘hard test of mathematics’. Iterative/estimation techniques might be more likely when
the test was described as problem solving.
The experiment took place in 1993. Students were administered one of the four
versions of the same test as part of a process of data collection for the A Level
Information System.
B
Sample
The following analysis involved about 5000 students who sat A-level examinations in
1993. After listwise deletion of cases with missing variables the sample size reduced
to 4606. The four alternative versions of the test were randomly administered to the
sample. The numbers of students taking each version of the test are shown in Table 1a.
Table 1a: Sample Composition
Title Key Words
Maths/Hard
Maths/Easy
Problem Solving/Hard
Problem Solving/Easy
Sample
1148
1290
1024
1144
2
W. Hodgson
08/11/01
C
Results
Table 2a summarises the scores achieved on each of the four versions of the test. The
overall average score on the test was 16.1 with scores varying between 15.8 and 16.5
for the different versions of the test. Males achieved higher scores than females on
average. The effect of different titles on performance on the ITDAM test was
examined by carrying out an ANOVA analysis on test performance, with sex and title
of the test as main effects. The results of the experiment are presented in Table 2b.
Table 2a: Results on the ITDAM by Test Title and Sex
Maths/Hard
Maths/Easy
Problem Solving/Easy
Problem Solving/Hard
Population
16.1
16.0
15.8
16.5
Males
17.8
17.8
17.6
18.1
Females
14.5
14.6
14.3
15.1
Table 2b: Test Performance by Title and Sex
Source of
Variation
Hard-Easy (H)
Math-Prob (I)
Sex (S)
Interaction (H*I)
Interaction (H*S)
Interaction (I*S)
Interaction (H*I*S)
Error
Sum of Squares
DF
Mean Square
F
p
79.27
9.69
11450
126.90
1.89
0.92
16.60
141998
1
1
1
1
1
1
1
4598
79.27
9.69
11450
126.90
1.89
0.92
16.60
30.88
2.57
0.31
370.75
4.10
0.06
0.03
0.54
0.11
0.56
<0.01
0.04
0.81
0.86
0.46
There was no three way interaction effect between sex and the two title dimensions on
the front of the test, but both the effect of sex and the interaction of the test title
dimensions were significant (p<0.01 and p=0.04 respectively). Males achieved higher
scores, on average, than females (17.8 and 14.6 respectively, corresponding to an
effect size of -0.58 and a raw score difference of 3.2 or about 20%). The magnitude of
the gender difference was similar to that reported previously for the ITDAM test (see
Hodgson, 1995). Figure 1, below, shows the scores on each of the four tests for the
sample as a whole.
3
W. Hodgson
08/11/01
Figure 1: Scores on the ITDAM Test by Title
16.6
16.4
16.2
16
15.8
15.6
ob ath ath rob
r
M -M d-P
-P
y
y
s
s
rd ar
a
a
a
E
E
H
H
Both sexes achieved more positively on the version of the ITDAM test that was
referred to as testing problem solving and being hard. Performance on the other three
versions of the test was about half a mark lower than the hard-problem solving test,
equivalent to an effect size of about 0.1, a small but significant difference,
corresponding to about 20% of the gender difference.
Student Ability and Test Performance
The results of the ITDAM test were used to band students into three categories, less
able (those scoring less than 13), intermediate ability (those scoring 13-18) and more
able (those scoring 19 or more). The effect of the two test title dimensions was then
considered by ability level. The rational for this was to consider if the effect of test
title might differ for students with different levels of developed mathematical ability.
Table 3a gives a breakdown of the sample composition and Table 3b summarises the
scores achieved by each of the three ability groups.
4
W. Hodgson
08/11/01
Table 3a: Sample Composition by Ability
Maths/Hard
Maths/Easy
Problem Solving/Easy
Problem Solving/Hard
Totals
Less Able
331
380
304
293
1308
Middle 30%
450
511
428
475
1864
More Able*
367 (248/119)
399 (248/151)
292(191/101)
376 (242/134)
1434
Totals
1148
1290
1024
1144
4606
* The numbers in brackets refer to the more able males and females respectively.
Table 3b: Results on the ITDAM by Test Title and Ability
Maths/Hard
Maths/Easy
Problem Solving/Easy
Problem Solving/Hard
Less Able
9.5
9.6
9.7
9.6
Middle 30%
15.4
15.4
15.4
15.4
More Able
23.0 (23.4/22.1)
22.9 (23.2/22.5)
22.6 (22.9/21.9)
23.3 (23.5/22.9)
* The numbers in brackets refer to the more able males and females respectively.
The effect of the test title was more pronounced for the most able students,
corresponding to about 30% of the sample, and test title made essentially no difference
to the performance to the other two groups of students.
5
W. Hodgson
08/11/01
Table 3c: Test Performance by Title, Sex and Ability Level
Source of Variation Sum of Squares DF
Hard-Easy (H)
8.99
1
Math-Prob (I)
3.61
1
Sex (S)
196.96
1
Ability Band (G)
104910
2
Interaction (H*I)
18.53
1
Interaction (H*S)
19.64
1
Interaction (H*G)
26.40
2
Interaction (I*B)
1.87
1
Interaction (I*G)
2.11
2
Interaction (S*G)
95.53
2
Interaction (H*I*S)
3.74
1
Interaction (H*I*G)
37.84
2
Interaction (H*S*G)
8.86
2
Interaction (I*S*G)
10.78
2
Interaction (H*I*S*G)
17.26
2
Mean Square
F
p
8.99
1.41 0.24
3.61
0.57 0.45
196.96
30.85 <0.01
52455
8215 <0.01
18.53
2.90 0.09
19.64
3.08 0.08
13.20
2.07 0.13
1.87
0.29 0.59
1.06
0.17 0.85
47.77
7.48 <0.01
3.74
0.59 0.44
18.92
2.96 0.05
4.43
0.69 0.50
5.39
0.84 0.43
8.63
1.35 0.26
More detailed consideration of the results for the most able 30% of students showed
that performance was greatest on the hard-problem solving test (23.3) and least on the
easy-problem solving test (22.6), see Table 3b. A four way analysis of variance,
comprising the two test title dimensions and sex and ability level, was carried out and
these results are given in Table 3c. There was no four way interaction effect. The three
way interaction effect between the two test title dimensions and ability was significant
(p=0.05).
The final analysis considered gender differences in test performance amongst the most
able 30% of students. Test results for this group of students, broken down by sex, were
also presented in Table 3b. Inspection of these results suggested that for the two test
title dimensions there was a main effect for ‘hard’ for males, but for females there was
an interaction effect between the hard-easy dimension and the maths-problem solving
dimension.
Table 4a gives the results of a three way ANOVA analysis, between the two test title
dimensions and sex, for only the most able 30% of students. The interaction effect
between the two test title dimensions was significant (p=0.04), but the interaction with
sex was not (p=0.23). Table 4b gives, for the same group of students, the results of
two separate ANOVAs for males and females respectively. For males there was a
significant main effect for the hard-easy dimension (p=0.08), and for females the
interaction effect between the two test title dimensions was significant (p=0.03).
6
W. Hodgson
08/11/01
Table 4a: Test Performance for the Most Able 30%
Source of Variation
Hard-Easy (H)
Math-Prob (I)
Sex (S)
Interaction (H*I)
Interaction (H*S)
Interaction (I*S)
Interaction (H*I*S)
Error
Sum of Squares
34.19
0.04
250.76
53.87
3.14
3.04
18.27
17753
DF
1
1
1
1
1
1
1
1426
Mean Square
34.19
0.04
250.76
53.87
3.14
3.04
18.27
12.45
F
2.75
0.003
20.14
4.33
0.25
0.24
1.47
p
0.09
0.96
<0.01
0.04
0.61
0.62
0.22
Table 4b: Test Performance by Title and Sex
Males
Hard-Easy (H)
Math-Prob (I)
Interaction (H*I)
Error
Females
Hard-Easy (H)
Math-Prob (I)
Interaction (H*I)
Error
Sum of Squares
41.47
1.70
6.72
12309
Sum of Squares
6.40
1.45
51.88
5444
DF
1
1
1
925
DF
1
1
1
501
Mean Square
41.47
1.70
6.72
13.31
Mean Square
6.40
1.45
51.88
10.87
F
3.11
0.13
0.51
P
0.08
0.72
0.48
F
0.59
0.13
4.77
P
0.44
0.72
0.03
Figures 2 and 3 illustrate the gender difference just described for the most able 30% of
the sample as raw scores and effect sizes respectively. The effect sizes are relative to
the mean score of the most able group of students (23.0) and used 5.0 as a
conservative estimate of the standard deviation.
If the main effect for the hard-easy dimension, seen for males, is taken as the general
pattern then the anomalous result would appear to be the performance of females on
the test described as hard and mathematics. Describing a test as ‘hard’ might
encourage able students to ‘give of their best’, if so, this effect was lost for females
when the test was described as mathematical.
From Figure 2 it can be seen that the scores achieved by males on the test described as
‘problem solving-easy’ were the same as the scores achieved by females on the test
described as ‘problem solving-hard’. It would appear that the difference of one word
has had an effect of similar magnitude to the gender effect. The use of the word hard
linked to problem solving appeared to increase the performance of the more able
females by about 1 point. However when problem solving-hard is changed to
mathematics-hard performance drops back
7
W. Hodgson
08/11/01
Figure 2: ITDAM Test Score by Title and Gender for More Able
Students
24
23.5
23
Males
22.5
Female
s
22
Ea
sy
Ea -Pr
sy ob
H -Ma
ar
d- th
H Ma
ar th
dPr
ob
21.5
0.2
0.15
0.1
0.05
0
-0.05
-0.1
-0.15
-0.2
-0.25
Males
Pr
ob
h
8
H
ar
d-
M
at
h
d-
ar
M
at
H
sy
-
Pr
o
b
Female
s
Ea
sy
Ea
Effect Size
Figure 3: Effect Size by Title and Gender for More Able Students
W. Hodgson
08/11/01
D
Conclusions
These results suggest that the performance of males and females is susceptible to
factors such as test presentation. In particular, that simply the use of words, such as
hard and easy, and mathematics and problem solving, might be important and
influence test performance.
The main conclusions from this experiment are:
1. That for some groups of students performance on the test of developed ability
in mathematics varied with the words used to describe the test. The effect
differed with gender and student ability.
2. That the students performed slightly better when the test was described as
being ‘hard’ as opposed to being ‘easy’. The magnitude of the effect was of the
order of 0.7 points or 4% between the tests described as ‘problem solving easy’
and ‘problem solving hard’.
3. For both male and female students the impact of test title was greater for the
more able students.
4. For these more able students the overall magnitude of the effect of test title
was greater for female students than for male students. The magnitude of the
effect was about 0.6 points, or 3%, for males and about 1 point or 5% for
females.
5. There appeared to be a general effect whereby students achieved higher scores
when the test was described as hard as opposed to easy. This was not the case
for the more able female students. Females showed the same general trend
except that performance appeared to be depressed if the test was described as
hard and mathematics.
E
Discussion
The effect of test title could have been greater for the more able students because the
test was relatively difficult. These more able students would be better equipped to
tackle the questions and the test title might influence the students’ motivation.
However, weaker students might not have the necessary skills and hence test title
might not have such a marked effect on performance in the test for these students.
9
W. Hodgson
08/11/01
It may be that use of the word ‘hard’ rather than ‘easy’ had a general motivating
effect. If so this appeared to be equivalent to about 1 point for the more able females.
The positive effect of describing the test as hard was almost negated for the females
when the test was described as mathematics rather than problem solving. Describing
the test as mathematics and hard rather than problem solving and hard appeared to
suggest that the term mathematics had a de-motivating effect on the females in
particular.
Educationally this finding might be important, however, it should be stressed that the
magnitude of the effect was relatively small. None-the-less if mind-set can affect
performance on a 25 minute test, what might be its cumulative effect over years of
study? The ways in which teachers present and describe mathematical problems,
including the language they use, might have a profound effect on the performance of
students on such problems.
The findings of this experiment would suggest that outcomes on any ability test must
be considered in relation to the students perception of the test. In particular,
interpretation of the results on tests of (developed) ability should consider the possible
effect of the test itself on student achievement behaviours. This might be of particular
importance for tests of ‘developed mathematical ability’ given to older students, such
as the ITDAM test.
References
Hodgson, W. (1994) PhD Thesis- ‘Gender Differences in Mathematics
and Science: A Study of GCE Advanced Level Examinations in the
United Kingdom’, School of Education, University of Newcastle upon
Tyne.
10
W. Hodgson