Uncertainties and Bias in PISA

Uncertainties and Bias in PISA
Joachim Wuttke
Copyright (C) Joachim Wuttke 2007. Revised 20jul08.
Download locations:
http://www.messen-und-deuten.de/pisa,
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1159042.
PISA zufolge PISA PISA According to PISA.
Appeared in:
Hält PISA, was es verspricht? Does PISA Keep What It Promises?
Edited by
S. T. Hopmann, G. Brinek, and M. Retzl.
Reihe Schulpädagogik und Pädagogische Psychologie, Bd.6.
Wien: Lit-Verlag 2007, ISBN 978-3-8258-0946-1.
Abstract
This is a summary of a detailed report that has appeared in German [31]. It will
be shown that statistical signicance criteria of OECD/PISA are misleading
because several sources of systematic bias and uncertainty are quantitatively
more important than the standard errors communicated in the ocial reports.
1
Introduction
1.1 A huge framework
PISA is a long-term project.
Starting in 2000, assessments are carried out
every three years. One and half a year are needed for data processing until the
international report First Results [15; 16] is published, and it takes even longer
until a Technical Report [1; 17] appears and the raw data are made available
for independent analysis. Therefore, although the third assessment was carried
out in spring 2006, at present (summer 2007) only PISA 2000 and 2003 can be
evaluated. In the following, we will concentrate on data from PISA 2003.
PISA 2003 was carried out in 30 OECD countries and in some partner countries.
As data from the latter were not used in the international calibration,
they will be disregarded in the following. The United Kingdom (UK), having
missed several participation criteria, was excluded from tables in the ocial
1
report. However, data from the UK were fully used in calibrating the international data set and in calculating OECD averages an inconsistency that is
left unexplained [17, p. 128], [16, p. 31].
PISA rules required a minimum sample size of 4500 students per country
except in very small countries (Iceland, Luxembourg), where all fteen-year-old
students were recruited. In several countries (Australia, Belgium, Canada, Italy,
Mexico, Spain, Switzerland, UK), considerably larger samples of up to nearly
30.000 students [17, p. 168] were drawn so that separate analyses for regions
or language areas became possible. For the comparison of the sixteen German
länder, an even larger sample of 44.580 students was tested [23, p. 392] of which,
however, only 4660 were contributed to the international sample [17, p. 168].
The Kultusministerkonferenz, fearing unauthorized cross-länder comparisons of
school types, has imposed deletion of länder codes from public-use data les.
Therefore, the inner-German comparison shall not be considered further.
The bulk of PISA data comes from a three-hour student testing session.
Some more information is gathered from school principals. The testing session
consists of a two-hour cognitive test and of a third hour devoted to questionnaires. The main questionnaire enquires about the students' social background,
educational environment, and learning habits. The questionnaire responses certainly constitute a valuable ressource for studying the living and learning conditions of fteen-year olds in large parts of the world, even though participation
rate gradients introduce some bias.
Compared to the rich empirical material obtained from the questionnaires,
the outcome of the cognitive test is meagre: the ocial data analysis reduces it
to just four scores per student, interpreted as competences in specic subject
domains (reading, mathematics, science, problem solving). Nevertheless, these
results are at the origin of PISA's political impact; communicated as league
tables of national mean values, they made PISA known to the general public,
causing an outright shock in some countries.
While controversy erupted about possible causes of results perceived as unsatisfactory, the three-digit precision of the underlying data has rarely been
questioned.
This is what shall be done in the present paper.
The accuracy
and validity of cognitive test results shall be reviewed from a statistical point
of view.
1.2 A surprisingly simple measure of competence
As a rst step of data reduction, student responses are digitally coded.
The
Technical Report discusses inter-coder and inter-country variance at length [17,
pp. 218232]; the conclusion that non-uniform coding is an important source of
bias and uncertainty is left to the reader.
2
Some codes are kept secret because national authorities want to prevent
certain analyses.
secret.
In several multilingual countries the test language is kept
Except for such deletions, the international raw data set is available
for download on the web site of OECD's main contractor ACER (Australian
Council for Educational Research).
On the lowest level of data aggregation, single item response statistics (percentages of right, wrong, and invalid responses to one cognitive test item) can
be generated. In the international report not even one such statistic is shown.
PISA is decidedly not a study in Fachdidaktik (math education, science education, . . . ). PISA does not aim at gathering information about the understanding of scientic concepts or the mastery of specic mathematical techniques.
The data provide almost no handle to understand why students give wrong responses. Only Luxembourg has scanned and published some student solutions
to free-response items [2]; these examples show that students sometimes just
misunderstood what the item writer meant to ask.
PISA is designed to be analysed on a much coarser level.
As anticipated
above, cognitive test results are aggregated into just four competence values
per student. The determination of these values is technically complicated because not all students worked on the same item set: thirteen dierent booklets
were used, and in some countries some items turned out to be invalid because
of misprints, translation errors, or other problems. This makes it necessary to
establish an item diculty scale prior to the quantication of student competences. For this calibration an elementary version of item response theory is
used.
The importance of this theory tends to be overestimated by defenders and
critics of PISA alike. Misunderstandings are also provoked by the poor documentation in the ocial reports. For a functional understanding of what PISA
measures it is not important that dierent booklets were used, and it is plainly
irrelevant that somewhere some items were deleted. Glossing over these technicalities, pretending that all students were assigned the same item set, and
ignoring the probabilistic aspect of item response theory, it becomes apparent
what the competence values actually measure: no more and no less than the
number of right responses.
In the mathematics subtest of PISA 2003, a student with a competence of
500 (the OECD mean) has solved about 46% of the items assigned to him. A
competence of 400 (one standard deviation below the mean), corresponds to a
correct-response rate of 23%; 600 corresponds to 71% [31, Fig. 4]. Within this
span the relationship between competence value and correct-response percentage is nearly linear. The slope is about 4 competence points per 1% of assigned
items. This conversion gives the competence scale a much simpler meaning than
the ocial reports allow one to suspect.
3
1.3 League Tables and Stochastic Uncertainties
Any analysis of PISA data aims at statistical statements about populations.
For instance, an elementary analysis of the cognitive test yields results like
the following: German students have a mean mathematics competence of 503;
the standard deviation is 103; the standard error of the mean is 3.3, and the
standard error of the standard deviation is 1.8 [22, p. 70].
such numbers they need to be put into context.
To make sense of
The PISA reports provide
two kinds of interpretation guidance: Verbal descriptions of prociency levels
give a rough idea of what competence dierences of 60 or more points signify
(see below), and comparisons between dierent populations insinuate that even
dierences of only a few points bear a message.
Since the assessment of competences within each of the four subject domains
is strictly one-dimensional, any inter-population comparison implies a ranking.
This explains the primordial role of leage tables in PISA: They are not only a
vehicle for gaining media attention, but they are deeply rooted in the study's
conception (cf. Bottani/Vrignaud [4]). In the ocial reports almost all statistics
are communicated in form of country league tables. The ranks in these tables,
especially low ranks (and every country has low ranks in some tables), are then
easily turned into political messages. In this way PISA results can be interpreted
without any understanding of what has actually been measured.
Of course not all rank dierences are statistically signicant. This is duly
noted in the ocial reports. For all statistics, standard errors are calculated.
After processing these standard errors through a zero hypothesis testing machinery, some mean value dierences are judged signicant, others are not. Complicated tables [16, pp. 59, 71, 81, 88, 92, 281, 294] indicate which dierences
of competence means are signicant. It turns out that in some cases 9 points
are sucient to say with condence that the higher performance by sampled
students in one country holds for the entire population of enrolled 15-year-olds
[16, p. 93].
Figure 1: Two Gaussian distributions with mean values diering by 9% of their
standard deviation. Such small dierences between two populations are considered signicant in PISA.
This accuracy is formidable when compared to the intra-country spread of
test performances.
The standard deviation of the competence distribution is
4
100 points in the OECD country average and not much smaller within single
nations. This is an order of magnitude more than an inter-country dierence of
9 points. Figure 1 visualises the situation.
However, signicant does not mean reliable, valid, or relevant.
Statistical
signicance is achieved by nothing more than the law of large numbers. The
standard errors on which the signicance criteria account only for two specic
sources of stochastic uncertainty: the student sampling and the item-response
modelling of student behaviour. By testing more and more students on more
and more items these uncertainties can be made arbitrarily small.
At some
point, however, this eort becomes inecient because reliability and validity
of the study remain limited by non-stochastic sources of bias and uncertainty,
which do not decrease with increasing sample size.
Before entering into details, the likeliness of non-stochastic bias can be made
plausible by just considering what a mean value dierence of 9 competence
points actually means: According to the conversion introduced above, 9 points
correspond to about 2% of responses.
mathematics items.
On average, a student is assigned 26
Hence a signicant dierence between two populations
can be brought about by no more than half a right response per student. This
suggests that little bias is needed to distort test results far beyond their nominal
standard errors.
In the following, I will argue that PISA suers indeed from severe nonstochastic limitations and that the large sample sizes are therefore uneconomic.
The paper is structured as follows: Part 2 describes disparities in student sampling, Part 3 shows that the projection of cognitive test results onto a onedimensional competence scale is neither technically convincing nor culturally
fair, and Part 4 adds some objections on the conceptual level.
2
Sampling disparities
In some countries it is clear from the outset that PISA cannot be representative
(Sect. 2.1). But even in countries where school is obligatory beyond the age of
fteen, low participation rates are likely to introduce some bias. Several imperfections and inconsistencies of the international sample are well documented in
the Technical Report. Participation rate requirements were not strict enough
to prevent signicant bias, and violations of these predened rules lead hardly
to any consequences.
5
2.1 Target population does not serve study objective
PISA claims to measure outcomes of education systems in terms of student
achievements. This claim is not consistent with the choice of the target population, namely 15-year olds enrolled full-time in educational institutions. In
some countries (Mexico, Turkey, several partner countries), enrollment is less
than 60%. Obviously, PISA says nothing about the outcome of the education
system of these countries.
On the other hand, in many countries school is obligatory beyond the age
of 15. At fteen, the ability of abstract reasoning is still in full development.
PISA therefore systematically underestimates the abilities students have near
the end of compulsory schooling [16, pp. 3, 298; 17, p. 46].
2.2 Target population too loosely dened: unequal exclusions
Rules allowed countries to exclude up to 5% of the target population: up to 0.5%
for organizational reasons and up to 4.5% for intellectual or functional disablities
or limited language prociency. Exclusions for intellectual disability depended
on the professional opinion of the school principal or by other qualied sta a completely uncontrollable source of uncertainty.
From the smallprint in
the Technical Report it appears that some countries dened additional criteria:
Denmark, Finland, Ireland, Poland, and Spain excluded students with dyslexia;
Denmark also students with dyscalculia; Luxembourg recently immigrated students [17, pp. 47, 65, 169, 183].
Actual student exclusion rates of OECD countries varied from 0.7% to 7.3%.
Canada, Denmark, New Zealand, Spain, and the USA exceeded the 5% limit.
Nevertheless, data from these countries were fully included in all analyses.
For a rst-order estimate of the impact caused by the unequal use of student exclusions, let us approximate the competence distribution in every single
country by a Gaussian with standard deviation 100, and let us assume that
countries exclude with perfect precision the least competent students.
Then,
exclusion of the weakest 0.7% increases the country's mean by 2.0 points and
reduces its standard deviation by 2.5 points, whereas exclusion of 7.3% increases
the mean by 15.0 and reduces the standard deviation by 12.8. Of course, exclusion criteria are only correlatives of potential test achievement, and they are
never applied with perfect precision. When a probabilistic cut-o, spread over
a range of
±100
points, is used to model soft exclusion criteria, the bias in the
two countries' competence mean dierence is reduced to about half of the initial
13 points.
6
In Germany much public attention has been drawn to the percentage of
students in a so-called risk group dened by test scores below an arbitrary
threshold. International comparisons of such percentages are particularly unreliable, because they are extremely sensitive to non-uniform exclusion criteria.
2.3 On the fringe of the target population: unequal inclusion of learning-disabled students
The imprecision of exclusion criteria and the resulting bias are further illustrated by the unequal inclusion of students with learning disabilities.
Seven
countries cater for them in special schools. In these schools, the cognitive test
was abridged to one hour, and a special booklet with a selection of easy items
was used. In all other countries, student exclusions were decided per case; but
even in countries that used the special booklets, some learning-disabled students
could be individually excluded (cf. [21, pp. 149, 158]).
The extent to which students were excluded from test or given the short
booklet varies widely between the seven countries. In Austria, 1.6% of the target
population were completely excluded, and 0.9% of the participating students
got the short test.
test.
In Hungary, 3.9% were excluded, and 6.1% did the short
Given this discrepancy, it is barely surprising that Hungarian students
who did the short test achieved nearly 200 points more than Austrians.
For another rough estimate of the quantitative impact of unclear exclusion
criteria, one can recalculate national means without short tests.
If all short
tests were excluded from the PISA sample, the mean reading score of Belgium,
Denmark, and Germany would increase by more than 7 points; in doing so,
Belgium (1.5% exclusions, 3.0% short tests) would even remain within the 5%
limit [17, p. 169]. A bias of the order of 7 points is in perfect accord with the
estimate from the previous section.
2.4 Sampling problems: inconsistent input
The sampling is technically dicult.
Often, governments do not dispose of
consistent data bases. Sometimes, this leads to bewildering inconsistencies: In
Sweden 102.5% of all 15-year olds are reported to be enrolled in an educational
insitution; in the Italian region of Tuscany 107.7%; in the USA, in spite of a
strong homeschooling movement, 100.000% [17, pp. 168, 183].
The sample is drawn in two stages: schools within strata (regions or/and
school types), and students within schools. As a consequence of this stratication and of unequal participation rates, not all students are equally representative for the target population. To correct for this, students are assigned statistical weights, composed of several factors. The recommended way to calculate
7
of these weights is so dicult that international rules foresee three replacement
procedures. In Greece, none of the four procedures worked so that a uniform
student weight had to be used [17, p. 52].
2.5 Sampling problems: inconsistent output
In the Austrian sample of PISA 2000, students from vocational schools were
underrepresented. In consequence, the country's means were overestimated and
other data were distorted as well. The error was only searched for and found
three years later. when the deceiving outcome of PISA 2003 induced the government (which had changed in the meantime) to order an investigation [14].
In South Tyrol, a change of government is not in sight, and therefore nobody
seems interested to verify accusations that the excellent PISA results of this
region are largely due to the underrepresentation of students from vocational
schools [25].
In South Korea, only 40.5% of PISA participants are girls.
In the 1980s,
due to selective abortion and in part possibly also to hepatitis-B, the sex ratio
at birth in South Korea has been down to 47%, perhaps even to 46%. Taking
this into account, girls are still severely underrepresented in the PISA sample.
According to the Technical Report, this cannot be explained by unequal enrollment or test compliance: The enrollment rate is 99.94%, the school participation rate 100%, the student participation rate 98.81%. Probably the sampling
scheme was inappropriate. This conclusion is also supported by an anomalous
distribution of birth months.
2.6 Insucient response rates
Rules required a school response rates of 85%, within-school response rates of
25%, and a country-wide student response rate of 80% [17, pp. 4850].
The
United Kingdom breached more than one criterion which lead to its supercial
disqualication. Canada proted from an illogical rule according to which initial
response rates above 65% be cured by replacement schools without the need of
reaching 85%; the case was settled by negotiation [17, p. 238]. With 64.9%, the
USA missed the required initial school response rate by a narrow margin, and
the response from replacement schools was overwhelmingly negative, bringing
the participation rate to no more than 68,1%. Nevertheless, US data were fully
included in all analyses (note: the USA contribute 25% of OECD's budget).
Non-response can cause considerable bias because the propensity of school
principals and students to partake in the testing is likely to be correlated with
the potential outcome. Quantitative estimates are dicult because the international data base contains not the least information about those who refused
8
the test.
Nevertheless, there is ample indirect evidence that the correlation
is quite high.
To cite just one example: In Germany schools with a student
response of 100% had a mean math score of 553.
Schools with participation
below 90% achieved only 476 points. Even if the latter number is subject to
some uncertainty (discussed at length in [31]), the strong correlation between
student ability and test compliance is beyond any doubt.
In the ocial analysis, statistical weights provide a rst-order correction for
the between-school variation of response rates: When schools refuse to participate, the weight of other schools from the same stratum is increased accordingly.
Similarly, in schools with low student response rates, the participating students
are given heigher weights.
However, these corrections do not cure within-school correlations between
students' latent abilities and their propensity to partake in the test.
In the
absence of data from absent students, the possible bias can only be roughly
estimated: In some countries, the student response rate is more than 15% lower
than in others. Assuming very conservatively that the latent ability of the missing students is only half a standard deviation below the true national average,
one nds that the absence of these students increases the measured national
average by 8.8 points.
2.7 Gender-dependent response rates
In many other countries girls are overrepresented in the PISA sample.
The
discrepancy is largest in France, with 52.6% girls in PISA against an estimated
48.9% among 15-year olds: compared to the age cohort, the PISA sample has
more than 7% too many girls and more than 7% too few boys. Insofar as this
is due to dierent enrollment, it enforces the argument of Sect. 2.1. Otherwise,
the most likely explanation is a gender-dependent propensity to participate in
the testing.
2.8 Doubts about data transmission: missing missing responses
Normally, some students do not respond to all questions of the background
questionnaire.
Moreover, some students leave between the cognitive test and
the questionnaire session. In Poland, however, such missing data are missing:
There is no single student who responded to less than 25 questionnaire items,
and there are 7 items to which no single student did not respond. Unless this
anomaly is explained otherwise, one must suspect that booklets with missing
data have been suppressed.
9
3
Ignored dimensions of the cognitive test
PISA's competence scale depends on the assumption that all items from one
subject domain measure essentially one and the same latent ability. In reality,
any test outcome is also inuenced by factors that cannot be subsumed under a
subject-specic competence. While there is no generally accepted way to indicate the degree of multi-dimensionality of a test [9], simple rst-order estimates
allow to demonstrate its impact : non-competence dimensions causes an amount
of arbitrariness, uncertainty, and bias in PISA's competence measure, which is
by no means negligible when compared to the purely stochastic ocial standard
errors.
3.1 Elimination of disturbing items
The evidence for multidimensionality to be presented in the following sections
is even more striking on the background that the cognitive items actually used
in PISA have been preselected for unidimensionality: Submissions from participating countries were streamlined by professional item writers, reviewed
by national subject matter experts, tested with students in think-aloud interviews, tested in a pre-pilot study in a few countries, tested in a eld trial in most
participant countries, rated by expert groups, and selected by the consortium
[17, pp. 2030].
Only one third of the items that had reached the eld trial were nally used
in the main test. Items that did not t into the idea that competence can be
measured in a culturally neutral way on a one-dimensional scale were simply
eliminated. Field test results remain unpublished, although one could imagine
an open-ended analysis providing valuable insight into the diversity of education
outcomes. This adds to Olsen's observation [19, p. 5] that in PISA-like studies
the major portion of information is thrown away.
However, the strong preselection did not prevent seriously awed items from
being used in the main test: In the analysis of PISA 2000, the item Continent
Area Q1 had to be disqualied, in 2003 Room Numbers Q1. Furthermore,
several items had to be disqualied in specic countries.
3.2 Unfounded models
In PISA, a probabilistic psychological model is used to calibrate item diculties
and to estimate student competences. This model, named after Georg Rasch,
is the most elementary incarnation of item response theory. It assumes that the
probability of a correct response depends only on the dierence of the student's
competence value and the item's diculty value. Mislevy [13] calls this attempt
10
to explain problem-solving ability in terms of a single, continuous variable a
caricature, based in 19th century psychology.
The model does not even
admit the possibility that some items are easier in one subpopulation than in
another. The reason for its usage in PISA is neither theoretic nor empiric, but
pragmatic: Only one-dimensional models yield unambiguous rankings.
Taking the Rasch model literally, there is no way to estimate the competence
of students who solved all items or none: for them, the test has been too easy or
too dicult, respectively. In PISA, this problem is circumvented by enhancing
the probability of medium competences through a Bayesian prior, arbitrarily
assumed to be a Gaussian. As distributions of achievement and psychometric
measures are never Gaussian [12], this inappropriate prior causes bias in the
competence estimates (Molenaar in [6, p. 48]), especially at extreme values [30].
This further undermines statements about risk groups with particularly low
competence values.
3.3 Failure of the Rasch model
Various mathematical criteria have been developed to assists in the decision
whether or not the Rasch model reasonably approximates an empirical data
set. It appears that only one of them has been used to check the outcome of
the PISA main test: an unexplained item int mean square [17, pp. 123, 278].
A much more sensitive way to test the goodness of t is a visual inspection of
appropriate plots [8, p. 66]. An item characteristic or score curve is a plot of
correct-response percentages as function of competence values, each data point
representing a quantile of examinees. In the Technical Report [17, p. 127] one
single item characteristic is shown an atypical one that agrees rather well
with the Rasch model.
According to the model, all item characteristics from one subject domain
should have strictly the same shape; the only degree of freedom is a horizontal
shift, driven by the model's only item parameter, the diculty. This is clearly
inconsistent with the variety of shapes exhibited by the four item characteristics
in Fig. 2. Whereas Water Q3b discriminates quite well between more or less
competent students, the other three items have deciencies that cannot be
described without additional parameters.
The characteristic of Chair Lift Q1 has almost a plateau at low competence values.
This is the typical signature of guessing.
On the other hand,
Freezer Q1 saturates at less than 35%. This indicates that many students did
not nd out the intention of the testers.
Low discrimination strengths as in
South Rainea Q2 may have several reasons: dierent diculties in dierent
subpopulations, dierent diculties for dierent solution strategies (cf. Meyer-
11
Figure 2: Some item characteristics that show pronounced deviations from the
Rasch model. Solid curves in (a) are ts with a 2-parameter model that accounts
for dierent discrimination.
The 4-parameter t in (b) additionally models
guessing and misunderstanding.
höfer [11]), qualied guessing, weak correlation of the latent ability measured
here and in the majority of this domain's items.
The solid lines in Fig. 2 show that satisfactory ts of the empirical data
are possible when the Rasch model is extended by parameters that allow for a
variable discrimination strength, for guessing, and for misunderstanding. Such
multi-parameter item-response models still contain a linear shift parameter that
may be interpreted as the item diculty.
parameter deviate by typically
However, best-t estimates of this
±30 points from the ocial Rasch diculties [31,
Fig. 11]. This model dependence of item diculty estimates is not compatible
with a one-dimensional ranking of items as is needed for the construction of
prociency levels (Sect. 4.1). Furthermore, as soon as one admits more than
one item parameter, any student ranking becomes arbitrary because of the adhoc anchoring of the diculty and competence scales.
The rst data point of the characteristics of South Rainea and Chair Lift
clearly lies below the t curves: the weakest 4% of participants perform weaker
than modelled. This may be due to a lack of cooperation: yet another dimension
that is not contained in elementary item-response theory. It may also be due
to the inappropriateness of the Gaussian population model.
3.4 Between-booklet variance
The use of dierent test booklets makes it possible to employ a total of 165
dierent items, though every single student works on no more than 60 of them.
This reduces the dependence of test results on the arbitrary choice of items.
12
At the same time, it allows us to get an idea of how strong this dependence
actually is. Calculating mathematics competence means for groups of students
who have worked on the same booklet, inter-booklet standard deviations between 4 (Hungary) and 18 (Mexico) points are found.
The largest dierence
occurs in the USA: students who worked on booklet 2 were estimated to have a
math competence of 444, whereas those who worked on booklet 10 achieved 512
points. Eliminating either booklet 2 or booklet 10 would respectively increase
or decrease the overall national mean by about three points. This variance only
reects the arbitrariness in choosing items from a pool that is already quite
homogeneous due to the procedures described above (Sect. 3.1). Cultural bias
in the submission, selection, and adaptation of items may have a far stronger
impact.
3.5 Imputation with wrong normalisation
Each of the thirteen regular booklets consists of four blocks. Each item appears
in four dierent blocks, in four dierent positions, in four dierent booklets.
The major subject domain mathematics is covered by seven of the thirteen
blocks; the other three subject domains are tested in two blocks each.
While all thirteen booklets contain at least one mathematics block, each
minor domain appears only in seven booklets. Nevertheless, in the scaled data
all students are attributed competence values in all four domains. If a student
has not been tested in a domain, the competence estimate is based on the his
questionnaire responses and on his school's average math achievement.
Such
an imputation, when done correctly, reduces the standard error of population
means without introducing bias.
In PISA, however, it is not done correctly. Bias is introcuced because the
imputation is anchored at only one of the seven booklets for which real data
are available. This bias is plainly admitted in the Technical Report [17, p. 211],
though it is quantied only for Canada. The case of Greece is more extreme:
the ocial science competence mean of 481 is 16 points above the average
achievement of those students who were actually tested in science [31, Sect.
3.10]; cf. Neuwirth in [14, p. 53]. This huge bias is certainly not justied by the
benets of imputation, which consist in a slight simplication of the secondary
data structure and in a reduction of stochastic standard errors by probably no
more than 10%.
3.6 Timing, tactics, fatigue
Since every item occurs in four dierent positions, one can easily investigate
how response rates vary during the two-hour testing session: Per-block response
13
rates, averaged across booklets over all items, can be directly compared to each
other.
One nds that the average rates of non-reached items, of missing responses,
and of wrong responses systematically decrease from block to block. The extent
of this decrease varies considerably between countries. The ratio of non-reached
items in the fourth block is 1% in the Netherlands; in Mexico it is 25.3%. In
the Netherlands, the ratio of items that were reached but not answered goes up
from 2.5% in the rst block to 4.0% in the fourth block; in Greece, from 11.1%
to 24.4%. In Austria, the ratio of right to given responses decreases from 56.2%
in the rst block to 54.4% in the fourth block; in Iceland, from 58.5% to 53.1%.
All these data indicate that students are lacking time in the last of the
four blocks. This alone is a strong argument against the applicability of onedimensional item response theory [28, p. 43]. The ways students react to the
lack of time vary considerably between countries:
•
Dutch students try to answer almost every item. Towards the end of the
test they become hasty and increasingly resort to guessing.
•
Austrian and German students skip many items, and they do so from the
rst block on, which leaves them enough time to nish the test without
much accelerating their pace.
•
Greek students, in contrast, seem to be taken by surprise by the time
pressure near the end.
In the rst block, their correct-response rate is
better than in Portugal and not far away from the USA and Italy. In the
last block, however, non-reached items and missing responses add up to
35%, bringing Greece down to one of the last ranks.
Aside such extreme cases, it is hardly possible to disentagle the eects of testtaking tactics and fatigue.
3.7 Multiple responses to multiple choice items
In PISA 2003, 42 of 165 items are in a simple multiple-choice format. For each
of these items, four or ve responses are proposed of which exactly one is meant
to be the right one. This essential rule is not clearly explained to the examinees.
In some countries, for some items, a considerable number of multiple responses
are given. They are denoted by a special code in the international data base,
but they are subsequently counted as incorrect.
In many countries, including Australia, Canada, Japan, Mexico, the Netherlands, New Zealand, and the USA, the quota of multiple responses is close to
14
A
Table 1:
B
C
D
Slovakia
3.1%
46.1%
17.5%
33.3%
Sweden
3.1%
46.2%
37.0%
13.7%
Percentages for the four possible responses of the multiple-choice
item Optician Q1. Data are shown for two countries where almost the same
percentage of students choose the right response B. However, preferences for the
distractors C and D vary by about 20%.
0% (except for one particularly awed item). In Austria, Germany, and Luxembourg, on the other hand, the fraction of multiple responses surpasses 4% for
at least eleven items, amd it reaches up to 10% for one of them.
Such a misunderstanding of the test format does not only distort the outcome
of the directly concerned item.
It also costs time: it is more eort to decide
four or ve times whether or not a proposed answer is correct than to choose
only one alternative. Those who are familiar with the multiple-choice format
sometimes do not even need to read all distractors.
3.8 Testing cultural background
If one wants to understand what a test actually measures one has to study the
manifold reasons why students give wrong responses (cf. Kohn [10, p. 11]). The
few student solutions of open-ended items published by Luxembourg show how
much information is lost when verbal or pictorial responses are digitally coded.
In contrast, in the digital coding of multiple-choice items most information
is preserved; the codes for formally valid but incorrect responses indicate which
of the three distractors was chosen.
Table 1 shows the response percentages
for one item and two countries. In this example distractor preferences vary by
about 20% although the correct-response percentage is almost the same. This
demonstrates quantitatively that the reasons that induce students to give a
certain wrong answer can vary enormously from country to country.
It is fairly obvious that the choice of distractors also inuences the correctresponse percentage. Had distractor D been more in the spirit of C, it would
have attracted additional responses in Sweden, whereas in Slovakia many students would have reoriented their choice towards B.
Between-country variance may be due for instance to school curricula, cultural background, test languange, or to a combination of several factors. This
factors are particularly inuential in PISA because students have little time
(about 2'20 per item) and reading texts are too long. Sometimes the stimulus material even tricks students into misclues [29]. In this situation, test-wise
15
students try to solve items, including reading item, without actually reading
the introductory texts. Such qualied guessing is of course highly dependent
on extrinsic knowledge and therefore susceptible to cultural bias.
The released reading unit Flu from PISA 2000 provides a nice example.
The stimulus material is an information sheet about a u vaccination. One of
the items asks how the vaccination compares to alternative or complementary
means of protection.
Of course students are not asked about their personal
opinion; the answer shall be sought in the reading text.
Nevertheless, the
distractor preferences reect French reliance on technology, and German belief
in nature.
3.9 Language related problems
The language inuences the test in several ways:
Translations are prone to errors. In PISA, a complicated scheme with double
translation from English and from French was foreseen to minimise such errors.
However, in many cases, including the German-speaking countries, the French
original was not taken serious, and nal versions were produced under extreme
time pressure.
items.
There are clear-cut translation errors in the released sample
In the unit daylight the English word hemisphere was translated
by the erudite Hemisphäre where German schoolbooks use the word Erdhälfte. In the unit Farms, attic oor was rendered as Dachboden which
just means attic. The fact that the Austrian version has the correct wording
Boden des Dachgeschosses though all German speaking languages had shared
the translation work indicates that uncoordinated and unchecked last-minute
modications have been made.
Blum and Guérin-Pace [3, p. 113] report that changing a question (Quels
taux . . . ?) into a prompt (Énumérez tous les taux . . . ) can change the rate
of right responses by 31%. This gives an idea of how much freedom translators
have to help or to confuse (cf. Freudenthal [7, p. 172] and Olsen et al. [18]).
(3, 4) Under translation, texts tend to become longer, and some languages
are more concise than others. In PISA 2000, the English and French versions of
60 stimulus texts were compared: the French texts contained on average 12%
more words and 19% more letters [1, p. 64]. Mathematics items of PISA 2003
had 16% more characters in German than in English [24]. Of course reading
time is not simply proportional to the number of words or letters.
It seems
nevertheless plausible that such a huge length dierence induces an important
bias.
16
3.10 Origin of test items
A majority of test items comes from English-speaking countries; the other items
were translated into English before they were streamlined by professional item
writers. If there is cultural bias, it is clearly in favour of the English-speaking
countries. This makes it dicult to separate it from the translation bias which
acts in the same direction.
The quantitative importance of cultural or/and linguistic bias can be read o
from the correlation of correct-response-percentage per item vectors, as has been
shown by Zabulionis [32, for TIMSS], Rocher [27], Olsen [20], and Wuttke [31].
Cluster analyses invariably show that the student behaviour is most similar for
countries that share both language and cultural heritage, like Australia and New
Zealand (correlation coecient 0.98). If the languages dier, correlations are at
best about 0.96, as for the Czech and Slovak Republics. If the languages do not
belong to the same stem, correlations are hardly larger than 0.94. While some
countries belong to large clusters, others like Japan and Korea are quite isolated
(no correlation larger than 0.90). These resultshave immediate implications for
the validity of inter-country comparisons: The lesser the correlation of response
patterns, the more a comparison depends on the arbitrary choice of items.
4
Interpreting cognitive test results
4.1 Prociency levels
Verbal descriptions of prociency levels are used to guide the interpretation
of numeric results [16, pp. 4656]. The boundaries of these levels are arbitrarily
chosen; nevertheless they are communicated with absurd four-digit precision.
Starting at a competence of 358.3, there are six prociency levels. The width
of levels 1 to 5 is about 62.1; level 6 starts at 668.7. Depending on how many
students gave the right response, each item is assigned to one of these levels.
Based on all items designed to one level, a verbal synthesis is given of what
students can typically do.
By construction the OECD country average student competence distribution is approximately a Gaussian. The mean of 500 and the standard deviation
of 100 are imposed by an explicit (though ill documented) renormalisation.
Therefore the percentages of students in the dierent prociency levels are almost constant.
To illustrate this important point let us perform a Gedanken experiment.
If the percentage of right responses given by a single student grows by 6%,
his competence value increases by about 30 points.
Suppose now that the
correct-response percentage grows by 6% for all students the competence values
17
assigned to the students will not increase because any uniform change of student
competences is immediately reverted by the renormalisation to the predened
Gaussian.
Instead, the item culty values would be lowered by about 30
points, so that about every second item would be relegated to the next lower
prociency level.
Theoretically, this should then lead to a rephrasing of the
prociency level description.
However, these descriptions are highly systematic. They are so systematic
that they could have been derived straight from Bloom's fourty-year-old taxonomy. They are far too systematic to appear like a summary of empirical results:
One would expect that not every single item ts equally well in such a scheme,
but the level descriptions do not reect the least irritation. As Meyerhöfer [11]
has pointed out, the very idea of prociency levels is not consistent with the
fact that test items can be solved in quite dierent ways, depending for instance
on curricular premises, on testwiseness and time pressure. Therefore, the most
likely outcome of our Gedanken experiment seems to be that the ocial level
descriptions would not change at all, so that the overall increase in student
achievement would pass unnoticed as has the mist of the Rasch model and
the resulting bias and uncertainty of about
±30
diculty points.
Another fundamental objection is the lack of transparency. The prociency
level descriptions are not scientically discussible unless the consortium publishes the instruments on which they are based and the proceedings of the
hermeneutic sessions in which the descriptions have been worked out.
In the German reports, students in and below prociency level 1 are called
the risk group.
This deviates from the international reports that speak of
risk only in connection with students below level 1. It has become an urban
legend in Germany that nearly one quarter of all fteen-year-olds are almost
functionally illiterate, although the original report states that PISA does not
bother to measure uency of reading, which is taken for granted even on level 1
[15, pp. 4748].
Furthermore, as has been stressed above, the percentage of
students on or below level 1 is extremely sensitive to disparities in sampling
and participation.
4.2 Is PISA an intelligence test?
PISA items from dierent domains are quite similar in style and sometimes
even in contents: Reading items are based on nontextual stiumulus material
such as graphics or tables, and math or science items require a lot of reading.
This is intentional insofar as it reects a certain conception of literacy. It is
therefore unsurprising that competence values from dierent domains are highly
correlated. A majority of per-country inter-domain correlations is stronger than
80%.
18
In such a situation, the sensible thing to do is a principal component analysis. One nds that between 75% (Greece) and 92% (Netherlands) of the total
variance of examinee competences can be attributed to just one component.
However, no such analysis has been published by the consortium, and when
Rindermann [26] did, members of PISA Germany tried to dismiss and even
to ridicule it.
The ideological and strategical reasons for this opposition are
obvious: Once it is found that PISA mainly measures one general factor per
examinee, it is hard not to make a connection to the g factor of intelligence research. This must be seen as a sacrilege and as a threat by PISA members, who
avoid the word intelligence throughout their writings. This word is taboo in
much of the pedagogical mainstream, and no government would spend millions
to be informed about the intelligence of students.
4.3 Uncontrolled variables
PISA aims at monitoring outcomes of education systems. However, the education system is just one of many variables that inuence the outcome of the
cognitive test. As we have seen, sampling, exclusions, response rates, test taking habits, culture, and language are quantitatively important. Since all these
variables are country dependent, there is no way to separate them from the
variable education system.
But even in the hypothetical case of a technically and culturally fair test, it
would not be clear that dierences in test outcome are due to dierences in education systems. There are certainly country dependent educational inuences
that are not part of what is generally understood under education system,
such as the subtitled TV programs prevalent in small language communities.
Furthermore, equating test achievement with the outcome of schooling is highly
ideological in that it dismisses dierences in genetic equipment, pre-scholar education, and extra-scholar environment.
The importance of extrinsic parameters becomes obvious when subpopulations are compared that share the same education system.
An example are
the two language communities in Finland. In the major domain of PISA 2000,
reading, Finnish students achieve 548 points in Finnish-speaking schools, but
only 513 in Swedish-speaking schools slightly less than Sweden's national
average of 516 [31, Sect. 4.8]. A national report [5] suggests that much of the
dierence between the two communities (which is somewhat smaller in 2003)
can be explained by two factors: by the language spoken at home and by the
social, economic, and cultural background.
If student dependent background variables have such a huge impact in an
otherwise comparatively homogeneous country like Finland, they can even more
severely distort international comparisons.
19
As several authors have already
noted, one of the most important background variables is the language spoken
at home. Except in a few bilingual regions, a non-test language spoken at home
is typically linked to immigration. The immigration status is accessible since
the questionnaire asks for the country of birth of the student and his parents.
Excluding rst and second generation immigrant students from the national
averages considerably alters the country league tables: On top of the list in the
2003 major domain, mathematics, Finland is replaced by the Netherlands and
Belgium, and it is closely followed by Switzerland. The superiority of the Finish
school system, one of the most publicised results of PISA, vanishes as soon as
one single background variable is controlled.
5
Conclusions
One defense line of PISA proponents reads: PISA is state of the art, at present
nobody can do it better. This is probably true. If there was one outstanding
source of bias, one could hope to improve PISA by ghting this specic problem.
However, it rather appears that there is a plethora of inaccuracies of similar
magnitude. Reducing a few of them will have very little eect on the overall
uncertainty. Therefore, one has to live with the unsatisfactory state of the art
and draw the right consequences.
Firstly, the outcome of PISA must be reassessed. The ocial signicance
criteria, based only on stochastic errors, are irrelevant and misleading.
The
accuracy of country rankings is largely overestimated. Statistics are particularly
distorted if they depend on reponse rates among weak students; statements
about risk groups are extremely unreliable.
Secondly, the large sample sizes of PISA are uneconomic. Since the accuracy
of the study is determined by other factors, the eort currently invested in
minimising stochastic errors is unjustied.
Thirdly, it is clear from the outset that little can be learned when something
as complex as a school system is characterised by something as simple as the
average number of solved test items.
References
[1] Adams, R. / Wu, M., eds. (2002):
PISA 2000 Technical Report. Paris:
OECD.
[2] Blanke, I. / Böhm, B. / Lanners, M. (2004): Beispielaufgaben und Schülerantworten. Le Gouvernement du Grand-Duché de Luxembourg. Ministère
de l'Éducation nationale et de la Formation professionelle.
20
[3] Blum, A. / Guérin-Pace, F. (2000): De Lettres et des Chires. Des tests
d'intelligence à l'évaluation du savoir lire, un siècle de polémiques. Paris:
Fayard.
[4] Bottani, N. / Vrignaud, P. (2005): La France et les évaluations internationales. Rapport établi à la demande du Haut Conseil de l'évaluation
http://lesrapports.ladocumentationfrancaise.fr/BRP/
054000359/0000.pdf.
de
l'école.
[5] Brunell, V. (2004): Utmärkta PISA-resultat också i Svensknland. Pedagogiska Forskningsinstitutet, Jyväskylä Universitet.
[6] Fischer, G. H. / Molenaar, I. W. (1995):
Rasch Models. Foundations,
Recent Developments, and Applications. New York: Springer.
[7] Freudenthal, H. (1975): Pupils achievements internationally compared the IEA. In: Educ. Stud. Math. 6, 127186.
[8] Hambleton, R. K. / Swaminathan, H. / Rogers, H. J. (1991): Fundamentals
of Item Response Theory. Newbury Park: Sage.
[9] Hattie, J. (1985): Methodology Review: Assessing Unidimensionality of
Tests and Items. In: Appl. Psych. Meas. 9 (2) 139164.
[10] Kohn, A. (2000):
The Case Against Standardized Testing. Raising the
Scores, Ruining the Schools. Portsmouth NH: Heinemann.
[11] Meyerhöfer, W. (2004): Zum Problem des Ratens bei PISA. In: J. Math.did. 25 (1) 6269.
[12] Micceri, T. (1989): The Unicorn, the Normal Curve, and other Improbable
Creatures. In: Psychol. Bull. 105 (1) 156166.
[13] Mislevy, R. J. (1993): Foundations of a New Test Theory. In: Frederiksen,
N. / Mislevy, R. J. / Bejar, I. I., eds.: Test Theory for a New Generation
of Tests. Hillsdale: Lawrence Erlbaum.
[14] Neuwirth, E. / Ponocny, I. / Grossmann, W., eds. (2006):
PISA 2000
und PISA 2003: Vertiefende Analysen und Beiträge zur Methodik. Graz:
Leykam.
[15] OECD, ed. (2001): Knowledge and Skills for Life. First Results from the
OECD Programme for International Student Assessment (PISA) 2000.
Paris: OECD.
21
[16] OECD, ed. (2004):
Learning for Tomorrow's World. First Results from
PISA 2003. Paris: OECD.
[17] OECD, ed. (2005): PISA 2003 Technical Report. Paris: OECD.
[18] Olsen, R. V. / Turmo, A. / Lie, S. (2001): Learning about students' knowledge and thinking in science through large-scale quantitative studies. Eur.
J. Psychol. Educ. 16 (3) 403420.
[19] Olsen, R. V. (2005a):
Achievement tests from an item perspective. An
exploration of single item data from the PISA and TIMSS studies, and
how such data can inform us about students' knowledge and thinking in
science. Thesis, University of Oslo.
[20] Olsen, R. V. (2005b): An exploration of cluster structure in scientic literacy in PISA: Evidence for a Nordic dimension? NorDiNa 1 (1) 8194.
[21] Prais, S. J. (2003):
Cautions on OECD's Recent Educational Survey
(PISA). Oxford Rev. Educ. 29 (2) 139163.
[22] Prenzel, M. et al. [PISAKonsortium Deutschland], eds. (2004):
PISA
2003. Der Bildungsstand der Jugendlichen in Deutschland Ergebnisse
des zweiten internationalen Vergleichs. Münster: Waxmann.
[23] Prenzel, M. et al. [PISA-Konsortium Deutschland], eds. (2005): PISA 2003.
Der zweite Vergleich der Länder in Deutschland Was wissen und können
Jugendliche. Münster: Waxmann.
[24] Puchhammer, M. (2007): Language-Based Item Analysis Problems in
Intercultural Comparisons. In: S. T. Hopmann, G. Brinek, and M. Retzl,
eds.: PISA zufolge PISA PISA According to PISA. Hält PISA, was es
verspricht? Does PISA Keep What It Promises? Wien: Lit-Verlag.
[25] Putz,
M. (2008):
PISA oder:
jedem das seine . . .
Wunschergebnis!
Zweifel an PISA anhand der Fälle Österreich und Südtirol.
//www.messen-und-deuten.de/pisa/Putz08.pdf.
http:
[26] Rindermann, H. (2006): Was messen internationale Schulleistungsstudien?
Schulleistungen, Schülerfähigkeiten, kognitive Fähigkeiten, Wissen oder allgemeine Intelligenz? In: Psychol. Rundsch. 57 (2) 69-86. See also comments
and reply in vol. 58 (2).
[27] Rocher, T. (2003):
La méthodologie des évaluations internationales de
compétences. In: Psychologie et Psychométrie 24 (23) [Numéro spécial:
Mesure et Éducation], 117146.
22
2
[28] Rost, J. ( 2004): Lehrbuch Testtheorie Testkonstruktion. Bern: Hans
Huber.
[29] Ruddock, G. / Clausen-May, T. / Purple, C. / Ager, R. (2006): Validation
study of the PISA 2000, PISA 2003 and TIMSS-2003 International studies
of pupil attainment. Nottingham: Department for Education and Skills.
http://www.dfes.gov.uk/research/data/uploadfiles/RR772.pdf.
[30] Woods, C. M. / Thissen, D. (2006): Item Response Theory with Estimation
of the Latent Population Distribution Using Spline-Based Densities. In:
Psychometrika 71 (2) 281301.
[31] Wuttke,
J. (2007):
Die Insignikanz signikanter Unterschiede:
Der
Genauigkeitsanspruch von PISA ist illusorisch. In: Jahnke, T. / Meyerhöfer, W., eds.: Pisa & Co. Kritik eines Programms. 2nd edition [note:
my contribution to the 1st edition is outdated]. Hildesheim: Franzbecker.
ISBN 978-388120-464-4.
[32] Zabulionis, A. (2001): Similarity of Mathematics and Science Achievement
of Various Nations. In: Educ. Policy Analysis Arch. 9 (33).
23