Uncertainties and Bias in PISA Joachim Wuttke Copyright (C) Joachim Wuttke 2007. Revised 20jul08. Download locations: http://www.messen-und-deuten.de/pisa, http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1159042. PISA zufolge PISA PISA According to PISA. Appeared in: Hält PISA, was es verspricht? Does PISA Keep What It Promises? Edited by S. T. Hopmann, G. Brinek, and M. Retzl. Reihe Schulpädagogik und Pädagogische Psychologie, Bd.6. Wien: Lit-Verlag 2007, ISBN 978-3-8258-0946-1. Abstract This is a summary of a detailed report that has appeared in German [31]. It will be shown that statistical signicance criteria of OECD/PISA are misleading because several sources of systematic bias and uncertainty are quantitatively more important than the standard errors communicated in the ocial reports. 1 Introduction 1.1 A huge framework PISA is a long-term project. Starting in 2000, assessments are carried out every three years. One and half a year are needed for data processing until the international report First Results [15; 16] is published, and it takes even longer until a Technical Report [1; 17] appears and the raw data are made available for independent analysis. Therefore, although the third assessment was carried out in spring 2006, at present (summer 2007) only PISA 2000 and 2003 can be evaluated. In the following, we will concentrate on data from PISA 2003. PISA 2003 was carried out in 30 OECD countries and in some partner countries. As data from the latter were not used in the international calibration, they will be disregarded in the following. The United Kingdom (UK), having missed several participation criteria, was excluded from tables in the ocial 1 report. However, data from the UK were fully used in calibrating the international data set and in calculating OECD averages an inconsistency that is left unexplained [17, p. 128], [16, p. 31]. PISA rules required a minimum sample size of 4500 students per country except in very small countries (Iceland, Luxembourg), where all fteen-year-old students were recruited. In several countries (Australia, Belgium, Canada, Italy, Mexico, Spain, Switzerland, UK), considerably larger samples of up to nearly 30.000 students [17, p. 168] were drawn so that separate analyses for regions or language areas became possible. For the comparison of the sixteen German länder, an even larger sample of 44.580 students was tested [23, p. 392] of which, however, only 4660 were contributed to the international sample [17, p. 168]. The Kultusministerkonferenz, fearing unauthorized cross-länder comparisons of school types, has imposed deletion of länder codes from public-use data les. Therefore, the inner-German comparison shall not be considered further. The bulk of PISA data comes from a three-hour student testing session. Some more information is gathered from school principals. The testing session consists of a two-hour cognitive test and of a third hour devoted to questionnaires. The main questionnaire enquires about the students' social background, educational environment, and learning habits. The questionnaire responses certainly constitute a valuable ressource for studying the living and learning conditions of fteen-year olds in large parts of the world, even though participation rate gradients introduce some bias. Compared to the rich empirical material obtained from the questionnaires, the outcome of the cognitive test is meagre: the ocial data analysis reduces it to just four scores per student, interpreted as competences in specic subject domains (reading, mathematics, science, problem solving). Nevertheless, these results are at the origin of PISA's political impact; communicated as league tables of national mean values, they made PISA known to the general public, causing an outright shock in some countries. While controversy erupted about possible causes of results perceived as unsatisfactory, the three-digit precision of the underlying data has rarely been questioned. This is what shall be done in the present paper. The accuracy and validity of cognitive test results shall be reviewed from a statistical point of view. 1.2 A surprisingly simple measure of competence As a rst step of data reduction, student responses are digitally coded. The Technical Report discusses inter-coder and inter-country variance at length [17, pp. 218232]; the conclusion that non-uniform coding is an important source of bias and uncertainty is left to the reader. 2 Some codes are kept secret because national authorities want to prevent certain analyses. secret. In several multilingual countries the test language is kept Except for such deletions, the international raw data set is available for download on the web site of OECD's main contractor ACER (Australian Council for Educational Research). On the lowest level of data aggregation, single item response statistics (percentages of right, wrong, and invalid responses to one cognitive test item) can be generated. In the international report not even one such statistic is shown. PISA is decidedly not a study in Fachdidaktik (math education, science education, . . . ). PISA does not aim at gathering information about the understanding of scientic concepts or the mastery of specic mathematical techniques. The data provide almost no handle to understand why students give wrong responses. Only Luxembourg has scanned and published some student solutions to free-response items [2]; these examples show that students sometimes just misunderstood what the item writer meant to ask. PISA is designed to be analysed on a much coarser level. As anticipated above, cognitive test results are aggregated into just four competence values per student. The determination of these values is technically complicated because not all students worked on the same item set: thirteen dierent booklets were used, and in some countries some items turned out to be invalid because of misprints, translation errors, or other problems. This makes it necessary to establish an item diculty scale prior to the quantication of student competences. For this calibration an elementary version of item response theory is used. The importance of this theory tends to be overestimated by defenders and critics of PISA alike. Misunderstandings are also provoked by the poor documentation in the ocial reports. For a functional understanding of what PISA measures it is not important that dierent booklets were used, and it is plainly irrelevant that somewhere some items were deleted. Glossing over these technicalities, pretending that all students were assigned the same item set, and ignoring the probabilistic aspect of item response theory, it becomes apparent what the competence values actually measure: no more and no less than the number of right responses. In the mathematics subtest of PISA 2003, a student with a competence of 500 (the OECD mean) has solved about 46% of the items assigned to him. A competence of 400 (one standard deviation below the mean), corresponds to a correct-response rate of 23%; 600 corresponds to 71% [31, Fig. 4]. Within this span the relationship between competence value and correct-response percentage is nearly linear. The slope is about 4 competence points per 1% of assigned items. This conversion gives the competence scale a much simpler meaning than the ocial reports allow one to suspect. 3 1.3 League Tables and Stochastic Uncertainties Any analysis of PISA data aims at statistical statements about populations. For instance, an elementary analysis of the cognitive test yields results like the following: German students have a mean mathematics competence of 503; the standard deviation is 103; the standard error of the mean is 3.3, and the standard error of the standard deviation is 1.8 [22, p. 70]. such numbers they need to be put into context. To make sense of The PISA reports provide two kinds of interpretation guidance: Verbal descriptions of prociency levels give a rough idea of what competence dierences of 60 or more points signify (see below), and comparisons between dierent populations insinuate that even dierences of only a few points bear a message. Since the assessment of competences within each of the four subject domains is strictly one-dimensional, any inter-population comparison implies a ranking. This explains the primordial role of leage tables in PISA: They are not only a vehicle for gaining media attention, but they are deeply rooted in the study's conception (cf. Bottani/Vrignaud [4]). In the ocial reports almost all statistics are communicated in form of country league tables. The ranks in these tables, especially low ranks (and every country has low ranks in some tables), are then easily turned into political messages. In this way PISA results can be interpreted without any understanding of what has actually been measured. Of course not all rank dierences are statistically signicant. This is duly noted in the ocial reports. For all statistics, standard errors are calculated. After processing these standard errors through a zero hypothesis testing machinery, some mean value dierences are judged signicant, others are not. Complicated tables [16, pp. 59, 71, 81, 88, 92, 281, 294] indicate which dierences of competence means are signicant. It turns out that in some cases 9 points are sucient to say with condence that the higher performance by sampled students in one country holds for the entire population of enrolled 15-year-olds [16, p. 93]. Figure 1: Two Gaussian distributions with mean values diering by 9% of their standard deviation. Such small dierences between two populations are considered signicant in PISA. This accuracy is formidable when compared to the intra-country spread of test performances. The standard deviation of the competence distribution is 4 100 points in the OECD country average and not much smaller within single nations. This is an order of magnitude more than an inter-country dierence of 9 points. Figure 1 visualises the situation. However, signicant does not mean reliable, valid, or relevant. Statistical signicance is achieved by nothing more than the law of large numbers. The standard errors on which the signicance criteria account only for two specic sources of stochastic uncertainty: the student sampling and the item-response modelling of student behaviour. By testing more and more students on more and more items these uncertainties can be made arbitrarily small. At some point, however, this eort becomes inecient because reliability and validity of the study remain limited by non-stochastic sources of bias and uncertainty, which do not decrease with increasing sample size. Before entering into details, the likeliness of non-stochastic bias can be made plausible by just considering what a mean value dierence of 9 competence points actually means: According to the conversion introduced above, 9 points correspond to about 2% of responses. mathematics items. On average, a student is assigned 26 Hence a signicant dierence between two populations can be brought about by no more than half a right response per student. This suggests that little bias is needed to distort test results far beyond their nominal standard errors. In the following, I will argue that PISA suers indeed from severe nonstochastic limitations and that the large sample sizes are therefore uneconomic. The paper is structured as follows: Part 2 describes disparities in student sampling, Part 3 shows that the projection of cognitive test results onto a onedimensional competence scale is neither technically convincing nor culturally fair, and Part 4 adds some objections on the conceptual level. 2 Sampling disparities In some countries it is clear from the outset that PISA cannot be representative (Sect. 2.1). But even in countries where school is obligatory beyond the age of fteen, low participation rates are likely to introduce some bias. Several imperfections and inconsistencies of the international sample are well documented in the Technical Report. Participation rate requirements were not strict enough to prevent signicant bias, and violations of these predened rules lead hardly to any consequences. 5 2.1 Target population does not serve study objective PISA claims to measure outcomes of education systems in terms of student achievements. This claim is not consistent with the choice of the target population, namely 15-year olds enrolled full-time in educational institutions. In some countries (Mexico, Turkey, several partner countries), enrollment is less than 60%. Obviously, PISA says nothing about the outcome of the education system of these countries. On the other hand, in many countries school is obligatory beyond the age of 15. At fteen, the ability of abstract reasoning is still in full development. PISA therefore systematically underestimates the abilities students have near the end of compulsory schooling [16, pp. 3, 298; 17, p. 46]. 2.2 Target population too loosely dened: unequal exclusions Rules allowed countries to exclude up to 5% of the target population: up to 0.5% for organizational reasons and up to 4.5% for intellectual or functional disablities or limited language prociency. Exclusions for intellectual disability depended on the professional opinion of the school principal or by other qualied sta a completely uncontrollable source of uncertainty. From the smallprint in the Technical Report it appears that some countries dened additional criteria: Denmark, Finland, Ireland, Poland, and Spain excluded students with dyslexia; Denmark also students with dyscalculia; Luxembourg recently immigrated students [17, pp. 47, 65, 169, 183]. Actual student exclusion rates of OECD countries varied from 0.7% to 7.3%. Canada, Denmark, New Zealand, Spain, and the USA exceeded the 5% limit. Nevertheless, data from these countries were fully included in all analyses. For a rst-order estimate of the impact caused by the unequal use of student exclusions, let us approximate the competence distribution in every single country by a Gaussian with standard deviation 100, and let us assume that countries exclude with perfect precision the least competent students. Then, exclusion of the weakest 0.7% increases the country's mean by 2.0 points and reduces its standard deviation by 2.5 points, whereas exclusion of 7.3% increases the mean by 15.0 and reduces the standard deviation by 12.8. Of course, exclusion criteria are only correlatives of potential test achievement, and they are never applied with perfect precision. When a probabilistic cut-o, spread over a range of ±100 points, is used to model soft exclusion criteria, the bias in the two countries' competence mean dierence is reduced to about half of the initial 13 points. 6 In Germany much public attention has been drawn to the percentage of students in a so-called risk group dened by test scores below an arbitrary threshold. International comparisons of such percentages are particularly unreliable, because they are extremely sensitive to non-uniform exclusion criteria. 2.3 On the fringe of the target population: unequal inclusion of learning-disabled students The imprecision of exclusion criteria and the resulting bias are further illustrated by the unequal inclusion of students with learning disabilities. Seven countries cater for them in special schools. In these schools, the cognitive test was abridged to one hour, and a special booklet with a selection of easy items was used. In all other countries, student exclusions were decided per case; but even in countries that used the special booklets, some learning-disabled students could be individually excluded (cf. [21, pp. 149, 158]). The extent to which students were excluded from test or given the short booklet varies widely between the seven countries. In Austria, 1.6% of the target population were completely excluded, and 0.9% of the participating students got the short test. test. In Hungary, 3.9% were excluded, and 6.1% did the short Given this discrepancy, it is barely surprising that Hungarian students who did the short test achieved nearly 200 points more than Austrians. For another rough estimate of the quantitative impact of unclear exclusion criteria, one can recalculate national means without short tests. If all short tests were excluded from the PISA sample, the mean reading score of Belgium, Denmark, and Germany would increase by more than 7 points; in doing so, Belgium (1.5% exclusions, 3.0% short tests) would even remain within the 5% limit [17, p. 169]. A bias of the order of 7 points is in perfect accord with the estimate from the previous section. 2.4 Sampling problems: inconsistent input The sampling is technically dicult. Often, governments do not dispose of consistent data bases. Sometimes, this leads to bewildering inconsistencies: In Sweden 102.5% of all 15-year olds are reported to be enrolled in an educational insitution; in the Italian region of Tuscany 107.7%; in the USA, in spite of a strong homeschooling movement, 100.000% [17, pp. 168, 183]. The sample is drawn in two stages: schools within strata (regions or/and school types), and students within schools. As a consequence of this stratication and of unequal participation rates, not all students are equally representative for the target population. To correct for this, students are assigned statistical weights, composed of several factors. The recommended way to calculate 7 of these weights is so dicult that international rules foresee three replacement procedures. In Greece, none of the four procedures worked so that a uniform student weight had to be used [17, p. 52]. 2.5 Sampling problems: inconsistent output In the Austrian sample of PISA 2000, students from vocational schools were underrepresented. In consequence, the country's means were overestimated and other data were distorted as well. The error was only searched for and found three years later. when the deceiving outcome of PISA 2003 induced the government (which had changed in the meantime) to order an investigation [14]. In South Tyrol, a change of government is not in sight, and therefore nobody seems interested to verify accusations that the excellent PISA results of this region are largely due to the underrepresentation of students from vocational schools [25]. In South Korea, only 40.5% of PISA participants are girls. In the 1980s, due to selective abortion and in part possibly also to hepatitis-B, the sex ratio at birth in South Korea has been down to 47%, perhaps even to 46%. Taking this into account, girls are still severely underrepresented in the PISA sample. According to the Technical Report, this cannot be explained by unequal enrollment or test compliance: The enrollment rate is 99.94%, the school participation rate 100%, the student participation rate 98.81%. Probably the sampling scheme was inappropriate. This conclusion is also supported by an anomalous distribution of birth months. 2.6 Insucient response rates Rules required a school response rates of 85%, within-school response rates of 25%, and a country-wide student response rate of 80% [17, pp. 4850]. The United Kingdom breached more than one criterion which lead to its supercial disqualication. Canada proted from an illogical rule according to which initial response rates above 65% be cured by replacement schools without the need of reaching 85%; the case was settled by negotiation [17, p. 238]. With 64.9%, the USA missed the required initial school response rate by a narrow margin, and the response from replacement schools was overwhelmingly negative, bringing the participation rate to no more than 68,1%. Nevertheless, US data were fully included in all analyses (note: the USA contribute 25% of OECD's budget). Non-response can cause considerable bias because the propensity of school principals and students to partake in the testing is likely to be correlated with the potential outcome. Quantitative estimates are dicult because the international data base contains not the least information about those who refused 8 the test. Nevertheless, there is ample indirect evidence that the correlation is quite high. To cite just one example: In Germany schools with a student response of 100% had a mean math score of 553. Schools with participation below 90% achieved only 476 points. Even if the latter number is subject to some uncertainty (discussed at length in [31]), the strong correlation between student ability and test compliance is beyond any doubt. In the ocial analysis, statistical weights provide a rst-order correction for the between-school variation of response rates: When schools refuse to participate, the weight of other schools from the same stratum is increased accordingly. Similarly, in schools with low student response rates, the participating students are given heigher weights. However, these corrections do not cure within-school correlations between students' latent abilities and their propensity to partake in the test. In the absence of data from absent students, the possible bias can only be roughly estimated: In some countries, the student response rate is more than 15% lower than in others. Assuming very conservatively that the latent ability of the missing students is only half a standard deviation below the true national average, one nds that the absence of these students increases the measured national average by 8.8 points. 2.7 Gender-dependent response rates In many other countries girls are overrepresented in the PISA sample. The discrepancy is largest in France, with 52.6% girls in PISA against an estimated 48.9% among 15-year olds: compared to the age cohort, the PISA sample has more than 7% too many girls and more than 7% too few boys. Insofar as this is due to dierent enrollment, it enforces the argument of Sect. 2.1. Otherwise, the most likely explanation is a gender-dependent propensity to participate in the testing. 2.8 Doubts about data transmission: missing missing responses Normally, some students do not respond to all questions of the background questionnaire. Moreover, some students leave between the cognitive test and the questionnaire session. In Poland, however, such missing data are missing: There is no single student who responded to less than 25 questionnaire items, and there are 7 items to which no single student did not respond. Unless this anomaly is explained otherwise, one must suspect that booklets with missing data have been suppressed. 9 3 Ignored dimensions of the cognitive test PISA's competence scale depends on the assumption that all items from one subject domain measure essentially one and the same latent ability. In reality, any test outcome is also inuenced by factors that cannot be subsumed under a subject-specic competence. While there is no generally accepted way to indicate the degree of multi-dimensionality of a test [9], simple rst-order estimates allow to demonstrate its impact : non-competence dimensions causes an amount of arbitrariness, uncertainty, and bias in PISA's competence measure, which is by no means negligible when compared to the purely stochastic ocial standard errors. 3.1 Elimination of disturbing items The evidence for multidimensionality to be presented in the following sections is even more striking on the background that the cognitive items actually used in PISA have been preselected for unidimensionality: Submissions from participating countries were streamlined by professional item writers, reviewed by national subject matter experts, tested with students in think-aloud interviews, tested in a pre-pilot study in a few countries, tested in a eld trial in most participant countries, rated by expert groups, and selected by the consortium [17, pp. 2030]. Only one third of the items that had reached the eld trial were nally used in the main test. Items that did not t into the idea that competence can be measured in a culturally neutral way on a one-dimensional scale were simply eliminated. Field test results remain unpublished, although one could imagine an open-ended analysis providing valuable insight into the diversity of education outcomes. This adds to Olsen's observation [19, p. 5] that in PISA-like studies the major portion of information is thrown away. However, the strong preselection did not prevent seriously awed items from being used in the main test: In the analysis of PISA 2000, the item Continent Area Q1 had to be disqualied, in 2003 Room Numbers Q1. Furthermore, several items had to be disqualied in specic countries. 3.2 Unfounded models In PISA, a probabilistic psychological model is used to calibrate item diculties and to estimate student competences. This model, named after Georg Rasch, is the most elementary incarnation of item response theory. It assumes that the probability of a correct response depends only on the dierence of the student's competence value and the item's diculty value. Mislevy [13] calls this attempt 10 to explain problem-solving ability in terms of a single, continuous variable a caricature, based in 19th century psychology. The model does not even admit the possibility that some items are easier in one subpopulation than in another. The reason for its usage in PISA is neither theoretic nor empiric, but pragmatic: Only one-dimensional models yield unambiguous rankings. Taking the Rasch model literally, there is no way to estimate the competence of students who solved all items or none: for them, the test has been too easy or too dicult, respectively. In PISA, this problem is circumvented by enhancing the probability of medium competences through a Bayesian prior, arbitrarily assumed to be a Gaussian. As distributions of achievement and psychometric measures are never Gaussian [12], this inappropriate prior causes bias in the competence estimates (Molenaar in [6, p. 48]), especially at extreme values [30]. This further undermines statements about risk groups with particularly low competence values. 3.3 Failure of the Rasch model Various mathematical criteria have been developed to assists in the decision whether or not the Rasch model reasonably approximates an empirical data set. It appears that only one of them has been used to check the outcome of the PISA main test: an unexplained item int mean square [17, pp. 123, 278]. A much more sensitive way to test the goodness of t is a visual inspection of appropriate plots [8, p. 66]. An item characteristic or score curve is a plot of correct-response percentages as function of competence values, each data point representing a quantile of examinees. In the Technical Report [17, p. 127] one single item characteristic is shown an atypical one that agrees rather well with the Rasch model. According to the model, all item characteristics from one subject domain should have strictly the same shape; the only degree of freedom is a horizontal shift, driven by the model's only item parameter, the diculty. This is clearly inconsistent with the variety of shapes exhibited by the four item characteristics in Fig. 2. Whereas Water Q3b discriminates quite well between more or less competent students, the other three items have deciencies that cannot be described without additional parameters. The characteristic of Chair Lift Q1 has almost a plateau at low competence values. This is the typical signature of guessing. On the other hand, Freezer Q1 saturates at less than 35%. This indicates that many students did not nd out the intention of the testers. Low discrimination strengths as in South Rainea Q2 may have several reasons: dierent diculties in dierent subpopulations, dierent diculties for dierent solution strategies (cf. Meyer- 11 Figure 2: Some item characteristics that show pronounced deviations from the Rasch model. Solid curves in (a) are ts with a 2-parameter model that accounts for dierent discrimination. The 4-parameter t in (b) additionally models guessing and misunderstanding. höfer [11]), qualied guessing, weak correlation of the latent ability measured here and in the majority of this domain's items. The solid lines in Fig. 2 show that satisfactory ts of the empirical data are possible when the Rasch model is extended by parameters that allow for a variable discrimination strength, for guessing, and for misunderstanding. Such multi-parameter item-response models still contain a linear shift parameter that may be interpreted as the item diculty. parameter deviate by typically However, best-t estimates of this ±30 points from the ocial Rasch diculties [31, Fig. 11]. This model dependence of item diculty estimates is not compatible with a one-dimensional ranking of items as is needed for the construction of prociency levels (Sect. 4.1). Furthermore, as soon as one admits more than one item parameter, any student ranking becomes arbitrary because of the adhoc anchoring of the diculty and competence scales. The rst data point of the characteristics of South Rainea and Chair Lift clearly lies below the t curves: the weakest 4% of participants perform weaker than modelled. This may be due to a lack of cooperation: yet another dimension that is not contained in elementary item-response theory. It may also be due to the inappropriateness of the Gaussian population model. 3.4 Between-booklet variance The use of dierent test booklets makes it possible to employ a total of 165 dierent items, though every single student works on no more than 60 of them. This reduces the dependence of test results on the arbitrary choice of items. 12 At the same time, it allows us to get an idea of how strong this dependence actually is. Calculating mathematics competence means for groups of students who have worked on the same booklet, inter-booklet standard deviations between 4 (Hungary) and 18 (Mexico) points are found. The largest dierence occurs in the USA: students who worked on booklet 2 were estimated to have a math competence of 444, whereas those who worked on booklet 10 achieved 512 points. Eliminating either booklet 2 or booklet 10 would respectively increase or decrease the overall national mean by about three points. This variance only reects the arbitrariness in choosing items from a pool that is already quite homogeneous due to the procedures described above (Sect. 3.1). Cultural bias in the submission, selection, and adaptation of items may have a far stronger impact. 3.5 Imputation with wrong normalisation Each of the thirteen regular booklets consists of four blocks. Each item appears in four dierent blocks, in four dierent positions, in four dierent booklets. The major subject domain mathematics is covered by seven of the thirteen blocks; the other three subject domains are tested in two blocks each. While all thirteen booklets contain at least one mathematics block, each minor domain appears only in seven booklets. Nevertheless, in the scaled data all students are attributed competence values in all four domains. If a student has not been tested in a domain, the competence estimate is based on the his questionnaire responses and on his school's average math achievement. Such an imputation, when done correctly, reduces the standard error of population means without introducing bias. In PISA, however, it is not done correctly. Bias is introcuced because the imputation is anchored at only one of the seven booklets for which real data are available. This bias is plainly admitted in the Technical Report [17, p. 211], though it is quantied only for Canada. The case of Greece is more extreme: the ocial science competence mean of 481 is 16 points above the average achievement of those students who were actually tested in science [31, Sect. 3.10]; cf. Neuwirth in [14, p. 53]. This huge bias is certainly not justied by the benets of imputation, which consist in a slight simplication of the secondary data structure and in a reduction of stochastic standard errors by probably no more than 10%. 3.6 Timing, tactics, fatigue Since every item occurs in four dierent positions, one can easily investigate how response rates vary during the two-hour testing session: Per-block response 13 rates, averaged across booklets over all items, can be directly compared to each other. One nds that the average rates of non-reached items, of missing responses, and of wrong responses systematically decrease from block to block. The extent of this decrease varies considerably between countries. The ratio of non-reached items in the fourth block is 1% in the Netherlands; in Mexico it is 25.3%. In the Netherlands, the ratio of items that were reached but not answered goes up from 2.5% in the rst block to 4.0% in the fourth block; in Greece, from 11.1% to 24.4%. In Austria, the ratio of right to given responses decreases from 56.2% in the rst block to 54.4% in the fourth block; in Iceland, from 58.5% to 53.1%. All these data indicate that students are lacking time in the last of the four blocks. This alone is a strong argument against the applicability of onedimensional item response theory [28, p. 43]. The ways students react to the lack of time vary considerably between countries: • Dutch students try to answer almost every item. Towards the end of the test they become hasty and increasingly resort to guessing. • Austrian and German students skip many items, and they do so from the rst block on, which leaves them enough time to nish the test without much accelerating their pace. • Greek students, in contrast, seem to be taken by surprise by the time pressure near the end. In the rst block, their correct-response rate is better than in Portugal and not far away from the USA and Italy. In the last block, however, non-reached items and missing responses add up to 35%, bringing Greece down to one of the last ranks. Aside such extreme cases, it is hardly possible to disentagle the eects of testtaking tactics and fatigue. 3.7 Multiple responses to multiple choice items In PISA 2003, 42 of 165 items are in a simple multiple-choice format. For each of these items, four or ve responses are proposed of which exactly one is meant to be the right one. This essential rule is not clearly explained to the examinees. In some countries, for some items, a considerable number of multiple responses are given. They are denoted by a special code in the international data base, but they are subsequently counted as incorrect. In many countries, including Australia, Canada, Japan, Mexico, the Netherlands, New Zealand, and the USA, the quota of multiple responses is close to 14 A Table 1: B C D Slovakia 3.1% 46.1% 17.5% 33.3% Sweden 3.1% 46.2% 37.0% 13.7% Percentages for the four possible responses of the multiple-choice item Optician Q1. Data are shown for two countries where almost the same percentage of students choose the right response B. However, preferences for the distractors C and D vary by about 20%. 0% (except for one particularly awed item). In Austria, Germany, and Luxembourg, on the other hand, the fraction of multiple responses surpasses 4% for at least eleven items, amd it reaches up to 10% for one of them. Such a misunderstanding of the test format does not only distort the outcome of the directly concerned item. It also costs time: it is more eort to decide four or ve times whether or not a proposed answer is correct than to choose only one alternative. Those who are familiar with the multiple-choice format sometimes do not even need to read all distractors. 3.8 Testing cultural background If one wants to understand what a test actually measures one has to study the manifold reasons why students give wrong responses (cf. Kohn [10, p. 11]). The few student solutions of open-ended items published by Luxembourg show how much information is lost when verbal or pictorial responses are digitally coded. In contrast, in the digital coding of multiple-choice items most information is preserved; the codes for formally valid but incorrect responses indicate which of the three distractors was chosen. Table 1 shows the response percentages for one item and two countries. In this example distractor preferences vary by about 20% although the correct-response percentage is almost the same. This demonstrates quantitatively that the reasons that induce students to give a certain wrong answer can vary enormously from country to country. It is fairly obvious that the choice of distractors also inuences the correctresponse percentage. Had distractor D been more in the spirit of C, it would have attracted additional responses in Sweden, whereas in Slovakia many students would have reoriented their choice towards B. Between-country variance may be due for instance to school curricula, cultural background, test languange, or to a combination of several factors. This factors are particularly inuential in PISA because students have little time (about 2'20 per item) and reading texts are too long. Sometimes the stimulus material even tricks students into misclues [29]. In this situation, test-wise 15 students try to solve items, including reading item, without actually reading the introductory texts. Such qualied guessing is of course highly dependent on extrinsic knowledge and therefore susceptible to cultural bias. The released reading unit Flu from PISA 2000 provides a nice example. The stimulus material is an information sheet about a u vaccination. One of the items asks how the vaccination compares to alternative or complementary means of protection. Of course students are not asked about their personal opinion; the answer shall be sought in the reading text. Nevertheless, the distractor preferences reect French reliance on technology, and German belief in nature. 3.9 Language related problems The language inuences the test in several ways: Translations are prone to errors. In PISA, a complicated scheme with double translation from English and from French was foreseen to minimise such errors. However, in many cases, including the German-speaking countries, the French original was not taken serious, and nal versions were produced under extreme time pressure. items. There are clear-cut translation errors in the released sample In the unit daylight the English word hemisphere was translated by the erudite Hemisphäre where German schoolbooks use the word Erdhälfte. In the unit Farms, attic oor was rendered as Dachboden which just means attic. The fact that the Austrian version has the correct wording Boden des Dachgeschosses though all German speaking languages had shared the translation work indicates that uncoordinated and unchecked last-minute modications have been made. Blum and Guérin-Pace [3, p. 113] report that changing a question (Quels taux . . . ?) into a prompt (Énumérez tous les taux . . . ) can change the rate of right responses by 31%. This gives an idea of how much freedom translators have to help or to confuse (cf. Freudenthal [7, p. 172] and Olsen et al. [18]). (3, 4) Under translation, texts tend to become longer, and some languages are more concise than others. In PISA 2000, the English and French versions of 60 stimulus texts were compared: the French texts contained on average 12% more words and 19% more letters [1, p. 64]. Mathematics items of PISA 2003 had 16% more characters in German than in English [24]. Of course reading time is not simply proportional to the number of words or letters. It seems nevertheless plausible that such a huge length dierence induces an important bias. 16 3.10 Origin of test items A majority of test items comes from English-speaking countries; the other items were translated into English before they were streamlined by professional item writers. If there is cultural bias, it is clearly in favour of the English-speaking countries. This makes it dicult to separate it from the translation bias which acts in the same direction. The quantitative importance of cultural or/and linguistic bias can be read o from the correlation of correct-response-percentage per item vectors, as has been shown by Zabulionis [32, for TIMSS], Rocher [27], Olsen [20], and Wuttke [31]. Cluster analyses invariably show that the student behaviour is most similar for countries that share both language and cultural heritage, like Australia and New Zealand (correlation coecient 0.98). If the languages dier, correlations are at best about 0.96, as for the Czech and Slovak Republics. If the languages do not belong to the same stem, correlations are hardly larger than 0.94. While some countries belong to large clusters, others like Japan and Korea are quite isolated (no correlation larger than 0.90). These resultshave immediate implications for the validity of inter-country comparisons: The lesser the correlation of response patterns, the more a comparison depends on the arbitrary choice of items. 4 Interpreting cognitive test results 4.1 Prociency levels Verbal descriptions of prociency levels are used to guide the interpretation of numeric results [16, pp. 4656]. The boundaries of these levels are arbitrarily chosen; nevertheless they are communicated with absurd four-digit precision. Starting at a competence of 358.3, there are six prociency levels. The width of levels 1 to 5 is about 62.1; level 6 starts at 668.7. Depending on how many students gave the right response, each item is assigned to one of these levels. Based on all items designed to one level, a verbal synthesis is given of what students can typically do. By construction the OECD country average student competence distribution is approximately a Gaussian. The mean of 500 and the standard deviation of 100 are imposed by an explicit (though ill documented) renormalisation. Therefore the percentages of students in the dierent prociency levels are almost constant. To illustrate this important point let us perform a Gedanken experiment. If the percentage of right responses given by a single student grows by 6%, his competence value increases by about 30 points. Suppose now that the correct-response percentage grows by 6% for all students the competence values 17 assigned to the students will not increase because any uniform change of student competences is immediately reverted by the renormalisation to the predened Gaussian. Instead, the item culty values would be lowered by about 30 points, so that about every second item would be relegated to the next lower prociency level. Theoretically, this should then lead to a rephrasing of the prociency level description. However, these descriptions are highly systematic. They are so systematic that they could have been derived straight from Bloom's fourty-year-old taxonomy. They are far too systematic to appear like a summary of empirical results: One would expect that not every single item ts equally well in such a scheme, but the level descriptions do not reect the least irritation. As Meyerhöfer [11] has pointed out, the very idea of prociency levels is not consistent with the fact that test items can be solved in quite dierent ways, depending for instance on curricular premises, on testwiseness and time pressure. Therefore, the most likely outcome of our Gedanken experiment seems to be that the ocial level descriptions would not change at all, so that the overall increase in student achievement would pass unnoticed as has the mist of the Rasch model and the resulting bias and uncertainty of about ±30 diculty points. Another fundamental objection is the lack of transparency. The prociency level descriptions are not scientically discussible unless the consortium publishes the instruments on which they are based and the proceedings of the hermeneutic sessions in which the descriptions have been worked out. In the German reports, students in and below prociency level 1 are called the risk group. This deviates from the international reports that speak of risk only in connection with students below level 1. It has become an urban legend in Germany that nearly one quarter of all fteen-year-olds are almost functionally illiterate, although the original report states that PISA does not bother to measure uency of reading, which is taken for granted even on level 1 [15, pp. 4748]. Furthermore, as has been stressed above, the percentage of students on or below level 1 is extremely sensitive to disparities in sampling and participation. 4.2 Is PISA an intelligence test? PISA items from dierent domains are quite similar in style and sometimes even in contents: Reading items are based on nontextual stiumulus material such as graphics or tables, and math or science items require a lot of reading. This is intentional insofar as it reects a certain conception of literacy. It is therefore unsurprising that competence values from dierent domains are highly correlated. A majority of per-country inter-domain correlations is stronger than 80%. 18 In such a situation, the sensible thing to do is a principal component analysis. One nds that between 75% (Greece) and 92% (Netherlands) of the total variance of examinee competences can be attributed to just one component. However, no such analysis has been published by the consortium, and when Rindermann [26] did, members of PISA Germany tried to dismiss and even to ridicule it. The ideological and strategical reasons for this opposition are obvious: Once it is found that PISA mainly measures one general factor per examinee, it is hard not to make a connection to the g factor of intelligence research. This must be seen as a sacrilege and as a threat by PISA members, who avoid the word intelligence throughout their writings. This word is taboo in much of the pedagogical mainstream, and no government would spend millions to be informed about the intelligence of students. 4.3 Uncontrolled variables PISA aims at monitoring outcomes of education systems. However, the education system is just one of many variables that inuence the outcome of the cognitive test. As we have seen, sampling, exclusions, response rates, test taking habits, culture, and language are quantitatively important. Since all these variables are country dependent, there is no way to separate them from the variable education system. But even in the hypothetical case of a technically and culturally fair test, it would not be clear that dierences in test outcome are due to dierences in education systems. There are certainly country dependent educational inuences that are not part of what is generally understood under education system, such as the subtitled TV programs prevalent in small language communities. Furthermore, equating test achievement with the outcome of schooling is highly ideological in that it dismisses dierences in genetic equipment, pre-scholar education, and extra-scholar environment. The importance of extrinsic parameters becomes obvious when subpopulations are compared that share the same education system. An example are the two language communities in Finland. In the major domain of PISA 2000, reading, Finnish students achieve 548 points in Finnish-speaking schools, but only 513 in Swedish-speaking schools slightly less than Sweden's national average of 516 [31, Sect. 4.8]. A national report [5] suggests that much of the dierence between the two communities (which is somewhat smaller in 2003) can be explained by two factors: by the language spoken at home and by the social, economic, and cultural background. If student dependent background variables have such a huge impact in an otherwise comparatively homogeneous country like Finland, they can even more severely distort international comparisons. 19 As several authors have already noted, one of the most important background variables is the language spoken at home. Except in a few bilingual regions, a non-test language spoken at home is typically linked to immigration. The immigration status is accessible since the questionnaire asks for the country of birth of the student and his parents. Excluding rst and second generation immigrant students from the national averages considerably alters the country league tables: On top of the list in the 2003 major domain, mathematics, Finland is replaced by the Netherlands and Belgium, and it is closely followed by Switzerland. The superiority of the Finish school system, one of the most publicised results of PISA, vanishes as soon as one single background variable is controlled. 5 Conclusions One defense line of PISA proponents reads: PISA is state of the art, at present nobody can do it better. This is probably true. If there was one outstanding source of bias, one could hope to improve PISA by ghting this specic problem. However, it rather appears that there is a plethora of inaccuracies of similar magnitude. Reducing a few of them will have very little eect on the overall uncertainty. Therefore, one has to live with the unsatisfactory state of the art and draw the right consequences. Firstly, the outcome of PISA must be reassessed. The ocial signicance criteria, based only on stochastic errors, are irrelevant and misleading. The accuracy of country rankings is largely overestimated. Statistics are particularly distorted if they depend on reponse rates among weak students; statements about risk groups are extremely unreliable. Secondly, the large sample sizes of PISA are uneconomic. Since the accuracy of the study is determined by other factors, the eort currently invested in minimising stochastic errors is unjustied. Thirdly, it is clear from the outset that little can be learned when something as complex as a school system is characterised by something as simple as the average number of solved test items. References [1] Adams, R. / Wu, M., eds. (2002): PISA 2000 Technical Report. Paris: OECD. [2] Blanke, I. / Böhm, B. / Lanners, M. (2004): Beispielaufgaben und Schülerantworten. Le Gouvernement du Grand-Duché de Luxembourg. Ministère de l'Éducation nationale et de la Formation professionelle. 20 [3] Blum, A. / Guérin-Pace, F. (2000): De Lettres et des Chires. Des tests d'intelligence à l'évaluation du savoir lire, un siècle de polémiques. Paris: Fayard. [4] Bottani, N. / Vrignaud, P. (2005): La France et les évaluations internationales. Rapport établi à la demande du Haut Conseil de l'évaluation http://lesrapports.ladocumentationfrancaise.fr/BRP/ 054000359/0000.pdf. de l'école. [5] Brunell, V. (2004): Utmärkta PISA-resultat också i Svensknland. Pedagogiska Forskningsinstitutet, Jyväskylä Universitet. [6] Fischer, G. H. / Molenaar, I. W. (1995): Rasch Models. Foundations, Recent Developments, and Applications. New York: Springer. [7] Freudenthal, H. (1975): Pupils achievements internationally compared the IEA. In: Educ. Stud. Math. 6, 127186. [8] Hambleton, R. K. / Swaminathan, H. / Rogers, H. J. (1991): Fundamentals of Item Response Theory. Newbury Park: Sage. [9] Hattie, J. (1985): Methodology Review: Assessing Unidimensionality of Tests and Items. In: Appl. Psych. Meas. 9 (2) 139164. [10] Kohn, A. (2000): The Case Against Standardized Testing. Raising the Scores, Ruining the Schools. Portsmouth NH: Heinemann. [11] Meyerhöfer, W. (2004): Zum Problem des Ratens bei PISA. In: J. Math.did. 25 (1) 6269. [12] Micceri, T. (1989): The Unicorn, the Normal Curve, and other Improbable Creatures. In: Psychol. Bull. 105 (1) 156166. [13] Mislevy, R. J. (1993): Foundations of a New Test Theory. In: Frederiksen, N. / Mislevy, R. J. / Bejar, I. I., eds.: Test Theory for a New Generation of Tests. Hillsdale: Lawrence Erlbaum. [14] Neuwirth, E. / Ponocny, I. / Grossmann, W., eds. (2006): PISA 2000 und PISA 2003: Vertiefende Analysen und Beiträge zur Methodik. Graz: Leykam. [15] OECD, ed. (2001): Knowledge and Skills for Life. First Results from the OECD Programme for International Student Assessment (PISA) 2000. Paris: OECD. 21 [16] OECD, ed. (2004): Learning for Tomorrow's World. First Results from PISA 2003. Paris: OECD. [17] OECD, ed. (2005): PISA 2003 Technical Report. Paris: OECD. [18] Olsen, R. V. / Turmo, A. / Lie, S. (2001): Learning about students' knowledge and thinking in science through large-scale quantitative studies. Eur. J. Psychol. Educ. 16 (3) 403420. [19] Olsen, R. V. (2005a): Achievement tests from an item perspective. An exploration of single item data from the PISA and TIMSS studies, and how such data can inform us about students' knowledge and thinking in science. Thesis, University of Oslo. [20] Olsen, R. V. (2005b): An exploration of cluster structure in scientic literacy in PISA: Evidence for a Nordic dimension? NorDiNa 1 (1) 8194. [21] Prais, S. J. (2003): Cautions on OECD's Recent Educational Survey (PISA). Oxford Rev. Educ. 29 (2) 139163. [22] Prenzel, M. et al. [PISAKonsortium Deutschland], eds. (2004): PISA 2003. Der Bildungsstand der Jugendlichen in Deutschland Ergebnisse des zweiten internationalen Vergleichs. Münster: Waxmann. [23] Prenzel, M. et al. [PISA-Konsortium Deutschland], eds. (2005): PISA 2003. Der zweite Vergleich der Länder in Deutschland Was wissen und können Jugendliche. Münster: Waxmann. [24] Puchhammer, M. (2007): Language-Based Item Analysis Problems in Intercultural Comparisons. In: S. T. Hopmann, G. Brinek, and M. Retzl, eds.: PISA zufolge PISA PISA According to PISA. Hält PISA, was es verspricht? Does PISA Keep What It Promises? Wien: Lit-Verlag. [25] Putz, M. (2008): PISA oder: jedem das seine . . . Wunschergebnis! Zweifel an PISA anhand der Fälle Österreich und Südtirol. //www.messen-und-deuten.de/pisa/Putz08.pdf. http: [26] Rindermann, H. (2006): Was messen internationale Schulleistungsstudien? Schulleistungen, Schülerfähigkeiten, kognitive Fähigkeiten, Wissen oder allgemeine Intelligenz? In: Psychol. Rundsch. 57 (2) 69-86. See also comments and reply in vol. 58 (2). [27] Rocher, T. (2003): La méthodologie des évaluations internationales de compétences. In: Psychologie et Psychométrie 24 (23) [Numéro spécial: Mesure et Éducation], 117146. 22 2 [28] Rost, J. ( 2004): Lehrbuch Testtheorie Testkonstruktion. Bern: Hans Huber. [29] Ruddock, G. / Clausen-May, T. / Purple, C. / Ager, R. (2006): Validation study of the PISA 2000, PISA 2003 and TIMSS-2003 International studies of pupil attainment. Nottingham: Department for Education and Skills. http://www.dfes.gov.uk/research/data/uploadfiles/RR772.pdf. [30] Woods, C. M. / Thissen, D. (2006): Item Response Theory with Estimation of the Latent Population Distribution Using Spline-Based Densities. In: Psychometrika 71 (2) 281301. [31] Wuttke, J. (2007): Die Insignikanz signikanter Unterschiede: Der Genauigkeitsanspruch von PISA ist illusorisch. In: Jahnke, T. / Meyerhöfer, W., eds.: Pisa & Co. Kritik eines Programms. 2nd edition [note: my contribution to the 1st edition is outdated]. Hildesheim: Franzbecker. ISBN 978-388120-464-4. [32] Zabulionis, A. (2001): Similarity of Mathematics and Science Achievement of Various Nations. In: Educ. Policy Analysis Arch. 9 (33). 23
© Copyright 2026 Paperzz