international studies EDUCATIONAL In education ACHIEVEMENTS OF THIRTEEN-YEAR-OLDS IN TWELVE COUNTRIES Results research project, reported by: institute for education, 1959-61, ARTHUR W. FOSHAY ROBERT L. THORNDIKE FERNAND HOTYAT DOUGLAS A. PIDGEON DAVID 1962 unesco of an international A. WALKER hamburg CONTENTS Foreword. . . . Arthur W. Foshay THE BACKGROUND AND TWELVE-COUNTRY Robert . THE STUDY . . PROCEDURES . . . . . . . . . . . . . . . . 5 . . 7 OF THE . . . . . . . . . . . . . L. Thorndike INTERNATIONAL COMPARISON OF THE ACHIEVEMENT Fernand FROM AND BELGIAN Douglas NATIONAL DATA. . . . . INTERPRETATIONS . . . . . . . . , . . . . 43 . . . . 63 . \ A. Pidgeon A COMPARATIVE Davld OF 13-YEAR-OLDS Hotyat INTERNATIONAL STUDY OF THE DISPERSIONS OF TEST SCORES A. Walker AN ANALYSIS PUPILS AND . OF THE TO ITEMS SCIENCE TESTS ---- REACTIONS IN THE . OF SCOTTISH GEOGRAPHY, . . . . . TEACHERS AND MATHEMATICS . . . . . . . . . . The national data pooled for purposes of this study derive from the work of various national centres for educational research and are used with their kind permission. The opinions expressed in the various sections of this report, which is sponsored by the Unesco Institute for Education, are those of their authors and do not necessarily represent the views of the Unesco Institute for Education, of Unesco, Paris, or of the research institutions to whose staffs the authors belong. Foreword The present study may well be described as an unusual addition to the literature of education. The results of the project here reported suggest that both empirical educational research and comparative education can gain new dimensions, the one by extending its range over various educational systems, the other by including empirical methods among its instruments. In the minds of its authors the project had the double purpose of throwing light on the possibilities of such research, and of obtaining actual results which would, not so much evaluate educational performances under different educational systems in absolute terms, but rather discern patterns of intellectual functioning and attainment in certain basic subjects of the school curriculum under varying conditions. This would be a first step towards bringing into profile the relative merits of various learning processes and procedures. If the results so far, because of limitations on their validity which the authors freely admit, are little more than suggestive, at least they offer real encouragement for believing that such researches can, in the future, lead to more significant results and begin to supply what Anderson has lamented as “the major missing link in comparative education”, which in his view is crippled especially by the scarcity of information about the outcomes or products of educational systems*. It does not detract from the achievement of this exploratory study to say that, when further refining international empirical research methods, it will be essential for the researcher also to explore the possibilities of building in more possibilities of relating the data to the specific educational principles and objectives which underlie the various national educational systems. Certainly the international group itself was sufficiently encouraged by the results of its first exploratory study to embark on a more ambitious one during which, at several key points in the secondary school cycle, as comparable samples of schoolchildren as can be obtained will be subjected to tests which bear close reference to curricula and educational aims in all the participating countries. The project here reported could not have been accomplished without substantial contributions from twelve different national centres for educational research (listed on p. 8 and 9) who were responsible for the technical and scientific aspects of the study, whilst the Unesco Institute contributed from its experience in international administration and coordination of research. Each national centre was involved in the expense of providing tests and the local organisation and admlnistration of field work, whilst the Unesco Institute substantially underwrote the expenses arising at the international level. Beyond this the project depended on the goodwill and cooperation of a large number of teachers in whose schools the tests were administered and the background information collected. I am very glad that the importance of this support, which for unavoidable reasons usually remains anonymous, is brought out clearly in Dr. Walker’s final section of this volume. In conclusion, I should like to express the particular debt of thanks which is owed to Professor Foshay, who directed the project, to Professor Thorndike, who acted as chief test editor and was primarily responsible for analysing the international data, and to the authors who have contributed to this volume. It is hoped that subsequently it may prove possible to add further analyses of the international data to those contained in the present report. Our thanks are due no less to Mr. Cobb who, following his responsibility for day-to-day coordination of the project, has undertaken the compilation and arrangement of the present report. Hamburg, August C. Arnold Anderson: 1961. No. 1, p. 7 and 8. l Saul B. Robinsohn 1982 “Methodology of Comparative Education”, International Review of Education, Vol. VII, 5 Arthur W. Foshay TWELVE-COUNTRY THE BACKGROUND AND THE PROCEDURES OF THE STUDY In the genersl orientation with which of setting up an international project I” 12 countries. and the characteristics this report opens. Professor Foshay describes for achievement testing. the tests that were of the samples to which they were given. the process administered If custom and law define what is educationally allowable within a nation, the educational systems beyond one’s national boundaries suggest what is educationally possible. The field of comparative education exists to examine these possibilities. The present exploratory study, like all studies in comparative education, has its roots in the desire of each of us to know more about educational systems other than our own. It differs from others, however, in that it seeks to introduce prominently an empirical approach into the methodology of comparative education, a field that has in the main relied on cultural analysis as its chief mode of inquiry. Beginning in June, 1959, and ending in June, 1961, research agencies from twelve countries cooperated in a pilot study of school achievement, under the general sponsorshrp of the Unesco Institute for Education in Hamburg. The purpose of the present report is to describe this effort and to indicate some of the results. Background and need The number of cross-country comparisons of school achievement is small, and the findings must be limited severely. Typically, such studies have involved two populations somewhere in the middle of their respective school careers. Achievement has been compared in one school subject, such as mathematics. Such studies, useful as they have been, have been limited in scope. In some cases, the comparisons have been inappropriate because of differences in curriculum, a difficulty compounded by the “mid-career” point at which comparison was made. What has not heretofore been attempted, even on a limited basis, is a comparison that would take the school population near a terminal point, and involve many countries from the same general world culture. Such a large-scale effort would seem difficult to administer and exceedingly expensive to carry out. Such an effort, however, would have advantages: the results could be examined with one’s mind on the fact that they arose from many apparently different conceptions of the nature and meaning of education; since the students were near the end of their formal education one might take the test responses as representing the outcome of the educational system as a whole, rather than catching a student in mid-career, before the curriculum had been completed. Since a large-scale attempt has not been made, one could find by trying it whether such a project was in fact feasible. More important than these considerations, however, is the possibility that comparisons could be made more analytically than has so far been attempted. The function of the academic curriculum is to teach children to think in ways appropriate to the subject matter being learned. It would be very useful if short-answer tests, with their desirable attributes of definiteness and objectivity, could be used to discern patterns of intellectual functioning in several standard school subjects. Such patterns, if they exist, would shed light on an ancient pedagogical problem - the problem created by the fact that we know much more about the measurement of the results of educational effort than we know about the effort itself. Purposes of the present study The present exploratory study was intended to show whether general terms, the purposes of the study can be stated as follows: such needs could be met. In 1. To see whether some indications of the intellectual functioning behind responses to shortanswer tests could be deduced from an examination of the patterning of such responses from many countries. 7 2. To discover the possibilities and the difficulties attending a large-scale international study. In the papers that follow in this volume, several reports are presented of the results so far achieved. Here I shall describe the procedures we have used, the populations tested, and the tests we have employed. Early planning: the participants In 1958, the Governing Board of the Unesco Institute for Education accepted the present writer’s proposal that an “international study of intellectual functioning” be undertaken. The officers of the Institute invited several directors of educational research organizations to meet in Hamburg in June, 1959, to consider the proposal further, and to decide whether they wished to participate in such a study. As it happened, the second meeting of representatives of European Centres of Educational Research was scheduled to take place near London a week later, and the proposal was described there also, with the result that some centers not represented at the Hamburg meeting also joined in the project. The Hamburg meeting of June, 1959 lasted for five days. During this time, the participants considered and adopted the proposal that a study be designed, and then proceeded to make the design, to prepare preliminary tests, and to arrange a schedule. Each of the participants was to bear the costs of test administration within his own country. The Unesco Institute paid the cost of travel and maintenance for the European participants at the three meetings finally held, and furnished extensive coordinative services. The participants who finally took part in the study were the following: Belgium: Fernand Hotyat, Directeur du Centre des Travaux, lnstitut de Pedagogic, England: Dr. W. D. Wall, Director, and D. A. Pidgeon, Senior Research Officer, tion for Educational Research in England and Wales, London. Finland: Professor Martti Takala, Research, Jyvaskylii. Professor of Psychology and Director, France: Professor Gaston Mialaret, Professor of Psychology (President of the International Association of Experimental Countries). Morlanwelz. National Centre Founda- for Educational and Pedagogy, University of Caen Education of the French-Speaking Federal Republic of Germany: Professor Dr. Walter Schultze, Director, und Dr. Rudolf Raasch, Research Assistant, Hochschule fur internationale padagogische Forschung, Frankfurt am Main. Israel: Dr. Moshe Smilansky, Pedagogical Adviser and Director of Research, Ministry tion and Culture; Director, Henrietta Szold Institute for Child Welfare, Jerusalem. Poland: Professor Scotland: Jan Konopnicki, Dr. D. A. Walker, University Director, Scottish of Wroclaw. Council for Research Sweden: Professor Torsten Husbn, Research Professor Bjiirkquist, Research Assistant, Institute of Educational of Stockholm. Switzerland: Professor Dr. S. Roller, Institute of Educa- of Sciences in Education, Edinburgh. of Educational Psychology, Research, Teachers College, and Education, University and L.-M. University of Geneva. USA: Professors Arthur W. Foshay, A. H. Passow, and D. L. Super, Horace Mann-Lincoln Institute, and Robert L. Thorndike, Institute of Psychological Research, Teachers College, Columbia 8 University, New York. Professors Benjamin S. Bloom and C. Arnold Anderson, parative Education, University of Chicago. Yugoslavia: Dr. Vladimir Muiiit, Institute of Education, University Center for Com- of Zagreb. In addition to Mr. D. J. Cobb, who served as the continuing coordinator of the project, the Unesco Institute for Education, under the direction of Dr. S. B. Robinsohn (and prior to his appointment the Acting Director, M. R. E. Hennion), put the services of its excellent staff at the disposal of the project. Under the supervision of Professor for the tabulations of data. The procedure Thorndike, Dr. and Mrs. Leonard Burgess were responsible as planned When the group described here met in June, 1959, they reached a number of agreements about procedure, first having considered and accepted the general proposal. Before stating them, however, it is necessary to state the caveat that applies to this study. This is an exploratory study. The participants in this study were working with no extra funds, no extra allotment of time, and without the benefit of a previously developed set of procedures. It will therefore be apparent that both the tests and the sampling procedures do not meet the standards that might otherwise be required. For these shortcomings we are not apologetic; it was necessary to accept them, and hence to restrict the statements based on the data gathered, if the study was indeed to be undertaken. Since not all of the sample populations are comparable, we shall not report total scores here as if they could be compared. The most interesting analyses involve patterns of responses among items and sub-scores, not comparisons of total scores, and it is this kind of analysis that is reported here. The following procedural agreements were made and acted on by the participants in the study: 1. The sample a. The students to be tested would all be aged from 13 years to 13 years 11 months on the first day of the school year whatever might be the school level (grade) at which they were found. b. The sample population in each country would be between 600 and 1000 in number. c. The sample to be tested would be all the children of both sexes residing in a community or communities selected to yield a population of the designated size. d. The community or communities selected for testing would be as representative as possible of the total population of the country, according to whatever data were available to the participant in the study. If (as was true in some countries) no data were available to aid the participant in his selection, he was to use his own judgment. 2. Data about the children a. Background 1) 2) 3) 4) 5) 6) 7) 8) 9) to be tested data on each of the children to be tested would birth date sex number of siblings place in birth order home language (if different from school language) location of home (city of 20,000-100,000; 2,000-20,000; years in school kindergarten (attended, not attended) size of class (by lo’s, from 10 or less to 61 or more) be gathered, as follows: under 2,000 inhabitants) 9 10) 11) 12) 13) 14) 15) father’s education mother’s education interest of parent (much, moderate, little or no) father’s occupation mother’s occupation score on non-verbal intelligence test 3. The tests a. Tests would be administered science, geography. b. A non-verbal the background test would information. in the following be administered fields: reading comprehension, to all of the children, mathematics, the score to be added to c. The working languages of the study would be French and English. Translation of the test items would be done by each participant into his home language. Copies of the translated tests would be deposited with the Unesco Institute for Education. d. Trial forms of the tests ( except the non-verbal, which had been developed by the National Foundation for Educational Research in England and Wales) would be developed by the participants working together at Hamburg. (The items for the tests as finally constructed were, in the main, taken from existing tests originally developed in England, France, Germany, Israel and the U.S.A.) e. The trial forms of the tests would be pre-tested with a small number of children country, and criticisms and suggestions sent to a test editor for consideration. f. The tests as finally the Unesco Institute. approved would be duplicated and circulated in each to the participants by g. Alterations in the substance of items would be permissible, provided they were approved by the test editor. (A typical alteration involved the change in units of measure to conform with the custom of the country.) h. The tests would be held to approximately 30 items, in the hope that each of them could be completed in less than 45 minutes. (Pre-testing and the later administration of the tests confirmed this as an adequate length of time.) i. No time limit would be imposed on the students. j. A practice test would be constructed which included examples of all the kinds of items included in the tests to be scored. During the practice session students would be encouraged to ask any question that occurred to them about the practice test and about the project as a whole. Teachers were requested to answer all questions fully, including giving the answers to the practice test. 4. The schedule The tests were to be administered in November, 1960, the data sent to New York for processing by February 1, 1961, and the results of the first data processing were to be made available to the participants by June, 1961. (Chiefly because of the excellent coordination by the Unesco Institute, this schedule was met virtually to the minute.) 10 5. Organization The administrative center for the project was the Unesco Institute in Hamburg. The participants met there three times, each time for one week: in June, 1959, to plan the project and construct trial forms for the tests; in October, 1960, to take a final look at the project before testing, and in June, 1961, to examine the data and to plan for interpretation and publication. Certain persons accepted special responsibilities for the conduct of the project, as follows: Arthur W. Foshay (U.S.A.), project director; editor of the geography Robert L. Thorndike (U.S.A.), test editor, A. Harry Passow (U.S.A.), editor of the science test, Gaston Mialaret (France), editor of the mathematics test, Walter Schultze (Germany), editor of the reading test, D. A. Pidgeon (England), editor of the non-verbal test, D. J. Cobb (Unesco Institute), coordinator of the project. The procedure test, as executed The procedure as described above was as uncomplicated and as realistic as the participants could make it. Future planners of such projects as this, however, will be interested in the variations from the plan that developed as it was actually carried out. There were several of these: the sampling procedure varied from the plan in a number of details: the times at which the tests were administered varied somewhat because of differences in the academic calendar; a few test items had to be changed, even after the pre-testing. In order that the limits of the comparisons may be known explicitly, we shall present here descriptions of the population samples furnished by the participants, descriptions of the tests, and some comments on the translation of the tests. The samples The plan called for samples of from 600 to 1000 children between 13 and 14 years of age, these being all of the children in a representative community. The samples actually ranged from 300 (Switzerland) to 1,732 (Israel). The total number of children tested in all the countries was 9,918. In the section that follows, descriptions of each sample as provided by the participant are reproduced. Belgium The area, over which the work of the lnstitut Superieur de Pedagogie du Hainaut extends, consists of localities with between 500 and 2.500 inhabitants (a small industrial city and its environs). Pupils at the post-primary level are scattered over a number of schools and not concentrated in any one town. Due to this fact, it was necessary to use statistical information relating to the whole of the French-speaking region of Belgium as the basis for forming the sample. According to statistics for the school year 1956-59. the school population aged 13-14 (counting only pupils who were in the grade appropriate to their age or not retarded more than 1 year) was divided up as follows: Boys: general secondary schools . . . . . . . . . . . . . . . . . . . . . . . . . . . .._...__.......... 55% vocational schools or “quatrieme degre” * of the primary school , . . . . . . 45 Oh II Girls: general secondary schools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .._.. 47 % 11 vocational schools or “quatrieme degre” of the primary school 53 % 4e degr6 de l’kole primaire: two top classes (7th and 8th years) of the primary school providing a suitable terminal course generally with vocational bias for pupils who have not transferred to general or vocational secondary education at the end of the sixth year. (World Survey of Education, Vol. III, Unesco, Paris, p. 236) l 11 In order to obtain representative groups, complete schools were taken, but these were later reduced by random sampling methods to conform to the percentages given in the official statistics. The procedure adopted is set out in the following table: BOYS Retalned in samples for Tested Vocational General schools secondary G analysis Tested I R L S Retained In samples analysis and 4e degre 145 145 205 170 schools 214 175 193 150 -for The sample finally submitted to analysis thus contained 640 subjects, 320 boys and 320 girls. A check was made to ensure that the elimating process did not affect the mean scores. No attempt was made to provide any representation of the Flemish-speaking part of Belgium. Children who were retarded two or more years in school were excluded. These are estimated (by M. Hotyat) to be about 10 y. of the total. One or two of the sections in the vocational schools included quite retarded children. In the vocational schools about 26 o/o of the pupils were children of foreign workers (primarily miners), and the corresponding percentage in the general course was 6 o/O. England The sample from England consisted of 1,181 pupils, 607 boys and 574 girls. The pupils were all the 13-year-olds attending school under one Local Education Authority in central England. This particular area had been chosen since other evidence had shown that, on tests given at the age of II, the authority was quite representative of the whole country both with respect to mean score (100) and standard deviation (15). Although the proportions of children from urban and rural administrative areas were also similar to those in the country as a whole, the authority was predominantly rural in character and contained no large industrial town. The number of schools and pupils were as follows: Schools Grammar Modern Unreorganised all-age 4 6 3 Boys Girls Both 115 461 31 105 443 26 220 (18.6 %) 904 (76.5 %) 57 ( 4.8%;) Three of the grammar schools, each containing boys authority concerned and the fourth was a Direct Grant in the enquiry. All six secondary modern schools were small unreorganised rural schools containing pupils of and girls, were controlled by the local boys’ school which volunteered to assist coeducational, as were the other three both primary and secondary school age. Finland The sample from Finland included 727 pupils, 386 boys and 361 girls. These came from about 50 classes in schools widely distributed over Finland. The choice of schools and numbers of pupils are such as to make the sample closely representative of Finland as a whole with respect to grade and type of school and with respect to the percentages of urban and rural pupils. The proportions for the country and for the sample were reported as follows: 12 Whole Sample country 62% 38% school (Grades IV-VIII) In secondary school (Grades I-V) In rural community (7- 15 yrs.) In urban community (7-15yrs.) Rural students in primary school Rural students in secondary school Urban students in primary school Urban students in secondary school In primary 68 x 32% 64% 36 % 78% 67 % 33% 76% 24 % 45% 55% 22 % 48 7; 54 % No pupils were tested in Finland who were in classes below Grade VII of the primary school. This meant that in primary schools about 2 y. of children in the age group, most of them retarded, were not included in the sample. In the secondary schools all 13-year-old children were included, irrespective of their grade level (actual proportions: 55% in Grade II, 20 O/Oin Grade I, and 25 oh in Grade Ill) and the secondary school sample was thus as representative as possible. No pupils were included from Swedish-speaking districts in Finland (about 8 y. of the population). Many of the classes in rural areas included mentally retarded children. France The sample from France was a relatively small one of 451 pupils, 181 of whom were boys and 270 girls. The small size was accounted for in part by bad weather, which reduced school attendance at the time of testing. The sample was drawn from one small city near Caen, together with the adjoining rural area. The sample corresponded approximately with the total French population with respect to percent of urban residence, occupation of father and size of family, as these were determined in an earlier extensive survey by Heuyer, Pieron and Sauvy*. The sample was chosen to represent the different types of schools in the following numbers: Primary school Vocational school Secondary school Boys Girls Total 117 20 52 142 20 98 259 40 150 Those pupils who were retarded more than one year in school were excluded. to be about 5 y. of the total age group. Federal Republic This is believed of Germany The sample consisted of 811 pupils, 403 boys and 408 girls, who were attending schools in the city of Darmstadt in Hessen, or in the adjoining rural districts. These three districts are believed to correspond well with the country as a whole in socio-economic structure, in distribution by types of occupation, and in education. Furthermore, previous experience in setting up test norms has shown this region to be close to the national average. Within the region, the sample was chosen so that the proportions corresponded closely with the national average with respect to type of school attended, and within the Volksschule representativeness was sought with respect to grade level reached and, within the eighth grade, with respect to size of school. The proportions are shown below for both sexes combined. * Heuyer, veraltalres G., pieron, de France, H., and 1950. Sauvy, A. Le Nlveau lntellectuel dae Enfanta dlAge Scofalre. paris, presses u,,I- 13 Country E.S.N. 8th grade Mittelschule 8th grade Gymnasium 8th grade 2.0 10.7 18.1 Country Volksschule Total 6th grade 7th grade 8th grade l-3 classes 4 - 7 classes 8-9 classes as a whole as a whole 69.2 1.7 7.3 63.2 15.0 18.1 27.1 Sample 2.1 11.0 18.5 Sample 68.4 1.7 7.4 59.3 15.4 16.2 27.7 No data were available concerning retardation in the Mittelschule or Gymnasium, and so no cases of this type were included. As in all other countries, except Scotland, the tests were administered in Germany in November, 1960, to pupils aged 13.0 to 13.11 on the first day of the school year. As Germany was in the unique position of having a school year beginning in April, not September as in other countries, the German children were in fact aged between approximately 13.7 and 14.6 at the actual time of testing, as opposed to approximately 13.2 to 14.1 in the other countries. This difference may, however, have been partly offset by the long summer and other holidays which considerably reduced the German pupils’ effective period in school prior to testing. Israel Because of local interest in certain types of sub-groups, Israel tested a relatively large sample, almost 1,900. Data were analyzed for 1,873, 930 coded as girls and 942 coded as boys. These were from a number of different schools in different localities. The basic classification is into town and city schools, schools in Moshavim (collective agricultural settlements), and schools in Kibbutzim (communities with communal housing, dining, and child-rearing). Schools in Moshavim were all located in well-established settlements of early immigrants. Schools in Kibbutzim were chosen in equal proportions to represent three ideological trends, but otherwise at random. Town and city schools were chosen so as to include essentially equal numbers of schools that had scored high, above average, and average on the national eight grade survey. The Israeli data are based on all eighth grade pupils in the schools which were tested. No 13-yearolds who were not in the eighth grade were included in the Israeli sample. The result is that 177, or 9.5 O/o of the group were 14 years of age or over at the beginning of the school year, while 514 or 27.4 O/cwere less than 13 years of age. At the same time, all 13-year-olds who were in the 7th or earlier grades were excluded. These are estimated to be 10 O/o of the total age group. Likewise about 4 oh of the 13-year-olds who had progressed beyond the eighth grade were excluded. Furthermore, the sample was limited to schools in which the pupils were primarily of the early group of immigrants to Israel, and thus largely of European origin. Thus, of the total sample only 193 or 1O.30/o had fathers born in an African or Asian country compared to about 30 O/oin the eighth grade nationally. (It is to be noted that of the fathers 166, or 8.9 %, are classified as professional or managerial, but Dr. Smilansky indicates this is not unusually high for the European section of the country’s population. (It is also to be noted that the date of testing in Israel was February-March rather than November, so that pupils had an additional three or four months of growth and schooling as compared with most countries.) 14 Poland The total population tested in Poland was 1,000, consisting of 346 boys and 654 girls. Children were selected from five different environments. The rural children were tested in large villages possessing at least the first classes of secondary school. Children from small towns were tested at Milicz, from mixed agricultural-industrial surroundings at Dzierzoniow, from industrial areas at Walbrzych (where there are coal mines and heavy industry), and children from a favored cultural environment in one of the districts of Wrociaw, 80 O/oof whose inhabitants were reported by the Polish research agency as being highly educated people. Scotland The Scottish sample consisted of 991 pupils, 515 boys and 476 girls, drawn from two of the educational administrative units in Scotland, the city of Aberdeen and the country of Stirlingshire. These were chosen because each had been found in the past to be representative of educational achievement in Scotland as a whole. In Aberdeen, one-sixth of the 13-year-old pupils were tested, the sampling being based upon the day of the month on which they were born. In Stirlingshire, a sample was drawn from half of the schools in the county in such a way that each school course was represented in the same proportions as in the county as a whole. Thirteen schools were involved in Aberdeen and ten in Stirlingshire. The testing in most of the countries took place in November of 1960, it having been agreed that this was a good time for those countries in which the school term begins early in October or during the month of September. The school year would have been well started by this time, and the children would have had a substantially similar number of days since the school year had begun. In Scotland, however, the testing was conducted in June of 1960 as certain reorganisations were due to take place in Scottish secondary schools during the autumn of 1960. It was therefore necessary to give the test at that time, and to adjust the selection procedure so that the Scottish children were at the appropriate age when they took the tests. Sweden The sample in Sweden consisted of 567 pupils in all, 284 boys and 283 girls. These were drawn from about 30 classes in the middle part of the country. The classes were chosen from schools which in previous national surveys had given results with an average score and variability close to the national average. Testing was limited to seventh year classes in the various types of schools. Those pupils in the classes who were above or below the age limits for the study were excluded from the sample. No attempt was made to test those 13-year-olds who were in classes either above or below the seventh. The number of 1S-year-old pupils in classes above the seventh is estimated to be about 10 y0 and the number in the sixth or lower classes is estimated to be about 5 %. Switzerland The Swiss sample consisted of only 314 pupils, 153 boys and 161 girls. These were drawn from the city of Geneva, no attempt being made to represent a wider geographical region. Dr. Roller points out that a national sample in Switzerland would have had to be drawn from each of the cantons in the country, there being no other unit that is appropriate to the cultural and population distribution in Switzerland. Since the total 13-year-old population of Geneva is about 2,000, the sample is considered adequate to represent them. Within Geneva, the sample was set up so as to include appropriate numbers in the different types of schools in grades 7 and 8. The total numbers tested were as follows: 15 7th Grade boys 8th Grade girls boys girls Ecole primaire College de G&eve Ecole primaire College de Geneve College moderne Ecole superieure lat. Ecole superieure mod Ecole m&-rag&e 101 72 203 72 94 52 77 76 Those pupils among this total group who were 13 years of age were included in the final sample. No attempt was made to test 13-year-olds who fell below the seventh grade. These are estimated to comprise 5 y0 of the age group. U.S.A. In the United States the total group tested was 2,254, but in order to reduce the burden of statistical analysis, only every other pupil was included in the sample finally analyzed, which comprised 1,127 pupils, 568 boys and 559 girls. The United States sample consisted of all the 13-year-olds in the public school systems in three different educational administrative units. One was an industrial city that is part of the Boston, Massachusetts metropolitan area. A second was an industrial city of about 50,000 in southern Ohio, not far from Cincinnati. The third was a rural county in south central Illinois. The three units were chosen because they had been found to give results close to the national average when they were used in the nationwide standardization testing for the Metropolitan Achievement Test published by the then World Book Company. Yugoslavia The Yugoslav sample consisted Boys Urban Rural Multiple Total Class teaching of 685 children, Girls drstributed as indicated in the following table: Combined 202 135 (28) 206 142 (31) 408 277 (59) 337 348 685 685 subjects in 24 classes, of 13 schools in 10 localities. Some previously determined localities were substituted by others with the same general situation (general cultural level of the population, distances from principal ways of communications, etc.). The localities where testing actually occurred were: Bukevje, Klara, Lomnica, Novo tire, Odra, Vele4evac. Velika Gorica, Velika Mlaka. Vukovina, Zagreb. In the localities tested, all I3-year-olds attend school. Only handicapped children educated in special institutions for the handicapped were excluded from the sample. Testing was also carried out with pupils above and below the seventh grade. All classes are co-educational. The tests Four tests of academic achievement were given. Since our general purpose was to gather data that could yield inferences about intellectual functioning, or reasoning, an attempt was made in each test to include items that called for reasoning, but did not require previous knowledge of the field. The ideal item was one that presented all the information required for a correct answer. The typical item was multiple-choice. Mathematics of the tests. test: 5 items requiring 5 “basic “Which a) b) c) d) Here are very brief descriptions simple computation concept” items, e. g.: is the largest of the following 94512 19542 95421 59241” numbers? 7 verbal problems, e. g.: 1) “A train leaves Rome at 1:OO p. m.. traveling along the same line 30 minutes later, traveling second train overtake the first?” at 60 miles an hour. A second train leaves at 60 miles an hour. At what time does the 2) “This table gives readings of maximum and minimum temperature in degrees Fahrenheit, of rainfall in inches, and of sunshine recorded for each month of the year. a) In which month did the highest temperature occur? b) In which month was the difference between the maximum and minimum temperatures greatest? c) Which was the wettest month?” A 9 problem sequences, e. g.: “We know that the altitude of a triangle is a line drawn from one vertex perpendicular to the opposite side. We are given the triangle in Figure 2. H” I I \ B a) What is the altitude from vertex B? b) What is the altitude from vertex C?” d ---lH C \ FIGURE \\J/ Total: 26 items, some with subdivisions; Science 29 responses. 2 H’ test: 16 items of the form: “When a) b) c) d) you enter pupils of lenses in pupils of lenses in a movie the eyes the eyes the eyes the eyes theater on a sunny day, you do not see well at first because the are still large will focus the light in front of the retina are still small will focus the light behind the retina.” 5 items to be marked as either “definitely false, or definitely false,” e. g.: “One can tell the approximate true, probably true, impossible age of a tree from the rings on a cross-section to determine, probably of its trunk.” Total: 21 items. 17 Geography test: 12 multiple-choice items depending on information, e. g.: “The Danube flows into the a) Red Sea b) Mediterranean Sea c) Dead Sea d) Black Sea.” 16 items involving drawing inferences from a set of hypothetical 4 items requiring that generalizations be stated as supported maps. or not supported by a bit of text. “Statements a) Many mountain stream beds are narrow, steep, and full of rapids. b) People who live in rugged mountain areas tend to depend for their livelihood more upon animal products than upon the growing of crops. c) Mountain dwellers are often fine craftsmen. d) In mountainous areas, water power can be used to produce electricity for manufacturing. Generalizations 1) 2) 3) 4) Beautiful carved leather articles are made in the Himalayan area. In mountainous areas, export trade is restricted to articles of high value and small bulk. Mountain people do not build large boats. Life in many mountainous areas has become considerably more comfortable during the present century.” Total: 32 items. Reading comprehension: 5 reading passages, each followed by 6 or 7 comprehension items, e. g.: “According to the text, considerate driving means a) greeting other people in a friendly way b) giving help if there has been an accident c) assisting if there has been a breakdown d) watching out for old and infirm people and children.” Total: 33 items. Non-verbal test: 74 items, requiring either perception of analogies among abstract perception of differences, or perception of relationships. Validity completion of series, of the tests The multiple-choice 18 figures, form of testing is more familiar to students in Scandinavia, the United King- dom, and Germany than it is in the other participating countries. A practice test which included examljles of all the forms of test items was provided for this reason. There is no evidence that the scores actually achieved bore any relationship to previous experience with tests of this kind. Reliability of the tests The reliability of the tests is discussed report. The general estimates of reliability by Professor Thorndike in the second of the tests are as follows: Mathematics Reading Geography Science chapter of this .81 .81 .70 .62 A note on translation The tests were originally prepared in either English, French, or German. They had to be translated into eight languages: English, Finnish, French, German, Hebrew, Polish, Serbo-Croatian and Swedish. The problem of translation was, of course, of great concern. Since the participants did not mean to make this the main problem of the study (any more than they meant to make sampling or test construction the main problems), they agreed to leave the translation of the items into their own languages to each participant. They did not, for example, test the translation by having it translated back into its original language in order to compare the re-translation with the original. This led to occasional differences in items. A striking example of this, as might be excepted, was in the translation of a passage in the reading comprehension test, in which the literary quality of the passage in its original French was its main characteristic. “Elle sort d’une touffe d’herbe qui I’avait cachbe pendant la chaleur. Elle traverse I’allbe de sable & grandes ondulations.” “A caterpillar emerges from a tuft of grass where it has been concealing itself during the warm weather. It crosses the gravel path, moving in a series of large ripples.” “Quelle belle chenille, grasse, velue, fourrbe. brune, avec des points d’or et ses yeux noirs!” “What a beautiful caterpillar-fat, hairy, furry and brown, with golden spots and black eyes!” A different translation problem appeared at one point in the mathematics test. A question in the English original read: “How would omitting the decimal point in 18.52 change the number? One of the answers from which the examinee could choose, read: “Makes it 1/I 0 as large”. The French translation reads: “II devient 10 fois plus petit”-an entirely different problem. Such difficulties in translation apparently were so small in number and so sca’ttered as to be insignificant. There is no evidence that they seriously influenced the national scores. Taking this exploratory study as a whole, we think importance in the field of comparative education: 1. A large scale project, which depends on similarities in education and in measurement, can be done. 2. The data obtained, even under the restrictions that And, by extension, we think we have shown that it an element only element into comparative education- we have demonstrated in technical certain and philosophical matters of assumptions inevitably arise, can be analyzed fruitfully. is possible to introduce a large empirical slightly present in the field until now. 19 Robert L.Thorndike ACHIEVEMENT INTERNATIONAL COMPARISON OF THE OF 13-YEAR-OLDS In the second section of this report Professor Thorndike presents and discusses certain aspects of the results. He explains the reason for deciding to restrict comperlsons between country and country to patterns of achievement (national proflles), omitting comparisons of levels, and he presents findings on the reliability of the tests. The results are analysed in relationship to sex and certain background variables. Further analyses of variations behveen countries in the relative difficulty experienced with selected test items, combined with a study of item content, lead the author to investigate e number of hypotheses with results which demonstrate the possibilities of international evaluations of achievement. Because the present project was a pilot enterprise, carried out with limited resources, it was not practical to try to get a truly representative sample of the 13-year-old population In each country. Sampling procedures varied from country to country, as described by Foshay, but in most instances sampling was limited to one or a few communities or regions that were thought to be representative of the country as a whole. In a few countries (England, Scotland, Sweden) there had previously been fairly complete national testing surveys, and communities or regions could be chosen which had been found on these to correspond to the country as a whole. In some countries (Switzerland, Israel) the sample was intentionally restricted to a place or to a fraction of the population that were fairly clearly not representative of the country as a whole. In most of the countries an attempt was made to achieve representativeness, but the evidence upon which communities or schools were chosen was rather meager and impressionistic. Because of these limitations on the representativeness of the national samples, there seems to be little value in comparing the absolute level of achievement in one country with that in other countries. For this reason, no country by country tables of mean scores are reported. We will turn our attention instead to an examination of the magnitude of the differences between countries, and to the differences in patterns of achievement from country to country. Statistical characteristics of the tests The test battery consisted of four short achievement tests and a non-verbal measure of scholastic aptitude. Three of the achievement tests yielded separate part scores as well as a total score. The nature of the several tests and sub-tests is described by Foshay (pp. 16 to 18). At this point we shall merely supplement that description by a brief table (Table I) showing for each test and sub-test (1) the number of items, (2) a general estimate of the mean, obtained by averaging the means for 11’ national groups, (3) a general estimate of the standard deviation, obtained as an average of the 11 standard deviations within countries, and (4) a general estimate of reliability, obtained by Kuder-Richardson Formula No. 20 from the average standard deviation and the average item difficulty over 11 national groups. As might be expected, the reliabilities of many of the sub-tests, consisting of from 4 to 10 items, are quite low. However, the total test reliabilities are fairly satisfactory, the estimated values ranging from a low of .62 for the 21-item science test through .70 for the geography test, .81 for the reading test and the mathematics test, to .89 for the considerably longer non-verbal test. The estimates are rough, since the assumptions underlying the Kuder-Richardson formulas are not completely met. However, the general order of magnitude is indicated. Though the tests would be of no use for the study of single individuals, they appear adequate for comparisons of groups of several hundred, and these are the comparisons with which this study is primarily concerned. l Results from Yugoslavia became available too late for inclusion in these and certain other analyses. 21 TABLE 1 Parameters Statistical No. of items Non-verbal Aptitude Mathematics - Part 1 - Part 2 - Part 3 - Part 4 -Total of Tests Average K-R No. 20 Average Stand. Mean Dev. Reliability 75 33.66 12.19 .89* 5 5 7 9 26 3.63 4.19 3.91 3.05 14.98 1.14 1.03 1.60 2.02 4.40 .51 .58 .51 .73” .81 Reading Comprehension 33 21.36 5.27 .81 Geography - Part 1 - Part 2 - Part 3 -Total 12 16 4 32 7.01 7.82 1.59 16.42 2.16 2.77 1.07 4.65 .49 .57 .38 .70 Science -Part1 - Part 2 -Total 16 5 21 7.77 1.99 9.76 2.85 1.17 3.39 .59 .28 .62 * By K-R Formula l * Inflated, No. 21. but spuriously because items high because of speed factor. are not independent. Within and between countries variance From the raw scores on each test, raw score means and standard deviations were obtained. A crude average of the variances in the 11 separate countries provided an estimate of the average variability of performance of pupils within a single country-the “within-countries” variance. The mean of the means for 11 countries was used as a grand “total mean”. Variance of the 11 national means around this average value provided an estimate of variability from country to country-the “between-countries” variance. A comparison of the two variance estimates-the one for variability within a country and the other for variability from country to country-provides an index of the magnitude of international differences. Table 2 expresses the variance between countries as a percent of typical variance within a country. The results are shown for boys and girls separately, and for the total group of all pupils. TABLE 2 Variance Between Countries Expressed as a Percent of Average Within-Country Variance Boys Non-verbal Mathematics Aptitude - Part 1 -Part2 - Part 3 - Part 4 -Total Reading Comprehension 22 and Girls Boys Only Girls Only 12.1 ,s 11.5% 11.7% 9.9 9.4 11.8 14.3 16.2 13.9 11.6 14.0 18.4 21.2 7.5 8.8 11.7 12.2 13.4 6.2 8.1 5.6 Geography -Pa*1 -Part2 - Part 3 -Total 35.8 7.1 5.6 15.4 39.3 7.9 4.4 17.7 37.4 7.8 7.1 15.9 Science - Part 1 - Part 2 -Total 6.1 2.3 5.2 7.3 1.5 5.9 8.0 3.3 7.1 It is clear that the variation between national means is small in relation to the variability of scores within any one country. National differences represent a minor rather than a major component in these results, And the probability is that they are over-estimated rather than underestimated, because the countries that did relatively well on the tests were in several instances those that were known to have tested an up-graded sample of their populations. We suspect that with truly representative national samples, the differences would have been reduced. Of course, the participants in this survey were all countries with a basically European culture, and with welldeveloped educational systems. A greater heterogeneity in national cultures and educational levels would very probably increase the national differences, perhaps substantially. A comparison of the different tests with respect to magnitude of international differences brings out some rather dramatic and surprising results. With these tests and samples, the tests that show the smallest variations from country to country are the tests of science and of reading comprehension. The presumably relatively culture-free non-verbal aptitude test shows about twice as much country-to-country variation as the reading and science tests, and the geography and mathematics tests about two-and-a-half times as much. It must be remembered that all the tests had been translated into eight different national languages - English, Finnish, French, German, Hebrew, Polish, Serbo-Croatian, and Swedish. The fact that a reading test, which would appear to be especially susceptible to changes in difficulty with translation, remained so uniform is rather unexpected. The findings suggest that the nearest thing we have to a culture-fair test may be a carefully translated reading test, and that level of reading ability is the feature with respect to which different educational programs are most nearly uniform. Several of the tests had sub-tests and a comparison of the variability between nations on these is of some interest. In the case of mathematics it is the verbal problems (Part 3) and the inductive series (Part 4) that showed the largest international differences. Geographical information (Part 1) showed by a large margin the widest international variation of any test, whereas in map reading (Part 2) the differences were much smaller and in drawing generalizations (Part 3) smaller still. Scientific judgment (Part 2) showed very little variation between countries, and scientific information somewhat more. In some measure, the above results are an outcome of differences in reliability of the sub-tests. If the within-group variance is inflated by measurement errors, the between-group variance will necessarily look small in comparison. However, this accounts for only part of the results, and the major differences appear to arise from more genuine factors. Generally speaking, the boys varied more from country to country rhan did the girls. However, the sex differences in this respect were neither large nor entirely consistent. National profiles Though variations in the sampling procedure from country to country make comparisons of level of achievement of questionable value, comparisons of patterns of achievement from country to country seem sound and of a good deal of interest. By pattern of achievement we mean a country’s achievement on the specific tests and sub-tests, relative to its own over-all level of achievement. Patterns of achievement were arrived at through the following steps: (1) For each test a crude average of the national means was computed, and also a crude average of the national standard deviations. (2) On any one test, such as the test of reading comprehension, each country’s average score was converted into a “standard score”, by subtracting from it the average score for all countries and dividing the result by the average standard deviation for that test. (3) The “average standard score” for the five tests (i. e., the total scores) was computed for each country. (4) This “average standard score” was subtracted from the standard score on each test and sub-test. That is, each specific standard score was expressed as a deviation from the country’s “average standard score”. In this way, each national group was reduced to a common and comparable base line. It is then possible to examine and compare directly the peaks and hollows of achievement in the different countries. 23 TABLE 3 National Patterns of Achievement Expressed as standard score deviations from national average on all 5 test5 Belgium Non-verbal Aptitude Mathematics -Part1 -Part2 - Part 3 -Part4 -Total -16 -36 9 -18 -6 -17 -31 -16 -16 -19 21 -40 -7 c46 -38 28 32 23 -18 6 -24 -11 - ia 3 -18 Science -Part1 -Part2 -Total -16 5 - 14 7 4 28 43 30 -8 11 19 -58 -12 27 -33 ta 11 -7 4 3 23 7 15 25 16 -15 26 47 -a -29 20 13 23 16 16 4 9 - 29 13 -21 24 -8 24 -14 -24 -14 Aptitude -Part1 -Part2 -Part3 - Part 4 -Total -18 25 -9 2 43 30 Scotland Sweden Swik. -a 6 12 11 33 19 -16 3 0 0 10 -4 16 -25 -26 -27 -58 -23 -43 23 -45 12 20 92 -31 -65 16 -33 -12 33 -16 -20 12 ia -5 -9 9 0 24 7 12 20 12 -43 -39 -43 -44 -3 23 -1 -24 -Part1 - Part 2 -Total Yugosl. 25 Geography -Part1 -Part2 -Part3 -Total U.S.A. - 28 -43 - 29 - 19 -39 Reading Comprehension Science 3 Israel Combined Poland Non-verbal Germany -51 -42 - 20 -9 -40 -Part1 - Part 2 -Part3 -Total Mathematics France 12 Geography and Girls Finland 25 40 34 42 44 Reading Comprehension c = Boys England -1 4 4 -3 7 -9 6 30 -54 15 35 -16 27 10 16 5 28 24 27 28 24 21 The complete set of national profiles is presented in Table 3. These show results for boys and girls combined. All entries in the table are expressed In hundredths of a standard deviation. That is, the entry 12 for Belgium on the non-verbal test means that Belgium’s standard score on that test was twelve hundredths of a standard deviation higher than Belgium’s average standard score on all five of the tests. Thus, if we look at the results for Belgium, we see that the pupils in Belgium were most outstanding, relative to their over-all level of performance, in mathematics. Here they show a peak of almost half a standard deviation. They are slightly above their own over-all average on the non-verbal aptitude test, and they do relatively least well on the test of reading comprehension. The sub-test scores show only minor deviations from the total scores. England, by contrast, performs especially well on the non-verbal aptitude test and is especially weak in mathematics and geography. The geography sub-test dealing with geographical information is notably lower than the map-reading or inference tests. 24 ”___.. .,_ A similar analysis could be made of the pattern for each country, pointing out points of relative strength or weakness in each. Or the results can be examined from the point of view of each test in turn. This has been done in Figure 1, in which the strength or weakness of each country (relative to its own over-all mean) has been plotted on a common scale. We see that on the non-verbal test the country that doea especially well is England. Since the test was English in origin, this result may possibly reflect some degree of previous familiarity with the test, and acceptance of the task as a reasonable and sensible one. Scotland also does well on the test, while Germany and Finland perform poorly on it. On the mathematics test, all the French-speaking countries are superior performers, with Belgium leading the way. Poland also shows up to advantage. The English-speaking countries are consistently poor. One wonders what part of this is contributed by their complex system of denominate numbers. Yugoslavia also has marked difficulty with this test. National differences in reading comprehension are relatively small. It is on this test that Yugoslavia shows up to best advantage, followed by Scotland and Finland, while Belgium and Poland do relatively poorly. On the geography test we find Germany, Israel and Poland leading the way, and their superiority is especially marked in that section of the test dealing with geographic information. The English-speaking countries do notably poorly on the geography test as a whole, and especially on the sub-test dealing with geographical facts and information. This is an area in which the different national curricula appear to have produced distinctly different results. Science is an area in which the French-speaking countries are relatively weak. Here the leaders in relative achievement are the United States and Germany, with Yugoslavia and England following in that order. Some countries show rather marked peaks and hollows in their profiles. Thus, England is very high on non-verbal aptitude and very low in mathematics and geography. Belgium is high on mathematics and quite low in reading. Others show a notably even pattern of performance. The best example is Sweden, which performs at almost the same level of excellence on all the tests. The patterns of relative strength and weakness provide a picture of achievement under the different educational systems. They provide no explanation of how the differences come into being. This must be contributed by the investigator who is intimately acquainted with the educational systems in the several countries. However, the data presented here must still be considered quite tentative. They are limited by (1) the local and only partially representative character of many of the national samples, (2) the brevity of the tests and especially the sub-tests, and (3) the limited opportunity to plan test content 50 as to assure the most balanced and appropriate representation of content and objectives. The results reported so far are for boys and girls combined. It is of some interest to look at the results for the sexes taken separately, and this is done for the five total tests in Table 4. Scores for boys and girls are each expressed as deviations from the average of all five tests for that sex in that country. That is, the score of 11 for Belgian boys on the non-verbal test means that the Belgian boys were eleven-hundredths of a standard deviation higher on that test than they were on the average of the five tests. 25 FIGURE Non-verbal Aptitude England Mathematics 1 Relative Achievement Reading Comp. Groups on Tests Geography Science Belgium France, Poland Scotland Yugoslavia Scotland Switzerland Finland Belg., Switz. USA England Israel USA Switzerland Germany Flnland Sweden Sweden of National Israel =rance, Israel I Israel USA Germany Yugoslavia Poland France England Germany Sweden Switzerland ;,;w$=fia Finland Scotland Poland Yugoslavia France Germany Belg., Israel Poland France USA Scotland England Yugoslavia Switzerland TABLE 4 Sex Differences in Pattern and Level of Achievement Average B-G 32 22 24 15 32 18 37 14 6 24 - -11 17 Belgium England Finland France Germany Israel Poland Scotland Sweden Switzerland U.S.A. Yugoslavia Average 19 Non-Verbal 6 G 11 42 -47 -7 -39 3 -22 16 -21 3 12 -12 -5 13 50 - 28 - 22 - 33 17 -17 33 9 21 11 -6 Geography Science B G B G B G B G 42 -42 -2 34 -18 -10 22 -44 -4 17 -35 -56 46 -37 13 27 -14 -4 34 -34 6 22 -18 -34 -35 -2 11 -27 -4 -17 -34 14 -8 -8 -8 25 -12 25 27 6 10 -3 -18 33 7 16 20 35 -16 -30 -21 -36 -25 7 5 29 20 15 -11 1 6 -14 8 0 23 23 21 16 -23 -11 9 -18 1 -1 31 31 -5 32 5 20 22 33 -16 47 37 12 2 Mathematics 4 -8 Reading Comp. 1 -8 -1 20 0 -14 -32 16 -32 -15 -11 -9 -68 7 4 -15 The first column of Table 4 shows a different kind of a finding. In this column, the average standard score on all five tests for boys and for girls is compared, country by country. Thus. in Belgium on the average of all five tests the boys fell 0.32 standard deviation units above the girls’. in An examination of this column, headed “Average B-G”, shows the extent of male superiority a pooled average performance, country by country. Thus, there is only one country in which the girls surpassed the boys in total performance - the United States. DifFerences between the two sexes were small in Sweden and Scotland. Largest differences were in Poland, Germany and Belgium. These results provide some clue as to the comparability of educational opportunity and motivation for the two sexes in different countries. On average, over all countries and tests, the boys fall about a fifth of a standard deviation above the girls. An examination of results for the different tests shows that girls perform best, relatively speaking, on the reading test, and least well on the test of science. This pattern is a universal one, appearing in each one of the 12 countries. We appear to have here a universal and quite stable sex characteristic. There is also a small, but rather consistent tendency for the girls to do relatively better on the non-verbal test (10 countries) and the mathematics test (11 countries). Differences in the geography test were small and inconsistent. All of these differences in relative performance on specific tests appear, of course, after adjusting for the 0.19 standard deviation difference in average performance of boys and girls. Achievement in relation to parental education or occupation An attempt was made to get information on parental education or father’s occupation for the children in each country. However, there was very real difficulty in getting comparable data for different countries. Pressures and sensitivities differ from country to country, so that in some it is possible to get information about education and in others it is possible to get information about occupation, but it is rarely possible to get both. Furthermore, the differences in educational structure in different countries make it difficult to establish classifications that will be comparable from country to country. However, the operation was carried out as well as could be done, and some comparisons are presented in Tables 5-10. * The basic unit IS the average standard deviation of boys and girls combined, averaged for 11 countries. 27 TABLE 5 Percent of Pupils with Fathers at Different Levels of Education or Occupation B Level of father’s educ. Belgium Elementary only Some secondary Secondary completed Some college College England 45 32 14 * t completed 0 College 33 44 15 * * completed B Germany Father’s occupation Unskilled, semi-skilled Skilled, farmer Clerical, sales Sub-professional Prof. & managerial * 77 11 7 24 55 19 37 27 14 . 74 ,* 9 78 14 l l l * + l . l Too few to justify computation I R 21 60 14 * 22 34 30 11 l l * 0 10 43 27 9 10 Y Sweden L t 76 12 7 Israel 25 37 9 9 16 S Poland G Elementary only Some secondary Secondary completed Some college Y France 50 26 15 5 + Yugoslavia 5 58 21 9 4 9 S 4 l 78 55 23 11 c 78 l 10 16 * + G S Scotland U.S.A. SWit2. Germany 22 30 18 * 23 26 42 10 5 15 I Israel R 8 L S Scotland SVMZ. 8 45 27 49 30 14 21 35 17 10 l l 9 l 17 of means. Table 5 shows the percent of cases with fathers at different levels of education or occupation. It is clear from this table that the different national samples were not comparable with respect to distribution of education or occupation of fathers. What is not clear is the extent to which these sample differences reflect similar differences in the total national population and the extent to which they reflect biases in the specific sample tested in that country. Thus the Scottish sample showed twice as many unskilled and semi-skilled workers as the German sample, twoand-a-half times as many as the Swiss, and five times as many as the sample from Israel. How shall we understand this? Examination of the sampling procedure brings out that the Israeli sample was limited to that segment of the population who were of European origin, that the Swiss sample was limited to Geneva, while several fee-paying schools were excluded from the Scottish sample. Thus, in part at least the differences between nationalities appear to reflect differences in sampling. However, it is also probably true that differences reflect in part actual national differences, especially in the amount of schooling. Thus, the differences in educational level of parents in Sweden and in Yugoslavia are certainly at least in part a reflection of the past educational level prevailing in the two countries. And to come from a family in which the father has completed secondary education certainly signifies a less outstanding experience in the United States than in most European countries, where such a level of education is still the exception rather than a typical event. However, in education also the figures suggest that the sample be non-representative of the total population in some countries. Thus, a Polish sample in which 40 s/s of fathers have completed secondary education hardly seems representative of the total Polish population in the age range 40-50. Comparisons of national sub-groups in which the level of parental education or occupation is may uniform from national groups, though they have some limitations country to country are almost certainly more meaningful as pointed than comparisons out in the preceding of total paragraphs. Table 6 shows comparative data for the non-verbal aptitude test. Average score is quite clearly related to average level of fathers’ education, and to a somewhat lesser degree to fathers’ occupation. Some fairly substantial differences between countries remain, when comparisons are restricted to those of a common level of parental education or occupation. However, it should be noted that the countries that show up least favorably in such a comparison, as of those with some high school education, are countries like Sweden and the United States in which almost all parents had received at least that much education - i. e., where education had been most nearly universal and non-selective. TABLE 6 Non-Verbal Test Averages by Level of Father’s Education or Occupation B Level of father’s educ. Elementary only Some secondary Secondary completed Some college College completed France Poland -6 23 50 - 33 96 115 - 24 -6 - 0 23 23 40 -34 -14 9 - Unskilled, semi-skilled Skilled, farmer Clerical, sales Sub-professional Prof. & Managerial 0 I -60 -33 18 69 49 B Father’s occupation Y England G Elementary only Some secondary Secondary completed Some college College completed 0 Belgium Sweden -93 -59 -12 23 - 26 24 - S - 26 16 - 1 - -65 7 L Yugoslavia U.S.A. -52 - -25 -19 3 Y R S 12 S -76 - 14 24 - G Germany Israel Scotland SWltZ. Germany - 1 -8 15 24 59 - 3 36 48 74 79 -12 28 59 58 - 29 41 77 53 - 28 -34 -18 -13 39 -92 -70 -63 -25 I Israel 4 33 55 50 86 R L S Scotland 0 41 37 - SWlb. 24 60 39 53 29 TABLE 7 Mathematics Test Averages by Level of Father’s Education or Occupation B Level of father’s educ. Belgium England 23 56 88 - -58 38 43 - Elementary only Some secondary Secondary completed Some college College completed 0 France 11 52 - 5 10 55 - College - completed -64 -8 -48 occupation 0 Y Sweden I R L U.S.A. -37 IO - -107 -124 22 33 51 - -31 13 - 56 - S Israel Scotland Unskilled, semi-skilled Skilled, farmer Clerical, sales Sub-professional 15 11 26 61 -7 25 34 62 -74 -31 -7 15 Prof. & Managerial 87 71 - Switz. -148 -86 -43 -9 -70 -30 -143 - 79 - 59 - -43 8 - G Germany Yugoslavia S -12 19 32 - B Father’s S 46 70 69 71 G Elementary only Some secondary Secondary completed Some college Y Poland I 6 R L S Germany Israel Scotland 77 52 72 - -7 -14 30 3 -22 12 37 49 -71 -28 -23 - 44 61 38 - 88 42 55 - 36 Switz. TABLE 8 Reading Test Averages by Level of Father’s Education or Occupation B Level of father’s educ. Elementary only Some secondary Secondary completed Some college College completed Belgium England -47 -25 -12 - France -53 -25 - - -20 53 91 - College completed -3 39 20 -61 -39 -11 - S Sweden -13 12 17 18 I R L occupation Unskilled, semi-skilled Skilled, farmer Clerical, sales Sub-professional Prof. & Managerial 30 Germany 26 22 57 90 96 Israel -11 21 28 55 67 -65 4 38 -45 22 - 40 S - - - 12 - 4 BOYS Father’s Yugoslavia -96 - -37 -19 2 7 U.S.A. -38 -4 - -30 24 -48 - - Y Poland G Elementary only Some secondary Secondary completed Some college 0 -80 1 Switz. Germany -13 21 58 78 - 42 28 38 63 19 7 32 47 86 -8 51 6 - G Scotland -66 -16 I Israel -20 19 35 56 62 R 57 L Scotland -2 48 41 - S Swttz. 7 49 53 51 TABLE 9 Geography Test Averages by Level of Father’s Education or Occupation B Level of father’s educ. Elementary only Some secondary Secondary completed Some college College completed Belgium England -33 -2 18 - -44 31 52 - College completed -60 -58 - 21 - -68 -17 -23 - Father’s occupation Germany Unskilled, semi-skilled Skilled, farmer Clerical, sales Sub-professional Prof. & Managerial 0 -25 - - -14 - 50 15 56 71 97 95 - 2 12 42 - -51 -1 -113 31 - -40 -3 39 74 TABLE 32 -92 -53 -40 - -46 7 - Germany 52 50 69 53 - S G Switz. 5 0 - L -72 -34 -47 - R Yugoslavia -132 - S Scotland Israel 61 59 83 88 139 I U.S.A. Sweden 45 63 70 Y S 14 8 - -22 18 25 - B Y Poland G Elementary only Some secondary Secondary completed Some college 0 France I Israel 23 25 64 35 94 -18 R L S Scotland 2 41 54 75 76 Switz. -53 -22 -18 - 8 46 32 46 - 10 Science Test Averages by Level of Father’s Education or Occupation B Level of father’s educ. Elementary only Some secondary Secondary completed Some college College completed Belgium England -19 18 12 - 16 95 99 - - -70 -57 -14 - -35 -36 16 15 B Father’s occupation Unskilled, semi-skilled Skilled, farmer Clerical, sales Sub-professional Prof. & Managerial Germany 0 Israel S - - Yugoslavia 4 - 6 50 69 - 70 - - -41 - -71 I R L -38 -34 -16 8 15 9 - 70 -73 -80 -56 -15 - 19 5 - 22 - G Switz. -44 -7 23 - S S Scotland U.S.A. Sweden 56 66 60 - -66 -38 -44 Y Y Poland G Elementary only Some secondary Secondary completed Some college College completed 0 France I R L S Germany Israel Scotland Switz. -42 -5 -42 87 61 96 90 1 45 57 73 5 32 58 50 40 17 42 - 25 30 26 25 102 87 - 40 54 1 32 -6 - -62 -25 -35 - 9 - -41 2 31 Tables 7-10 show results for the mathematics, reading, geography and science tests respectively. The relative performance of different countries, and of boys and girls, differs on the different tests, reflecting the national and sex differences in patterns of achievement previously discussed in connection with Tables 3 and 4. However, the patterns of achievement in relation to education or occupation of father are much the same from test to test. A crude pooling of results from different tests and countries yields the following results: BOYS Father’s education Elementary only Some secondary Secondary completed Some college College completed Father’s occupation Unskilled, semi-skilled Skilled, farmer Clerical, sales Sub-professional Professional Girls -36 -2 36 36 41 - 58 -27 1 12 13 14 28 51 66 79 -10 18 25 36 50 There is a difference of roughly three-fourths of a standard deviation in average score between the lowest educational category and the highest, a range of about two-thirds between extreme occupational groupings. Though the gradient seems to be a little less steep for girls than for boys, the general pattern is quite similar for the two sexes. The gradient is common to most of the countries studied. but there are one or two exceptions. Poland shows relatively little diffe rence associated with level of father’s education, Switzerland relatively little associated with father’s occupation. One can only speculate on whether this represents some peculiarity in the specific sample or whether school achievement in these countries is less related to parental education or occupation. Additional tabulations were carried out by size of community in which the pupil resided. Communities were classified into those of under 2,000. those of 2.000 to 20,000, and those of over 20,000 inhabitants. However, there were certain ambiguities in this coding from country to country. It was not entirely clear whether place of residence meant the community in which the school was located, or the immediate community in which the pupil had his home. Thus, in some countries, farm children living in quite rural areas were apparently coded as coming from communities of two to twenty thousand because that is where they attended school. There was also no systematic attempt to have the rural areas in the sample be representative of all rural areas in the country and the urban areas be representative of all urban areas. Thus, in the United States the primarily rural area was from a rather prosperous mid-western farming area, whereas the two urban communities were primarily rather undistinguished industrial centers. The average result over all tests is summarized in Table 11. In general, though with some exceptions, those in very small communities did slightly less well (by about one-fifth of a standard deviation, on the average) than those in larger communities, but no differences were found between the two categories of larger communities. A comparison of the different tests, as averaged over all countries, suggests that the differences associated with size of community are greatest in the case of mathematics and reading, least for the non-verbal test. Results for the separate tests are shown in Table 12. Average TABLE 11 Score by Size of Community Pooled Results on All Tests B Under 2C0J 0 Y 2000 - 20 oco for S G Over 20 000 England -15 Finland France Germany Poland Scotland Sweden U.S.A. Yugoslavia 31 21 -4 -44 - 34 - 94 -22 56 42 43 2 - 20 -37 - 4 -11 58 58 -8 -13 - 33 - 30 Crude Average - 20 8 4 Under 2000 I 2000 R L - 20 000 S Over 20 000 - -32 - 20 4 2 - 23 - 33 -48 -6 -99 -51 18 11 6 2 -18 -15 -50 -11 - -34 -13 -14 - 24 22 26 -9 - 29 - 26 -58 No data on size of community available for Belgium. Different system of categories used for Israel. TABLE 12 Average Score by Size of Community for Each Test Based on Pooled Data from All Countries B Under Non-Verbal Mathematics Reading Geography Science Item difficulties: Zoo0 -17 -44 - 27 -22 15 resemblance 0 Y 2000 - 20 CM -2 0 -3 5 42 S G Over 20 000 -17 -8 4 -3 27 Under 2000 - 26 -54 -31 - 28 -31 I 2000 R L - 20 000 -16 -20 Over 20 000 -20 -13 0 -14 -15 S 3 -11 - 25 between countries In addition to analyzing scores on tests and sub-tests, it was also possible to study responses to specific items country by country. The original tabulations showed the frequency with which each wrong option to an item was selected as well as the frequency of correct response. However, most of the analyses to be reported here deal only with the correct responses. These are studied from two points of view. First, correlations are presented showing the degree of consistency of item difficulty from country to country. Secondly, certain special groups of items are examined to throw more light on certain elements of content that are especially easy or difficult in different countries. Tables 13-16 show the correlations of item difficulty among eleven countries (excluding Yugoslavia). The correlations are over the population of items. A high correlation signifies that the same items are difficult and the same ones easy for the pair of countries in question. The first thing that impresses one as one scans Tables 13-16 is the generally substantial correlations across countries. The average correlation is .87 for mathematics, .87 for reading, .68 for geography, and .72 for science. The high correlations for mathematics and reading are especially impressive. A difficult item is a difficult item in these two tests, regardless of the school system in which the pupil has been educated or the language in which his schooling has been couched. The reading test is particularly noteworthy, because the differences between countries are small both in level of average score and in the relative difficulty of different items. 33 TABLE 13 of Item Difficulties Between Mathematics Test Correlations’ 1. Belgium 2. England 3. Finland 4. France 5. Germany 6. Israel 7. Poland 8. Scotland 9. Sweden 10. Switzerland 11. U.S.A. l Average of correlation 1 2 3 4 5 6 7 8 9 10 11 86 92 95 92 90 84 84 90 97 80 86 89 78 93 88 69 98 92 89 90 92 89 90 95 96 76 89 90 95 92 95 78 90 85 86 75 76 84 91 75 92 93 95 85 93 78 94 95 95 91 90 88 96 86 93 75 87 86 93 89 84 69 76 75 78 75 71 76 83 60 84 98 89 76 94 87 71 93 88 92 90 92 90 84 95 86 76 93 92 91 97 89 95 91 95 93 83 88 92 86 80 90 92 75 91 89 60 92 91 86 - for boys and for girls. TABLE 14 of Item Difficulties Between Reading Test Correlations’ Belgium England Finland France Germany Israel Poland Scotland 9. Sweden 10. Switzerla: nd 11. U.S.A. l Average of correlation 2 3 4 5 6 7 8 9 10 11 88 89 98 83 87 87 85 92 96 89 88 88 86 86 91 82 98 92 85 96 89 88 84 88 84 86 85 91 85 89 98 86 84 80 84 85 83 91 94 86 83 86 88 80 84 81 87 89 81 88 87 91 84 84 84 88 90 85 83 90 87 82 86 85 81 88 82 82 82 86 85 98 85 83 87 90 82 88 84 94 92 92 91 91 89 85 82 88 89 91 96 85 85 94 81 83 82 84 89 85 89 96 89 86 88 90 86 94 91 85 - for boys and for girls. TABLE 15 of Item Difficulties Between Geography Test Correlations* 1. Belgium 2. England 3. Finland 4. France 5. Germany 6. Israel 7. Poland 8. Scotland 9. Sweden 10. Switzerlal 7d 11. U.S.A. * Average 34 of correlation Countries 1 1. 2. 3. 4. 5. 6. 7. 8. Countries Countries 1 2 3 4 5 6 7 8 9 10 11 63 69 93 67 62 56 70 67 90 68 63 67 61 74 54 26 95 82 60 89 69 67 67 84 77 55 74 82 77 72 93 61 67 64 55 54 67 65 86 70 67 74 84 64 84 54 77 80 76 74 62 54 77 55 84 66 64 71 62 58 56 26 55 54 54 66 35 40 49 31 70 95 74 67 77 64 35 85 65 88 67 82 82 65 80 71 40 85 70 84 90 60 77 86 76 62 49 65 70 66 68 89 72 70 74 58 31 88 84 66 - for boys and for girls. TABLE 16 of Item Difficulties Between Science Test Correlations* 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. l Belgium England Finland France Germany Israel Poland Scotland Sweden Switzerland U.S.A. Average of correlation Countries 1 2 3 4 5 6 7 8 9 10 11 73 73 94 89 04 46 68 70 90 72 73 70 68 78 84 37 95 72 75 83 73 70 70 78 81 64 65 79 77 75 94 68 70 84 79 39 63 67 84 63 89 78 78 84 88 48 72 78 88 80 84 84 81 79 88 51 82 88 79 84 46 37 84 39 48 51 37 57 47 53 68 95 65 63 72 82 37 71 68 80 70 72 79 67 78 88 57 71 84 79 90 75 77 84 88 79 47 68 64 78 72 83 75 83 80 84 53 80 79 78 - for boys and for girls. The correlations in Tables 13-16 appear to show a certain amount of clustering. In order to bring out the structure more clearly, a factor analysis of the correlation tables was carried out. The rotated factor loadings (by varimax rotation) are shown in Tables 17-20. Countries have been rearranged in the tables to bring out most clearly the clusters. TABLE 17 Mathematics Test Loadings of Rotated Factors Factor 1 2 Belgium France Switzerland Poland Finland Israel Germany Sweden England Scotland U.S.A. 96 90 98 80 97 95 98 96 94 94 91 -ia -09 -10 -38 08 05 00 03 08 09 37 X of variance 87.8 3.2 3 17 27 12 06 12 08 -08 -09 -26 -29 -05 2.8 4 07 00 04 02 -11 -21 -02 17 -02 04 06 0.9 5 -11 -14 - 01 10 06 02 07 06 -12 -03 11 0.7 35 TABLE 16 Reading Test Loadings of Rotated Factors Factor 1 Belgium France Switzerla md Poland Israel Germany Finland Sweden U.S.A. England Scotland 96 94 93 90 93 90 93 95 96 96 95 % of variance 87.9 2 - 25 - 31 - 26 00 09 17 07 -01 13 16 18 4 3 05 - 02 -03 10 -10 12 23 11 -06 -14 - 22 -01 -01 05 -26 -19 01 00 16 05 12 05 3.1 1.4 1.6 5 04 08 - 10 05 - 02 -12 05 04 09 06 -07 0.5 TABLE 19 Geography Test Loadings of Rotated Factors Factor 1 2 3 Belgium France Switzerland Poland Israel Germany Finland Scotland England U.S.A. 81 79 83 50 78 89 88 92 89 a9 -11 -06 -05 - 60 - 48 - 20 -19 16 29 25 -53 -53 -38 -17 16 14 08 10 13 04 % of variance 62.0 7.8 4 -04 02 27 -03 05 12 22 - 28 - 25 -08 7.4 36 “.., . _ - . _ 2.6 5 -05 09 -13 04 -06 -18 06 -02 -03 15 0.8 TABLE 20 Science Test Loadings of Rotated Factors Factor 1 2 Belgium France Switzerland Germany Israel Poland Finland Sweden U.S.A. England Scotland 92 a7 91 94 95 57 86 86 a7 89 85 25 29 11 13 -02 07 06 - 01 - 21 -40 -44 g of variance 75.3 5.3 3 -13 -17 00 -06 -02 48 33 16 12 -12 -09 4.1 4 - 14 -09 -32 -03 21 05 00 37 05 -06 04 2.9 5 -14 -18 06 11 04 -04 06 04 12 01 -09 0.9 The most striking feature of Tables 17-20 is the large proportion of variance accounted for by the first factor. This can be thought of as a general factor of difficulty determined by the content of the item and independent of country. Loadings on later factors, corresponding to sub-groups of country, are quite small and account for only a minor fraction of the variance. The later factors seem to be bi-polar factors in most instances, discriminating one sub-group of countries from another. In mathematics, Factor 2 involves primarily difference between Poland and the USA, while Factor 3 involves difference between the French-speaking and the English-speaking countries. The other factors are of no consequence. The only factor beyond the first that seems to amount to anything in the reading test is Factor 2, which again discriminates French-speaking from English-speaking countries. Factors 2, 3 and possibly 4 appear to amount to something in the case of the geography test. Factor 2 contrasts a cluster composed of Israel. Poland, Germany and Finland with the Englishspeaking countries. Factor 3 groups together the French-speaking countries. Factor 4 contrasts England and Scotland with Switzerland and Finland. On the science test, Factor 2 seems once again to be the French-speaking versus Englishspeaking factor. Factor 3 links together Poland and Finland, while Factor 4 separates Switzerland from Sweden. The language groupings represented in these factors after the first make sense at least in that the test remains completely uniform for all countries using the same language. It is also quite possible that educational patterns are more alike within language groups. A knowledgeable person may see some rationale underlying the other groupings and polarities that is not apparent to the author. Item difficulty: special groups of items Examination of item content suggests certain hypotheses concerning groups of items that can be tested by examining item difficulty in different countries. In this section, several of these hypotheses will be stated and examined. The basic data in terms of which these hypotheses are examined are item difficulty deviation values. The procedures for deriving these values are stated below. 1. The percent right on each item in each country was first transformed into a normal deviate, using tables of the normal curve. 2. The average scaled value was obtained for each item over all countries, and the scaled value in any one country was expressed as a deviation from this. This procedure brought all items to 37 a common base line, so that the deviation value for a given country had comparable meaning from item to item. 3. The average deviation value over all items on a single test was computed for each country, and this average deviation value was subtracted from the single item deviation values. This procedure eliminated differences in average level of performance between countries, and left a residual deviation value, expressed in standard-score units, that indicated how much harder or easier that item was for that country than would be expected on the basis of average difficulty of the item and average performance level in the country. Hypothesis 1 Decimal fractions will be relatively in the English-speaking countries. easy in the French-speaking countries, and common fractions The mathematics test includes four items dealing directly with the manipulation or understanding of decimal fractions and three dealing with common fractions. The residual difficulty indices for these were averaged over the items and for boys and girls, and are shown below for eleven countries. We show the average residual for decimal fraction items, the average residual for common fraction items, and the difference between them. Decimals Belgium England Finland France Germany Israel Poland Scotland Sweden Switzerland U.S.A. 18 -46 13 30 01 -11 08 -42 18 10 01 Fractions -10 10 -10 -71 14 13 -07 19 -03 03 44 Dec.-Fract. 28 -56 23 101 -13 - 24 15 -61 21 07 -43 From the tabulation, we see that for Belgian children the difference in residuals favors the decimal items to the extent of about 28/100 of a standard deviation on our normal deviate scale, on the average. By contrast, for English children, the fractions items are relatively easier by 56/100 of a standard deviation. In general, our hypothesis is supported because the large differences favoring fractions are all for the English-speaking countries, and all of the French-speaking countries show a difference favoring decimals. However, the difference between France on the one hand and Belgium and Switzerland on the other is also quite striking. French children of this age appear to be especially weak on items dealing with fractions. The major differences which we find in these groups of items are explicable in terms of the systems of weights and measures in the countries involved, and a corresponding curricular emphasis. The English, Scats and Americans have so many measures that go by 3’s, 4’s, or 12’s that they must spend instructional time on denominate numbers and the fractions that go with them. The continental countries, relying almost entirely on a decimal metric system can concentrate on decimal fractions, and give other types of fractions only limited emphasis. The results suggest that this is what has taken place. Hypothesis 2 National differences will be greater in the relative of specific places than in geographical information difficulty of items dealing with the geography relating to the world as a totality. It was possible to identify in the geography test five items dealing with facts about specific places and five others dealing with concepts of latitude, longitude, and time. Average residual values were computed for each of these sets, and are shown below. 38 Specific Belgium England Finland France Germany Israel Poland Scotland Sweden Switzerland U.S.A. Places Latitude -16 -31 & Longitude 9 -5 -4 3 -18 10 50 82 -11 -13 -12 -45 6 -5 -9 54 -10 -17 -11 -7 In general, the hypothesis is supported by the data. The largest average residuals tend to relate to specific place geography. Israel and Poland perform notably well in these items, and England and the United States relatively poorly. The items on latitude and longitude show generally smaller residuals, only Poland performing on these items somewhat better than would be expected in the light of her performance on the total test. The results suggest fairly wide national differences in emphasis on the teaching of specific geographic facts; at least they are learned to different degrees. Hypothesis 3 Reading test items will be easier for pupils of the country and language from which the passage and items originally come than for other countries and especially those speaking a different language. The reading test was composed of five passages, two of which had originally been in French (one from Belgium and one from France), two in German, and one in English (from the United States). It seems plausible, at least, that these might be easier in the language in which they had been written and for the national culture for which they had been designed. Therefore, the residuals were computed separately for each passage for each of the English-speaking, Frenchspeaking and German-speaking countries. (In Belgium and Switzerland testing was limited to French-speaking parts of the country.) Results are shown below. Passage No. I (Belgium) Belgium England France Germany Scotland Switzerland U.S.A. 00 -06 12 -08 -11 35 -13 Passage No. (France) 07 -04 13 -19 -13 - 09 -06 2 Passage (Germany) No. 3 Passage No. 4 (USA) 05 01 -07 04 04 04 05 -07 -03 - 10 10 07 -15 11 Passage No. 5 (Germany) -06 13 -09 12 11 -10 02 Examination of the table shows that the results are in general supported. More generally, the passages of French language origin seem to be slightly easier for the French-speaking countries and the passages of either English or German origin for the English- and German-speaking countries. Thus, the precaution of choosing original passages from different language sources appears to have been a sound one. However, even though the differences appear in a fairly consistent pattern, they are generally of very small size. The task difficulty seems to transcend language in considerable measure. Thus, once again, the universality of the reading task is affirmed. 39 Hypothesis National clusion. 4 groups show consistent differences in willingness to express certainty about a con- The second part of the science test consists of five statements, for each of which the pupils must pick one of the five choices: Definitely True, Probably True, Impossible to Determine, Probably False, and Certainly False. The keying of each item was based on the pooled judgment of several faculty members at Teachers College, Columbia University, New York, where the test was constructed. Our current interest is in the nature of the erroneous can be wrong in any one of these ways: answers. Over the set of items, an answer (1) An examinee can be too sure. That is, he can mark an item definitely true or false when it is keyed only probably true or false, or he can choose one of the four other alternatives when he should mark the item indeterminate. (2) An examinee can be too cautious. That is, he can mark an item as probably true or false or as indeterminate when he should have marked it definitely true or false, or he can mark the item indeterminate when he should have marked some other choice. (3) He can be grossly in error. That is, he can mark an item on the true side when he should have marked it on the false side, or vice versa. For the set of 5 items, there were 20 wrong response options. Of these, 6 were of Type 1, 6 of Type 2, and 8 of Type 3. We have examined the results for the different countries to see what proportion of the choices fell on each of these types of error each time the opportunity offered. That is, we have divided the total number of errors of a given type by the total number of opportunities to make that category of error. We have also determined the ratio of too sure errors to too cautious errors, providing an index of readiness or reluctance to jump to conclusions. The results are shown by sex for 10 national groups*. o/o Too Sure Belgium England Finland France Germany Israel Poland Scotland Sweden U.S.A. 21.9 23.7 13.8 15.8 13.9 15.6 22.3 26.7 24.0 21 .l 16.9 16.3 19.9 20.6 14.6 14.6 14.5 13.4 14.0 12.7 Boys Girls Boys Girls Boys Girls Boys Girls Boys Girls Boys Girls Boys Girls Boys Girls Boys Girls Boys Girls o/o o/o Too Cautious Gross 10.0 14.2 15.0 17.3 17.2 20.5 11.1 12.2 9.8 12.0 14.7 16.8 19.7 22.4 14.2 18.4 22.6 25.9 18.7 21.5 Error Index of Sureness 2.19 1.68 0.92 0.92 0.81 0.76 2.00 2.19 2.46 1.75 1.15 0.97 1 .Ol 0.92 1.03 0.80 0.64 0.52 0.75 0.59 10.8 10.6 11.3 11.6 14.0 14.3 9.5 7.3 7.1 9.7 9.8 9.8 14.3 14.1 10.2 10.9 6.7 7.5 12.7 12.6 Clearly, there are substantial differences between national groups in their tendency to be too assured or too cautious on these items. When Belgian, French or German children made errors, they were much more likely to be errors of over-sureness. When Finnish, Swedish or American l 40 Frequencies of choice of the separate response options were not available for Switzerland and Yugoslavia. pupils made errors, they were likely to be errors of over-caution. We are dealing, of course, with a very limited sample of items, and the differences we find may be peculiar to this item sample rather than a more general characteristic of the national educational system or cultural background. However, the differences are certainly suggestive, and open up a line of inquiry that might be pursued with profit using a larger and more varied sample of tasks. Concluding statement The preceding pages have shown some of the kinds of comparisons that are possible when academic achievement is examined at the same time in a number of countries with the same set of tests. These results are of some value for themselves for the international differences and simi!arities that were found and described. They are certainly also of interest as exhibiting a mode1 of international cooperation and of empirical comparative education that may be further developed in the future. 41 Fernand FROM Hotyat BELGIAN INTERNATIONAL AND NATIONAL INTERPRETATIONS DATA Following Professor Thorndike’s overall view of the results, we publish two accounts In which the authors have used the data for the lnterpretatlon mainly of national performances. In the first of these, M. Hotyat concerns himself particularly with some striking analogies between patterns of achievement in French-speaking countries and contrasts these with patterns of achievement on the same groups of test Items in English-speaking countries. M. Hotyat also uses the international results to throw light on certain aspects of the Belgian situation, and in the second half of hls article analyses the Belgian results according to school type, sex. regularity of pupils’ promotion through the grades, and nationality of parents. In the course of his paper he suggests a number of interesting methods which can be used to interpret data of this kind. The Belgian share in the research was carried out by a team from the Centre de Travaux de I’lnstitut Superieur de Padagogie du Hainaut (Mme. Delepine, M. M. Hotyat. Lowyck, Rousseaux. Manouvrier). In accordance with the decision taken by the international research group, the analyses made by the Belgian participants aim at profiles of achievement which provide fruitful opportunities for interpretation. In order to do this, we have constructed profiles in such a manner that in each country the average score on each of the five tests is reduced to zero. The following statistical approach was adopted with this in mind: 1. We calculated for each test the mean and standard deviation of all the test scores, and expressed the mean for each country as a standard score*. 2. We established for each country the mean of its 5 standard scores. 3. We scaled down the number thus obtained from each of the 5 scores so that, for each country, the mean of the marks equals zero. Here is an example based on the standard scores from one of the countries: Non-verbal +28 Mathematics - 30 Reading Comprehension 1-25 Science 0 Geography -8 Their mean is: 28 - 30 + 25 + 0 - 8 = c3 5 By deducting 3 from each of the standard scores we obtain scores with a mean of 0. We thus arrive at the following adjusted scores: Non-verbal, +25; mathematics, - 33: reading comprehen\ sion, + 22; science, -3; geography, - 11. This process finally enables us to establish a profile of the scores obtained by each sample in relation to its own mean. For example, Table I and Figure I shown the results from the groups in the three French-speaking countries which took part in the study. The samples were taken from Lisieux in France, from Geneva in Switzerland, and from a central eastern region of Hainaut in Belgium. l The standard deviation from scores the mean. express the positive or negative value of a test score in hundredths of a standard 43 TABLE Non-verbal France (Lisieux) Switzerland (Geneva) Belgium (Hainaut) 1 Reading Comprehension Maths - 20 t31 -6 +17 t9 1-21 14 +9 t8 +45 - 21 FIGURE Non-verbal + 0.4 Geography Maths Science - 20 -42 -17 -12 1 Reading Comprehension Science Geography - 0 -- Lisieux Geneva - - _ - - Hainaut - . -. These three profiles show some interesting analogies, in particular a peak in mathematics troughs in reading comprehension and in science which are common to all three. and A detailed analysis of the profiles obtainable from the scores of each of the twelve participating countries would be beyond the scope of this article, and we shall therefore limit ourselves in Table 2 and Figure 2 to comparing the profiles of the means of the three French-speaking samples with those obtained similarly from the three English-speaking groups in England, Scotland and the USA. The way in which the two profiles appear to complement each other is striking. (It should be remembered that the two language groups which are here being compared represent only half of the national samples which took part in the research.) TABLE 2 Non-verbal Fr.-speaking countries Engl.-speaking countries 44 -1 +23 Maths +32 -34 Reading Comprehension -8 +15 Geography -1-3 -21 Science -25 +17 FIGURE 2 Non-verbal Maths Reading Comprehension Results from the English-speaking samples- Results from the French-speaking samples Geography _ _ - - Science - - - _ Of course, even if the research had related to true national samples, it would still be rash to draw any conclusions from these findings about the quality of the respective school systems; other factors would have to be considered, such as curricula, time-tables, and the form of the tests. But we have now developed a method which will enable us to make comparisons based on real national samples when subsequent research has been carried out. Relative difficulty of items We asked ourselves, too, to what extent the order of difficulty of items in each of the subjects tested was the same from country to country when their results were compared. 1. In our preliminary analysis we applied Yule’s” formula to the percentage of success on each item obtained by each national group. The correlation coefficients are spread out in the following way (220 coefficients for 11 samples and 4 subjects): in mathematics, from 0.75 to 1, with a mean of 0.94; in reading comprehension from 0.60 to 1, with a mean of 0.89; in geography from 0 to 0.92, with a mean of 0.68; in science from 0.05 to 0.95, with a mean of 0.54. Assuming that the order of difficulties within each test were equivalent, these data would mean that there existed a closer relationship between the way in which mathematics was taught in the various countries than there was in the teaching of geography and science. 2. The table of percentages of items correctly answered enables us to make comparisons which are of definite educational interest. Let us take, for example, on the one hand the mean percentages for three items involving fractions, and on the other for four items involving decimal numbers, and compare the results obtained in Belgium with those from the Anglo-Saxon countries where the system of weights and measures requires early and intensive teaching of fractions. The percentages are set out in Table 3. * If a = the number of items for which the results are higher than the mean in the hvo samples El and if b = the number of items superior to the mean in El and inferior in E2: if c = the number of items Inferior the mean in El and superior in E2; and if d = the number of items inferior to the mean in both samples, ad+bc we have o = ad-bc EC?: to 45 TABLE 3 Belgium Fractions (mean %I Decimal numbers (mean %I Anglo-Saxon countries 76 61 78.5 75.5 All other things being equal, it would appear that pupils in the Anglo-Saxon disadvantage because of the need to start learning fractions at an early age. countries are at a 3. By converting the percentages of items passed into standard deviations from various national means, we are able to establish national profiles of the relative difficulty of items in each of the tests, leaving out of account the absolute levels of national achievement. These results, quantified in this way, are more precise and more flexible than those provided by the ranking of items according to the degree of success achieved’on them, and they offer very interesting possibilities of analysis. Thus, a comparison of these indices for a particular country enables us to examine whether the order of results has a close correspondence with the hierarchy of aims set up by the authors of school curricula there, and, if this is not so, to study the teaching methods which are being employed with a view to making whatever improvements seem desirable. Table 4, for example, shows the standard deviations relating to some of the mathematical items, obtained from the Belgian sample’s scores: TABLE 4 2 written calculations (whole numbers) 2 written calculations (decimal numbers) 4 calculations of areas and volumes 1 problem regarding meeting point of moving objects 3 items requiring pupil to extract information from a table Boys Girls - 0.145 +0.126 +0.25 - 0.21 - 0.28 -0.16 -to.02 +0.355 - 0.32 - 0.20 Is there any close correspondence between these relative percentages of success (which are independent of the absolute scores in mathematics) and the order of priority in which we would place these topics? The authorities would be faced with the solution of an educational problem if they thought, for example, that accuracy in calculating with whole numbers had not been given a sufficiently prominent ranking in the list of aims set for the teaching of mathematics. A comparison of these deviations or of the profiles with those deriving from results obtained in the other countries could also lead to interesting observations. Thus, the mean deviations (in standard deviation units) for Belgium and Country X on three parts of the geography test are shown in Table 5: TABLE 5 Belgium 12 items of information Interpretation of maps (16 items) Relating facts to generalised statements - 1.2 -2.1 +15 Taking into account the age of the pupils who are being tested, do these respond to the hierarchy of aims which the education authorities assign to If not, there are indications that we should study the way in which the methods and procedures have been conceived in a country where, it seems, reveal a more satisfactory balance. 46 Country x +19 -11.8 - 11.3 relative results corgeography teaching7 curriculum, teaching the results achieved Results according to sex If we first convert the scores of the boys to a standard score scale with a mean of 0, here for purposes of comparison is the profile of the mean scores of the group of Belgian girls given as standard scores: Non-verbal Reading Comprehension Maths Geography Science + 0.60 + 0.4 + 0.2 0 - 0.2 - 0.4 - 0.60 - 0.115 - 0.23 + 0.015 - 0.24 - 0.54 The table of frequencies of mean plus and minus scores for girls from eleven countries confirms this profile when compared with the boys’ results. Non-verbal Belgium England Finland France Germany Israel Poland Scotland Sweden Switzerland U.S.A. Total Maths TABLE 6 Reading Comprehension Geography Science + + + + 8/l 1 -I- - + + - - + + + - + + + 9111 5/l 1 9/l 1 - - ll/ll (Table 6) No. of negative results 5 4 5 3 5 5 5 3 2 4 1 42155 Over the whole range of subjects tested we are able to conclude that, with the exception of reading comprehension, the girls’ results are clearly inferior to those of the boys and the gap is particularly marked in science. Various hypotheses could be advanced to explain this situation which seems all the more surprising considering that common programmes of instruction are followed by both boys and girls in most of the countries concerned. Is the relative weakness of the girls’ results to be explained by educational factors - for example, teaching which is biased towards literary subjects - or rather by the way in which the tests themselves have been conceived, so that they call especially for the types of intellectual functioning which come more naturally to boys? Since the profiles of scores vary very greatly from country to country, we are faced with the hypothesis (which needs to be verified experimentally) that there is possibly the 47 influence of a combination of social and educational factors at the bottom of these differences* The most striking feature is the general inferiority of the girls’ average scores in science. It would be very valuable if research could be done on this problem. The question has real social significance at a time when more room is being found for sciences in educational programmes, and when the technical aspects of life require that schoolchildren receive a basic training in which the experimental sciences play an important part. Correlations between different types of test The non-verbal test which we used contains three types of test item requiring the following kinds of reasoning: - choosing a fourth figure which has the same relationship to the third as the first two have to each other; t - extending a series of numbers or of letters in accordance with a pre-set pattern; - picking out from a group the odd figure which does not conform to the principle governing the others. Comparisons have been made of the correlations between this test and those parts of the mathematics and geography tests which depend on reasoning rather than information for an answer. The correlations obtained (Bravais-Pearson formula) are as follows: Mathematics: Section with items involving information and calculation . . . . . . . . . . . . . . . . . . 0.26 . . . . . . . . . . . . . . . . . . 0.50 Section with items involving arithmetical reasoning Section with items involving geometrical reasoning . . . . . , . . . . . . . . . . . . . . . . 0.30 Geography: Section with items involving information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0.23 Section with items involving interpretation of maps . . . . . . . . . . . . . . . . . . . . . . 0.17 Section with items requiring facts to be related to general statements . . . . , . . . 0.64 Comparison of these results does not confirm the existence of a close link between the types of reasoning brought into play in the non-verbal test and those required by the scholastic tests. By way of contrast, Table 7 shows the correlations between the three sections of the mathematics test. TABLE 7 Maths. II (arithmetical reasoning) Maths. I (information Maths. II (arithmetical 0.72 - and calculations) reasoning) To sum up, in a subject like mathematics range of information and skills. The dispersion Maths. (geometrical the intellectual activity III reasoning) 0.81 0.72 involved depends on a whole of test scores We have taken as the coefficient of variation the relationship between the standard deviation and the mean. Table 8 shows these coefficients in the form of percentages for each of the tests and each of the national samples. * This hypothesis metic. Thus Toivo the same textbook, and significant not a significant 48 has already been found to have some foundation in experimental studies in the field of arithVahervuo studied the performance of 30 classes in the 4th school year in Helsinki, all using and discovered that the results of the boys’ classes were superior to those of girls’ classes at the level. 2 y. level: but he also observed a slight superiority of the girls in mixed classes, although TABLE 8 ABCDEFGHIIK Countries Means Tests Non-verbal 26 30 36 31 41 37 48 Mathematics Reading comprehension Geography Science 19 19 21 21 29 24 26 20 27 24 26 24 30 29 36 40 46 28.9 26 26 28 26 28 24.1 24 20 24 23 28 30 28 33 35 33 36 28 33 33 29 40 36 34 36 36 35 36 40 35.2 Means 23.7 23.7 26.5 27.5 28.2 28.5 30 31 33.5 33.7 37.5 43 35 34 36 36.9 Two striking observations emerge from this table: 1. The coefficients of variation vary quite noticeably according to subject. In the scholastic tests, they are almost constantly lowest in reading comprehension and highest in science. Two chief hypotheses suggest themselves to explain these differences. The first to occur to us is that the tests are not all equally discriminating. However, it is also quite possible that the teaching methods at present in use lead to children producing less homogeneous results in natural sciences than in other subjects. Experimental research is needed to determine to what extent the second factor influences the scores. 2. These coefficients vary considerably from country to country (the difference between the extremes is over 50 O/J. Assuming that the same care had been taken when drawing the samples in every case, this would mean - not taking into account the absolute value of the means that the national systems produced far from homogeneous results. Some questions of profound importance emerge from this: Is the dispersion of results satisfactory in relation to the general educational aims in a given country? If not, what sort of curricula and methods have led in other school systems to a variability which could be considered more desirable? INTERPRETATIONS 1. General OF DlFFERENCES IN PERFORMANCE WITHIN THE BELGIAN SAMPLE situation One should remind the reader (see description of Belgian sample, analysis does not concern all Belgian schools but is only a regional mean scores obtained by the Belgian sample (scores in hundredths after the general mean has been converted to 0). p. 10) that the following study. Table 9 gives the of a standard deviation, TABLE 9 Non-verbal Total Boys Girls Maths Readmg Comprehension Geography +10 +45 - 27 -15 -;25 +58 +32 - 23 i-3 -5 These results are particularly high in mathematics -31 and particularly - 34 Science - 25 +2 -53 low in reading comprehension. Going beyond this, we can compare the scores on the various sections of the tests. In mathematics, the Belgian scores, which are barely satisfactory for arithmetical calculations, are particularly high for items which require reasoning and the use of concepts. At the other extreme, in reading comprehension they are expecially low for literary, historlcal and economic texts, One wonders whether our present teaching methods demand sufficient individual participation from pupils when texts are being studied, especially in literature and history. It would be worthwhile to study this problem, since silent reading - which is true reading according to our official primary school syllabus-plays a highly important cultural role. 49 In geography, Belgian scores are average on questions which involve the direct use of verbal information, but they are rather weak on items calling for the interpretation of maps. An evaluation of the results in science seems inappropriate, since curricula in this subject differ too much to permit valid comparisons to be made between the mean scores of different countries. 2. Differences according to types of school Table 10 and Figure 4 below show pupils’ achievements expressed in fractions of standard deviations according to types of school and sex. Since in this case we are concerned with comparisons within one country, the figures relate to the norms of the total Belgian sample. TABLE Type of school Non-verbal A. Vocational schools - boys B. Vocational schools - girls C. General secondary schools-boys D. General secondary schools - girls 10 Reading Comprehension Maths Science - 26 -32 -44 - 24 -44 -55 - 34 -55 -48 +52 +55 +52 +55 +57 +24 Jr34 +11 +22 -6 FIGURE Non-verbal C. Mean-general, boys Geography Maths -3 4 Reading Compr. Geography Science - D. Mean-general, girls Mean of total sample Mean _ total A. Mean-vocational, boys - B. Mean-vocational, girls . -1 of sample (I a. Assuming that scores on the non-verbal test can be taken as a valid criterion of the pupils’ capabilities, these data lead us to the following conclusions: - in the sample of boys in vocational schools, the mean performance is rather low in mathematics, very low in reading comprehension, but high in science, when compared with the result of the non-verbal test: 50 - among girls in vocational schools the mean is higher than might have been expected in reading comprehension, but particularly low in mathematics and geography; - among boys in general secondary schools, the mean is high in all subjects; - among girls in general secondary schools, the mean is high in mathematics, a little low in reading comprehension and very low in science. b. Both for boys and for girls the means in general secondary schools are superior to those for either sex in vocational schools. If the scores for both sexes are combined, the differences are, respectively: 71 y. of a standard deviation on the non-verbal test, 93 y. in mathematics, 72 y. in reading comprehension, 81 o/0 in geography and 57 o/o in science. It would, of course, be quite erroneous to judge the relative merits of these two types of schools merely by comparing these results. In the first place the study covered only subjects to which more importance is attached in general secondary schools than in vocational schools, and in addition to this, the gap between mean scores achieved on the non-verbal test entitles US to assume that the differences already existed at the time the pupils started their post-primary education*. The means give us only a rough idea of the differences, but the following table (Table 11) of quartile distributions provides US with a more differentiated picture. (These distributions have been obtained by finding the score point at the quartiles for the total group and then calculating the percentage falling within the different quartile ranges in the general and vocational schools separately.) TABLE 11 Reading Comprehension Maths Gfl”. Lowest VOC. Gen. Geography VOC. Gen. Science voc. GC?“. voc Quartile Lower Intermediate Quartile 2 l.5 pq :;‘:: pj ii? m F pq Upper Intermediate Quartile Highest Quartile * Here, for example, tions to post-primary are the schools mean based percentages on samples of success of N=150+ on the arithmetic test In each type of school. Boys: General Secondary Schools Whole and decimal numbers Mental arithmetic Fractions Metric system Calculation of areas and volumes Geometrical figures Problems From: Epreuves Analytlques d’Arithm&ique Girls: General Secondary Schools 74.7 % 59 74 74.3 54.5 71.3 64.8 (Publications 74.7 % 56.4 73.1 73.4 50.7 70.5 64 de I’lnstitut Sup6rieur taken during entrbnce Boys: Vocational Schools exsmlns- Girls: Vocational Schools 70 % 38.6 54.1 56.9 35.5 58.5 53.2 de Pbdagogie du Halnaut, 62 % 29.8 50.6 54.1 25.8 44.7 37.4 1961). 51 The differences proceed, in short, from an unequal distribution of superior and inferior scorers between the different types of schools, the superior ones being more numerous in general secondary schools, and inferior ones in vocational schools. c. The dispersion of scores is wider in vocational schools. on the mathematics test in the two types of school: Below, in Table 12, we give the scores TABLE 12 Vocational schools General Secondary Schools Mean S. D. 14.4 18.4 4.04 3.59 Coefficient of variation in 1OOths of an S. D. from the mean 28 19 The greater homogeneity of results in general secondary schools results from the fact that they are very selective, whereas in the large vocational schools, the spreading of pupils over different parallel courses makes it possible for weaker pupils to stay on in classes leading to leaving certificates at lower levels of achievement. 3. Differences according to sex According to Table 10 and Figure 4 the boys’ means are higher than those of the girls in general secondary as well in vocational schools, except in vocational schools where reading comprehension is concerned. Because of the small number of schools covered by our present enquiry, we are not entitled, however, to draw any general conclusions from the results obtained. It would be interesting to carry out research on a larger scale into this problem as it exists in Belgium in order to determine the causes for these differences, should they appear significant. 4. Differences The pupils Table 13: according included to regularity in the sample of promotion were distributed among the various grades as shown in TABLE 13 Vocstionsl: In grade appropriate Repeated one year to age boys 62 78 Vocstlonsl: 84 86 girls General: 114 59 boys General: girls 97 53 Vocational schools clearly contain a higher number of 13- and 14-year-old pupils who have repeated a grade, but this repetition has usually already taken place before entry into the vocational schools, as their intake includes numbers of pupils who have already doubled classes at primary school or who have been diverted to technical schools after failure in general secondary schools. In order to study the extent to which repetition of a grade corresponds to lower levels of ability or educational performance, we present in Table 14 figures for the degree of significance of the difference between the means for each of the tests and for each type of school. 52 3 E p’ VOCATIONAL: TEST SCHOOL MSWl Non-verbal Mathematics SOYS Regularly promoted Repeaters Regularly promoted Repeaters 32.24 29.65 15.69 14.37 s. D. 13.7 10.5 t GIRLS Regularly promoted Repeaters 17.80 16.97 Mean 30 s. D. 4.6 1.06 14.32 Science Regularly promoted Repeaters Regularly 26.93 10.5 14.77 3.5 BOYS 13.17 4.1 18.75 16.90 3.8 GENERAL: 12.80 GIRLS 4.1 19.94 9.37 2.7 8.42 3.05 10 2.9 3.8 8.29 2.5 16.84 2.6 23.07 4.3 2.15 37.72 11.3 31.67 9.6 18.82 14.77 16.32 13.17 17.80 4.4 19.44 3.8 16.97 4.05 18.47 4 17.74 3.8 12.98 3.5 9.11 2.8 7.86 2.6 4.17 2.05 4.73 15.44 3.9 11.50 2.7 7.8 4.78 9.30 2.95 t 3.47 5.67 4.83 6.55 S. D. 7.2 1 12.23 Meall 5.73 33.93 3.6 t 11.3 3.24 0.67 13.83 S. D. 2.75 2.63 Repeaters 8 promoted 4.4 Meall 43.57 3.7 Geography GENERAL: 1.88 1.85 4.8 4.3 t 10.8 1.22 3.6 Reading Comprehension VOCATIONAL: HISTORY 4.16 According to this table of pupils’ t scores, the differences are important in general secondary schools at the 1 ‘J,$level for all tests, except for reading comprehension in the case of girls where the level is 5 %; on the other hand, the t scores are particularly low for boys in vocational schools and the only significant one is that for science. The same trend is present, but to a less marked extent, in girls’ vocational schools where the degree of significance of the difference is higher in reading comprehension and in science. These observations confirm that general secondary schools are more severely selective where the subjects examined by the international tests are concerned. 5. Differences according to nationality of parents The sample of boys in vocational schools included a particularly high percentage - more than 35 O& - of children of foreign workers, mostly Itaiians. A recent study* has shown that these pupils are subject to a great deal of educational failure due to the absence of teaching methods which would help them to adapt progressively to our language and school system. It seemed to us interesting to examine, on the one hand, whether or not the presence of this group had caused a lowering of mean scores for boys in vocational schools, and on the other hand, to study the extent to which children of foreign parents, who had overcome the initial obstacles facing them in elementary school, still remained handicapped in one direction or another. To this end, we divided the results of the 145 subjects into two sub-groups: children with foreign parents (N=52), and children of Belgian parents (N=93). We then calculated the means and of standard deviations separately for each group and estimated the degree of significance differences between the means. Table 15 gives the results obtained. TABLE 15 MS%lS Non-verbal Mathematics Reading Comprehension Geography Science Foreign Belgian parents parents ~~ .-... __-- Standard Foreign parents Deviations Belgian parents DlffWeWXS 31 .l 14.8 30.2 15 12.3 5.28 11.9 4.17 16.7 13.8 8.5 18.6 14.1 9.1 4.82 4.59 3.1 1 4.62 3.97 2.81 not significant not significant significant at 57; level not significant not significant -_.~-.- A study of Table 15 enables us to conclude: a. that the mean obtained by children with foreign parents on the non-verbal test is slightly superior to that of Belgian children. There are probably two reasons which chiefly explain this. In the first place they have already overcome a more severe selection process than the Belgian children and it is therefore likely that their mental capacities are higher. In the second place, fewer of them have chosen to enter general secondary schools. b. that in spite of this advantage, their mean scores on the scholastic tests are all inferior to those of Belgian children. These differences are not high, except in reading comprehension where they represent 40 y0 of a standard deviation and are significant at the 5 y. level. This higher difference is explained by the fact that in silent reading knowledge of the language is the essential element, while it is only one of the factors coming into play in the other tests. c. that the standard deviations are all higher in the group of children with foreign parents. This group is altogether more heterogenous: while the best pupils remain at a level close to that of the best Belgian-born children, the weaker ones appear to fall further behind their Belgian counterparts. * De Caster, S.. and Derume: ‘Retard pkdagogique - lnstltute de Sociologic U. L. B. 1961. 54 et situation soc~ale dans la region du Centre et du Borlnage Figures 5 and 6 below illustrate this situation: they present in diagrammatic form the two groups’ results in reading comprehension and mathematics in terms of 5-point normalised scales. FIGURE 6 FIGURE Mathematics Reading 6 Comprehension ‘I 25F -3/2a -712 Mean -I- ‘12 Children with Belgian parents - B Children with foreign - F parents +31’2 a - 312 a - 112 . I-* Mean +1/z -?- 3/2 a 55 -.- - --.- D.A.Pidgeon TEST A COMPARATIVE STUDY OF THE DISPERSIONS OF SCORES In the second of the two articles which eerve to demonstrate the usefulness of international comparative studies in shedding further light on problems occurring In particular educational systems, Mr. Pidgeon compares the standard deviation on the five tests in the twelve countries and draws conclusions about the effect which “streaming” (the form of class organisation extensively practised in England and Scotland) the different approach to teaching for the phenomenon he has noted. has which on the dispersion of test scores. The author discusses “streaming” may encourage and which could account Introduction In an earlier study (Pidgeon, 1958) in which the performances of eleven-year-old children from Queensland, Australia, and from England and Wales were compared on tests of non-verbal ability, reading and arithmetic, it was noted that in each of the tests used, the standard deviation of test scores obtained by the English and Welsh children was considerably greater than that of the Australian sample. It was thought that this finding might, in part, reflect the effects of “streaming” as practised in England - it being sometimes alleged that one of the results of streaming is that the brighter children make more rapid progress than they would otherwise achieve, and also that the duller children, when assigned to a”C”stream, become more backward, partly as a result of poor morale, and partly because there is a tendency in some schools for the more experienced teachers to be put in charge of the abler streams. It was noted, however, that while there were larger proportions of high scorers on all tests in the English and Welsh sample, there were differences between the tests with respect to the low scorers; on the two arithmetic tests there were considerably fewer children in this category from the Queensland sample, while on the reading and non-verbal tests there were slightly more. It was cautiously concluded, therefore, that the results might be due to differences in the methods and approach employed in teaching the different subjects and not to any overall differences in the organisation of classes. Since that study, however, other evidence has been reported which suggests that perhaps the larger variance of test scores may after all be due in part to the system of class organisation employed in England. An investigation (Lloyd and Pidgeon. 1961) in which a non-verbal test, standardised in England, was given to groups of African, European and Indian children in Natal, South Africa, revealed considerably lower standard deviations in that country. It was observed that the system of class organisation employed in Natal meant that “no child can proceed from one class to another unless he can pass the examination set for that class. Such a system inevitably has repercussions on the methods of teaching employed and the concern of a teacher in Natal is to get as many through the examination as possible, since too many failures might reflect on his efficiency. This leads to mass methods of teaching and to a complete lack of recognition of individual differences”. It might be mentioned here that the very opposite occurs in England. Teachers are trained to recognise individual differences and to adjust their teaching accordingly, and indeed, the system of streaming is a further aid to this end. The system employed in Natal is used with all ethnic groups but “its effect is more pronounced with the African and Indian children in view of their larger school classes”. The results reported reflected this, in that the standard deviations tended to be smaller with these groups, particularly with the African children. 57 A further study, particularly relevant to the question of streaming in England, is that of Daniels (1961). Daniels compared the performances of children in two large primary schools that streamed by ability, with those of children in two similar schools in the same type of neighbourhood that did not stream. Daniels stressed that the “non-streaming” in the latter schools was a “consistently thought-out policy” or, in other words, the teachers in these schools did not believe in streaming and in fact “felt it was educationally wrong to do so”. His results revealed a consistent trend towards smaller dispersions of test scores in the un-streamed schools; the standard deviations in 22 of his 24 separate comparisons were smaller with un-streamed children although only five of these reached statistical significance. There would appear to be some evidence, then, that the dispersion of test scores in England tends to be rather larger than in other countries and that this larger dispersion may be associated with the method of class organisation employed. The number of studies providing this evidence is, however, fairly small, and it would need to be confirmed before any generalised conclusions could be drawn. The present study It seemed clear that, apart from its other main purpose, the international study sponsored by the Unesco Institute for Education, Hamburg, would also provide an excellent opportunity for making further comparisons of the dispersions of test scores in England and other countries. This article, therefore, is concerned with presenting such results as are relevant for this purpose and to discuss the implications of the results. A full description of the investigation has been given by Foshay in the opening chapter of this report. It is only necessary to note here that attempts were made to make the tests appropriate for each country, despite the necessity to translate into eight different languages. It can be reasonably claimed that this was successfully accomplished for, although a translated test is, of course, a different test, as Foshay points out there is no evidence that difficulties in translation “seriously influenced the national scores”. Since this was in the nature of a pilot study in this field, it had been agreed that it was unnecessary to procure a strictly random sample of the school population of the stated age in each country, but that attempts should be made to obtain a representative sample that was as typical as possible of the total population at least with respect to mean test score and standard deviation. The detailed descriptions of the sample tested in each country given by Foshay suggest that, with a few exceptions, this had been achieved. It is, of course, particularly important, if inferences are to be drawn about the dispersions of test scores in different countries, to ensure that samples fully representative of the countries concerned are obtained. Further comment, therefore, will be made on this point in the discussion of the results below. Results The obvious statistic for measuring the spread of scores on a test in any sample is the standard deviation. Using the raw S. D. for each country on a particular test, comparisons can legitimately be made between countries on that test provided it can be shown that there is no direct relationship existing between S. D. and mean. As an indication of how successful the tests were in each country, in no case was the mean score either too high or too low to be seriously influenced by a “ceiling” or “floor” effect, and in only one test (science) was the rank order correlation between S. D. and mean, positive. Hence the standard deviation has been used for making comparisons in this study. Since, however, five different tests were used, each with a different number of items, in order that comparisons could be made between tests, the raw S. D.s for each country have been converted into a standard score. In Table 1, which gives the relevant figures, a high positive value indicates a standard deviation well above the average for all countries on that test, and a negative value a standard deviation below average for all countries. 58 Table 1 Standard Deviations, country expressed Belgium England Finland France Germany Israel Poland Scotland Sweden Switzerland U.S.A. Yugoslavia - - in Standard Score Mathematics Non-Verbal 0.43 1.47 0.29 0.78 0.66 0.23 1.74 0.20 0.81 1.72 0.06 1.34 Form, on Five Tests from Twelve Reading - 0.26 2.23 - 0.38 - 0.06 0.14 -0.18 - 1.21 1.27 - 0.66 - 1.68 0.14 0.65 Geography - 0.85 1.44 0.48 - 0.55 0.84 -1.16 - 1.39 1.21 0.11 - 1.49 0.89 0.46 Countries Science -0.19 1.23 - 0.69 0.18 0.53 - 0.47 - 2.37 1.21 0.76 - 0.49 1.03 - 0.72 Average - 1.34 2.61 - 0.49 - 0.87 0.27 - 0.06 0.08 0.99 0.18 - 0.77 0.23 - 0.82 - 0.44 1.80 -0.16 - 0.10 0.49 - 0.33 - 1.33 0.97 0.24 - 1.23 0.45 - 0.35 It will be seen from Table 1 that, consistently on every test, England has by far the largest dispersion of test scores. To illustrate these figures graphically, the average value for each country on all five tests is depicted in Figure 1. (To obviate negative values 2.0 has been added in each case.) FIGURE1 The Average Value of the Standard Deviation on Five Tests for each of Twelve Countries I 4.0 I B E Fi Fr G I P SC0 Swe Swi U Y Some comments on Table 1 and Figure 1 are clearly necessary. Firstly, it must be emphasised that no significant relationship exists between the standard deviations and the mean scores obtained in each country. Secondly, the fact that the pupils tested in each country were not strictly random samples must to some extent detract from the significance to be attached to these findings. Samples of schools selected by subjective judgment to be representative, are more likely than not to yield smaller standard deviations than random samples, owing to the human tendency to under-represent the very bad. Also in some instances a restriction was acknowledged. The description provided by Switzerland of its sample clearly indicates that it was hardly representative of the whole country, since it was taken exclusively from a prosperous middle-class town. In Israel the sample excluded recent immigrants from under-developed areas; it was also chosen from 8th grade pupils, that is, although the mean age was 131/g, it contained some pupils younger than 13 and older than 14. Both these factors clearly affected the dispersion of test scores. In other countries, however, the sample was chosen by methods similar to that employed in England, namely, the testing of all pupils falling within the stated age range attending ---. -_-_ -_-.__. all types of secondary school in a seiected area - the selection of this area being based on other test evidence, which suggested that it was reasonably typical of the whole country both as regards mean and spread of test scores. In Scotland, two areas were chosen - a city and a county - but the sample was deficient in children from the professional and skilled worker classes, probably resulting in some restrictions of the dispersion of test scores. With these reservations the data from Table 1 must be viewed with caution. Nevertheless, there would seem to be fairly clear indications that the dispersion of test scores in England is large compared with that in other countries, thus supporting the previous evidence cited. It would not seem an idle occupation, therefore, to speculate upon the reasons for this. Discussion All countries concerned in this study, apart from England and Scotland, employ some variant of what might be called the “grade placement” system. In such a system, children are assigned to grades initially according to age, but subsequently according to their ability to assimilate successfully the work covered in the grade in which they were the previous year. It is possible for some children to be “accelerated”, that is, to miss a grade, and for others to be “retarded”, that is, to repeat a grade. Thus, at any given point in time, say after five years’ schooling, while the majority of pupils will be found in Grade 6, some will be in Grade 7, and others in Grade 5 or possibly, having “repeated” twice, in Grade 4. The numbers of pupils accelerated or retarded wi!l depend upon the limits accepted as constituting a successful “pass” in the previous grade’s work, and this, in turn, upon the standard of work demanded in each grade. In England and Scotland, however, yearly promotion is primarily based on age, and if numbers necessitate, pupils within one age group will be divided into separate classes. In both countries, it Is the general practice in such circumstances to “stream” these separate classes by ability and attainment, even in the junior school i.e. between age 7 and 11. Owing to the relatively larger number of small primary schools in Scotland, the proportion of children in streamed classes is somewhat less. These descriptions do not do justice to the differences that exist in the various countries; they are probably sufficient, however, for the present purpose, since it is contended that it is not the difference between the two systems that is important so much as the general aims and beliefs of the teachers practising within the systems. It is argued that the major objective of the grade class teacher is to ensure that as many of his pupils as possible complete the work of the grade successfully. His class is, however, heterogeneous with regard to ability and it is not unreasonable to suppose that the brighter children will require much less effort from the teacher than the duller ones. Hence, while it is possible that the brighter children within such a group will not be unduly extended, the duller ones will presumably receive every encouragement to achieve the grade pass and the net result will be a tendency for achievement test scores, if not to cluster closely around the mean, at least to be relatively unrepresented at the extremes. In England, however, there is no such general acceptance of a similar curriculum for all children in a particular class and, certainly in streamed schools, the work achieved by “A” stream children at any given age will be considerably more advanced than that expected of “C” stream children. This introduces the notion of “expectancy” or the standard of work expected by a teacher of his pupils. What is expected will clearly be determined in the first place by any curriculum that is defined, but secondly, it will be influenced by the philosophical beliefs of the teacher. In many countries employing the grade placement system, the curriculum for any grade will be defined quite clearly even to the provision of “state” text books; in others a greater degree of fluidity within a school or class may be found. In England and also Scotland individual schools have a higher degree of autonomy and what is taught at any age will depend to a greater degree upon what the head teacher or even the class teacher thinks is right, although external examinations even in the primary school control this to a certain extent. But in all countries what a teacher expects from individual pupils in his class must also depend upon his own particular beliefs. Different patterns of achievement would be obtained, for example, by two grade class teachers, one of whom believed that all children in his class were perfectly capable of covering adequa- 80 tely the work of the grade, given sufficient effort on his part with the duller ones, and the other who believed that, since achievement was necessarily limited by innate ability, there must be some children in his class incapable of completing the year’s work successfully. Such difference in beliefs will also have an influence on the work of brighter children. The teacher who strives to match attainment with ability will also be aware that children of high ability are capable of work more advanced than that demanded by the grade syllabus, and although in the grade system this matching will be achieved to some extent by jumping a grade at promotion time, it is clear it will also influence what the teacher expects from the brighter pupils within his own class. When considering the effects of these given age, it would seem clear that the extent of regarding innate ability as the hence tend to obtain a wider dispersion the fact of individual ability differences, of the teaching situation play an equal if different beliefs upon the spread of achievement at any teacher who stresses innate individual differences to the major factor in determining achievement, will expect and of attainment than the teacher who, although accepting nevertheless believes that the environmental influences not more important part. It is maintained that this belief in the relative importance of innate individual ability differences is predominantly held in England and indeed has lead to the general acceptance of the practice of streaming. Burt (1959) has said “. . it is plainly imperative that both teachers and local authorities should take full account of such differences in their efforts to provide an education which will (in the words of the Act)* be adapted to each child’s ‘ability and aptitude”‘. To match attainment to innate ability presupposes, of course, that the ability can be measured. Some popular misunderstandings regarding attempts to do this have been described elsewhere (Pidgeon, 1961). Burt himself has stressed many times the difficulty of ascertaining a child’s “I. Q.” accurately (e, g.: Burt, 1959), but nevertheless insists that, because of the wide dispersion of measured selection is absolutely essential: in my intelligence at age 10 or 11, “some kind of provisional view it should start much earlier, . indeed, as soon as possible after a child has entered school”. The acceptance of this view led not only to the separation of children of greater and less ability into different types of secondary school (Board of Education, 1926) but also to the practice of streaming in junior schools (Board of Education, 1937). However, and Burt himself stresses this, if children are to be separated for differential instruction from an early age, it is essential that the child is “free to swim from one stream to another” as his “capacities develop or decline” (Burt, 1959) and also, it must be added, to allow for inaccuracies in the original measurement. But. as Daniels (1961a) has shown, there is far less fluidity in streaming than teachers themselves imagine, or, as Vernon (1955) has demonstrated, is necessary. The concern here is not whether streaming is, in itself, good or bad, but with its effect on the dispersion of achievement. It is argued that the expectancy of “A” stream teachers for relatively high attainment helps in itself to lead to this result being obtained, just as the expectancy of “C” stream teachers for relatively low attainment helps to produce this result. Also, of course, the belief that attainment can and should be matched to ability, made easier perhaps in homogeneous ability classes, while it would tend to result in the stretching of brighter children, would not have this effect with duller ones, since, for many at least, the limit of their capacity would apparently have been reached. The effect this has on increasing the dispersion of achievement scores might, perhaps, be enhanced where teachers use tests of “intelligence” and attainment to help measure the success of their efforts, for ordinary regression effects will tend to make “dull” “C” stream children, subsequently tested for attainments, appear to be “working up to capacity” and bright “A” stream children appear to have room for further improvement. It should perhaps be added here, that the more successfully children’s attainments are matched to their ability, the more successful will any initial streaming appear to be - the “self-fulfilling prophecy” described by Daniels (1959). There would appear, therefore, to be a number of factors affecting the dispersion of achievement at any given age. In the first place, the general aim of the grade class teacher may tend to result in a relatively smaller dispersion. Perhaps exerting a greater influence, ‘however, is the belief a teacher may have that innate ability is of paramount importance in determining the level of * The 1944 Education Act. 61 attainment to be expected from a child. Streaming by ability, which is viewed as an administrative device resulting from the acceptance of this belief, will merely tend to enhance its effects. When all these factors act in the same direction the effect will clearly be greatest and this is what happens in England. Here, it is claimed, the aims and, more especially, the beliefs of most teachers and educational administrators lead them to expect wide differences in performance, and this is what is therefore achieved. Where, on the other hand, the grade placement system operates and especially where, within such a system, teachers do not attempt to measure innate ability and therefore do not expect their pupils’ attainments to be matched to it, then the dispersion of achievement will be much less. While possible explanations for the relatively wide dispersion of test scores in England can be offered, clearly a value judgment is involved in answering the question as to whether this is, educationally, good or bad. Some further relevant evidence can be given from the present study, however, by examining the proportions of pupils obtaining relatively high and low scores, Obviously, the average levels of performance will influence these proportions, but the contrast between England and other countries is shown in Table 2, which gives the percentage of pupils scoring outside the limits of raw score, approximating to plus and minus one standard deviation. TABLE 2 Percentage of Pupils scoring Test Average 12 countries 20.7 20.9 19.5 21.9 18.4 Non-Verbal Mathematics Reading Geography Science Below of beyond & 1 S. D. on each of Five Tests I S. D. England 15.9 39.1 25.1 34.7 24.2 Above Average of 12 countries 20.3 22.7 18.2 18.3 16.0 +1 S. D. England 28.7 15.5 20.8 11.4 18.1 It will be observed from Table 2 that, in three of the five tests (non-verbal, reading and science) England has a larger percentage than the average scoring above plus one standard deviation, but that in all four attainment tests, it also has a larger percentage than the average scoring below minus one standard deviation. Some concern might be felt for the 39.1 O/Oof pupils obtaining low scores on the mathematics test. Bibliographical references Board of Education 1926 - Board of Education Burt, C., 1959 1937 - Daniels, J. C., 1959 - Daniels, J. C., 1961a - Daniels, J. C.. 1961 b - Lloyd, F. and Pidgeon, D. A., 1961 - Pidgeon, D. A., 1958 - Pidgeon, D. A., 1961 - 62 Report of the Consultative Committee on the Education of Adolescent Children, H. M. S. O., London. Handbook of Suggestions for Teachers, H. M. S. 0.. London. General Ability and Special Aptitudes, Educational Research, Vol. I, No. 2. Some effects of sex segregation and streaming on the intellectual and scholastic development of junior school children. Unpublished thesis, Nottingham University. The effects of streaming in the Primary School, I - What Teachers Believe, Brit. Jour. Educ. Psych. 31, 69-78. The effects of streaming in the Primary School, II - A Comparison of Streamed and Unstreamed Schools, Brit. Jour. Educ. Psych. 31, 119-l 27. An Investigation into the Effects of Coaching on Non-Verbal Test Material with European, Indian and African Children, Brit. Jour. Educ. Psych. 31, 145-l 51. A Comparative Study of Basic Attainments, Educational Research, Vol. I, No. 1. The interpretation of test scores, Educational Research, Vol. IV. 33-43. David A.Walker AN ANALYSIS OF THE REACTIONS OF SCOTTISH TEACHERS AND PUPILS TO ITEMS IN THE GEOGRAPHY, MATHEMATICS AND SCIENCE TESTS To establish the relevance of test items to pupils’ learning opportunities is important both from the point of view of measuring achievement and from that of maintaining the goodwill of teachers whose pupils undergo the tests. In the following article Dr. Walker shows how this need can be fulfilled and seeks further to establish the extent to which in- and out-of-school learning opportunities. ability and otherfactors appeared from the international tests to be determinants of success. When the same test, or series of tests, is administered to pupils of different countrieswith different educational systems, it is unlikely that the items will be equally acceptable or equally useful in all of the countries concerned. The present inquiry was intended in the first place to assess the reactions of the teachers of the classes concerned to the items used in the tests of geography, mathematics and science, and secondly to estimate, if possible, the contributions made by stress on the topics in the curriculum and bythe environment to the accuracywith which pupils answered the questions. Rating the items To obtain the opinion of the teachers concerned, copies of the test booklets were sent to each school taking part in the original investigation. The following instructions were issued with the booklets. “instructions (1) (2) for rating items (Mathematics, Science) was taken by a group to rate the items of the test in two ways: used for these pupils; of the pupils. of 13-year-old Read the required each test item. typically covered point scale. Rating Rating Rating Enter Consider in the degree instruction of to which pupils in rating Consider 1, 2 or 3 in the next the extent answer apace provided to which pupils have Rating Rating Rating rating. (4) knowledge like in the Use the and yours. skills Give extensively, in the test opportunity encounter knowledge such as those involved some, or little exposure to such experiences. Enter the classes 1 Stressed: well covered in class and in homework (if any). 2 included but not stressed; touched on but not dealt with 3 Not included. the (3) test The attached test in Geography school last year. You are asked (a) in relation to the curriculum (b) in relation to the environment a rating intensively pupils by the on the in your question following era three or repeatedly. booklet. in the home and in the test question. Decide following rating scale. community whether to use there skills or is considerable, A Considerable exposure B Some exposure C Litt!e or no exposure the rating A. 6 or C in the e.g., 1A. 2C. answer apace provided in the test booklet. Indicate clearly on the front of the test booklet the school and course of the tests these coursea were described as five-year, three-year with no foreign language and three-year modified. In some booklets schools should (5) comment Any the coursea for boys be labelled accordingly. which teachers may and wish girls differ to make and separate on the tests will Each item will thus have a double to which the ratings refer. At the one foreign language, three-year booklets be welcomed will be required for each sex. time with The by the Council.” All schools taking part in the original investigation co-operated in this supplementary inquiry. A very small part of the data had to be rejected because some teachers did not give a definite rating, e.g., an item was rated “1 or 2” or “B/C”. 63 Of the 864 curriculum ratings given to were 2, and 28% were 3. In the opinion relation to the topics covered in school the proportions 9 %, 34 % and 57 z, i.e., from the environment. the 32 items of the geography test 28 % were 1, 44 X of the Scottish teachers this test had only a moderate work. The environment ratings A, B and C occurred in the teachers felt that only a minor amount of help came In mathematics the position was similar though in this subject there was a higher proportion given to rating 1. Of the 1,066 curriculum ratings for the 26 items, 42 % were 1, 34 % were 2 and 24 % were 3. The help expected from the environment was even less in this subject, the rating A occurring in only 6 % of the replies, B in 28 % and C in 65 %. In science, the curriculum ratings were similar to those in the other subjects. Of 730 ratings on 21 items, 30 % were 1, 30 z were 2 and 40 9; were 3. Greater help was thought to be available from the environment in this subject, the rating A being given in 20 % of the cases, B in 40 % and Cin40%. The ratings Appendix. differed greatly from item to item in each test as is shown in the list given in the It must not be assumed from the figures quoted above or from the data in the Appendix that the topics with adverse curriculum ratings are not covered in the school courses. In many cases they occur at a later stage in the curriculum. The pupils tested were mostly in the second year of a course which is of three to six years’ duration and different schools have different schemes for covering the work. Agreement among the ratings It would have been possible to calculate a mean curriculum rating and standard error of the mean for all teachers, using the values 1, 2 and 3 for the three ratings. This might, however, have given a misleading picture. For example, an item rated 1 by 20 teachers, 2 by none, and 3 by 20 teachers would then be given a mean rating of 2, which was not actually given by any teacher, and a standard error of about 0.16, which is relatively small because of the number of teachers involved. For this reason the table in the Appendix gives the most frequently occurring rating for each item and not the mean. One possible factor causing disagreement among ratings is the variation in course to suit the ability and sex of the pupils. In any assessment of the extent of agreement among teachers in rating the items it is therefore advisable to deal separately with different types of course. The main types in Scotland are (a) the five-year course for the more gifted pupil, (b) the three-year course with one foreign language for the pupil a little above the average, (c) the three-year course with no foreign language for the average pupil and (d) the modified course for the pupil within the lowest 10 to 20 % of the ability range. In some schools it is also necessary to differentiate between courses for boys and those for girls. Within each of these types of course we can assess the extent of agreement among the teachers’ ratings by calculating their variance. If the curriculum ratings are valued at 1, 2 and 3, as given by the teacher, the variance of the distribution wi!l be zero when all teachers agree, 2/3 when the ratings are distributed evenly over all three values, and 1 when half of the teachers select rating 1 and the other half select rating 3, showing maximum disagreement. These variances ously described, 64 were calculated for all the items of the three tests, using the categories and the results are summarised in Table 1. previ- TABLE Variances of Curriculum 9 z E 5 2 Items Highest variance Lowest variance Average variance 0 0.30 1 0 0.15 0 0.39 0.36 0.20 1 0.63 0.67 Five-year 6 11 6 4 0 0.29 0.75 Three-year one foreign language no foreign language (boys) no foreign language (girls) G 11 12 0.14 0 0 0.39 0.25 0.25 0.80 0.52 0.67 0 0.18 0.75 0 0.13 0 0 0.43 0.53 0.43 0.36 0.80 0.98 0.86 1 A Five-year Three-year one foreign language no foreign language modified 4 Five-year E z z cn for Different Number of teachers COURX 2 $ m : (3 1 Rating Distributions Three-year one foreign language no foreign language (boys) no foreign language (girls) no foreign language (boys and girls) - It will be observed that even within the main types of course the curriculum ratings of particular items were in perfect agreement for some items and in complete disagreement for others. A similar pattern was obtained from the environment ratings. These patterns indicate that the extent of agreement among the teachers, even within a course, was only moderate when averaged over all the items of each test. They throw little or no light on the reliability of each teacher’s ratings. The relation between facility of item, teachers’ ratings and ability of class Although the reliabilities of the teachers’ ratings were not established by the results of the preceding section, an effort was made to measure the extent to which the stressing of a topic in the curriculum, as assessed by the rating, or having help from the environment, similarly assessed, determined the pupils’ proficiencies in that topic. The proficiency of each group (which might comprise all pupils in a particular course in one school) was measured by the percentage of correct answers to the appropriate item. This percentage was converted to the corresponding probit (i.e., normal deviate plus five) for statistical reasons. The curriculum ratings 1, 2 and 3 were replaced by 1, 0 and -1 and the environment ratings A, B and C were changed to the numerical scale 1, 0 and - 1 also. It is likely that the proficiency of a group is affected by the general ability of the group, whatever the stress in the curriculum or the help provided by the environment. A simple and rough measure of this ability for each group was obtained by rating five-year courses as 2, three-year courses with one foreign language as 1. three-year courses with no foreign language as 0, and modified courses as -1. The objection may be raised that it is not the ability of the pupils that is here being assessed but their exposure to a given type of course. It would have been possible, by seeking further information from schools, to establish that the average abilities, as measured by tests of verbal reasoning, are in the descending order of the numbers given. Readers who prefer in the following discussion by the phrase “type of to do so may replace the word “ability” course followed”. For the geography test there were then available 27 different groups, each containing at least seven pupils, and for each of these groups and for each item of the test there were available (a) the facility probit; (b) the curriculum rating and (c) the environment rating for the item; and (d) 65 the ability level of the group. For example, the 49 pupils in the 5-year course in one school gave 35 correct answers to item 2 of the geography test, rated 1 B by the teachers in that school. Thus the facility percentage for curriculum rating 1, environment rating 0 and ability level 2 was 71.4 y0 for this group, giving a facility probit of 5.57. The 27 groups then provided the data to set up for each item the regression equation facility probit = br x curriculum rating + b2 x environment rating + bs x abrlity level As a first approximation each group was given equal weight, i.e., the differences in the numbers of pupils in the groups were ignored. This technique was applied to three items in the geography test, three in the mathematics test and four in the science test. The items were chosen partly for their relevance to the present inquiry and partly because results from the main inquiry had suggested points of interest. A summary of the results is shown in Table 2 in which coefficients which are statistically significant are marked *. It will be observed that for no item was the regression coefficient for the curriculum rating significantly different from zero, and only for one item was this true for the regression coefficient for the environment rating. On the other hand, for all items save one the regression coefficient for the ability rating was significantly greater than zero. In other words, the proficiency of a group in answering an item is directly related to the ability level of the group, but appears to have little relation to the amount of stress given by the teacher to the topic tested by the item. The fraction of the whole variance of the probits accounted for by the regression equation varied from a non-significant 12 o/0 for Item 9 of the science test to 57 o/0 for Item 22 of the mathematics test. Table 2 Regression for Selected Items Percentage 0 Test Geography Item Curric. Errors Envt. Ability varmnce accounted 2 - 0.12 0.30 0.36* 0.19 0.18 0.11 50 0.06 - 0.04 0.34" 0.11 0.14 0.08 46 0.26 0.31* 0.16 0.15 0.11 45 0.16 0.30* 0.21 0.21 0.15 24 0.54* 0.13 0.13 0.09 57 0.31* 0.19 0.10 0.17 0.11 52 0.1 1 0.08 43 0.14 0.17 0.13 0.14 0.10 0.1 1 23 5 12 22 Science Standard 8 14 Mathematics Regresslon coefficients Curric. EfM. Ability 1 - 0.11 0.22 see discussion.. 0.21 -0.11 -0.04 7 0.16 9 0.1 1 ia 0.11 0.44* 0.07 -0.02 0.31* 0.17 0.12 0.26* for 12 Item 2 of the geography test was of information type, asking what use was made of Tundra regions. Item 8 was purely factual, asking into which sea the River Danube flowed. In either item it might have been thought that degree of stress in the curriculum would have markedly affected the proportion of correct answers. This was not so, the accuracy of the pupils’ answers being related only to their general ability and not to the stress given in the curriculum or to the help of the environment. The position was very similar in response to Item 14, which was an exercise in interpreting a map. The first mathematics item to be examined was number 5, which was “multiply 9.04 by 0.4”. The performances of the groups varied from 0 oh to 100 %. but the regression accounted for only 24 y. of the variance. One reason may be that 31 of the 40 teachers gave the curriculum rating 1, showing that this type of calculation was stressed in the curriculum. Within that rating, degrees of stress would no doubt vary. The second mathematics item examined was number 12, which was a problem involving the cal- 66 culation of the area of a triangle with base 40 yards and altitude 37 yards. This question proved very difficult for Scottish pupils, twenty-three of the forty groups scoring zero, and the mean score of all groups being only 19 %. As there was so large a number of zero scores, the regression technique was not applied. It was, however, noted that the percentages of correct answers for the three curriculum ratings were 23 %, 18 y. and 17 %, while those for the four ability groups were 49 %, 24 %, 12 y. and 0 %. Item 22 of the mathematics test was again a calculation of areas, but in this case an example was given. Scottish pupils fared better on this item; every group contained pupils giving correct answers and the mean score of all groups was 59 %. The percentage of the variance accounted for by the regression equation was 57, the highest of all the items examined. Items 1, 7 and 9 of the science test were selected partly because sex differences were shown in the percentages of correct answers, boys being superior in all three. The first referred to the force required to push an object up an inclined plane and the teachers, while not stating that this type of question was stressed more frequently with boys than girls, appeared to be of opinion that it was the kind of problem more likely to occur in a boy’s environment than a girl’s This was the only item in which the environment rating was significant. Boys were also superior in their replies to Item 7, which dealt with the principle of flotation, but neither curriculum rating nor environment rating appeared to affect the regression equation. The responses to Item 9, on the principle of the lever, provided some surprises. Not one of the regression coefficients was significant nor was the contribution of all three together, the percentage of the variances attributable to regression being only 12 %. The percentage correct over all groups was 36 O/Oand the percentage for the various groups ranged from 9 to 75 %. As the item was a multiple choice one, with four possible answers, there is a suggestion here that a fair amount of guessing had occurred. This idea is supported by the fact that the curriculum rating for 26 of the 36 groups indicated that the topic had not been referred to by those teachers. Finally, Item 18 of the science test, which referred to the usual method of estimating the age of a tree, produced a good response from the Scottish pupils, but once again the only factor associated with success was the ability rating, and the proportion of the total variance accounted for by all three factors was only 23 %. The results of this analysis may be disappointing to teachers in that so little difference seems to be made to the proportion of correct answers by their stressing or not stressing particular topics. It must be borne in mind, however, that the measures used were comparatively coarse and that the analysis has been made as simple as possible. With these reservations, it would appear that, at the age at which the tests were administered, ability is a greater determinant of success than stress by teacher or help from environment, and that other factors are, in most cases, having greater effect than all three together. APPENDIX ON PAGE 66 67 APPENDIX Most frequently Item 68 Geography given ratings for each item Mathematics Science 1 IB 18 3c 2 28 1c 3c 3 2c IC 3c 4 IC 5 1c 2B 1c IC 38 6 2c 2B 3c 7 1C 1B 3c 8 2c 16 IB 9 1C IC 3c 10 2c IC 3c 11 2C IC 3c 3A 1B 12 2c 13 2c 2c IB 14 2c IA 15 2c 3c 2B 16 2c 2c 2c 17 2c 2c 28 18 2c 3c 25 19 2c 3c 2c 20 2c 3c 3B 21 2c 1B 1C 22 2c 1c 23 2c 3c 24 2c 36 25 2c 3c 26 2c 3c 27 2c 28 2c 29 3c 30 3c 31 3c 32 3c IC
© Copyright 2026 Paperzz