Finding and Defining Pertinent and Practicable Ways of Assessing Language Learners’ Productive Skills at ‘A2’ Level How viable marking grids can be established for the competence‐based assessment of pre‐intermediate students’ speaking and writing skills, and how they could be sensibly integrated into the Luxembourg ELT curriculum. Plagiarism statement I hereby certify that all material contained in this travail de candidature is my own work. I have not plagiarised from any source, including printed material and the internet. This work has not previously been published or submitted for assessment at any institution. All direct quotation appears in inverted commas or indented paragraphs and all source material, whether directly or indirectly quoted, is clearly acknowledged in the references as well as in the bibliography. Michel Fandel Michel Fandel Candidat au Lycée Technique Michel Lucius Finding and defining pertinent and practicable ways of assessing language learners’ productive skills at ‘A2’ level How viable marking grids can be established for the competence‐based assessment of pre‐intermediate students’ speaking and writing skills, and how they could be sensibly integrated into the Luxembourg ELT curriculum. Luxembourg (2010) 4 Abstract Due to their direct and ultimately decisive impact on final grades that decide whether or not individual students pass their school year (or successfully complete their learning cycle), summative tests inherently influence the learners’ chances to progress through the various stages of our school system. It is thus of paramount importance that these instruments provide a theoretically sound and fundamentally reliable basis on which the teacher can reach adequately informed judgments about a student’s true level of achievement and competence. Making sure that both the content of summative tests (i.e. what is checked and assessed) and their form (how they verify knowledge and skills) live up to these standards is just as challenging as it is crucial. Especially in a significantly changing national education system which is becoming ever more focused and reliant on competence-based methods of language teaching, it is clear that long-standing practices in the field of summative testing need to be reconsidered and newly adapted as well. One of the first objectives of this thesis therefore consists in identifying common fallacies that have affected predominant testing and assessment schemes in the Luxembourg school system for many years. After outlining salient theoretical cornerstones that must be at the root of appropriate testing and assessment procedures, Chapter 1, in particular, analyses problematic elements in the ‘traditional’ ways of approaching speaking and writing in summative tests. In the search for more suitable alternatives, Chapter 2 chiefly focuses on the enormous potential offered by the Council of Europe’s Common European Framework of Reference for Languages as a basis for a competence-oriented teaching and assessment scheme, though not without highlighting some of the contentious elements of this groundbreaking document in the process. The third and fourth chapters, through detailed descriptions and analyses of practical examples, then illustrate how competence-based tests and assessments can be implemented and integrated into daily teaching practice in relation to each of the two productive skills. The concluding chapter 5 not only stresses the resulting beneficial effects of these new ways of assessment, but also outlines the challenges that still lie ahead before the Luxembourg ELT curriculum has turned into a maximally effective and coherent framework for competence-based teaching, testing and assessment, uniting and spanning across all its different levels. 5 Table of contents Plagiarism statement _____________________________________________________ 2 Abstract _______________________________________________________________ 4 Table of contents ________________________________________________________ 5 Chapter 1: Analysing ‘traditional’ ways of assessing productive skills at pre‐intermediate level. ___________________________________________________ 7 1.1. Tests and assessment: theoretical considerations ___________________________ 8 1.1.1. ‘Test’ versus ‘assessment’ ______________________________________________________ 8 1.1.2. Different types of assessment __________________________________________________ 9 1.2. Factors defining a ‘good’ test ___________________________________________ 13 1.2.1. Validity ___________________________________________________________________ 13 1.2.2. Reliability __________________________________________________________________ 16 1.2.3. Feasibility _________________________________________________________________ 21 1.2.4. Authenticity ________________________________________________________________ 22 1.3. Problem zones in ‘traditional’ ways of assessing productive skills at pre‐ intermediate level __________________________________________________________ 23 Chapter 2: Towards a different, competence‐based assessment method of speaking and writing. ___________________________________________________________ 29 2.1. The Common European Framework as foundation for competence‐based assessment. __________________________________________________________________________ 30 2.1.1. Why choose the CEFR as a basis for teaching and assessment? _______________________ 30 2.1.2. The need for caution: political and economic concerns _____________________________ 32 2.1.3. The need for caution: pedagogic concerns _______________________________________ 34 2.1.4. Reasons for optimism: chances offered by the CEFR ________________________________ 35 2.2. Challenges of introducing competence‐based assessment _______________________ 42 2.2.1. The achievement versus proficiency conundrum in summative tests ___________________ 42 2.2.2. Contentious aspects of CEFR descriptors and scales ________________________________ 45 2.3. The CEFR and the Luxembourg school system: possible uses and necessary adaptations __________________________________________________________________________ 50 Chapter 3: Competence‐based ways of assessing speaking at A2 level. ___________ 55 3.1. Central features of interest to the assessment of speaking ______________________ 56 3.1.1. Features shared by both productive skills ________________________________________ 56 3.1.2. Features specific to speaking __________________________________________________ 59 3.2. Case study 1: speaking about people, family, likes and dislikes ___________________ 61 3.2.1. Class description ____________________________________________________________ 61 3.2.2. Laying the groundwork: classroom activities leading up to the test ____________________ 62 3.2.3. Speaking test 1: description of test items and tasks ________________________________ 63 3.2.4. Practical setup and test procedure ______________________________________________ 66 3.2.5. Form, strategy and theoretical implications of assessment used ______________________ 69 3.2.6. Analysis of test outcomes _____________________________________________________ 77 6 3.3. Case study 2: comparing places and lifestyles / asking for and giving directions. _____ 82 3.3.1. Speaking test 2: description of test items and tasks _________________________________ 82 3.3.2. Form, strategy and theoretical implications of assessment used _______________________ 87 3.3.3. Analysis of test outcomes _____________________________________________________ 90 Chapter 4: Competence‐based ways of assessing writing at A2 level. _____________ 97 4.1. Reasons for change: a practical example _____________________________________ 97 4.1.1. Description of the implemented test task _________________________________________ 97 4.1.2. “Traditional” method of assessment used ________________________________________ 99 4.2. Key features of a ‘good’ free writing task ____________________________________ 103 4.3. Central features of interest to the assessment of writing _______________________ 107 4.4. Defining an appropriate assessment scheme for writing tasks ___________________ 109 4.5. Case study: using a marking grid to assess written productions in a summative test _ 115 4.5.1. Description of test tasks ______________________________________________________ 115 4.5.2. Description of test conditions _________________________________________________ 117 4.5.3. ‘Horoscope’ task: analysis and assessment of student performances __________________ 118 4.5.4. ‘Summer camp’ task: analysis and assessment of student performances _______________ 123 4.5.5. Outcomes of the applied assessment procedure and general comments _______________ 129 4.6. Alternative types of writing tasks _______________________________________ 132 4.6.1. Informal letters and emails ___________________________________________________ 132 4.6.2. Story writing _______________________________________________________________ 134 Chapter 5: Conclusions and future outlook. _________________________________ 139 5.1. The impact of a competence‐based approach on test design and assessment ______ 139 5.1.1. Effects on validity ___________________________________________________________ 139 5.1.2. Effects on reliability _________________________________________________________ 143 5.1.3. Feasibility of the explored testing and assessment systems __________________________ 145 5.2. The Luxembourg ELT curriculum, the CEFR and competence‐based assessment: perspectives _______________________________________________________________ 149 Bibliography ________________________________________ Erreur ! Signet non défini. List of appendices ______________________________________________________ 161 7 Chapter 1: Analysing ‘traditional’ ways of assessing productive skills at preintermediate level. Few elements in the field of evaluation can be as treacherous as the numerical marks that sum up a student’s level of performance in an end-of-term report. By virtue of their apparent clarity and specificity, numerical values tend to assume a definitive authority that all too often remains unquestioned and absolute. As a result, such numbers represent a student’s learning achievements over the course of a term in a tidy and seemingly objective form which can easily tempt teachers, students and parents alike into drawing overly general conclusions about a particular learner’s target language proficiency and overall progress. As any devoted teacher will testify, however, assessing a student’s writing or speaking performances encompasses a much more complex and tricky process than a sole summative value can ever possibly represent, no matter if it is expressed by means of percentage points, broad categories (such as A-F or 1-6) or, as in the very peculiar case of the Luxembourg secondary school system, within the framework of a 60-mark scheme. Thus, as H. Douglas Brown puts it, ‘[g]rades and scores reduce a mountain of linguistic and cognitive performance data to an absurd molehill.’ 1 From the moment when a summative test is designed until a final grade or mark is awarded, every single decision can ultimately have a major impact on the teacher’s interpretation and valuation of a student’s product. Yet since a numerical mark invariably concludes the summative assessment process, it is all the more important that the reasoning which has led to that conclusion is based on theoretically and practically sound test items and tasks, assessment tools and strategies. In this context, it is certainly necessary to analyse to what extent the ‘traditional’ ways of assessing the productive skills may not always have been built on sufficiently solid foundations, and to identify the weaknesses and problem zones that have stubbornly persisted in them up to this point. Before this is possible, however, a number of key 1 H. Douglas Brown, Teaching by Principles, An Interactive Approach to Language Pedagogy (3rd ed.), Pearson Longman (New York: 2007), p.452. 8 Chapter 1 theoretical concepts and considerations that inevitably underpin the complex procedures of testing and assessment need to be highlighted and defined. 1.1. Tests and assessment: theoretical considerations 1.1.1. ‘Test’ versus ‘assessment’ The often interconnected notions of ‘tests’ and ‘assessment’ are unquestionably central components in most, if not all, language courses. However, they are also ‘frequently misunderstood terms’ 2 which can easily be confused with each other; as a result, it is essential to clarify their respective functions and scopes. According to Brown, a test is a ‘method of measuring a person’s ability [i.e. competences and/or skills] or knowledge in a given domain, with an emphasis on the concepts of method and measuring.’ In that sense, tests constitute ‘instruments that are (usually) carefully designed and that have identifiable scoring rubrics.’ 3 Importantly, they are normally held at fairly regular intervals, particularly in the case of so-called achievement tests. Students are aware of their importance and implications, and they can usually prepare for them in advance. As a consequence, tests are prepared administrative procedures that occupy identifiable time periods in a curriculum when learners muster all their faculties to offer peak performance, knowing that their responses are being measured and evaluated. 4 In that sense, Brown argues, tests can be seen as ‘subsets’ of the much wider and extremely multifaceted concept of assessment. This view is supported by the Council of Europe’s Common Framework of Reference for Languages (CEFR), which states that ‘all language tests are a form of assessment, but there are also many forms of assessment … which would not be described as tests.’ 5 In fact, this process affects almost all elements and activities in the language classroom. Virtually any spoken or written sample of language produced by a student prompts an implicit (or indeed explicit) judgment on the part of the teacher, who thus spontaneously gauges the demonstrated level of ability even 2 Ibid., p.444. Ibid., p.445. Italics are the author’s. 4 Ibid., p.445. 5 Council of Europe, Common European Framework of Reference for Languages: Learning, Teaching, Assessment, Cambridge University Press (Cambridge: 2001), p.177. All subsequent references to this text are to this edition. The abbreviation CEFR is used throughout the remainder of the text (except where the alternative abbreviation CEF is used in quotations by other authors). 3 Chapter 1 9 in the absence of a genuine test situation. Ultimately, this implies that ‘a good teacher never ceases to assess students, whether those assessments are incidental or intentional.’ 6 The type of feedback provided to the student evidently changes accordingly. Apart from the limited number of occasions when a teacher’s judgment of a language sample coincides with the correction and marking of a classroom test, the result certainly need not be a numerical score or grade at all (which, in Brown’s terms, would constitute a type of formal assessment). To name but a few informal alternatives, an assessment may just as well lead to such varied teacher responses as verbal praise, probing questions to elicit further elaboration, or even a simple nod of the head to confirm a correct or otherwise useful answer. However, even in the comparatively ‘high-stakes’ domain of summative tests, a number of very diverse approaches to assessment can be adopted. 1.1.2. Different types of assessment According to the CEFR, summative assessment ‘sums up attainment at the end of the course with a grade’ (p.186). Its central purpose consists in allowing the teacher to ‘evaluate an overall aspect of the learner’s knowledge in order to summarize the situation.’ 7 In very broad terms, summative assessment thus focuses on a particular product that illustrates the student’s achievement of a specific set of learning objectives, even though it seems crucial to point out that the reference to ‘the end of the course’ in the CEFR definition is misleading. In fact, in the context of multiple fixed-point class tests (“devoirs en classe”), whose exact number is usually predetermined by an official syllabus in the Luxembourg school system, summative assessment actually occurs much more frequently. After all, such tests generally provide closure to particular learning sequences at various successive points (rather than merely at the end) of the school year. In contrast, formative assessment is ‘an ongoing process of gathering information on the extent of learning, on strengths and weaknesses, which the teacher can feed back into their course planning and the actual feedback they give learners.’ (CEFR, p.186) It is thus centred on the student’s learning process, which it aims to both analyse and support via constructive feedback rather than an isolated summative value. As Penny Ur puts it, ‘its main purpose is to ‘form’: to enhance, not conclude, a process’, which, according to the same author, ‘summative evaluation may contribute little or nothing to.’ 8 6 Brown, op.cit., p.445. Penny Ur, A Course in Language Teaching: Practice and Theory, Cambridge University Press (Cambridge: 2006), p.244 8 Ibid., p.244 7 10 Chapter 1 Whether or not the feedback provided by summative assessments may contain formative elements and thus also support the student’s future learning process is in fact a contentious issue which will be discussed in more detail in later chapters. However, their clear focus on a learner’s ‘attainments’ at precise points in time certainly includes a variety of different approaches and types of tests. In this respect, one key distinction opposes what the CEFR defines as ‘the assessment of the achievement of specific objectives – assessment of what has been taught’ and, on the other hand, ‘proficiency assessment’, which focuses on ‘what someone can do [or] knows in relation to the application of the subject in the real world’. Achievement assessment thus ‘relates to the week’s/term’s work, the course book, the syllabus’ (p.183) whereas proficiency assessment is a broader form of judgment with the potential of covering a much wider spectrum of linguistic skills and competences. As a result, the CEFR states that the ‘advantage of an achievement approach is that it is close to the learner’s experience’, especially in a school-based context. In contrast, one of the major strengths of a ‘proficiency approach’ resides in the fact that ‘it helps everyone to see where they stand’ because, in that case, ‘results are transparent’ (pp.183-184). This central difference has important repercussions on the form, purpose and respective benefits of the tests that are correspondingly administered to language learners. Achievement (or progress) tests, for instance, are ‘related directly to classroom lessons [or] units’, thus ‘limited to particular material covered … within a particular time frame’ and deliberately ‘offered after a course has covered the objectives in question’ 9 . For understandable reasons, such tests have traditionally occupied a predominant place in language assessment in the Luxembourg school system. They provide a practical means for teachers to split up the material to be tackled over the course of an entire year into smaller, usually topic- or grammar-oriented chunks. As a result, both the size and scope of the corresponding tests can be significantly reduced while allowing teachers ‘to determine acquisition of [very specific] course objectives at the end of a period of instruction’ 10 . On the other hand, the possible content of achievement tests is for the same reasons also fairly limited. As Harmer points out, such tests only work if they contain item types which the students are familiar with. This does not mean that in a reading test, for example, we give them texts they have 9 Brown, op.cit., p.454. Ibid., p.454. 10 Chapter 1 11 seen before, but it does mean providing them with similar texts and familiar task types. If students are faced with completely new material, the test will not measure the learning that has been taking place […]. 11 Due to their explicit focus on previously covered language elements, achievement tests are thus a very useful tool to identify whether specific concepts have been internalised to a sufficient extent. Brown rightly suggests that they can also, as a corollary, ‘serve as indicators of features that a student needs to work on in the future’, even if this is not their ‘primary role’ 12 . However, if the ‘aim in a test is to tap global competence in a language’ 13 , then only proficiency tests can ‘give a general picture of a student’s knowledge and ability’ 14 . As they aim to establish an overview of a student’s overall strengths and weaknesses in the target language, these tests are generally ‘not intended to be limited to any one course, curriculum, or single skill’ 15 . Instead, the learners have to tackle a variety of tasks that usually require them – at different points of the test – to access each of the four basic language skills. As a result, proficiency tests paint a composite profile of a particular language learner whilst highlighting language skills and competences that have already been (or, in contrast, still need to be) developed. As famous examples of standardised proficiency tests, Brown and Harmer mention the TOEFL (Test of English as a Foreign Language) and IELTS (International English Language Testing System) examinations, which often involve especially high stakes for the language learner; the results in such tests may for example decide whether he or she can attend a particular university course. Given the current increasing focus on competence-based assessment in Luxembourg, proficiency assessment is quickly gaining more immediate and constant importance in the local and everyday context of our school system as well. 16 A particular challenge that lies ahead in this respect is highlighted by a further crucial distinction presented in the CEFR: the difference between performance 11 Jeremy Harmer, The Practice of English Language Teaching, Pearson Longman (Harlow, England: 2006) Brown, op.cit., p.454. 13 Ibid., p.453. 14 Harmer, op.cit., p.321. Emphasis added. 15 Brown, op.cit., p.453. 16 In addition to achievement and proficiency tests, there are two other test types in particular which are commonly used in educational contexts and described in test theory: • diagnostic tests, which aim at pinpointing specific (remaining) learner difficulties in order to adapt subsequent learning objectives accordingly; • placement tests, which seek to ‘place a student into an appropriate level or section of a language curriculum at school’ (Brown, op.cit., p.454). As both of these test types are used in more exceptional circumstances (and pursue more specialised purposes) than the scope of a common summative test allows, they are not treated in further detail in this thesis. 12 12 Chapter 1 assessment and knowledge assessment. The former ‘requires the learner to provide a sample of language in speech or writing in a direct test’; in the latter, students have to ‘answer questions which can be of a range of different item types in order to provide evidence of their linguistic knowledge and control’ (CEFR, p.187). In other words, a performance assessment asks the learner to directly produce entire stretches of language himself (for example in an oral interview), while a knowledge assessment would require more indirect proof of what the student knows about the language through the adequate selection of more discrete, separate items (for instance in a gap-filling or multiple-choice exercise). Both types of assessment may contribute to determining a student’s overall proficiency in the target language. However, one clear strength of performance assessment certainly consists in the more varied, extensive and direct language samples that it is based on in comparison to the insights gained from knowledge-based test tasks. Nevertheless, if one exclusively considers the results of a unique test rather than a number of successive efforts over the course of a substantial period of time, it is crucial to bear in mind that a single learner performance can never be more than indicative of his actual language competences. Whereas the above-mentioned aims and purposes of various assessment types already decisively affect the form and content of the tests that learners are confronted with, important differences also exist in the ways in which the respective results are most commonly interpreted. A first frequently used approach consists in norm-referencing, where the main objective is the ‘placement of learners in rank order’; the test-takers are compared with each other, ‘their assessment and ranking in relation to their peers’ is of central importance (CEFR, p.184). In such a context, the quality of a learner’s performance in a particular test is deliberately viewed against other productions in that class. If strategies of differentiated learning are adopted, norm-referencing may also involve subjecting ‘stronger’ students to different, more complex test items or task types than ‘weaker’ pupils. In stark contrast, criterion-referencing is focused on the performance (and possibly the traceable development) of a single learner in reference to a specifically developed set of performance standards. Instead of comparing an individual student’s efforts against those of his classmates, ‘the learner is assessed purely in terms of his/her ability in the subject’ (CEFR, p.184). This approach evidently presupposes that the criteria which the learner’s performance is measured against are clearly defined as well as theoretically and empirically proven to be adequate for the learner’s level. In terms of Chapter 1 13 purely describing a student’s proficiency, one may argue that the criterion-referenced approach permits more precise and intricate characterisation of learner strengths and weaknesses than a norm-referenced one. 1.2. Factors defining a ‘good’ test While the previous section focused on defining and contrasting a number of key concepts in the theory of testing and assessment, it is now time to analyse a few features which all tests must contain to be considered appropriate and theoretically sound measuring tools. Two particular features invariably emerge as central factors deciding whether a test can be considered as an adequate basis for assessment: validity and reliability. In addition, feasibility (or practicality) and authenticity constitute key elements in contemporary language tests. Exploring the most salient characteristics of each element will lead to further insight into the theoretical soundness – or issues – of traditional testing assessment methods in the Luxembourg school system. 1.2.1. Validity Validity is often regarded as the most important but also ‘by far the most complex’ 17 criterion when it comes to the theoretical legitimacy of a particular test. In very general terms, it can be defined as ‘the degree to which the test actually measures what it is intended to measure.’ 18 However, this rather broad definition does not necessarily reflect the multifaceted nature and implications of this notion; for that reason, it is normally broken down into a number of more detailed components that allow for a more focused analysis. For the sake of conciseness, however, the following theoretical exploration will remain limited to three particular areas of test validity that are most frequently cited as salient in educational research 19 . • Content validity, in general terms, requires any measuring instrument to ‘show that it fairly and comprehensively covers the domain or items that it purports to cover.’ 20 More specifically, both achievement and proficiency tests must thus ‘actually sample the subject matter about which conclusions are to be drawn’, which in turn implies that the assessment tool ‘requires the test-taker to perform the behavio[u]r that is 17 Brown, op.cit., p.448. Ibid., p.448. 19 For a more exhaustive list of validity aspects, see for example Louis Cohen, Lawrence Manion & Keith Morrison, Research Methods in Education, Routledge (London / New York: 2007), p.133. 20 Cohen et al., op.cit., p.137. Emphasis added. 18 14 Chapter 1 being measured’ 21 . For instance, if the teacher’s aim consists in assessing the students’ writing skills in the target language, the corresponding test task cannot merely be a multiple-choice exercise since the actually produced behaviour (i.e. ticking a box to confirm comprehension) would then not offer any evidence of actual writing skills. Correspondingly, no meaningful or legitimate inferences could be drawn about the students’ veritable proficiency in writing because the task content would be invalid for that purpose. • Face validity is closely connected to this first concept. However, it shifts the focus from the test content itself to the test-takers’ interpretation of the tasks they are being asked to fulfil. In fact, face validity is granted if learners feel that what they are asked to do in the test is relevant, reasonable and a fair reflection of what they can actually (and justifiably) expect to be assessed on. By looking at the “face” of the test, they get the impression that it truly allows them to show what they are capable of in the target language according to their progress up to that point. Indeed, most learners would feel unfairly treated if they were suddenly subjected to a task that seemingly had nothing to do with the learning objectives, subject matter and task types previously encountered. The psychological link between these first two types of validity is summed up by Brown as follows: Face validity is almost always perceived in terms of content: If the test samples the actual content of what the learner has achieved or expects to achieve, then face validity will be perceived. 22 • Construct validity, in contrast to the aforementioned aspects, is a concept that is even more firmly rooted in theory rather than practice. As Cohen, Manion and Morrison put it, ‘a construct is an abstract; this separates it from the previous types of validity which dealt in actualities – defined content.’ 23 Construct validity is fundamentally concerned with the conceptual purpose and relevance of tests and their constituent items, as well as the theoretical conclusions that they permit to draw. Brown describes it in the following terms: One way to look at construct validity is to ask the question “Does this test actually tap into the theoretical construct as it has been defined?” “Proficiency” is a construct. “Communicative competence” is a construct. […] Tests are, in a manner 21 Brown, op.cit., p.449. Emphasis added. Ibid., p.449. 23 Cohen et al., op.cit., p.137. 22 Chapter 1 15 of speaking, operational definitions of such constructs in that they operationalize the entity that is being measured. 24 For instance, if a given test has been billed as capable of measuring the test-taker’s “proficiency” in (a given aspect of) the target language, several factors must be respected to grant the construct validity of that test. First of all, it must have been clearly defined – prior to the test – what the construct of “proficiency” stands for, as well as under what circumstances it can be adequately demonstrated by a particular learner performance. The test itself, then, needs to provide an adequate possibility to show that “proficiency” has indeed been reached to a satisfactory degree (and as seen above, this is far from a straightforward procedure in a single test!). As for the construct of “communicative competence”, insufficient validity may for example be attributed to tests which are exclusively composed of gap-filling tasks. Indeed, if one defines this particular competence as the learner’s ability to express himself fluently enough in the target language to communicate meaning successfully and independently, then such an indirect test task would certainly not constitute a valid means to prove it. In other words, such a test would not ‘operationalize’ the learner’s ‘communicative competence’ in an adequate way and thus it would represent an invalid assessment tool for the construct it claimed to focus on. However, while content and construct validity are often closely linked, a significant difference between both aspects is cited by Brown via an example from the TOEFL. Interestingly, Brown states that although this well-known proficiency test ‘does not sample oral production’, that practical choice ‘is justified by positive correlations between oral production and the behavio[u]rs (listening, reading, grammaticality detection, and writing) actually sampled on the TOEFL.’ 25 In pure terms of content, the component of oral production is thus completely absent from this particular test; there is no moment when the ‘behaviour’ of speaking is actually demanded from the student. As a construct, however, proficiency can still be inferred if the test-taker completes all other tasks in a satisfactory way. This strikingly exemplifies how ‘it becomes very important for a teacher to be assured of construct validity’ in cases ‘when there is low, or questionable content validity in a test.’ 26 Evidently, this is only possible if such correlations have been clearly and solidly demonstrated by educational research. 24 Brown, op.cit., p.450. Ibid., p.450. 26 Ibid., p.450; italics added. 25 16 Chapter 1 Whichever path is chosen to validate a given test, it is vital to point out that completely unassailable validity is practically unattainable for any assessment instrument; hence, it is important to bear in mind that absolute validity is a virtual impossibility. Yet if careful and systematic measures are taken in order to respect the above-mentioned criteria, then test validity can of course be crucially increased. In fact, any teacher should effectively strive to eliminate any factors that potentially reduce test validity as far as possible. 1.2.2. Reliability Similarly to validity, reliability is a complex notion that can be positively or negatively affected by a wide range of factors. The CEFR defines this ‘technical term’ as ‘basically the extent to which the same rank order of candidates is replicated in two separate (real or simulated) administrations of the same assessment’ (CEFR, p.177). Borrowing Harmer’s considerably simpler terms, this essentially means that ‘a good test should give consistent results’ 27 . Evidently, in the context of a “one-off” classroom test which a teacher needs to design for and administer to a particular set of students at a specific point in time, this aspect of reliability is very difficult to verify in practice. Indeed, how often does a teacher, already pressed for time to ensure a reasonable level of progression, find the time to make his class (or a comparable one with sufficiently similar proportions of ‘strong’ and ‘weak’ learners) take the same test more than once? Even then, could one realistically expect to recreate exactly the same conditions as on the first occasion? Clearly, in everyday circumstances, the reliability of a particular test is nearly impossible to prove by empirical means. Yet it certainly seems desirable that any given test should consistently and unambiguously allow the assessor to separate ‘good’ performances from ‘bad’ ones. Crucially, there are a number of guidelines and precautions that one can try to respect so as to prevent potentially adverse effects on reliability. The different elements that need to be taken into account are generally associated to the following categories: • First of all, the reliability of the test itself depends on the items that it comprises and the general way in which it is constructed 28 . In this respect, Harmer specifies that reliability can be ‘enhanced by making test instructions absolutely clear’ (so that their wording does not induce errors of misinterpretation) or ‘restricting the scope for 27 28 Harmer, op.cit., p.322. Brown, op.cit., p.447. Chapter 1 17 variety in answers’ 29 . Appropriate size, complexity and context of test tasks, as well as a sensible ‘number and type of operations and stages’ 30 that such tasks comprise, are further elements which can contribute to increasing test reliability. Additionally, overall length is important due to the fact that ‘the test may be so long, in order to ensure coverage, that boredom and loss of concentration impair reliability’ 31 , even though some researchers argue that in general terms ‘other things being equal, longer tests are more reliable than shorter tests’ 32 . Interestingly, culture- and gender-related issues can also have an effect on the reliability of test results; as Cohen et al. point out, ‘what is comprehensible in one culture is incomprehensible in another.’ Furthermore, certain ‘questions might favour boys more than girls or vice versa’; for example, according to the same authors, [e]ssay questions favour boys if they concern impersonal topics and girls if they concern personal and interpersonal topics. […] Boys perform better than girls on multiple choice questions and girls perform better than boys on essay-type questions […], and girls perform better in written work than boys. 33 These statistically proven tendencies underline the teacher’s need to be mindful of different learner types and learning styles when aiming to set up a test with a high degree of reliability. • During the actual administration of a test, a multitude of variables related to the physical conditions in which it takes place can hamper its overall reliability as well. As Harmer points out, it is vital that ‘test conditions remain constant’ 34 , yet unfortunately this is not always within the teacher’s control. As an example, noise from outside (caused, for instance, by road or repair works) can suddenly intrude into the classroom and make the results of an otherwise immaculate test task unreliable as the students’ concentration spans and levels will be affected. Similar ‘situational factors’ which Cohen et al. identify as potential obstacles are for example ‘the time of day, the time of the school year [or] the temperature in the test room’ 35 . While the teacher is essentially powerless to counteract most of these elements, he can still strive to ensure stability wherever possible; for instance, tests should, whenever 29 Harmer, op.cit., p.322. Cohen et al., op.cit., p.161. 31 Ibid., p.161. 32 Ibid., p.159. 33 Ibid., p.161. 34 Harmer, op.cit., p.322. 35 Cohen et al., op.cit., p.160. 30 18 Chapter 1 possible, take place ‘in familiar settings, preferably in [the students’] own classroom under normal school conditions’ 36 . • However, uncontrollable influences on test performance are not just limited to the afore-mentioned situational factors; other causes of unreliability reside within the huge diversity of the individuals who actually take the test (student-related reliability). Elements which can significantly differ from student to student include ‘motivation, concentration, forgetfulness, health, carelessness, guessing [and] their related skills’, but also the test-specific effects of ‘the perceived importance of the test, the degree of formality of the test situation [and] “examination nerves”’ 37 . While extrinsic motivation is usually ensured by the important consequences of summative tests on the students’ overall chances of passing their year, intrinsic motivation to do well in a particular test is heavily linked to the ways in which learners accept the usefulness and reasons behind its constituent tasks. As Cohen et al. put it, ‘motivation to participate’ – and arguably to do well – ‘in test-taking sessions is strongest when students have been helped to see its purpose’ 38 . As mentioned above, the students’ response to situational factors also varies and thus a calm, reassuring atmosphere in the classroom should be aimed for. The learners’ confidence level can additionally be raised through simple and unambiguous instructions – if they understand what they have to do, they are more likely to do it well. Nevertheless, even if great care has been given to the clarity of instructions, one needs to bear in mind that the reliability of the corresponding results is still likely to be affected by the questions and items that have ultimately been chosen. According to Cohen et al., the students may vary from one question to another – a student may have performed better with a different set of questions which tested the same matter. 39 While the students’ work may thus not always give a fair reflection of their actual knowledge in the test situation, the same applies to their overall skills (which is, of course, particularly important for the reliability of proficiency tests). Indeed, unreliability may result from the fact that ‘a student may be able to perform a specific 36 Ibid., p.160. Ibid., p.159. In this respect, Cohen et al. also mention the so-called ‘Hawthorne effect, wherein […] simply informing students that this is an assessment situation will be enough to disturb their performance – for the better or worse (either case not being a fair reflection of their usual abilities).’ (p.160; emphases and italics added) 38 Ibid., p.160. 39 Ibid., p.160. 37 Chapter 1 19 skill in a test but not be able to select or perform it in the wider context of learning.’ Vice-versa, ‘some students can perform [a given] task in everyday life but not under test conditions’ 40 . This intricate connection between test (item) reliability and studentrelated reliability underlines the need to use high caution in the interpretation of test results. Because of the large number of variables involved, wrongful (over)generalisations about learner skills based on an isolated performance must indeed be carefully avoided, however tempting they may be. • If the different students in a classroom inherently constitute a potential source of unreliability due to the diversity that characterises them as human beings, it is not surprising that a similar situation presents itself at the other end of the assessment system. Indeed, the person (and personality) of the assessor can also affect the reliability of a given test in a lot of different ways; in test theory, this is usually referred to as scorer reliability. Especially insofar as the finally awarded grades or marks are concerned, divergences between the assessments of different scorers are virtually inevitable. ‘Inter-rater reliability’ is thus threatened because ‘different markers giv[e] different marks for the same or similar pieces of work’ 41 – unfortunately a reality in most schools which can understandably make some students feel unfairly treated. Indeed, the low mark awarded by their own teacher might (at least in some cases) have been a considerably higher mark in another teacher’s class. While some may argue that scoring consistency could be increased by having all summative tests assessed by two or more markers (a method which the Luxembourgish ‘double correction’ system in 13e and 1ère classes at least partially tries to implement), this is obviously impossible to put into practice in all classes and levels in everyday teaching. In the usual situation where a teacher is the only marker of all the tests taken by his own students, becoming aware of – and counteracting – the various potential inconsistencies in one’s own marking is, in Brown’s view, ‘an extremely important issue that every teacher has to contend with’ 42 . Familiar examples include ‘being harsh in the early stages of the marking and lenient in the later stages’43 , overlooking similar mistakes in one student’s paper but not in another’s due to stress, fatigue, carelessness or time pressure. Teachers may also proceed in an overly subjective 40 Ibid., pp.160-161. Ibid., p.159. 42 Brown, op.cit., p.447. 43 Cohen et al., op.cit., p.159. 41 20 Chapter 1 manner or use ‘unclear scoring criteria’; in fact, even if the criteria are reasonably valid in every single way, their ‘inconsistent application’ may lead to unfair ‘bias toward “good” or “bad” students’ based on prior performances (also referred to as “Halo” effect) 44 . Such scorer-related problems further underline the manifold challenges a teacher faces when trying to ensure reliability at all stages of the assessment process; not only external factors can have a negative impact, but one’s own subjective tendencies, lack of rigour and discipline as well. Interestingly, students and teachers also affect each other’s performances over the course of an assessment procedure in different ways. While separate reliability issues are already inherent to both groups, their relationship in the classroom leads to another set of problematic factors. For instance, Cohen et al. stipulate that during a test situation ‘students respond to such characteristics of the evaluator as the person’s sex, age and personality’; they also ‘respond to the tester in terms of what he/she expects of them’ 45 . In other words, rather than providing evidence of their veritable range of abilities, students risk engaging in a “guessing game” instead, artificially altering their actual answer style and content in anticipation of perceived teacher expectations. On the other hand, teachers can also (albeit often unintentionally) be guided by the overall impression they have of their students, rather than focusing on their performances in a test alone. Aside from the already mentioned ‘Halo’ effect, ‘marking practices are not always reliable’ because ‘markers may be being too generous, marking by effort and ability rather than performance.’ 46 Of course, as in the case of validity-affecting issues, it is certainly impossible to completely prevent all different types of reliability-threatening behaviour and factors at once. However, it will be one main aim of this thesis to explore important counteractive measures which teachers might take in this respect on a more regular basis. 44 Brown, op.cit., p.448. Cohen et al., op.cit., p.160. 46 Ibid., p.161. 45 Chapter 1 21 1.2.3. Feasibility As seen so far, the principles of validity and reliability have a significant and extremely multifaceted impact on the theoretical soundness of any ‘good’ test and assessment. Due to the numerous constraints encountered in daily teaching practice, however, teachers also need to keep in mind that the instrument they choose to implement respects the criterion of feasibility (also referred to as practicality). According to Brown, this means that a ‘good test … is within the means of financial limitations, time constraints, ease of administration, and scoring and interpretation’ 47 . Whereas financial aspects usually play a subordinated role in classroom testing, the other factors Brown mentions all constitute familiar and predominant concerns for most teachers. Keeping in mind that a normal school lesson in Luxembourg is limited to 50 minutes, timing is indeed a decisive element for practitioners when it comes to choosing the types and contents of tasks to include when setting up a test. As a result, concessions may for example have to be made in regard to ‘open-ended’ questions (requiring longer, more complex answers) to be incorporated in a 60-mark test, especially if the students have to complete it within a single lesson. Instead, multiple-choice or gap-filling exercises may be preferred to a certain extent because of their time-saving nature rather than their cognitive requirements and content validity. Indeed, less time is necessary for the students to complete such tasks and the duration consecrated to assessment is in turn reduced as well. Restricting possible answers to a low, fixed number of discrete items also certainly increases the ‘ease of administration’. Evidently, though, weighing up content validity and feasibility factors is often tricky, and sensible compromises have to be found in most cases. Yet doing so is an absolutely crucial matter, as ultimately neither element can be sacrificed excessively without affecting the test in a negative way. Practicality is also important for the form(s) of assessment the teacher uses after the test has been administered. To cite but one example, the CEFR states that ‘feasibility is particularly an issue with performance testing’ because in that case ‘assessors operate under time pressure. They are only seeing a limited sample of performance and there are definite limits to the type and number of categories they can handle as criteria’ (CEFR, p.178). This example already clearly underlines the importance of selecting a “feasible” method of judging a particular performance; the teacher, as assessor, must avoid being overwhelmed by an excessive mass of input – and of marking criteria to assess it. 47 Brown, op.cit., p.446. 22 Chapter 1 Carefully selecting the most salient and practical factors in relation to the task at hand is vital in order to maximise the efficiency and relevance of the assessment method, particularly given the often considerable amounts of tests that teachers have to mark within a limited time span. I will return to more precise and detailed considerations about these aspects of criterion-referenced assessments in subsequent chapters. 1.2.4. Authenticity According to Brown, another element which can significantly make a test ‘better’ is a high degree of authenticity. While he concedes that this is ‘a concept that is a little slippery to define, especially within the art and science of evaluating and designing tests’ 48 , it certainly affects the constituent tasks and items of any test. The issue here is that, particularly in the past, ‘unconnected, boring, contrived items were accepted as a necessary by-product of testing’. Yet even nowadays ‘many item types fail to simulate real-world tasks’ because they are excessively ‘contrived or artificial in their attempt to target a grammatical form or lexical item’ 49 , betraying a lingering influence of pedagogic principles from the grammar-translation approach. However, such a de-contextualised approach must be avoided as much as possible if tests are to mirror real-life contexts, allowing learners to truly demonstrate communicative competence and prove their use of language in “authentic” situations. Brown argues that both the language and context of test items should thus be ‘natural’ and connected rather than take the form of ‘isolated’ sentences or chunks which are neither linked to each other, nor to any specific real-world situation. Furthermore, ‘topics and situations’ should, as far as possible, be ‘interesting, enjoyable’ and perhaps even ‘humorous’; in that way, the learner is more likely to engage with the tasks voluntarily (i.e. intrinsic motivation is raised). Further enhancement of authenticity can then be reached if ‘some thematic organization is provided, such as through a story line or episode’ and a task-based approach rather than a grammar-centred one is pursued. 50 Of course, feasibility issues might dictate to what extent a teacher is able to implement this in each single classroom test in all of his classes. In comparison to the vast resources and possibilities of professional test designers and internationally renowned examination bodies, an individual practitioner will indeed find sources of “authentic” and varied material harder to come by. Nevertheless, striving to adapt test items and tasks more 48 Ibid., p.450. Ibid., p.451. 50 Ibid., p.451. 49 Chapter 1 23 closely to real-life situations and contexts rather than opting for “the easy way out” with mere lists of isolated, de-contextualised sentences lies well within the reach of any teacher and can certainly be sought for on a much more regular basis. 1.3. Problem zones in ‘traditional’ ways of assessing productive skills at preintermediate level In light of the theoretical and practical considerations seen above, two major questions emerge: which contentious issues affect the methods and principles that have dominated summative testing and assessment procedures in Luxembourg for many years? On the other hand, which alterations does a shift towards competence-based assessment imply? One of the most striking elements of summative tests used by the majority of local teachers up to this point is their perceptible over-reliance on writing tasks and exercises. Even though the importance of developing all four skills across the various language levels has been highlighted by the English curriculum in both ES and EST systems for years, apparent reluctance exists when it comes to including, for instance, listening or speaking tasks into summative tests on a regular basis or even at all. 51 Of course, this does not imply that the remaining three skills have not been catered for (hence developed) in classroom activities during the term. However, a long-standing discrepancy in favour of writing tasks in regular summative tests cannot be denied. As seen above, feasibility issues might in fact play a large part in this; for example, implementing listening exercises in classroom tests tends to be fairly time-consuming and implies additional preparations and requirements of a technical nature (i.e. making sure that there is a CD or MP3 player available; time elapses while it is being set up…). Even bigger concerns certainly arise in relation to the systematic assessment of each student’s speaking skills, considering that most classes consist of twenty or more pupils; as a result, many teachers tend to shy away from thorough, individual oral tests due to their obviously timeconsuming nature. Nevertheless, in a competence-based assessment scheme seeking to clearly attest the progress made by individual students in relation to all four skills, an excessive focus on written samples (in all summative tests of a term or year) does not 51 This attitude was not least exemplified by the vivid resistance that the proposed reconsideration of the weighting of the four skills encountered in both Commissions Nationales des Programmes in 2009-10. While the importance of writing was preliminarily reduced to 40% of a term’s assessment (the 3 other skills receiving a weighting of 20% each), protests from many corners stubbornly persisted and repeatedly resurfaced in the sessions of both commissions, arguing that 40% was an insufficient valuation of writing skills which needed to be re-adjusted to a higher number again. 24 Chapter 1 seem sustainable anymore. For listening, speaking and arguably even reading skills, most ‘traditional’ tests would indeed carry insufficient content, face and construct validity as they would fail to provide enough (if any) data for an adequate judgment of overall proficiency or skills development. Moreover, the current system clearly disadvantages students who may be good and confident speakers of English, yet have problems with the more technical aspects of writing such as orthographic accuracy; the one-dimensionality of ‘traditional’, exclusively written tests does not allow them to demonstrate their biggest strengths. A similar case can certainly be made for those students who have no major trouble understanding written or spoken input (i.e. who are good at reading and listening), but struggle to show comparable strengths in terms of language production. As a result, summative tests that only focus on writing ultimately disregard the existence of a wide variety of learner types and profiles in the classroom, favouring only one particular group of students instead. However, not only the lack of focus on multiple skills recurs as a problem; the actual content (and corresponding validity) of a number of commonly used ‘writing’ tasks can be contentious as well. Undoubtedly as a remnant of the grammar-translation approach, many ‘traditional’ summative tests, particularly at ‘elementary’ and ‘preintermediate’ levels, tend to contain a disproportionate amount of grammar-focused exercises. To make matters worse, the latter frequently lack cohesive real-life contexts and thereby ignore the beneficial effects of (semi-)authenticity. A particularly popular example consists in ‘gap-filling’ exercises to verify the use of tenses or other discrete language items; similarly, vocabulary knowledge might simply be verified by way of simple word-for-word translations. In themselves, such tasks may be perfectly valid for restricted purposes: for instance, if one wants to check the students’ usage of a discrete grammatical item, gap-filling exercises may permit to assess attainment of a specific (though admittedly narrow and purely grammar-based) learning objective. As a part of an achievement test that focuses on knowledge rather than performance, such tasks thus certainly make sense. Yet a problem arises if the same exercises are used to presumably attest a learner’s overall proficiency in writing. If students are not asked to do more than fill in single words or slightly modify sample sentences throughout an entire test, one cannot consider the accordingly limited amount of produced language as an example of writing performance. Perhaps symptomatically, I found out at the beginning of the 2009-10 Chapter 1 25 school year that none of my 10e students (47 learners supposedly entering ‘intermediate’ or ‘B1’ level) had ever been subjected to a genuine ‘free writing’ task in an English test before; throughout the first two years of learning the language, their success in summative tests (and consequently their school year) had thus mostly depended on their adequate completion of discrete-item tasks about grammar and vocabulary. In most ‘traditional’ summative tests, the only exception often consists in comprehension questions about set texts treated in class prior to the test, which arguably lead to longer and hence more “productive” answers. However, even the students’ answers to these questions can actually not be interpreted as purely tapping into their writing skills, given the strong dependence of such answers on the predefined content and lexis items of the texts. In fact, one might even argue that such presumable ‘reading comprehension’ questions do not truly verify reading skills either, as the students’ responses are inspired more by their ability to memorise and recall specific details than by the direct application of reading skills. The necessity to reconsider traditional methods of testing productive skills at ‘preintermediate’ level thus already becomes clear in respect to writing tasks. The case is arguably even more extreme when it comes to the systematic sampling of speaking performance. As previously suggested, speaking skills have often been completely neglected in summative testing due to feasibility issues. However, aside from the high amount of time needed for an in-depth assessment of each individual’s oral productions, some teachers may also have justified this choice by pointing to the students’ relatively limited speaking abilities at this early stage of their language learning. Not infrequently, this has led to the belief that low-level learners are simply incapable of providing sufficiently extensive performance samples for a valid and reliable judgment of their speaking skills to be based on. As a result, speaking tasks have simply been left out of summative tests entirely. It seems clear that in a competence-based learning environment, such attitudes need to be changed if both productive skills are to be assessed in more equal measure than it has been the case so far. Sensible ways of regularly collecting and assessing spoken samples (in addition to written ones) must therefore be found as quickly as possible. Whereas a number of problem zones thus exist in the ‘traditional’ selection of items and tasks for summative tests, some frequently used assessment procedures present several potential weaknesses as well. Up to this point, one of the major shortcomings in the evaluation of written and spoken productions has been the absence of absolutely 26 Chapter 1 clear and unified assessment criteria for both productive skills. As a result, the assessment of free writing often remains very subjective; holistic marking based on the teacher’s overall impression usually prevails, even though basic distinctions between form and content are normally used to break down the assessment into two broad scoring factors. For instance, one frequent practice consists in attributing half of the overall mark for a written sample on content (based on the presence, complexity and cohesion of key ideas and concepts) while the other half of the mark is determined by the linguistic correctness of the student’s answer (resulting from an overall impression of the student’s grammatical and lexical accuracy, vocabulary range, spelling…). Alternative weighting may consist in a slight emphasis of content (e.g. 2/3 of the final mark) over form (e.g. 1/3). On the one hand, the results arrived at by this method may not always differ significantly from those of an approach that splits up and rates a given performance according to more diverse and specifically defined (sub)criteria. It can also be argued that a certain degree of subjectivity is not always negative in assessment, particularly as we should, after all, be able to put a reasonable amount of trust into the professional judgment of a trained and experienced teacher. Nevertheless, with holistically attributed marks, scorer reliability issues will inevitably remain; one can simply not get away from the fact that ‘hard’ and ‘soft’ markers may score a same sample of written performance in completely dissimilar ways. This can evidently have disadvantageous consequences for pupils: their final mark does not solely depend on the quality of their performance, but also on the person who teaches them in a given year. Additionally, as seen above, even one same marker may reach unreliable results in relation to various students’ performances in a particular class test, especially when assessing samples of writing without having solid, unchanging guidelines to fall back on. In this context, a norm-referenced approach might help to establish a “qualitative hierarchy” between better and less convincing efforts. However, in some circumstances this might lead to a perfectly satisfactory answer being marked down only due to its perceived inferiority to an exceptionally good performance by another student who actually exceeds expectations for a given level. Rather than rewarding the positive elements of the first answer, such scoring behaviour would ultimately penalise remaining weaknesses that might be perfectly normal for a student at ‘pre-intermediate’ level. This type of reasoning highlights another typical flaw of ‘traditional’ assessment strategies: especially in free writing tasks, maximum marks are normally not awarded at all or only in very exceptional cases of excellent and virtually mistake-free work. Yet is it Chapter 1 27 fair to make the highest possible score only available to students who have actually already surpassed the level of proficiency that is realistically to be expected? Does it not point to a rather negative assessment culture that is more focused on punishing mistakes than on reinforcing positive steps and perceived progress? A final problem of ‘traditional’ assessment procedures fittingly affects the concluding step to each summative assessment: the feedback that teachers give to students (and/or their parents) based on their performance in a particular test. An assessor will inevitably find it more difficult to justify a given mark if they cannot underpin it by referring to precise and transparent assessment criteria. Yet the unavailability of such clearly defined criteria and universally expected standards of performance has been a constant problem up to this point; providing such a solid “backbone” to assessments could certainly reduce problems with scorer (especially inter-rater) reliability. Simultaneously, the assessment process could become more transparent for all parties involved. In that respect, criterion-referenced assessment can contribute to creating a “level playing field” for all students, which an exclusively norm-referenced approach would make impossible. As a corollary, however, a competence-based system also implies that summative classroom tests may not always stay confined to achievement assessment; particularly in relation to productive skills, establishing the proficiency levels reached by individual students is bound to take on increased significance in the proposed new assessment scheme. In the next chapters of this thesis, I will explore possible ways of designing and implementing such new competence-based forms of assessment. However, as an in-depth analysis of all four skills would exceed the possible scope of this study, I will exclusively concentrate on the two productive skills of speaking and writing; it is in this domain that students produce the most immediate and extensive evidence of the progress they have made in their language learning. Given the aforementioned doubtful validity of writing tasks and complete omission of speaking elements in ‘traditional’ summative testing, this thesis seeks to pave the way towards a different approach that allows for a more informed and accurate picture of the student’s various achievements in terms of target language production. Furthermore, the exclusive focus on the ‘pre-intermediate’ level of proficiency is neither random nor simply due to reasons of conciseness. In fact, the study intends to question and contest the seemingly widespread assumption that productive skills 28 Chapter 1 (particularly “free” writing and speaking) cannot be extensively tested at this early stage of language learning (commonly resulting in an over-insistence on drilling grammatical basics instead). Moreover, the thesis is being written at a time when the shift towards competence-based teaching is still at an early stage in the Luxembourg secondary school system. As it is essential to build on solid foundations, it is both logical and necessary to design and implement corresponding methods of assessment at the lowest levels of the proficiency scale first. This thesis aims to actively contribute to that development by finding and implementing such competence-based ways of testing and assessment in the classroom, as well as exploring to what extent this may enhance the respective validity, reliability and feasibility of these procedures. 29 Chapter 2: Towards a different, competencebased assessment method of speaking and writing. As a result of the present government’s large-scale move towards competence-based forms of assessment, groundbreaking changes to the ‘traditional’ ways of designing, implementing and marking classroom tests are inevitably in store for the teachers (and consequently the requirements to learners) in the Luxembourg school system. The explicit focus on the students’ multiple and varied language skills implies that writing tasks can no longer be the only component of regular summative tests. To give a more complete account of their capacities, learners must also demonstrate that they can communicate orally as well as understand spoken and written input in the target language to a satisfactory degree. Even in respect to writing skills, test strategies and items need to be adapted so as to veritably tap into competences instead of merely demanding a mechanical, de-contextualised application of discrete grammatical structures and lexical items (for instance). Yet how can one reliably collect and interpret samples of performance that trace tangible achievements in terms of skills rather than knowledge? What precisely does this complex yet often only vaguely understood notion of ‘competence-based assessment’ actually encompass? And how can one truly certify that a student has developed a particular language ‘competence’ to a satisfactory degree? Tackling these crucial questions in a theoretically founded and convincing way is paramount if the ongoing drastic overhaul of the approach to assessment is to stand any chance of being legitimate and, ultimately, successful. It comes as no surprise, then, that rather than engaging in the monumental and extremely risky task of devising an independent system “from scratch”, Luxembourg – like many other European countries – is currently seeking to align its education system with one of the most heralded breakthroughs in recent linguistic research: the CEFR. In a first step, it is thus necessary to analyse what exactly this Common European Framework represents and what benefits and dangers it contains as a foundation for a competence-based assessment scheme. 30 Chapter 2 2.1. The Common European Framework as foundation for competencebased assessment. 2.1.1. Why choose the CEFR as a basis for teaching and assessment? Ever since the first public release of the CEFR in 2001, several of its core elements have been adopted (and in many cases adapted) by various international coursebook writers, curriculum designers and examination bodies. This is most notably the case for the six main proficiency levels (A1-C2) that it defines, as well as its multitude of detailed and skills-specific descriptor scales. As a consequence of this rapid and widespread influence and success, numerous teachers (and indeed most people connected to language education in general) would undoubtedly agree with Julia Keddle’s claim that the CEFR is fast becoming ‘an essential tool for the 21st century’ 1 in the field of language learning and assessment. In short, as Alderson puts it, ‘nobody engaged in language education in Europe can ignore the existence of the CEFR.’ 2 The reasons for this are manifold. For instance, one may argue that a first major strength of the CEFR inherently consists in the highly ambitious goal it has set out to achieve: In documenting the many competences which language users deploy in communication, and in defining different levels of performance in these competences, the authors of the Framework have made explicit the true complexity of the task that confronts learners – and teachers – of a language. 3 In doing so, the CEFR authors approach and define the notion of competence from various angles. While ‘competences’ are first of all broadly described as ‘the sum of knowledge, skills and characteristics that allow a person to perform actions’ (CEFR, p.9), an important distinction is then made between ‘general’ and ‘communicative language competences’. ‘General competences’, on the one hand, are ‘not specific to language, but…called upon for actions of all kinds, including language activities’; they are built on the four main pillars of ‘knowledge (savoir),…skills and know-how (savoir-faire),… existential competence (savoir-être)’ as well as the ‘ability to learn (savoir-apprendre)’ (pp.9-11). Of even more direct relevance to language courses, the concept of ‘communicative language competence’ is then defined in terms of ‘linguistic, sociolinguistic and pragmatic competences’, all of which need to be ‘activated in the 1 Julia Starr Keddle, ‘The CEF and the secondary school syllabus’ in Keith Morrow (ed.), Insights from the Common European Framework, Oxford University Press (Oxford: 2004), p.43. 2 J. Charles Alderson, ‘The CEFR and the Need for More Research’ in The Modern Language Journal, 91, iv (2007), p.660. 3 Keith Morrow, ‘Background to the CEF’ in Morrow, op.cit., p.6. Chapter 2 31 performance of…various language activities’ (pp.13-14). In other words, the CEFR’s strongly action-oriented focus becomes clearly visible from the outset; the various levels of language proficiency are not just considered in terms of knowledge about the language, but via a description of how linguistic resources are actively and effectively put to use in different socio-cultural contexts and situations 4 . In general, making the different processes and competences involved in language learning (and communication as a whole) more transparent is certainly a crucial requirement for any pertinent school-based language course as well. After all, as Morrow further states, learners, teachers, and ministers of education want to know that decisions about language teaching are based on a full account of the competences that need to be developed. For the first time, [due to the CEFR,] such an account is now available. 5 In the specific context of Luxembourg, the pre-existence of such a detailed inquiry into the multiple facets of language learning is an undeniable asset. Since competence-based teaching constitutes a distinctly new and thus largely unexplored path in our education system, the inevitable task of defining a relevant, valid and workable framework seems a daunting one indeed. Encouragingly, the CEFR not only represents a possible, detailed alternative for this purpose; through its inherent status as a European project, it also intrinsically offers an opportunity to make the certification of our students’ achievements more easily comparable to international standards, and thus more widely recognisable (and more willingly recognised) beyond our local borders. The exclusively marks-based system thus far used in Luxembourg has recurrently presented more problems in that respect, not least due to its singular reliance on a 60-mark scheme that is not generally found (and therefore not always adequately interpreted) in other European countries. In contrast, the six Common Reference Levels defined by the CEFR (reaching from ‘A1’ or ‘Breakthrough’ stage to the ‘C2’ or ‘Mastery’ level) present a number of very useful features. For example, as Heyworth points out, they are already ‘being used as the reference for setting objectives, for assessment and for certification in many contexts around Europe’ 6 , facilitating comparisons of different national systems and their respective outcomes in the process. This is partly due to the fact that the traditionally used 4 In this paragraph, emphasis has only been added to the words ‘action’ and ‘performance’. Otherwise, italics and emphases are the CEFR authors’. 5 Ibid., p.7. 6 Frank Heyworth, ‘Why the CEF is important’ in Morrow, op.cit., p.17. 32 Chapter 2 words ‘intermediate’ and ‘advanced’ [for instance] are vague terms which tend to be interpreted very differently in different contexts. 7 The Common Reference Levels, on the other hand, have been painstakingly defined and extensively underpinned by nuanced and level-specific proficiency descriptors. As a result, the ‘scale from A1 to C2 makes level statements more transparent’; hence, it is ‘increasingly being used to describe and compare levels both within and across languages’ 8 . 2.1.2. The need for caution: political and economic concerns To a certain extent, however, caution should be maintained in this context. After all, it would be negligent for any sensible analysis of the CEFR to overlook the underlying political dimension that partially explains its current, speedy propagation across the European continent. As critics like Glenn Fulcher astutely point out, a significant reason why ‘language education in Europe is lurching toward a harmonised standard-based model’ resides in ‘the interests of so-called competitiveness in global markets’ 9 . In vying to serve the ‘primary economic needs of a [European] superstate, competing for its place in the global economy’ 10 , the harmonisation of education systems in individual member states clearly provides a crucial stepping-stone. A common denominator like the CEFR can most certainly play an integral role in this unifying effort, even if its authors did not produce it with that particular design in mind 11 . Ultimately, it is important to realise that ‘aligning our teaching and assessment practices to the CEFR is not an ideologically or politically neutral act’ 12 . The improvement of intercultural understanding and of the quality of local language education systems may very well be real and reasonable grounds for seeking such an alignment, and will hopefully emerge as some of its most positive consequences in the future. Nevertheless, it would evidently be naïve to suppose that the CEFR would have enjoyed the same instant success if, in addition to its cultural and educational benefits, it did not promise to be so useful in the wider context of European politics and economics. 7 Ibid., p.17. Ibid., p.17. 9 Glenn Fulcher, ‘Testing times ahead?’ in Liaison Magazine, Issue 1 (July 2008), p.20. Accessible at http://www.llas.ac.uk/news/newsletter.html. 10 Ibid., p.23. 11 As Morrow (art.cit., p.3) crucially points out, the ‘Council of Europe has no connection with the European Union’ and ‘while its work is ‘political’ in the broadest sense – aiming to safeguard human rights, democracy, and the rule of law in member countries – an important part of what it does is essentially cultural in nature rather than narrowly political or economic.’ 12 Fulcher (2008), p.23. 8 Chapter 2 33 While Fulcher does recognise that the ‘adoption [of the Framework even] beyond Europe testifies to its usefulness in centralised language policy’, he also warns that the ensuing ‘pressure to adopt the CEFR’ can easily lead to undesirable consequences. Essentially, the CEFR risks becoming a ‘tool of proscription, […] made effective through institutional recognition of only those programmes that are “CEFR-aligned”’ 13 . In theory, of course, requiring compliance to a universally accepted standard need not be an inherently negative prerequisite; as seen above, such harmonisation certainly has the potential of markedly increasing the comparability of results. In practice, on the other hand, it is precisely this intended development which can cause a number of problems, particularly if its implementation is rushed and forced: For the users of language tests, the danger is that any test that does not report scores in terms of CEF levels will be seen as “invalid” and hence not “recognised”. […] For many producers of tests, the danger lies in the desire to claim a link between scores on their tests and what those scores mean in terms of CEF levels, simply to get “recognition” within Europe. 14 The main concern is that the establishment of a valid ‘link’ between particular tests and the CEFR levels is far from a straightforward procedure. While a preliminary Manual 15 has been published with guidelines for this exact purpose, the validation process remains a complex and difficult one. As Alderson points out in this context, it does not help that ‘the Council of Europe has refused to set up an equivalent mechanism to validate or even inspect the claims made by examination providers or textbook developers’ (a decision which the same author emphatically slams as ‘a serious dereliction of professional duty’) 16 . Unsurprisingly, if they remain largely left to themselves, test designers (and, in the same vein, coursebook writers and curriculum developers for entire national education systems) can easily become tempted to skip or abbreviate some of the necessary meticulous steps for validation. This is particularly the case if they find themselves under time pressure to produce “CEFR-compatible” material, be it for economic or political purposes. At its worst, Fulcher states, ‘the “linking” is’ in those cases ‘mostly intuitive’ 17 . Contrary to the intended outcome, different tests superficially ‘claiming’ a link to a same 13 Ibid., p.21. Glenn Fulcher, ‘Are Europe’s tests being built on an ‘unsafe’ framework?’, The Guardian Weekly (18 March 2004), accessible at http://www.guardian.co.uk/education/2004/mar/18/tefl2. Emphasis and italics added. 15 Council of Europe, Manual for relating Language Examinations to the Common European Framework of Reference for Languages, January 2009. Accessible at http://www.coe.int/t/dg4/linguistic/Manuel1_EN.asp#TopOfPage. 16 Alderson, art.cit., pp.661-662. 17 Fulcher (2004) 14 34 Chapter 2 CEFR level may thus actually provide a much more unreliable basis for comparison than their apparent commonalities may initially suggest. Hence, if European states are vying to integrate the CEFR into the teaching and assessment strategies of their national education systems to increase the comparability of their results across borders, they need to be highly alert to the aforementioned dangers of excessively precipitating this adaptation. Clearly, the speed with which changes are executed should always remain subordinated to the importance of careful planning and thorough inquiry into their theoretical soundness. Considering the fundamental and allencompassing way in which our national school system is currently being reshaped, it is clear that in Luxembourg, too, a truly valid alignment to the CEFR can only be arrived at if rash, “intuitive” decisions and adaptations are avoided in this crucial period of time. 2.1.3. The need for caution: pedagogic concerns Whereas such considerations are important to keep in mind, questioning the political and economic purposes and ramifications of the CEFR’s success in further depth would evidently go beyond the scope of the present study. After all, this thesis is primarily interested in exploring the pedagogic value that the competency model and illustrative scales of the Framework offer for practitioners in the classroom. However, even in that respect, Fulcher voices some clear doubts regarding the theoretical foundation and validity of the ways in which the CEFR descriptors were established. While North points to the fact that the illustrative scales of the CEFR ‘form an item bank of empirically calibrated descriptors’ based on ‘extensive qualitative research’ 18 , Fulcher counters by insisting on a number of problematic elements in the chosen approach: There is a widespread, but mistaken, assumption that the scales have been constructed on a principled analysis of language use, or a theory of second language acquisition. However, the descriptors were drawn from existing scales in many testing systems around the world, and placed within the CEFR scales because teacher judgments of their level could be scaled using multi-faceted Rasch. […] The selection of descriptors and scale assembly were psychometrically driven, or … based entirely on a theory of measurement. The scaling studies used intuitive teacher judgments as data rather than samples of performance. What we see in the CEFR scales is therefore ‘essentially a-theoretical’. 19 18 Brian North, ‘The CEFR Illustrative Descriptor Scales’ in The Modern Language Journal, 91, iv (2007), p.657. 19 Fulcher (2008), pp.21-22; emphasis added. Also see Fulcher (2004) for a more detailed explanation of the different steps that were used to compile the CEFR descriptors and scales. Chapter 2 35 These are very serious issues, considering that numerous language curricula and examinations all over Europe (including Luxembourg) are already setting (or aiming to set) the attainment of particular CEFR proficiency levels as targeted – and ultimately certified – learning outcomes. Evidently, the validity of the latter is very much at stake if the CEFR scales they are built on ‘have no basis in theory or SLA research’ 20 . Problems with construct validity inevitably arise if CEFR descriptors, which supposedly describe ‘learner proficiency’ in the target language, must in fact be more accurately defined as being merely ‘teachers’ perceptions of language proficiency’. North is aware of this possible caveat, but he crucially stresses that ‘at the time the CEFR was developed, SLA research was not in a position to provide’ the ‘validated descriptions of SLA processes’ implied in the illustrative descriptors 21 . This is echoed by Hulstijn, who underlines that ‘the CEFR authors, in the absence of fully developed and properly tested theories of language proficiency, had to go ahead by scaling the available descriptors’ 22 . One may even argue that since the CEFR descriptors were selected and compiled from various existing scales and testing systems, there is a strong chance that the latter were, at least in part, based on reasonably safe and systematic theoretical principles in the first place. In that sense, their blending and reorganisation into the CEFR scales might imply that a sound theoretical basis is still indirectly present even if no one scale or system may have been rigorously maintained. Nevertheless, it certainly seems sensible to heed Fulcher’s warnings against a premature ‘reification’ (which he defines as the ‘propensity to convert an abstract concept into a hard entity’) of the CEFR as a genuine tool for attesting proficiency if it is ‘not theoretically justified’ 23 at its core. 2.1.4. Reasons for optimism: chances offered by the CEFR Such considerations should rightly make us wary of adopting the Council of Europe’s Framework in an overly uncritical and unselective fashion. However, this does not mean that we ought to shy away from it completely; far from it. The key resides in looking at the CEFR as a flexible, malleable and useful foundation instead of a complete, dogmatic and ready-made solution. As North puts it, ‘the CEFR is a concertina-like 20 Fulcher (2008), p.22. North, art.cit., p.657. 22 Jan H. Hulstijn, ‘The Shaky Ground Beneath the CEFR: Quantitative and Qualitative Dimensions of Language Proficiency’ in The Modern Language Journal, 91, iv (2007), p.664. Italics added. 23 Fulcher (2008), p.22. 21 36 Chapter 2 reference tool, not an instrument to be “applied”’ 24 . It is also in this sense that Fulcher ultimately sees a truly suitable purpose for it: It is precisely when the CEFR is merely seen as a heuristic model which may be used at the discretion of the practitioner that it may become a useful tool in the construction of tests or curricula. […] The purpose of a model is to act as a source of ideas for constructs that are useful in our own context. 25 Importantly, this same view is in fact pre-emptively voiced by the authors of the CEFR no later than in the introductory ‘notes for the user’ of their document. From the outset, they insist that their findings are by no means intended to be prescriptive; they expressly stress that they ‘have NOT set out to tell practitioners what to do, or how to do it’ (p. xi). Adaptation of the CEFR’s content to specific local or pedagogic contexts is consistently encouraged by the writers, which further underlines that the document ‘is not dogmatic about objectives, about syllabus design, or about classroom methodology’ 26 . At numerous points throughout the CEFR, the user is directly challenged to actively pick his own preferred strategies or approaches from a number of alternatives that are meticulously and factually described rather than dogmatically imposed 27 . Incidentally, as Heyworth notes, [a]n interesting implication of the presentation of options in this way is that it pre-supposes teachers who are responsible, autonomous individuals capable of making informed choices, and acting upon them. 28 On the surface, then, the authors have gone to great lengths to pursue and demonstrate a descriptive approach in their Framework, overtly promoting autonomous reflection and decision-making on the part of the user. Nonetheless, there are of course some foci that recurrently emerge as particularly prominent in the CEFR’s descriptive scheme. For instance, it is clearly perceptible that the emphasis throughout the CEF is on how languages are used and what learners/users can do with the language – on language being action-based, not knowledge-based. 29 24 North, art.cit., p.656. Fulcher (2008), p.22. 26 Morrow, art.cit., p.8. 27 The sections in question are clearly recognisable boxes in which ‘users of the Framework’ are explicitly encouraged ‘to consider and where appropriate state’ what their preferred approach to key issues is; biased or prescriptive suggestions are markedly omitted. An interesting example is provided in the ‘Errors and mistakes’ section, where different (traditional as well as innovative) attitudes and strategies to error correction are offered to the teacher, who then needs to decide what the best and most efficient alternative may be in order to achieve the most productive and lasting effect on his students’ language learning. (CEFR, pp.155-156) 28 Heyworth, art.cit., p.20. Emphasis added. 29 Heyworth, art.cit., p.14. Emphasis added. 25 Chapter 2 37 Such an approach, as Alderson attests, is ‘in large part … based on ideas from the AngloAmerican world of communicative language teaching and applied linguistics’ 30 , and it definitely carries some intrinsic benefits. For instance, Keddle points out that the ‘focus on learner language’ goes hand in hand with a welcome ‘move away from mechanical grammar work’. In fact, the ‘renewed focus [of the CEFR] on situational and functional language, and on the strategies students need in the four skills’ not only implies increased reliance on ‘communicative language practice’ 31 as well as on the nurturing of language (and learning) skills in the classroom. The insistence on strategy development also provides a clear indication that the CEFR regards the fostering of learner autonomy as a highly desirable process. This consistent tendency is most obvious in the CEFR’s recommendation of selfassessment through learner-centred ‘can do’ descriptors and the systematic use of the European Language Portfolio (ELP). Unfortunately, specific applications of this latter, particular tool (which, in its root form, predominantly focuses on an independent and arguably adult language learner rather than a school-based teenage one) cannot be discussed in further detail in this thesis as this would significantly transcend the framework of classroom-based summative testing 32 . Nevertheless, the fundamental structure and purpose of the ELP already exemplify two key characteristics that underline the CEFR’s potential to trigger a veritable paradigm shift in language teaching and testing. On the one hand, by recording his own performances and results in different parts of the portfolio, the learner himself is clearly in the centre of both the learning and assessment processes. Through the numerous and skills-specific ‘I can…’ descriptors in the ELP, the student acquaints himself in much more depth with the various competences needed to become a proficient foreign language user than he would be able to do in a more traditional, teacher-centred classroom environment. At the same time, the learner’s heightened metacognitive awareness contributes to making him trace and realize the progress he has gradually made (and continues to make) as a language learner in a much more conscious and engaged fashion. While the aforementioned promising effects arguably find their greatest development in the context of portfolio-based work (hence the recommended use of the ELP), they can also be attributed to the CEFR’s descriptive scales and their underlying 30 Alderson, art.cit., p.660. Keddle, art.cit., p.43. 32 For a particularly interesting use of how the ELP has been put to use, see for example Angela Hasselgreen et al., Bergen ‘Can Do’ project, Council of Europe (Strasbourg: 2003) 31 38 Chapter 2 concept in general. As suggested above, these scales allow us to break down the competences involved in language learning in a very nuanced way, making it easier to distinguish between particular areas where mastery has been achieved to a specific degree, and those which still need to be developed further. Simultaneously, progress is rigorously traced in exclusively positive terms. As rendered perfectly evident by their deliberate phrasing, the designed purpose of the CEFR descriptors consists in identifying what the learner can rather than cannot do. An interesting and symptomatic example is provided by the general descriptor for the ‘grammatical accuracy’ of a typical A2 learner: Uses some simple structures coherently, but still systematically makes mistakes – for example tends to mix up tenses and forget to mark agreement; nevertheless, it is usually clear what he/she is trying to say. (CEFR, p.114) In comparison to more traditional assessment methods which focus heavily on grammar correction even in free writing, one notices that the descriptor does not deny the presence of remaining imperfections in the learner’s performance. Yet instead of dismissing the entire performance as insufficiently proficient (due, for example, to the wrong use of tenses or a missing 3rd-person ‘–s’ in the present simple), it also recognises the overall, successful communicative effect of the learner’s performance (i.e. it is ‘clear what he/she is trying to say’ in spite of the still limited linguistic means at his/her disposal). It is precisely this approach which can pave the way towards a much more positive assessment culture overall: instead of penalising the students’ weaknesses, the CEFR, as a framework for teaching and assessment, allows us to single out the elements that the learners already do well. Keddle rightly points out that this points to significant changes for teachers, who all too often ‘still tend to measure student performance against a native-speaker, error-free absolute, even at beginner levels’, even if they are actually ‘aware of theories of language acquisition’ that prohibit precisely such expectations of “instant perfection”. In turn, this promises a fairer chance for students, ‘accustomed as they are to being marked down for ‘mistakes’’. Indeed, particularly at lower levels such as A2, the CEFR descriptors incite a radical change of approach in that they actually ‘allow an ‘imperfect’ performance to be appropriate for someone of that level, rather than being perceived as failure’ 33 . At the root of this more positive attitude to remaining shortcomings is an awareness of the central concept of the learner’s interlanguage. Brown defines this notion as ‘a 33 Keddle, art.cit., pp.45-46. One may add, of course, that even a ‘native-speaker’ performance might not always be ‘error-free’. Chapter 2 39 system that has a structurally intermediate status between the native and target languages’. As learners progressively construct an understanding of the foreign language, they develop ‘their own self-contained linguistic systems’ 34 consisting of approximations and, initially, flawed interpretations of the target language systems they are trying to acquire. In other words, one needs to be aware that second language learners (at lower levels in particular) do not simply amalgamate all the chunks of language that they have been exposed to and then correctly connect them right away (and continue to do so in a consistent, permanent manner afterwards). Instead, their progressively developing interlanguage represents ‘a system based upon the best attempt of learners to bring order and structure into the stimuli surrounding them’ 35 . Remaining errors are perfectly normal at this stage, and should thus be expected when learner performances in the target language are assessed. Admittedly, the CEFR still lacks a consistent, SLA-theory-based explanation about the exact steps that are necessary for the progression from a low level of proficiency to a higher one (e.g. A2 to B1), as well as a systematic inquiry into how and when such a progression is achieved by the learner – or facilitated by the teacher – in pedagogical terms. In this respect, Weir questions the CEFR’s ‘theory-based validity’, since it ‘does not currently offer a view of how language develops across these proficiency levels in terms of cognitive or metacognitive processing’ 36 . However, taken in isolation, the detailed descriptions for each of the six levels provide a remarkably thorough insight into the characteristics of the various stages of proficiency that a language learner can achieve (even if, of course, not every learner can or will always reach C2!). Within any particular level, the CEFR descriptors certainly have the merit of drawing attention to the types of shortcomings that still have to be expected. Yet imperfection is not represented as an automatically negative flaw to be extinguished immediately and unrelentingly; in the classroom, such an excessively accuracy-focused stance would understandably deter many students from engaging in a more spontaneous and adventurous experimentation with the target language as a result. Instead, occasional flaws are included as an intrinsically normal and natural feature of the complex process of language learning. 34 H. Douglas Brown, Principles of Language Learning and Teaching (5th ed.), Longman/Pearson (New York: 2007), p.256. 35 Ibid., p.256. 36 Cyril J. Weir, ‘Limitations of the Common European Framework for developing comparable examinations and tests’ in Language Testing, 22 (2005), p.282/p.285. Accessible at http://ltj.sagepub.com/cgi/content/abstract/22/3/281 40 Chapter 2 In this context, a further strength of the Framework is its explicit distinction between errors and mistakes, errors being examples of the learner’s interlanguage, which demonstrate her/his present level of competence, whereas mistakes occur when learners, like native speakers sometimes, do not bring their knowledge and competence into their performance – i.e. they know the correct version, but produce something which is wrong. 37 This crucial difference forces the assessor to be cautious and nuanced in his interpretation of incorrect elements in student productions. When it comes to determining the level of competence that a particular learner has attained, error analysis can, and should, play a vital role. Rather than simply counting the number of incorrect verb endings or wrongly used tenses to reach an overall verdict on a specific performance (which, in summative tests, is often synonymous with a final mark), careful consideration of the frequency and nature of particular “mishaps” is needed to paint a more composite and accurate picture of the student’s achievements. Keeping in mind the above-mentioned descriptor for ‘grammatical accuracy’, for instance, the assessor should verify whether a wrong verb ending (such as the notoriously forgotten ‘-s’ for the third person singular in the present simple) occurs repeatedly and systematically in a given student’s production (in which case it represents an ‘error’ in his/her interlanguage). On the other hand, if it is only an uncharacteristic ‘slip’ (i.e. a mistake which may appear merely once or twice) that stands out from other, consistently correct uses of the same grammatical item, then one can assume that the student has simply forgotten to ‘bring [his/her] knowledge and [his/her] competence into [his/her] performance’ on this one exceptional occasion. Especially in challenging productive tasks involving free speaking or writing, numerous and varied linguistic elements (e.g. grammar, spelling, syntax…) need to be focused on simultaneously. As a consequence, such ‘slips’ are most likely to happen to students in those particularly demanding task types. Distinguishing between errors and mistakes thus allows the teacher to get a clearer view of the level of competence reached by the learner in a particular domain at a certain point in time. It also represents a powerful diagnostic tool through which the assessor can provide more precise feedback (or, more accurately, ‘feed-forward’) concentrating on specific areas which the student still needs to work on in the future. At the same time, the CEFR descriptors remind us not to get too caught up in our “search” for errors, either: the overarching question of whether the overall communicative act has been successful or not 37 Heyworth, art.cit., p.19. Emphasis added. Chapter 2 41 should always remain in the back of the assessing teacher’s mind. After all, as Brown reiterates, The teacher’s task is to value learners, prize their attempts to communicate, and then provide optimal feedback for the [interlanguage] system to evolve in successive stages until learners are communicating meaningfully and unambiguously in the second language. 38 If the focus of assessment expressly shifts towards the overall communicative success of a learner’s performance, then it is of course essential that the tasks which the students have to carry out in summative tests offer them adequate opportunities to produce relevant and contextualised samples in the target language. In this respect, another strength of the CEFR becomes strikingly evident, as its ‘renewed focus on situational and functional language’ could significantly contribute to ‘bring[ing] the real world back into the classroom’ 39 . In other words, aspirations to make learning situations, contexts and test tasks as authentic as possible could receive a significant boost through the importance conferred to communicative language use in the CEFR. Even if, at its origin, the scope that the Framework covers is predominantly ‘out in the real world and distant from the unnatural habitat of the classroom’, Keddle argues that an appropriate application of the CEFR in a school-based context could go a long way towards increasing the degree of authenticity of language learning tasks: The CEF, with its references to text types (e.g. news summaries, messages, advertisements, instructions, questionnaires, and signs), provides teachers with a checklist they can use to incorporate genuine communicative skills/strategies work into their teaching. Authentic text-types can be adapted to suit the interests and age levels of students, and clear objectives can be set to fit their language level… 40 Indeed, the descriptive scheme of the CEFR includes a myriad of lists and examples of possible text types and linguistic tasks, demonstrating in meticulous detail where and when each one of the core language skills is called into action within the four main domains or ‘spheres of action’ of everyday life. Furthermore, as Heyworth stresses, [t]he concepts of domain – personal, public, occupational, educational – and the descriptive categories of location, institution, person, object, event, operation, and text, provide a framework for the design of needs analysis questionnaires, and for the definition of outcomes. 41 38 Brown, Principles of Language Learning and Teaching, p.281 Keddle, art.cit., p.43. 40 Ibid., p.45. 41 Heyworth, art.cit., p.18. 39 42 Chapter 2 Without a doubt, stimulating competence-based lessons and learning objectives can be derived from this extensive scheme as well. In addition, however, the impact on test foci and features can be just as significant. As Fulcher points out, Language testing rightly prioritises purpose as the driver of test design decisions. The context of language use is therefore critical, as it places limitations upon the legitimate inferences that we might draw from test scores, and restricts the range of decisions to which the score might be relevant. 42 In that sense, the functional/situational focus of the CEFR’s descriptor scales promises a step into the right direction, even if a careful and selective application to the school context evidently still needs to be ensured. Hence, the most essential communicative functions and situations must be chosen and defined in relation to immediate learner needs, and the corresponding instructional material adapted to the students’ competence level. However, if such an action-oriented, competence-based and authenticity-promoting framework as the CEFR is chosen as the foundation for school-based learning and assessment, it is glaringly obvious that tests composed only of de-contextualised discreteitem tasks become obsolete once and for all. 2.2. Challenges of introducing competencebased assessment The theoretical usefulness of the CEFR as foundation for a competence-based teaching and assessment scheme is thus perceptible in a variety of its most defining aspects. Yet how can this Framework be used in practice to arrive at a workable and sound assessment system? What are the difficulties and obstacles that have to be surmounted before we can truly affirm that our school system also develops and measures language skills rather than exclusively knowledge? 2.2.1. The achievement versus proficiency conundrum in summative tests One of the major challenges inherent to competence-based assessment is underlined by the following passage in the CEFR: Unfortunately, one can never test competences directly. All one ever has to go on is a range of performances, from which one seeks to generalise about proficiency. (CEFR, p.187; emphasis added) If the aim is to certify learning outcomes with a high degree of validity and reliability, this realisation certainly seems inconvenient at first. To what extent can we trust – or justify – the objectivity of ‘generalisations’ that an individual assessor makes about a 42 Fulcher (2008), p.22. Emphasis added. Chapter 2 43 given learner’s apparent level of competence if the judgment is inexorably founded on inference rather than a direct and straightforward connection between tangible evidence and its factual significance? Moreover, if a ‘range of performances’ is needed to reach a conclusion about the learner’s proficiency, what can a single summative test contribute to that effect? The complicated nature of the situation is underlined further by the following statement: Proficiency can be seen as competence put to use. In this sense, therefore, all tests assess only performance, though one may seek to draw inferences as to the underlying competences from this evidence. (p.187) As seen in the previous chapter, the traditional core function of regularly scheduled summative tests consists, by definition, in verifying achievement of fairly narrow, previously covered objectives rather than overall target language proficiency. Rating a particular, isolated student performance is thus their predominant focus and purpose. Yet if the entire school system becomes centred on the development and assessment of competences, it seems highly desirable that summative tests should simultaneously provide hints, at regular intervals, as to where a given student stands in relation to the ultimately targeted competence level at a given point in time. While no single test can illustrate the entire extent of progress that the learner has made in terms of his linguistic competences, each one of them nevertheless constitutes one piece of the gradually emerging “proficiency puzzle”. Considered together rather than in isolation, the student’s performances in summative tests thus do provide a significant contribution to the overall picture of his proficiency (and, by extension, help to assess the levels of competence reached in regard to the different skills). Furthermore, the general form and constituent tasks of individual summative tests can be altered so as to present a stronger indication about proficiency in their own right. Essentially, the more diverse the individual task types are, the more they tap into different aspects of specific competences; hence, through the inclusion of a range of varied tasks and requirements (even in relation to one same overarching skill such as spoken or written production), a ‘range of performances’ may in fact be collected even within the obvious limitations of a single test. However, it would evidently be a big mistake to simply equate the student’s actual proficiency in the foreign language exclusively with his performances in summative tests. Over the course of a school year, numerous other, formatively assessed activities in the classroom elicit language productions that complete the teacher’s overall impression of the student’s proficiency and thus the extent of 44 Chapter 2 competence development. In that sense, summative tests should never be considered as a sole and sufficient source for proficiency assessment. This realisation also bears crucial implications if one wants to use the CEFR descriptor scales as a general basis for assessment. Through their intrinsically competence-oriented nature, the individual descriptors are inherently characterised by a focus on the “bigger picture”: overall proficiency. Hence, their scope is far too extensive to allow for a direct and sufficiently nuanced rating of a specific, isolated student performance in a particular summative test task for instance. This is most evident in the phrasing of the global level descriptors which, as Morrow points out, ‘like any attempt to capture language performance in terms of language, … cannot be absolutely precise’ 43 . Indeed, it would be difficult for a single summative test to verify in depth whether a particular student aspiring to attain the ‘A2’ level ‘can communicate in simple and routine tasks requiring a simple and direct exchange of information on familiar and routine matters’ (CEFR, p.24). Evidently, one can easily imagine a sample task which would check one particular aspect of this global competence; for example, in an oral interview, one could simply interrogate a student about his favourite hobbies or other daily habits. It is very conceivable that a student in 6e or 9e might answer all questions about such a topic in a satisfactory way following a few weeks of correspondingly topic-based language learning (thus proving the achievement of a precise objective). However, this one successful performance does not allow us to draw definite conclusions about the student’s overall level of competence – such an inference would be rash and excessive after one narrowly-focused task, and thus ultimately invalid. This also clearly underlines that, in their original form, the descriptors in the CEFR scales are neither intended nor ready-made to be simply ‘copy-pasted’ into marking grids and then have precise numerical values allotted to them so that they can be used in the summative assessment of individual test performances. The global insights which they can offer into the competence profile of a given language learner can only be safely attested after numerous, varied indications have been collected over a more substantial period of time. This explains why in the present Luxembourg education system, the most suitable place for such global CEFR descriptors is in the so-called “complément au bulletin” awarded at the end of an entire instruction “cycle”. For example, after complete learning cycles of one (6e) or two (8e/9e) school years, teachers have a much wider range 43 Morrow, art.cit., p.8. Chapter 2 45 of performances at their disposal to decide whether their students have attained the ‘A2’ level in English in terms of a number of salient CEFR descriptors. As already implied, this final and global assessment evidently needs to go beyond the students’ performances in the summative tests alone; for a truly meaningful certification of a student’s veritable proficiency level, all the additional classroom activities conducted over that period of time most certainly need to be taken into account as well. 2.2.2. Contentious aspects of CEFR descriptors and scales In a school-based context, a number of other aspects of the CEFR descriptors can appear problematic and have therefore been criticised in various quarters. For instance, if the six main proficiency levels are to be used as the foundation for an entire language curriculum, the phrasing of several descriptors can cause a number of concerns. A first major problem is often seen in the rather vague choice of terminology inherent to a number of descriptors. As far as assessment is concerned, this is particularly problematic in that ‘the wording for some of the descriptors is not consistent or transparent enough in places for the development of tests’ 44 . To cite but one example, a learner is adjudged to have attained a ‘low A2’ (or ‘A2-’) level in ‘vocabulary range’ if he ‘has a sufficient vocabulary for the expression of basic communicative needs’ and ‘for coping with simple survival needs’ (CEFR, p.112). What exact ‘communicative’ and ‘survival needs’ qualify as ‘basic’ or ‘simple’? Perhaps more importantly in a school context, what amount of ‘vocabulary’ is deemed to be ‘sufficient’ and thus has to be demonstrated by a student at that level? The nature of the problem is further clarified by Weir: The CEFR provides little assistance in identifying the breadth and depth of productive or receptive lexis that might be needed to operate at the various levels. Some general guidance is given on the learner’s lexical resources for productive language use but, as Huhta et al. (2002: 131) point out, ‘no examples of typical vocabulary or structures are included in the descriptors.’ 45 Going back to the A2 descriptor for ‘grammatical accuracy’, similar questions might easily be raised about the exact meaning of the phrase ‘some simple structures’: indeed, which grammatical elements are considered ‘simple’ and which ones are not? If grades and progress in our school system are ultimately connected to such assessments of 44 Weir, art.cit., p.282. Ibid., p.293. The reference and quote in this passage are to Ari Huhta et al., ‘A diagnostic language assessment system for adult learners’ in J. Charles Alderson (ed.), Common European Framework of Reference for Languages: learning, teaching, assessment: case studies, Council of Europe (Strasbourg: 2002), pp.130-146. 45 46 Chapter 2 competence, the salience of these considerations can certainly not be underestimated. It seems an unavoidable requirement for both curriculum and test writers to specify the precise functions, grammatical items and topical lexis areas to be acquired and mastered by the students. This definitely reinforces the claim that the CEFR has to be adapted to particular circumstances and needs. In this context, Weir underlines the fact that ‘the CEFR ‘is intended to be applicable to a wide range of different languages (Huhta et al., 2002)’’; sufficiently detailed language-specific guidelines and lists are impossible to provide in that instance. At the same time, however, ‘this offers little comfort to the test writer who has to select texts or activities uncertain as to the lexical breadth or knowledge required at a particular level within the CEFR’ – a problem which is further amplified by the fact that, in the CEFR, ‘activities are seldom related to the quality of actual performance expected to complete them (scoring validity)’ 46 . In other words, there are no benchmark samples to be found in the Framework which directly demonstrate the standards of performance to be expected at a given proficiency level. Once again, such an omission is evidently understandable in view of the plurilingual aims, purposes and overall applicability of the CEFR. Including suitable examples in a multitude of European languages for all the identified competences and levels would indeed have been an excruciatingly extensive (and, considering the inherent necessity of language-, culture- and country-related adaptations, most difficult) endeavour indeed. As a result, a clear need emerges for curriculum developers to illustrate the theoretical phrasing of the descriptor scales with language-specific samples of performance rooted in practice, or, at the very least, to provide guidelines as to which particular functions, topics as well as grammatical and lexical ranges correspond to the chosen standards. Ideally, of course, teachers might even expect to be offered both types of specifications from curriculum and syllabus writers. In this respect, grammar is arguably the traditional core element of many language courses which the CEFR most notably fails to scrutinise in extensive depth – that is, as far as direct usability in classroom tests is concerned. Indeed, due to the generalising nature of the grammar-related CEFR descriptors, they are largely unsuited to be put to immediate use for the summative assessment of grammatical micro-skills within a relatively small and thus limited sample of spoken or written production. In itself, this is not necessarily negative; as Keddle remarks, the approach chosen in the Framework does 46 Weir, art.cit., p.293/p.282. Chapter 2 47 not aim to verify the mastery of any isolated grammatical items; instead, ‘the CEF puts the emphasis on what you achieve with grammar’ 47 . Nevertheless, Keddle admits that the CEFR’s rather patchy treatment of grammatical components does lead to problems when it comes to adapting the Framework to the requirements of school-based language teaching. In essence, the predominantly functional/situational focus of the CEFR clashes with the more form-focused, grammar-oriented layout and exigencies of numerous traditional school syllabi. Hence, while ‘the global descriptors fit any successful language learning situation, including the school classroom,’ Keddle rightly points out that presently ‘the detailed descriptors often don’t match with the grammar focus found at school level’ as they are ‘not sufficiently linked to concept areas to provide a basis for a teaching programme’ 48 . As an example that forcefully underlines the inconsistent approach to grammatical components in the CEFR, Keddle notes a surprising discrepancy in the global descriptors. While some elements such as ‘speaking about the past’ are overtly discussed, [t]here are grammar-based concept areas that are not covered early enough, or at all, which could be included, e.g. ‘talking about the future’. […] The future is in fact only indirectly referred to in the descriptor ‘make arrangements to meet’ (A2). As most teachers also knowingly cover the concept of the future it is a shame that it is absent from the descriptors. 49 Such realisations further underline the complex double nature of the adaptation process which is needed to unify CEFR contents and grammar-led school syllabi in a satisfactory way. Not only is it necessary for curriculum developers to select elements already present in the CEFR that are suitable for particular age and learner groups; at times, they may also have to fill in remaining gaps by supplementing new elements that were not initially treated in the descriptor scales at all. Further problematic implications tied to the phrasing of the CEFR descriptors were systematically investigated and described by Alderson et al. in a project carried out for the Dutch Ministry of Education, Culture and Science 50 . Although, in contrast to the present thesis, the main aim of that project affected the receptive skills of reading and listening, the insights offered by that research study can be applied in a general way to the vast majority of descriptor scales in the CEFR and thus also prove valuable in relation to 47 Keddle, art.cit., p.47. Emphasis and italics added. Ibid., p.44/p.49. 49 Ibid., p.49. 50 J. Charles Alderson et al., ‘Analysing Tests of Reading and Listening in Relation to the Common European Framework of Reference: The Experience of the Dutch CEFR Construct Project’ in Language Assessment Quarterly, 3, 1 (2006), pp.3-30. 48 48 Chapter 2 the Framework’s exploration of productive skills. Alderson et al. found four main types of problems when analysing and comparing different CEFR descriptors and levels: 1. First of all, a number of ‘inconsistencies’ appear in various descriptor scales. These may affect different proficiency levels, for instance ‘where a feature might be mentioned at one level but not at another’ 51 . As an illustrative example, Alderson et al. cite concepts such as ‘speed’ or ‘standard’ which are not consistently mentioned at all levels. Similarly, references to the operation of using a dictionary are mysteriously absent from the lower levels, but discussed at B2 level. 52 This latter point clearly indicates that omissions of certain terms at some levels (but not others) are not always logical; they cannot always, for example, simply be justified by the inapplicability to low-level learners on the grounds of such students’ limited target language resources. If anything, the use of dictionaries might even be of more importance to A1 and A2 learners than to B2 students. Particularly in the interests of a school-based application that aims to develop specific skills over time, it would definitely be helpful to maintain a strict consistency of terms and concepts across levels with more rigour. Furthermore, even at one and the same level, problems of inconsistency may surface: A feature may appear in one descriptor for a level, but not in another for the same level. For example, what is the difference between specific information (A2) and specific predictable information (A2)? 53 Less critical readers and users of the Framework might consider (and sometimes dismiss) the insistence on such nuances as mere “semantic nitpicking”. Nevertheless, the careful study of the various CEFR descriptor scales does reveal a number of slight variations which could arguably have been avoided in order to facilitate the systematic tracing of cognitive development and skills-related progress over time. 2. In close relation to the previous point, further criticism affects the CEFR writers’ use of a variety of verbs to describe similar cognitive operations (e.g. ‘identify’ and ‘recognise’), as it frequently remains unclear whether true synonymy exists between two different expressions. On the one hand, Alderson et al. note that the CEFR authors might have had ‘stylistic reasons’ for this inconsistent use of terminology; however, a more likely explanation is that the use of different verbs actually betrays that, as seen above, ‘the can-do statements were originally derived 51 Ibid., p.9. Ibid., p.10. 53 Ibid., p.10. Italics are the authors’. 52 Chapter 2 49 from a wide range of taxonomies’ 54 . In either case, Alderson et al. chose to pursue a higher degree of standardisation for the terminology used in the remainder of their own project. With concerns of validity and reliability in mind, such a course of action seems commendable indeed. 3. The problems with descriptor phrasing are further compounded by a ‘lack of definitions’ for numerous expressions used to describe cognitive and linguistic acts as well as text or task types. Alderson et al. draw attention to the unclear meaning of the term ‘simple’ in numerous descriptors; as it is never clearly defined or illustrated (either by theoretical means such as lists of grammatical or lexical items, or empirically through sample productions), it is equally unclear what degree of quality a language learner’s performance needs to reach to correspond to a specific CEFR level. In addition, [t]he same definitional problem applies to many expressions used in the CEFR scales: for example, the most common, everyday, familiar, concrete, predictable, straightforward, factual, complex, short, long … and doubtless other expressions. These all need to be clarified, defined, and exemplified if items and tasks are to be assigned to specific CEFR levels. 55 The clarification and standardisation of these terms is crucial in several ways, particularly if CEFR descriptors are to play a central role in the assessment of language proficiency in an entire school system. Evidently, every individual assessor needs to be unmistakably aware of the exact criteria that a given student performance needs to fulfil to attain a particular competence level, so that high scorer reliability is granted. In addition, to increase inter-scorer reliability as well, all the different teachers in that system need to share a common understanding about the exact meaning and levels of performance that correspond to such standardised terminology. Potential ways of attaining both of these crucial prerequisites in practice will be discussed in subsequent chapters of this thesis. 4. A final weakness that Alderson et al. identify in the CEFR is constituted by several remaining ‘gaps’ 56 in the descriptor scales. Given that their project focused on receptive skills, some of their justified concerns affect the absence of ‘a description of the operations that comprehension consists of and a theory of how comprehension 54 Ibid., p.10. Ibid., p.12. Italics are the authors’. 56 Ibid., p.12. The authors specify that they ‘considered a feature missing if it was mentioned in general terms somewhere in the CEFR text but then was not distinguished according to the six CEFR levels or was not even specified at one level.’ 55 50 Chapter 2 develops’, echoing the general lack of a rigorous SLA-theory-based background affecting the CEFR overall. They also argue that ‘the text of the CEFR introduces many concepts that are not then incorporated in the scales or related to the six levels in any way’, such as, for instance, ‘competence, … activities, processes, domain, strategy and task […]’ 57 . While this criticism is not factually incorrect, it nevertheless appears to neglect the complementary structure of the CEFR’s two major components (i.e. the descriptive scheme and the descriptor scales). Considering the already fairly extensive scope and phrasing of most descriptors, as well as of the explanations provided in the descriptive scheme (conciseness is definitely not always the CEFR’s strongest suit), it is hard to fathom how all the concepts defined and discussed in the descriptive scheme could additionally be applied and included in the scales as well without making the latter excessively bulky and overloaded. Precisely elements such as activities or domains can surely be derived from – or related to – the descriptor scales even if they are not explicitly incorporated there. However, the regret voiced by Alderson et al. about a missing direct link between such notions as ‘processes’ or ‘tasks’ and the six levels appears more warranted, particularly with test writers and language teachers in mind. Especially in a pedagogic aim, it would indeed be helpful to have a clear overview of precise types of processes and tasks that can be associated to – and thus expected from – learners evolving at a given level of proficiency; in a similar vein, such information would be of tremendous help in view of deriving fair and valid test specifications and constructs from the CEFR. 2.3. The CEFR and the Luxembourg school system: possible uses and necessary adaptations In this chapter, a close look at the Common European Framework has revealed it to be an instrument that is promising and challenging in almost equal measure. On the one hand, it certainly represents a most useful and long overdue catalyst for change, capable of triggering drastic alterations to more traditional approaches to language learning, teaching and assessment. Through its detailed and thorough catalogue of skills and competences that any speaker of a foreign language needs to call upon, its fundamentally positive focus on what language learners can do, and its undeniable, ever-increasing weight on an international scale, the CEFR unites all the necessary ingredients to be considered a 57 Ibid., p.12. Chapter 2 51 legitimate and stimulating basis for a competence-based system of teaching and assessment. On the other hand, the controversy surrounding a number of elements in the CEFR unmistakably shows that several adaptations are necessary before the Framework can be used as a workable and potent tool, particularly for assessment, in the Luxembourg school system. Its initial focus on independent, presumably adult learners must be acknowledged and reshaped for classroom purposes, with the particular needs and interests of adolescent pupils in mind. As far as ESL courses are concerned, one also needs to consider that English is in fact the third foreign language that students learn in our school system. When deriving a subject-specific alignment to the CEFR for lower-level English courses in Luxembourg, one thus needs to take into account that students do not start developing all the corresponding competences from the very bottom all over again. This aspect of plurilingualism is, in fact, explicitly pointed out by the CEFR authors: A given individual does not have a collection of distinct and separate competences to communicate depending on the languages he/she knows, but rather a plurilingual and pluricultural competence encompassing the full range of the languages available to him/her. (CEFR, p.168) Those who have learnt one language also know a great deal about many other languages without necessarily realizing that they do. The learning of further languages generally facilitates the activation of this knowledge and increases awareness of it, which is a factor to be taken into account rather than proceeding as if it did not exist. (p.70) Ignoring this extremely useful realisation would be robbing oneself of one of the most powerful means of encouraging smart and efficient learning: activating skills and knowledge that have already been developed to a certain extent. As Heyworth puts it, this could prove invaluable to help students ‘develop… strategies and skills for ‘learning to learn languages’’: Teachers sometimes assume that a beginner starts from scratch, but in fact most have experiences of other languages and skills and knowledge they can apply usefully to learning the new language. 58 At the same time, however, this creates a new set of problems when it comes to linking CEFR levels with course objectives at school. If students have (to a certain extent) already developed some skills and learning strategies (but not others) before beginning a particular course in a ‘new’ language, which targets do we set them for the successful achievement of their school year (or cycle) in that subject? In other words, if the overall 58 Heyworth, art.cit., p.15. 52 Chapter 2 target level after the first learning cycle of the English curriculum is globally set at A2 (as it is currently the case in Luxembourg), does that mean that we should “only” expect A2level performances across all (sub-)sets of competences, or are there perhaps some elements where a minimal target requirement of B1 might be more appropriate due to the students’ vast repertoire of prior language learning (in other foreign languages)? Before such questions can be answered, and the attainment of particular CEFR levels attested, it is of course necessary to devise a valid and reliable CEFR-aligned assessment system in the first place. As seen above, this is far from a straightforward task. The problems of inconsistency and partial incompleteness characterising the terminology in the different descriptor scales, as documented by Alderson et al., certainly call for a great deal of care when it comes to adapting them for local education and assessment purposes. Moreover, their role in the overall assessment system can prove rather complex. Due to their inextricable link to the concept of language proficiency, the CEFR descriptors seem to be most appropriate for an end-of-cycle certification that takes into account multiple and varied performances over a substantial period of time. No such conclusions can be drawn if they are not underpinned by a sufficient number of regularly collected indications about a given student’s level of competence. Hence, the purpose of our regular summative tests will increasingly have to be geared towards a mixture of both achievement and proficiency assessments as well – once again no simple undertaking. Since a single summative test cannot explore the learner’s whole range of competences and skills (i.e. his veritable level of proficiency), the corresponding assessment system for each single test can, by definition, not solely rely on proficiency-based descriptors such as those from the CEFR scales. In other words, the original CEFR descriptors are neither intended nor particularly suited to simply be associated with a specific numerical mark in the interests of reaching a final score in a classroom test. This is not to say, however, that they cannot partially be drawn upon to assess such an isolated performance; for instance, if a marking grid is compiled for the explicit purpose of summative assessment, various skills-related CEFR descriptors might certainly help to characterise a range of aspects of even an isolated student production. However, in such a context it is highly likely that more precise, achievement-oriented elements must complement them if, aside from the learners’ general ability to communicate meaningfully, their mastery of very specific language elements (e.g. previously treated topical lexis or grammatical forms) is to be verified as well. Chapter 2 53 In essence, then, the CEFR is likely to take on a dual role in our new competencebased education system. To confirm the satisfactory attainment of a targeted, global CEFR level at the end of a learning cycle (for instance by means of a complément au bulletin attesting the level of proficiency demonstrated by the student against target standards such as A2 or B1), the Framework’s descriptors can be drawn upon in almost direct fashion, even if certain inconsistencies and gaps might still have to be addressed in the interests of validity and reliability. CEFR descriptors can also intervene in the establishment of marking grids for regularly conducted summative tests, as they potentially provide a basis for a more reliable, criteria-based assessment of specific learner performances. However, due to their rather vague, generalising and proficiencycentred terminology, their inclusion into summative marking grids is not as straightforward in this second instance; further adaptations and specifications might have to be supplied in that case to increase the immediate relevance and aptitude of the assessment. The next two chapters of this thesis will now provide a detailed description and analysis of activities, test and task types implemented in the classroom to stimulate and develop the learners’ productive skills, and the corresponding means and strategies that were used to assess the students’ resulting performances. This will not only allow us to explore how such competence-based ways of approaching speaking and writing might be included into everyday teaching practice, but also to look for beneficial effects as well as potential remaining shortcomings of the applied methods. 54 Chapter 2 55 Chapter 3: Competencebased ways of assessing speaking at A2 level. As briefly outlined in chapter 1, the systematic assessment of speaking skills has traditionally played a strongly subordinated role in lower-level ESL classes in the Luxembourg school system. Instead, writing tasks have usually been favoured in classroom testing, often to the point where the summative assessment of the other productive skill has been completely ignored throughout the students’ first two (or more) years of English language courses. Yet an important aim of our schools undoubtedly consists in producing independent individuals who are fundamentally capable of calling upon a foreign language to communicate with others in a variety of situations. Does it make sense, then, to neglect the most spontaneous, common and immediate form of communication in such a disproportionate manner? Certainly not, particularly within a competence-based teaching and assessment scheme largely inspired by the strongly communicative approach of the CEFR. Language teachers all around Luxembourg, by virtue of new official syllabi that oblige them to test all major skills over the course of the ‘A2’ English cycles of both the ES and EST systems, now suddenly find themselves in a position where they have to alter their own approach to testing speaking, often in fundamental ways. As with all major changes to the traditional assessment system, some initial uncertainty evidently exists as to how this new curricular requirement can be suitably put into practice. Which precise skills and competences are to be called into action for an appropriate and sufficiently thorough assessment of speaking? What type of tasks can be used to achieve a satisfactory activation of those skills and competences? And what type of instrument(s) should teachers correspondingly rely on to arrive at a highly valid and reliable assessment of student performances? In this chapter, I will first outline a number of salient features of speaking performance that need to be taken into account when drawing up, implementing and assessing summative test tasks that focus on this complex productive skill in the classroom. These theoretical considerations will then be illustrated through the description and analysis of a range of activities that I conducted in a ‘pre-intermediate’ 56 Chapter 3 class of the EST system, in an aim to integrate a competence-based approach to assessing speaking into everyday teaching practice. 3.1. Central features of interest to the assessment of speaking 3.1.1. Features shared by both productive skills In terms of what is relevant to testing and assessment, the two productive skills of speaking and writing have several features in common. As Harmer points out, ‘there are a number of language production processes which have to be gone through whichever medium we are working in’ 1 . Such similarities occur in various aspects of the language samples that our students produce. As far as the form of a linguistic sample goes, the two concepts of fluency and accuracy are of key importance to both speaking and writing performances. Hence, the student is generally supposed to demonstrate adequate range as well as appropriate choice and use in terms of lexis and grammar in order to deal with the task he is given. In general, as Brown stresses, it is ‘now very clear that fluency and accuracy are both important goals to pursue in CLT and/or TBLT’. In the context of ‘pre-intermediate’ classes, it is also interesting to note that ‘fluency may in many communicative language courses be an initial goal in language teaching’ 2 to ensure that learners can bring across a basic message; subsequently, accuracy often takes on an increased role as it gets more and more important to “fine-tune” the quality of the student’s performances. However, in an isolated test, the teacher (as assessor) must decide to what extent fluency is to be prioritised over accuracy (or vice-versa) in accordance with the specific cognitive demands of each constituent task. Hence, if the achievement of a narrow objective is to be verified (such as the student’s appropriate use of discrete lexis items in response to controlled, closed questions), it is clear that accuracy plays a predominant role in the corresponding assessment. On the other hand, the focus tends to be put on the student’s fluency if longer samples of fairly free writing or speaking are elicited, for example through more open questions. If the various components of a summative test aim to allow for a mixture of both achievement and proficiency assessments, it is clear that 1 Harmer, op.cit., p.246. Emphasis added. Brown, Teaching By Principles, p.324. Italics added. (CLT = communicative language teaching; TBLT = task-based language teaching) 2 Chapter 3 57 considerations about both accuracy and fluency must be taken into account and weighed against each other in regard to the respective task requirements. Convincing oral and written productions also have to meet some comparable requirements on a structural level. As Harmer summarises, for communication to be successful we have to structure our discourse in such a way that it will be understood by our listeners or readers. 3 To achieve this, the two components of coherence and cohesion are of central importance. Coherence can broadly be defined as ‘the way a text is internally linked through meaning’ 4 ; hence, the ideas in a given text or speech are structured and sequenced in such a way that the listener or reader can follow the author’s intended meaning and reasoning without confusion. Cohesion, on the other hand, may be seen as ‘the way a text is internally linked through grammar/lexis’ 5 and stylistic elements in general such as, for instance, linking words or phrases used to connect successive paragraphs with each other. Due to the higher amount of time and planning (as well as the possibility of re-drafting) generally available when approaching a writing task, it may be sensible to expect (and thus insist on) a higher degree of coherence and cohesion in written productions than in spoken ones (which often result from a much more impulsive and immediate engagement in communication). As Harmer points out, ‘spontaneous speech may appear considerably more chaotic and disorganised than a lot of writing’; in general, however, ‘speakers [still] employ a number of structuring devices’ which may even include speech-specific features such as ‘language designed to ‘buy time’ [or] turntaking language’ 6 . In speaking performances of ‘pre-intermediate’ language learners, the latter two elements may of course be too challenging to be included, particularly if such conversational strategies have not been pre-taught. Yet even at that level, each assessment of speaking skills globally needs to take into account to what extent the learner is able to structure his discourse so as to increase the intelligibility of the intended message or line of argument. Evidently, the overall content of a particular speaking or writing performance also needs to fulfil a number of conditions to be considered a successful effort on the part of the learner. Features such as completeness and level of detail of the information provided 3 Harmer, op.cit., p.246. Martyn Clarke, PowerPoint notes from the seminar ‘Creating writing tasks from the CEFR’, held in Luxembourg City in October 2009, © Oxford University Press, p.3. Emphasis added. 5 Ibid., p.3. Emphasis added. 6 Harmer, op.cit., p.246. 4 58 Chapter 3 in the student production, as well as the general relevance to the actual topic or instructions, usually combine to determine an adequate achievement of (or response to) the set task. However, one needs to be very cautious in the interpretation and valuation of content-related criteria, especially as they are generally considered to be of overarching importance in performance assessment. Indeed, the complex interplay of form and content can lead to a number of problematic cases. For instance, an inadequate performance in terms of content can sometimes be explained by the student’s insufficient mastery of form-related knowledge or skills (i.e. a student lacking the necessary lexis or grammar will not be able to express task-relevant ideas in appropriate ways). In that instance, a potential unsuitability of the task and its excessively demanding requirements might be at fault rather than the learner himself; for example, insufficient scaffolding and pre-teaching of key elements may in fact have been offered to the student in preparation for the test. On the other hand, a learner performance may be perfectly proficient in regard to purely language-related features (such as vocabulary range and grammatical structures); yet it can evidently still be seen as an inadequate effort – and rightly so – if it completely ignores the topical specifications of the test task. Finally, to further enhance the relevance and adequacy of their spoken and written performances, learners need to be aware of appropriate styles and genres to use 7 . Depending on the purpose (e.g. information, inquiry, commentary,…), audience and setting of a given act of speaking or writing, competent language users must show that they can choose an appropriate format and register (e.g. formal / informal) for the required communicative act. As Harmer stresses, this also presupposes an awareness of ‘sociocultural rules’; particularly in the context of foreign language learning, students indeed have to develop an understanding of conversational or textual conventions that ‘guide [typical] behaviour in a number of well recognised speech events’ within the target culture, ‘such as invitation conversations, socialising moves, and typical negotiations’, to name but a few 8 . A student’s sociolinguistic competence, ‘concerned with the knowledge required to deal with the social dimension of language use’ (CEFR, p.118), may not be an explicit and stand-alone focus when it comes to assessing a particular spoken or written language sample. Nevertheless, it is often crucial that learners develop it to a satisfactory degree so as to prevent their task response from becoming inappropriate in a range of other interlinked areas. 7 8 Ibid., p.247. Ibid., p.247. Chapter 3 59 3.1.2. Features specific to speaking The use of correct orthography is an aspect of accuracy that is generally confined to the assessment of written performances. In speaking, the students’ pronunciation skill may be considered to be its logical counterpart. Such parallels are indicated in the respective CEFR scales: for orthographic control, the A2 descriptor expects ‘reasonable phonetic accuracy (but not necessarily fully standard spelling)’ (CEFR, p.118) from a language learner of that level. The assumption goes that occasional wrong or missing letters can be automatically corrected by the reader so that a breakdown in communication is avoided. For a speaking performance, the CEFR similarly stipulates that an A2 learner’s ‘pronunciation is generally clear enough to be understood despite a noticeable foreign accent’ (p.117); again, the listener may be expected to make some reasonable “compensations” for the imperfections in his interlocutor’s speech. However, the importance of phonological control can easily be underestimated; in fact, one might wonder whether the corrective ‘effort’ made by a listener is always comparable to the one made by a reader. One could even argue that faulty pronunciation can have a more adverse impact on successful communication than occasional misspellings, at least as far as the English language is concerned. Thus, a single misplaced letter in writing may not cause unintelligibility very often; one wrong vowel sound in speaking, on the other hand, can radically alter or obfuscate the meaning of an utterance. In that respect, it is certainly advisable not to prematurely dismiss pronunciation as a mere sub-skill of relatively little importance when it comes to assessing a general speaking performance. On a different scale, the communicative process of interaction is one of the key areas where the applicable scope of the two productive skills clearly differs. In the broadest sense, writing can of course easily be used for interactive purposes (for example, a letter may be written in reaction to another one). However, a very different type of interaction certainly occurs in oral communication, where speakers can immediately react and adapt to each other’s utterances. In that respect, oral discussions can undeniably develop complex dynamics that would be difficult to emulate in written form (except perhaps via contemporary means of written interaction offered by online services such as instant messaging programs). When designing speaking tests, a clear distinction thus needs to be made between tasks which aim at activating skills of interaction and others which are centred on a longer, individual turn for a single student. As Heyworth notes, the ‘CEF description’ of speaking skills usefully caters for this crucial distinction: it ‘allows 60 Chapter 3 us to distinguish between spoken production and spoken interaction’ 9 and correspondingly provides different descriptor scales for both aspects rather than just approaching speaking as one large general skill. In view of test design, recognising this fundamental importance of interaction in many speech acts opens up a number of stimulating possibilities. Teachers pursuing more traditional strategies often seek to collect evidence of students’ speaking skills purely from rehearsed speeches such as formal presentations. Yet such an approach is often flawed since the corresponding oral productions tend to be prepared or even read out from notes; in that sense, they do not effect a “true” and direct activation of speaking skills. In contrast, truly interactive tasks require the students to access their repertoire of oral competences “in real time”; no extensive preparation (for example through word-forword memorisation of precise sentences and formulations) and thus no such distorted or falsified picture of actual oral proficiency can occur in that instance. In addition, as Brown puts it, learning to produce waves of language in a vacuum – without interlocutors – would rob speaking skill of its richest component: the creativity of conversational negotiation. 10 A number of suitable and varied activities are conceivable in the aim of creating such an exchange. Teacher-student (T-S) interaction, for example a short teacher-guided interview, may thus be included into a speaking test just as easily as different types of student-student (S-S) interaction. Hence students may for instance be asked to engage in opposing debates or, in contrast, to work collaboratively towards the achievement of a common goal. As an added benefit, different micro-skills may be verified in all of these types of interactive activities (T-S and S-S). Students may, for instance, be expected to show their ability of asking for repetition or clarification in the target language; similarly, appropriate use of turn-taking skills can take on an important role particularly in S-S tasks 11 . Naturally, however, the insistence on interactive competences does not prohibit the simultaneous inclusion of more ‘production’-oriented activities (requiring a longer, exclusive turn for each individual student) in one same speaking test. In fact, as seen in chapter 2, the use of a wider range of considerably diverse task types (generating 9 Heyworth, art.cit., p.16. Emphasis added. Brown, Teaching by Principles, p.327. 11 Incidentally, for all of these strategic micro-skills, the CEFR once again provides useful, separate descriptor scales: see tables for ‘Taking the floor’, ‘Co-operating’ and ‘Asking for Clarification’ (CEFR, pp.86-87). 10 Chapter 3 61 a ‘range of performances’ rather than a single, one-dimensional sample) is advisable if the assessment of general proficiency is the defined purpose of the speaking test. 3.2. Case study 1: speaking about people, family, likes and dislikes After identifying the main components that a thorough assessment of speaking skills should be founded on, the next step consists in finding a suitable way of integrating these concepts into daily teaching, testing and assessment practice. At this point, I will therefore venture into a detailed description and analysis of two different speaking tests that I implemented in the classroom with that aim in mind. The first one, detailed in this section, mainly focused on the thematic area of physical descriptions, as well as a general discussion of A2-typical areas of interest such as personal information, family, likes or dislikes, and free-time activities. 3.2.1. Class description The practical experimentations with competence-based tests and assessments presented in this thesis were conducted in one same 9TE class at the Lycée Technique Michel Lucius in the school year 2009-10. That class was initially composed of 22 students of different nationalities (12 girls and 10 boys; 12 pupils were Luxembourgish, 5 students had a Portuguese background and 5 others were of Italian descent). However, two students left the class and another pupil joined the group over the course of the school year 12 . A slight peculiarity consisted in the fact that no fewer than seven students were retaking their year; hence, they arguably started out with a slightly higher initial level of proficiency in English than their peers. Correspondingly, student age generally ranged from 14 to 16 years at the beginning of the year, with the exception of one 17-year-old boy. No majorly disruptive behaviour affected the general climate in the classroom, and so the overall atmosphere among the students was mostly a co-operative and mutually supportive one. As a positive corollary, the absence of serious disciplinary problems was certainly conducive to the students’ ability to concentrate on competence-based language activities in an efficient and attentive way. 12 One (Luxembourgish) girl left for a similar class in another school, while the other (Portuguese) student, already re-taking her year, voluntarily joined a 9PO class after the first term. The new student, a 15-year-old Luxembourgish girl, transferred into the class from a school belonging to the ES system in the middle of the second term. 62 Chapter 3 3.2.2. Laying the groundwork: classroom activities leading up to the test As the summative speaking test described in this case study was scheduled for the penultimate week of the first term, ample time was available to gradually prepare the students during the weeks leading up to it. In a number of ways, it was of paramount importance to familiarise the students both with the subject-matter and the types of tasks that would be the cornerstones of the eventual summative assessment. First of all, of course, the necessary lexical and grammatical groundwork had to be laid so that the students would have a sufficient knowledge base to master the test tasks in a satisfactory way. However, on a methodological level as well, it was particularly necessary to provide scaffolding and to model expected behaviour in different activity types as the students had been completely unused to having their speaking skills assessed in summative tests up to this point. Hence they needed to become accustomed to elements such as interactive pair work activities and the verbal description of visual prompts so that student-related reliability could be increased in their test performance. While reliability-affecting factors such as ‘exam nerves’ can of course never be completely pre-empted, they would certainly have a less prominent impact in the eventual test if the students recognised the task requirements (i.e. the expected type of behaviour and performance) in the test situation due to previous encounters with similar tasks in “regular” classroom time. At the same time, the more the test components corresponded to the material and activities treated in class, the more their content and face validity would be granted. Since one of the primary thematic areas to be incorporated in the test consisted in physical descriptions of people, a number of classroom activities evolved around that topic as well. To raise intrinsic motivation and familiarise students with interactive strategies at the same time, short games proved a particularly useful tool. In one instance, the students were thus asked to describe the looks of a specific person to their respective neighbours; each student was handed a cue card with the picture of a celebrity, and according to his or her descriptions, their partner had to guess who the described man or woman was (if necessary by asking further questions for clarification). Another activity, simultaneously recycling the use of personal pronouns and family vocabulary, asked the students to complete a family tree by drawing some missing facial features into various partially blank portraits on their handouts; each pair of students had received two different versions of the same family tree so that they had to elicit missing information Chapter 3 63 from their partner through targeted questioning and negotiating 13 . In a playful manner, the students were thus encouraged not only to apply some newly acquired topical lexis, but also to actively develop their interactive speaking skills and become used to working with a partner in the process. In terms of feasibility, one might add that it is by no means a difficult or particularly time-consuming endeavour to integrate such explicitly speaking-focused activities into one’s general teaching routine; indeed, the necessary time-span to be set aside for either of the described activities did not exceed twenty minutes. In fact, as all the students were simultaneously engaged in discussions with their neighbours (while the teacher’s role was confined to monitoring the learners’ efforts and occasionally helping them out), such games actually constituted a very time-efficient way of actively involving as many students as possible within a relatively short amount of time. 3.2.3. Speaking test 1: description of test items and tasks The first summative speaking test was strategically placed at the end of term so that it could deliver relevant indications in view of both achievement and proficiency assessments. On the one hand, its various components therefore needed to check that adequate learning had taken place in relation to specific objectives (i.e. reflecting the elements expressly focused on in the weeks immediately before the test, such as lexical accuracy in physical descriptions as well as the correct grammatical use of the present simple and past simple tenses). On the other hand, longer individual speaking turns could also offer some insight into the general progress (i.e. competence development) made by individual students in the three months since the start of the school year. Hence, when compiling the items and tasks for this test it was not only necessary to ensure reasonable validity, reliability and a sensible degree of difficulty; if a genuine, comprehensive insight into speaking proficiency was to be reached, it was also important to include both individual turns (of spoken production) as well as elements targeting spoken interaction. In that sense, it quickly became clear that a range of activities (rather than a single one) needed to be included in the test to yield conclusive results. In turn, however, this gave rise to a new set of questions. Could one expect a sufficiently extensive sample from a language learner working within the A2 level of proficiency to provide sufficient indicators for all of these different foci? If so, how did the different tasks need to be 13 This activity was adapted from Mick Gammidge, Speaking Extra, Cambridge University Press (Cambridge: 2004), pp.16-17. 64 Chapter 3 structured and sequenced to support the learners in their efforts and thus ensure maximally effective results? In regard to such crucial questions, Harmer reminds us that [i]n the first place, we need to match the tasks we ask students to perform with their language level. This means ensuring that they have the minimum language they would need to perform such a task. 14 Therefore, the different sections of the test needed to be firmly centred on the actual vocabulary items and linguistic structures that the students had already encountered and practised in the classroom; the occasional provision of structural or lexical clues within the individual test items would further help to support the learners in their efforts. Another particularly helpful form of scaffolding that Harmer points out concerns the thoughtful sequencing of tasks: [t]eachers should not expect instant fluency and creativity; instead they should build up students’ confidence ‘bit by bit’…, giving them restricted tasks first before prompting them to be more and more spontaneous later. 15 Establishing a clear order among the tasks according to their increasing cognitive demand certainly makes sense in a summative test. As students will be aware that their performance has an immediate impact on their overall grade (and thus their general progress at school), they are likely to be particularly nervous at the start of the test situation. Making sure that comparatively simple, confidence-building tasks are incorporated at the beginning of a test can therefore be vital, as early positive experiences (leading to the feeling that they can do this) will help the students to quickly allay any potential test anxiety (at least to a certain extent). As a consequence, it is more likely that they will be able to access their full and genuine potential for the more demanding tasks to follow; in turn, the test will then provide more reliable indications as to the students’ actual level of competence in speaking. With that purpose in mind, I chose to use simple T-S interaction at the start of this speaking test: while the students took the test in pairs, each individual candidate was first asked a few simple, closed questions about his or her personal details (e.g. name, hometown…) and about daily routines or simple likes and dislikes (e.g. favourite music, free time activities…). In addition, a simple spelling task was included in relation to one of the students’ answers, allowing a first precise indication about the students’ phonological control. The two candidates were questioned in regular alternation (usually 14 15 Harmer, op.cit., p.251. Ibid., pp.251-252 Chapter 3 65 two to three questions per student at a time) so that no individual would become excluded from the conversation for too long (and thus be allowed to “switch off”). The questions used in this part of the test were also clearly defined and scripted beforehand 16 , in an aim to guarantee similar answer lengths from each student yet still leaving myself a possibility to slightly vary the topics of inquiry. In that way, later candidates would not know with absolute certainty what questions to expect even if they had received some information from their peers who had already taken the test; nor would a pair of candidates receive exactly the same questions, as this would have given an unfair advantage to the second speaker. Both the pattern and content of this opening task were inspired by the simple exchanges that typically characterise Speaking Part 1 of the Cambridge ESOL ‘Key English Test’ (KET) 17 – an examination specifically designed to assess and certify A2level proficiency in English. The second, more extensive section of the speaking test then consisted in two thematically interlinked parts: a longer individual turn, followed by a controlled S-S interaction task 18 . At this stage, the pair of candidates were asked to choose one envelope (for the two of them) from a pack, which I then opened. Each envelope contained four complementary cue cards (two for each student). First, student A received one card with a picture of a famous person (not to be revealed to the partner) and some corresponding facts about that celebrity (e.g. date of birth, marital status…). The other card, given to student B, contained basic cues to ask for information about that celebrity19 . The first task, student A’s long turn, consisted in describing the looks of the depicted celebrity in as much detail as possible (but without giving away the famous person’s name), thus activating the competences built up through the speaking activities implemented in class. As the student was supposed to display his ability to complete this task independently (potentially using simple forms of paraphrasing whenever a precise lexical item eluded him), I chose not to intervene with any linguistic prompts here if the speaker “got stuck” (although strategic tips were occasionally given, such as ‘try using another word or working with opposites if you can’t find the precise expression’). 16 See appendix 1. A general description of this KET speaking activity can for example be found in University of Cambridge ESOL Examinations, Key English Test – Handbook for Teachers, UCLES (Cambridge: 2009), p.35. 18 The S-S interaction task was partly modelled on role play tasks that can be found in KET examinations as well (see Key English Test – Handbook for Teachers, p.37 for examples). The main difference resides in the fact that KET role plays are based on fictitious material (e.g. information about invented schools, libraries…) whereas the material used in this speaking test dealt with real people and accurately researched, authentic facts (in an aim to further increase student interest). 19 See appendix 2 for samples. 17 66 Chapter 3 Once the physical description was complete, I instructed student B to use the cues on his card to ask precise questions about the celebrity’s life and career, which student A then had to answer by using the data provided underneath the picture on his card. After all the questions had been answered, student B was asked to venture a guess about the celebrity’s identity, mirroring the similar games implemented in class (although, of course, failure to guess correctly would not coincide with a subtraction of marks as this could evidently not be construed as a legitimate achievement focus). The roles were then reversed; student B was given a cue card portraying another famous person, student A received the corresponding questioning cues, and the game started anew. At the end of this activity, both students were finally asked whether they liked their respective celebrities (as well as why/why not), and who their favourite famous singer (or actor / model / athlete…) was, so as to involve them into a less controlled discussion once more and elicit a final longer sample from each pupil. Importantly, to prevent the students from simply copying their respective partner’s exact formulations (in terms of both physical descriptions and questions), I had taken particular care to make the various “pairs” of celebrities as contrastive as possible. Hence any given envelope included one male and one female celebrity to elicit the respectively accurate personal pronouns and possessive adjectives. Similarly, I targeted significantly different physical descriptions by trying to vary elements such as eye and hair colours and shapes, complexions, build, and so on. The type of information to be asked or given about the celebrities during S-S interaction also slightly varied to trigger correspondingly diverse questions and answers. 3.2.4. Practical setup and test procedure For the implementation of this speaking test (as well as the second one described in section 3.3. below), a number of factors were important to ensure reasonable feasibility. Hence, the decision to have students take the test in pairs was useful not only because of the possibility of integrating interactive tasks, but also in view of making the administration more efficient in regard to time and required material. If the students had been called up one by one, more time would indeed have elapsed between individual performances, and a higher number of different test questions might have been needed to make sure that later candidates could not completely anticipate possible test content and thus gain an unfair advantage before even starting the test. Chapter 3 67 Of course, the decision to put students into pairs does not come without dangers of its own. Particularly if a summative mark is at stake, it would be negligent to overlook the risks of allowing potential room for lopsided pairs in terms of ability and language proficiency levels. If students need to feed off each other’s performances to fuel their own successful reactions in interactive activities, one individual’s performance can be dragged down by his peer’s if the latter is unable to keep a dialogue going in an appropriate way (or even at all). In this respect, Brown draws attention to what Nunan calls ‘the interlocutor effect’: ‘one learner’s performance is always colo[u]red by that of the person (interlocutor) he or she is talking with.’ 20 Indeed, while helpful contributions from a so-called “strong” student could have a positive effect on the performance of a “weaker” one, the opposite scenario is just as likely. Thus, if a “strong” student is unable to fulfil part of a task satisfactorily because of the insufficient clues provided by his partner, this can easily have uncharacteristic, detrimental consequences on his final mark. Similarly, due to elements such as the “halo effect”, the teacher may also be tempted (albeit unintentionally) to excessively mark down a “weaker” student if his efforts in the test have been overshadowed by a particularly good parallel performance of a much more proficient partner. In other words, the reliability of results (and of the scorer) can be at risk in multiple ways if insufficient care is attributed to the sensible formation of candidate pairs. However, the consideration of affective factors can help to compensate for this potential weakness to a certain extent. Thus, one reason to validate this methodological choice consisted in providing a reassuring element for the students in the test situation: particularly as it was an entirely new type of summative test experience for them, they might have been more intimidated or overwhelmed if they had had to face the task (and the teacher) entirely on their own. Moreover, I left the choice of partner up to them so that they could team up with somebody that they would work and cooperate comfortably with; in virtually all cases, this led to the same pairs that students had worked in during previous in-class activities. Therefore, they did not have to readjust to a different interlocutor during the test than the one they had practised with in the classroom – a reassuring situation with positive effects on student-related reliability. It needs to be added that there were no extreme cases of exceptionally “strong” or “weak” students in this class; as a result, I was satisfied with the pairs that the students chose because the 20 Brown, Teaching By Principles, p.325. Emphasis added. 68 Chapter 3 respective partners were all fairly evenly matched in terms of ability. For that reason, I did not have to intervene and reassign students to different pairs in this case; however, as noted above, such measures may be necessary in other cases to prevent excessively unreliable results. As location, an empty classroom (adjacent to the students’ usual one) was used for the administration of the test in order to minimise the risk of external disturbances hampering student performance. Each pair of candidates were successively called into the “test room” while in the meantime the rest of the class were doing language work in their own classroom, supervised by a member of the school staff 21 . Evidently, it must be noted that such opportunity and flexibility to increase the reliability of test conditions was surely fortunate; in everyday teaching practice, it is highly likely that spare rooms and supervisors will frequently be unavailable. As four regular school lessons (of 50 minutes each) were ultimately necessary to test the entire class (approximately 15 minutes per pair of candidates, involving about 5 minutes of “pure” speaking time per student plus the time necessary for them to get installed and to receive test material and instructions), the conditions described in this instance must certainly be considered exceptional. In most cases, a solution will have to be found to administer the test in one’s own classroom, which presupposes that the rest of the class must be kept silent and occupied for the entire duration of the actual implementation. While test-taking students were engaged in the different speaking activities, my own role was a rather challenging and complex one. On the one hand, I evidently needed to provide instructions to the students and drive on the dialogues in the various parts of the test through questioning (in T-S interaction) and occasional prompting or probing (in S-S sections). Simultaneously, however, the quality of the students’ performances evidently had to be assessed; given the limited amount of time available for each pair of candidates, it was not straightforward to combine this task with the other, administrative role. Although a first brief assessment was possible during the test phase itself (see section 3.2.5. below), I also put an MP3 player on the teacher’s desk to record the students’ speaking performances. This allowed me to go back to individual efforts at a later point in time to verify some of the nuances and details that were impossible to capture all at once during the actual test. It also provided me with useful evidence of 21 At the LTML, full-time “surveillants” were handily available and, upon my request, kindly charged with that specific task by the deputy headmaster. Chapter 3 69 performance which could be retrospectively consulted in case a student were to challenge the marks he or she had received in the test. 3.2.5. Form, strategy and theoretical implications of assessment used To avoid the plain conversion of an overly impulsive, subjective and holistic (and thus insufficiently justified) impression into eventually awarded marks, it was clear from the outset that the summative assessment of the students’ speaking performances needed to be founded on clear and well defined criteria. The most suitable assessment tool for the simultaneous inclusion of a range of different foci is a marking grid, where descriptors of expected standards can be juxtaposed to elements that define sub-par performances as well as particularly proficient ones, thus establishing what Goodrich calls ‘gradations of quality’ 22 . In order to ensure that such a marking grid is immediately usable for the assessment of a speaking performance in practice, it needs to fulfil a number of conditions. Hence it evidently has to pay justice to the inherent complexity of speech acts, yet at the same time it must not be overloaded so that the assessor does not get sidetracked by an excessive number of small details and nuances. In the same vein, the descriptors used for the different criteria need to be fairly short and concise to be of immediate use in practice; if too much (or excessively vague) information is included, it is difficult to maintain a clear overview of the core elements to focus on. In this respect, it quickly becomes evident why the general proficiency descriptors in the various CEFR scales cannot be copied directly into marking grids that are applied to the rather narrowly focused performances in a single summative test. Thus, the A2 descriptor for ‘overall oral production’ states that the student can give a simple description or presentation of people, living or working conditions, daily routines, likes/dislikes etc. as a short series of simple phrases and sentences linked into a list. (p.58) This certainly gives us a general idea of the type of performance that might be required, but for the precise attribution of marks in relation to various features of a specific speech act, it does not give us sufficiently nuanced criteria to work with (an impression that is compounded by the fact that even CEFR descriptor scales which target a more precise linguistic competence, such as ‘grammatical accuracy’ or ‘vocabulary range’, are equally generalising in nature). In fact, in a 9TE (or 6eM) class, many different students might 22 Heidi Goodrich Andrade, ‘Using Rubrics to Promote Thinking and Learning’ in Educational Leadership, 57, 5 (2000), p.13. 70 Chapter 3 produce an oral performance that globally fits this description, yet a variety of nuanced qualitative differences will certainly exist between them. Hence they might all have surpassed the very basic A1 expectations but still obviously be below B1 level; however, even if they all thus broadly operate within the scope of the A2 level, their performances will evidently still not be of exactly the same level of quality in every single aspect. As the numerical marks eventually awarded in a summative test should reflect such differences, they need to be based on descriptors that are more closely linked to the requirements of the actual test tasks. Hence, while the CEFR descriptors might give us a general idea about the degree of difficulty to be expected (and set) in A2-level tests, their inherent function remains to define overall proficiency, which can only be inferred from (and thus subsequent to) a range of performances instead of a single, precise one. Correspondingly, marking grids for individual summative tests do need to be informed by the realistic proficiency expectations defined by the CEFR descriptors to make sure that their highest bands are achievable by learners of a particular level (in this case A2); however, the CEFR descriptors themselves are too global and generalising in nature to be directly used in such grids and equated to numerical values. Given the wide range of key criteria identified in relation to speaking performances (see section 3.1.), it was first of all necessary to construct a grid that regrouped these various factors in a logical and efficient way. However, rather than engaging in the daunting task of devising a completely new assessment scheme (which would have been extremely risky in terms of validity and reliability given my own limited experience with systematic summative assessments of speaking up to that point), it was a more sensible option to draw on already established ones that were compatible with the central aims of the first implemented speaking test. Since the nature and purpose of a number of test items was influenced by various elements appearing in Cambridge KET examinations, it made sense to generally align my main assessment criteria with the ones used in those high-stakes tests as well. The Cambridge ESOL approach to assessment described in the corresponding Handbook for Teachers includes an interesting focus on four central categories which permitted the inclusion of all the central features of speaking that I intended to concentrate on in this speaking test 23 : 1. ‘Grammar and Vocabulary’, where the expected degree of accuracy could be globally defined in relation to those two core elements; 23 See Key English Test – Handbook for Teachers, p.35 for a more detailed description of the entire Cambridge ESOL assessment scheme in regard to speaking skills. Chapter 3 71 2. ‘Pronunciation’, which was usefully included as an important feature of speaking to be considered in its own right; 3. ‘Interactive Communication’, which, due to the different interactive exchanges in the speaking test, was certainly an essential criterion to be included in the marking grid; 4. ‘Global Achievement’, which regrouped content-related elements such as completeness, level of detail and relevance of the answers provided; it also took into account to what extent the students were willing to take risks to experiment with the target language to clarify and expand their answers (rather than just doing the mere minimum). For instance, the required physical description of a celebrity based on visual prompts provided a prime example of an activity where students could achieve high scores in relation to such criteria in this particular test. Whereas a ‘basic standard’ performance would be marked by very short utterances (e.g. ‘He is tall. He has blue eyes.’), more adventurous and complex attempts would allow the student to reach the ‘target standard’ (e.g. through longer sentences and deliberate use of adjectives or adverbs, as in ‘This famous person is a very handsome, tall man. He has beautiful blue eyes.’). The descriptors in the correspondingly established final marking grid (see appendix 3) were inspired by the expected requirements for a successful completion of the different KET speaking parts as described in the corresponding Handbook for Teachers. Through the certified alignment of the Cambridge ESOL exams with the CEFR levels, one could also generally assume that the descriptors defining acceptable performance in relation to the different key criteria in this grid reflected standards that could genuinely be expected from A2 language learners 24 . Descriptors were established for three different bands: ‘insufficient’, ‘basic standard’ and ‘target standard’. In between those three levels of performance, further nuances would have been virtually impossible to define without creating excessive overlap and confusion; therefore, I chose to leave out unspecified additional bands altogether and to stick to this three-fold ‘gradation of quality’ instead. A particular numerical value then needed to be assigned to the achievement of each band. As my 9e students had never taken a summative speaking test before, I wanted to 24 For the establishment of this marking grid, I am indebted to Tom Hengen (an English teacher and member of the ‘groupe de travail - Socles de compétences pour l’anglais’ in Luxembourg), who provided me with the core layout and descriptors. Slight alterations were made to the initial descriptors by myself (e.g. exactly the same expressions instead of synonymous ones were used as far as possible; where some elements were present in one band but not others, corresponding entries in relation to those criteria were added in the other two bands). 72 Chapter 3 avoid raising the stakes excessively so as not to put the pupils under too much pressure in their first attempts to deal with one. At that point in time, the call to attribute 20% of the term’s mark to speaking tests was an official suggestion but not a binding requirement yet 25 ; therefore, it was possible to exceptionally allot only 12 marks (rather than a possible but arguably too weighty 36) to this first-term test. That way, if a student turned out to be overwhelmed by the new task, or uncharacteristically underperformed in it for one reason or another, this would not immediately have a highly significant impact on his final mark for the term (and thus potentially even diminish his overall chances of progress). However, with the novelty wearing off in subsequent terms, I aimed to build on my students’ increasing experience and familiarity with such tests to gradually increase their corresponding significance and weighting over the course of the school year. For the time being, with four major sets of criteria considered to be of equal importance, and three bands of performance defined for each of them (allowing for a total of 4x3 = 12 possible combinations), a very simple ratio was available to associate numerical marks to the marking grid in this instance (i.e. 1 mark per band attained per category; for instance, a student reaching band 2 across all four categories would score 4x2 = 8/12 marks in total). In this respect, it is important to note that students did not have to reach the highest level of performance to achieve a ‘pass’ mark in the test; instead, attainment of the ‘basic standard’ (or band 2) would be sufficient. On the other hand, full marks would only be awarded for a performance that corresponded to the ‘band 3’ descriptor for a particular criterion. Yet to be in line with a CEFR approach that (as seen in chapter 2) grants the A2 learner some (reasonable) remaining weaknesses, even these established ‘target standards’ crucially needed to avoid expectations of complete faultlessness. In terms of ‘pronunciation’, for instance, the highest band of quality defined in the grid still allowed for ‘occasional L1 interference’ as a feature of student speech ‘to be expected’ and thus to be deemed ‘acceptable’ (reflecting the CEFR’s approval of ‘a noticeable foreign accent’ in phonological control noted in 3.1.2. above). A similar awareness of interlanguageinduced interferences explains the presence of relativised statements in the descriptors for best possible ‘Interactive Communication’ (e.g. ‘in general, meaning is conveyed successfully’ or ‘can react quite spontaneously’). 25 In 2009-2010, when this project was implemented, this overall weighting was still only a suggestion (but not an absolute, legal requirement) in 9e classes. Chapter 3 73 The only exception to the rule was provided by the ‘Global Achievement’ category: as it measured the completeness and relevance of student answers in relation to central task expectations, this was the only area that warranted the perfect fulfilment of absolute targets (i.e. ‘all parts of the task’ were to be ‘successfully dealt with’) to earn maximal marks. It was also the only category of criteria that could have a fundamentally overriding effect on the general assessment: a ‘band 1’ performance in relation to these contentrelated elements (most crucially the corresponding realisation that ‘most of the task’ had been ‘dealt with insufficiently’) would necessarily have to lead to an insufficient mark overall as the student would not have addressed large parts of the test in a suitable manner. At the lower end of the assessment spectrum, even minimal contributions would often be sufficient to earn the students at least one mark out of a possible three per group of criteria; only in cases of ‘totally incomprehensible’ or ‘totally irrelevant’ language samples (or indeed the complete absence of a rateable sample in the first place), a categorical ‘0’ mark was to be awarded for the student’s entire performance. This also corresponded to a CEFR-compatible approach, as each sample of language would thus be analysed in a fundamentally positive manner; even if it had not reached a sufficient level of proficiency overall, it would still be considered as carrying some merit and adequate features (even if they were few and far between). This does not mean, however, that students would not have to work hard enough to earn their marks; an insufficient (‘band 1’) performance in all areas would, after all, still receive a clearly insufficient mark (3/12). Yet if a student had shown a serious effort to work to the best of his abilities, it would surely be wrong to discard an entire performance as completely worthless (for example by “awarding” an outright ‘0’ mark in ‘Grammar and Vocabulary’ simply on the grounds of a comparably high number of grammatical mistakes committed). In general terms, as noted in the previous chapter, one of the key challenges in competence-based assessment resides in striking a satisfactory balance between achievement and proficiency factors. In this respect, marking grids predominantly seem suited to assess overall proficiency. After all, it seems highly unlikely that one can fully anticipate what exact structures a student may (or may not) use in an act of speech, particularly if the corresponding test tasks target free production rather than discrete-item responses. In that sense, even if a central aim of a summative test consists in verifying the achievement of various small objectives, it is virtually impossible to incorporate a multitude of correspondingly narrow and precise descriptors into a single marking grid 74 Chapter 3 without running the risk of overloading it (and yet still overlooking some elements that might later surface in the actual student productions). This also explains why the terminology in marking grids cannot usually be absolutely precise, particularly in relation to form-focused descriptors: indeed, a reference to ‘basic sentence structure and grammatical forms’ in the ‘Grammar and Vocabulary’ section of a marking grid is much more flexibly applicable to a long, varied language sample than the meticulous, individual identification of discrete forms and structures. Moreover, if a speaking test (as in this case) includes a number of different tasks with different foci, it would be very complicated and time-consuming to establish (and use) a separate, specific grid for each individual task. It thus comes as no surprise that for speaking tests such as those used in KET, for example, the general marking guidelines specify that assessment is based on performance in the whole [speaking] test, and is not related to performance in particular parts of the test. 26 Such an approach is very probably the most feasible one for regular speaking tests in the classroom as well, given the complex demands posed to a teacher who finds himself in the challenging position of having to test and simultaneously assess multiple aspects of oral skills. Tellingly, the KET procedure deliberately splits up the test administration and assessment tasks between two different individuals. Thus, the assessment responsibilities of a first examiner, acting as the students’ ‘interlocutor’ during the test, are strictly confined to the ‘award[ing] of a global achievement mark’ 27 . The other, more specific criteria of the students’ performances are explicitly focused on by a second “assessor” who does not intervene in the actual test discussion at all and can thus direct his entire attention exclusively on assessment. In the everyday classroom, of course, very few teachers generally enjoy the luxury of having such a second assessor at their side, and so the full burden of assessment falls upon a single person (who also partially has to act as interlocutor and test administrator). Hence it is very clear that a general assessment in relation to a number of different criteria is much more feasible during (and in the few minutes immediately after) the test than one that goes into very precise detail. To maximise the efficiency of such an assessment, it is evidently an almost unavoidable prerequisite that the teacher has thoroughly familiarised himself with the different criteria (and the corresponding descriptors) in the marking grid prior to the test, so that he 26 27 Key English Test – Handbook for Teachers, p.35 Ibid., p.35. Chapter 3 75 virtually knows them by heart and needs to invest as little time and concentration as possible on (re-)acquainting himself with them at the time of the test situation itself. During the administration of the particular speaking test described here, I quickly filled out a basic assessment sheet for each student (by inserting crosses into blanked out sections of the marking grid) 28 in the short periods of time between examinations of different pairs of candidates. As I had recorded all test performances on an MP3 player, I was then able to listen to the various conversations again later and, if necessary, slightly amend assessments in relation to criteria that I might not have been able to focus on in sufficient detail during my role as interlocutor. In a way, I thus took over the roles of both KET examiners at different stages of the assessment process; moreover, by being able to “revisit” each conversation in a less stressful and time-pressured context, I managed to focus on the various criteria more closely, which in turn reduced the risk of relying on “gut-feeling” assessments (hence increasing scorer-related reliability in the process). At the same time, however, it is of course important to resist the temptation of “overcorrection”; the intended aim is certainly not to spend hours and hours on a single student sample (and potentially becoming excessively focused on a particular aspect in the process). The overall proficiency signalled by student performances across the various test tasks is thus a first indication to go on in the assessment of their speaking skills (hinting at the levels of competence reached). However, that does not mean that confirmation of the achievement of more precise objectives cannot additionally be sought for on the same occasion. In that sense, the speaking test described here certainly provided numerous valuable clues as to whether my 9TE students had attained some of the intermediate objectives of the entire term, even if the latter were not explicitly mentioned in the marking grid. In this case, some of the language elements that had been previously covered in class included for example the following: • Grammar: correct use of the present simple and past simple (including correct question formation and short answers in both tenses), personal (subject and object) pronouns and possessive adjectives; • Vocabulary: discrete expressions used for physical descriptions (e.g. hair, eyes, build, general appearance), daily routines, family relationships and free time activities; 28 See appendix 4. 76 • Chapter 3 Pronunciation: correct vowel sounds, such as the distinction between ‘i’, ‘e’ and ‘a’ (and adequate use of English phonemes in general); • Interactive Communication: asking for repetition or explanation of a particular utterance in English. With the arguable exception of accurate physical descriptions, none of these elements were tested in clear and deliberate isolation over the course of the speaking test. On the contrary, most of them were flexibly combined and blended within the students’ overall speaking performances. For that reason the test was, by definition, not suitable for a pure achievement assessment in relation to the individual mastery of those various discrete items. Nevertheless, as a significant amount of time had been spent on elements such as (for example) past and present simple tenses in the classroom during the weeks and months prior to the test, a reasonably high degree of accuracy could evidently still be expected in relation to such familiar items. In that sense, consistent failure to use them correctly (obviously indicating persisting errors in the students’ interlanguage) would of course constitute a more legitimate reason to mark a student’s performance down (in relation to ‘Grammar and Vocabulary’ criteria) than an ambitious yet partially unsuccessful experimentation with grammatical or lexical items that had not been previously encountered (or much less practised) in the classroom. This forcefully underlines the complicated link that existed between achievement and proficiency considerations in this assessment: while the adequate mastery of precise structures and items was not directly demanded by the predominantly proficiency-based descriptors in the marking grid, a generally satisfactory performance in the test was evidently impossible if none of the grammatical and lexical elements encountered in class had been acquired and internalised (i.e. precise learning objectives achieved) to a sufficient degree. In other words, without achievement of at least some specific objectives in the first place, there can ultimately only be little or no proficiency as a result (especially at this early, ‘pre-intermediate’ level of language learning where no steadfast foundation in terms of knowledge and skills has been laid yet). In that sense, this summative speaking test was certainly capable of providing insights into both detailed achievements and the learners’ general proficiency; in different ways, both aspects could be derived from the descriptors used in the marking grid. Chapter 3 77 3.2.6. Analysis of test outcomes All in all, this first summative speaking test yielded fairly satisfactory scores. While the average mark achieved was not particularly high (7.5/12, with marks ranging from 5.5 to 11.5 29 ), there were no disastrously low marks at all, indicating that none of the students were completely overwhelmed by the tasks they had to deal with in the test. In other words, the basic standards that had been set were well within the students’ reach, showing that the test had indeed ‘matched their language level’ to a satisfactory extent. Only three students ended up with a marginally insufficient final mark (5.5 in all three cases), although another four pupils just about passed by scoring exactly 6/12. On the other hand, a third of the class managed to score more than 8 (i.e. 2/3 of the maximum mark), including two particularly impressive performances (incidentally from two students who were neither native speakers nor re-taking the year) that were rewarded with final marks of 10.5 and 11.5 respectively. It is perhaps not particularly surprising (yet definitely worth noting) that the target criteria which students found hardest to meet were to be found in the predominantly accuracy-focused ‘Grammar and Vocabulary’ section: in this area of the assessment scheme, the highest number of insufficient marks were given (eight students received 1/3) and the class average was correspondingly lower (1.64/3) than in the remaining three categories. However, as indicated by the general results above, in more than half of those cases the students managed to make up for those apparent weaknesses by meeting basic or standard targets in other sets of criteria (which were more closely linked to fluency). In fact, only a single insufficient mark was respectively scored in each of the remaining three categories; the highest class averages were reached in the ‘Interactive Communication’ (2.05/3) and ‘Pronunciation’ (1.98/3) sections. Hence a learner performance was not automatically doomed to receive an insufficient overall mark simply because it might have contained a fairly high number of mistakes in terms of accuracy. Nevertheless, a few legitimate questions arise in regard to the disproportionate number of cases where the basic standard in ‘Grammar and Vocabulary’ had ostensibly not been met. Had the expected standards aimed too high? Had the test tasks been too difficult in this area? Had I been particularly harsh as an assessor in relation to accuracy? To what extent might an insufficient effort of revision on the part of the students be to blame as well? In general, a combination of all or several of those factors is easily 29 See appendix 5 for a detailed breakdown of individually obtained marks. 78 Chapter 3 imaginable. In this case, however, I would argue that the vast majority of ‘basic standard’ descriptors in the marking grid did not seem exaggerated at all: based on the CEFR scales, it is certainly appropriate to expect the A2 student to ‘use a limited range of structures’ and ‘have sufficient control of relevant vocabulary’ for the tasks set. The test tasks themselves appear reasonably valid as well, since they mainly recycled familiar and pre-taught grammatical structures and lexical items and could thus be safely assumed not to be excessively difficult; furthermore, their topical focus was, after all, of an ‘everyday’ nature that should be rooted firmly within the A2 level. Yet it may seem questionable whether more than a third of my students really had such ‘little control of very few grammatical forms and little awareness of sentence structure’ to really only deserve a ‘band 1’ in ‘Grammar and Vocabulary’; therefore, my own approach to this part of the assessment needs to be examined more closely in this particular instance. On the one hand, it must be noted that numerous students actually did show consistent problems with a number of grammatical foci that had been extensively practised over the course of the term. For instance, numerous errors were recurrently committed in regard to question formation during the S-S interactive activity where students had to construct suitable questions from cues even though the very same type of activity (using similarly structured cues) had repeatedly been implemented in class. In those attempts, the most frequent mistakes consisted in the wrong use or complete omission of auxiliary verbs, as well as the confusion of past and present tenses (e.g. ‘*When is he born?’). If numerous mistakes noticeably occurred in relation to various pre-taught elements within a single student’s test performance, the mere mention of ‘some grammatical errors of the most basic type’ in the corresponding ‘band 2’ descriptor sometimes seemed too lenient to be attributed to that performance 30 . Hence the ‘band 1’ description of ‘little control’ appeared more appropriate in such cases, especially as the more controlled test tasks (such as the S-S question-answer one) essentially targeted only ‘few grammatical forms’. In that sense, the assessment was firmly guided by a focus on achievement factors in this section of the test. However, one must not forget that the remaining sections of the speaking test also needed to be taken into consideration before it was legitimate to reach a final decision about the generally displayed level of grammatical and lexical proficiency. In other words, it is essential for the assessor not to let a “bad” performance in one test section 30 See appendix 3 for the entire marking grid. Chapter 3 79 taint his impression of the student’s efforts in other parts of the test as well; if one is excessively focused on achievement of discrete objectives, the “bigger picture” may be forgotten in the process. At the same time, it is understandable that teachers may expect a fairly high degree of grammatical accuracy in regard to particular grammar points which extended periods of time have been consecrated to in regular class hours. This also explains why I tended to clamp down on grammatical mistakes (in relation to such pretaught items) rather severely in my assessment strategy in this instance. As a more constructive corollary, however, the recurrence of similar problems across different student performances allowed me to identify discrete elements that still commonly posed problems to the learners and thus especially needed to be further reinforced (and would thus be re-integrated into objectives to be achieved) in classroom activities in the future. On the other hand, various criteria and descriptors used in this marking grid also proved to be rather problematic during the actual implementation phase. Hence one element which arguably further explains the students’ relatively low scores in ‘Grammar and Vocabulary’ can be seen in the actual phrasing of a particular ‘band 1’ descriptor in this category: namely, the stipulation that an insufficient performance would be characterised by ‘speech’ that ‘was often presented in isolated or memorised phrases’. In itself, this judgment might be appropriate for longer turns where students have to keep talking in a fairly spontaneous and uninterrupted fashion for an extended period of time (in the case of A2 learners, of course, this might only mean one or two minutes); if a student fails to utter more than a few disconnected parts of sentences in such an instance, then the performance can indeed be deemed ‘insufficient’. However, as the majority of the test tasks were designed in such a way that the students could usually respond by means of single sentences, the inclusion of such a descriptor in the lowest defined band became more controversial. Across large parts of the test, the use of ‘isolated’ and even partly ‘memorised’ phrases could in fact be judged to basically fulfil task requirements. Similarly, the choice to mention ‘minor hesitations’ as a feature of ‘basic’ rather than ‘target standard’ could (rightly) be considered to be fairly harsh; would it not be normal even for remarkably proficient A2 learners to still be allowed to briefly hesitate in spontaneous speech acts? The first practical experimentations with this marking grid thus certainly revealed a number of discrete formulations that needed to be reconsidered in the next assessment of speaking performances. In a more general way, the decision to blend two such widely diverse elements as ‘Grammar and Vocabulary’ into a single category also turned out, in retrospect, to be 80 Chapter 3 questionable in a number of ways. One particularly detrimental effect was that some students who clearly demonstrated they had assimilated a suitable range of topical lexis could still fail to reach the highest band in this accuracy-based section of the grid because of their remaining problems in grammar (and vice-versa). Of course, one may argue that extreme discrepancies are very unlikely to occur between the grammatical and lexical competence levels of a single learner; hence, high proficiency in grammar is rarely accompanied by very severe deficiencies in vocabulary-related competences. A complete dissimilarity marking the qualitative performances in grammar and vocabulary may thus be a crass exception, and a fairly parallel competence development might normally be expected to take place in relation to those two language components. However, while such considerations might help to explain the decision to combine them into a single group of criteria in a marking grid, the very general way in which both elements respectively had to be described in this instance did not necessarily pay sufficient justice to the importance and intricacies of either of them. The clearer separation of ‘grammar’ and ‘vocabulary’ foci thus already seemed to impose itself for subsequent assessments of speaking performance purely on the basis of the respective complexities inherent to both components which could not be analysed in a sufficiently thorough and targeted way on this occasion. In addition, one might question the exactly equal weighting that had been attributed to the two elements of ‘Grammar and Vocabulary’ and ‘Pronunciation’ in this assessment. Even though the latter category included global, fluency-centred descriptors (e.g. ‘speech is mostly fluent’) in addition to the precise analysis of phonological control (e.g. ‘L1 interference still to be expected’), the place that it occupied in the marking scheme in comparison to grammar and vocabulary ultimately seemed a bit exaggerated. Without denying that adequate pronunciation is (as discussed in 3.1. above) an important feature of any successful speech act, one may wonder whether it truly deserves to be equally valued as the combination of such core linguistic components as grammar and vocabulary. Hence, the inclusion of pronunciation-related criteria into this marking grid may have been a perfectly legitimate decision; nevertheless, some fine-tuning in terms of sensible weighting seemed to be necessary in subsequent speaking tests. Finally, in respect to the ultimately awarded total scores, the noticeable presence of “half” (.5) marks highlights an issue presented by the actual structure of the grid used in this instance: namely, the absence of midway bands between the three explicitly defined ones. Even though this had initially been (as described above) a deliberate choice, its Chapter 3 81 effects repeatedly proved problematic in practice. At various times during the assessment procedure, I noticed that full marks in relation to one of the four main criteria could not always be justified in a very straightforward manner. The simple reason behind this was that a particular student performance often did not simultaneously correspond to all descriptors that were defined and grouped in relation to a particular criterion in one single band. Hence, one student’s speech might have been ‘occasionally strenuous’ (‘Interactive Communication’, band 2) yet the same student could also have shown an ability to ‘react quite spontaneously’ and ‘ask for clarifications’ (‘Interactive Communication’, band 3). Similar examples of “mixed” performance occurred in relation to each of the other three major criteria sets as well, underlining the fact that the quality of a given speech act does not always handily fit into a single band description (evidently, the more different descriptors and criteria one assigns to a specific band in a single category, the likelier such an outcome becomes). As my marking sheets had not sufficiently anticipated such a need for flexibility, I needed to improvise during the assessment by putting crosses on the line separating two different bands and correspondingly awarding half marks if no clear tendency to either the higher or lower band could be reasonably discerned. For future assessments, however, the provision of two extra, midway bands seemed necessary, even if they would evidently not require specific sets of descriptors (but cater for such ‘mixed’ performances instead). All in all, the implementation and assessment of this first speaking test thus yielded various useful realisations. On the one hand, it clearly showed that even at such an early stage as the first term of 9e, my students could deal with tests that veritably called upon their speaking skills, provided that the corresponding tasks had been suitably prepared in class and adapted to the A2 language level. On the other hand, the assessment tools and strategies used in this instance could still be more closely tailored to the students’ real needs and capacities. As a result, this would constitute one of my main aims in the second summative assessment that targeted my students’ speaking skills in the subsequent term. 82 Chapter 3 3.3. Case study 2: comparing places and lifestyles / asking for and giving directions. For this second speaking test in my 9TE class, I once again chose the end of term as the most suitable point in time for implementation as this would give my students the chance to build up further lexical and grammatical resources in some depth over the course of almost three entire months. As the strategy I pursued in the build-up to the test mirrored the one from the first term (i.e. constructing knowledge of topic-based vocabulary and fundamental grammatical structures, practising pair work activities and task types to be included in similar form in the test), I will avoid a more detailed description of general classroom activities in this section of the chapter. Instead, I will focus exclusively on the design and outcomes of the specific test tasks and modified assessment strategies used. 3.3.1. Speaking test 2: description of test items and tasks Although I once again wanted the speaking test to start with a confidence-building activity, excessive repetition of (or overlap with) elements from the first speaking test – and thus undue predictability – needed to be carefully avoided this time around. Therefore, while the opening activity in this second test still essentially revolved around T-S interaction, it was set up in significantly different ways than the very direct questionanswer scheme used at the beginning of the previous term’s test. Although the students once again took the test in pairs, the starting task was an individual one. After the candidates had picked a sealed envelope from a pack and handed it to me, I opened it and gave one of them (student A) a handout with two contrasting visual prompts 31 which served a number of purposes. First of all, the pictures had been carefully selected so that they showed places, objects and (to a lesser extent) jobs or activities the students would be able to describe in some detail with the lexical knowledge they had built up in relation to various thematic areas over the course of term (or even the year). Since topics such as different places, cultures and holidays abroad had all been covered in class, a correspondingly substantial number of test sheets included pictures of different types of locations (e.g. city and countryside, mountains and beaches…). As variations (so that students taking the test at later stages would not be able to prepare their answers in advance), pictures relating to other treated topics such as different jobs and lifestyles (e.g. rock stars or ultra-rich people versus “normal” people with “regular” jobs) or various free-time activities (e.g. outdoor sports versus video gaming) were also included. 31 See appendix 6 for samples Chapter 3 83 In a first stage, student A had to describe one set of pictures in as much detail as possible. This part of the activity evidently had a clear achievement aim: students were supposed to demonstrate that they had acquired appropriate vocabulary range and accuracy in relation to topics encountered in numerous classroom activities. At the same time, the visual prompts functioned as an important support for the learners in their efforts; rather than being engaged in completely free discussions from the outset, students could gradually “warm up” by simply describing thoroughly familiar depicted items that should pose no particularly high cognitive challenge to them (evidently this was only possible if they had suitably paid attention in class and done adequate revision – two key features which any summative test should be afforded to presuppose as well!). As this first long turn aimed to verify the student’s ability to deal with the task independently, my role as teacher would be to provide as few prompting questions or hints as possible; in that sense, a few hesitations on the part of the learner were absolutely acceptable in this instance. On the other hand, if the students got stuck immediately and risked not being able to achieve this fundamental task at all, I helped them with a few guiding questions to “nudge” them into the right direction. However, the more prompting I needed to offer, the more negative the impact would have to be on correspondingly awarded marks in regard to core features like fluency and global achievement. Once the student had finished describing both pictures in isolation, the activity further expanded the identified topic in a second stage. As the two visual prompts represented an inherent contrast, I now asked the students to compare them with each other and name the most striking differences. In that way, further achievement foci were now added; this time, however, they were not just of a predominantly lexical nature. Instead, as the ability to compare various people, places and objects had constituted a core learning objective over the course of the term, this part of the speaking test aimed to activate a precise aspect of the students’ grammatical competence as well (by verifying their ability to use comparative and superlative structures in context). Occasional prompting was once again necessary at times, for example if the students failed to identify adequate contrasts which they could elaborate on. In such instances, the indication of a specific feature in the pictures would often suffice to guide the students into the right direction and trigger further verbal contributions. Finally, the activity then opened up into a freer discussion: the student was asked to state a preference in relation to the two depicted elements (for example choose the place where he would rather spend a holiday or live permanently) and to justify this choice in a 84 Chapter 3 few words or sentences. My decision to include such a section of fairly free discussion was primarily guided by the consideration that this least controlled part of the activity provided an indicator of general proficiency rather than specific achievements. Indeed, it was less predictable what particular students would mention in this part of the task; instead, they would have to display their general level of fluency to spontaneously elaborate on the choice they had made. Once student A had completed this set of individual activities, another handout was given to his or her partner (student B) and the same procedure was repeated. Once again, though, I had meticulously tried to avoid overly similar pictures on both handouts; thus the second student would have to use a different set of topic-related expressions and could not just recycle or even copy his peer’s utterances. In its entirety, this initial set of activities also allowed me to get important indications in regard to wider aspects of the student’s general attitude to speaking. Thus, the length and level of detail of each candidate’s speech would not only illustrate overall fluency, but also in many cases provide hints about the individual’s actual willingness to engage in speaking and to correspondingly take a number of necessary “risks”. Hence, some students were reluctant to give more than monosyllabic descriptions and answers, while others clearly sought to elaborate in depth on the various pictures and questions instead, even if this sometimes meant that they had to improvise in terms of lexis and experiment with more complex structures they might not have previously encountered. While the first category of students reduced the possibility of making language mistakes in their utterances by keeping their answers very short (ostensibly limiting negative impact on overall accuracy), they evidently risked providing incomplete answers and contributions in relation to the topic and test expectations. In contrast, the potentially higher number of mistakes in the more extended speeches of higher risk-takers could easily be compensated for by the increased respect paid to content-related criteria. Considering that a faultless performance is not expected at A2 level anyway, it was certainly the second type of student behaviour and performance that the test ideally targeted and that would offer the most logical and likely path towards high overall marks. If the longer individual turns of the first half of the test had been mostly productionoriented, the second half of the speaking test was once again designed to trigger S-S interaction (and thus truly exploit the fact that students were taking the test with a partner). In this particular case, a short role play activity was used to that effect: in their pairs, the students had to show their respective abilities of asking for and giving Chapter 3 85 directions (an activity which the students were familiar with as it had been practised via similar role plays in class prior to the test). As a visual support, both students were therefore given a handout that contained the same fictional map and brief written instructions (detailing their respective roles and tasks) which they were allowed to read through quietly 32 . In the context of the depicted city, student A played the role of “tourist” while his partner had to pretend to be a local inhabitant. At the start of the activity, student A had to tell student B where his starting position was, and then ask his partner for directions to a precise location from there (in this duly rehearsed type of question, the student was also supposed to show evidence of his sociolinguistic competence by choosing an appropriately formal register for this polite inquiry). Once student B had given a corresponding first set of instructions to the “tourist”, the entire operation was repeated once again (with the same roles but a different destination). Then the students were asked to turn over their test sheets; on the second page, they found a different city map which the pair of candidates had to use to repeat the exercise (twice) with reversed roles. By the end of the activity, both students would thus have had to show their ability to assume both roles in such a semi-authentic situation. The decision to make each student give directions to two different locations in their respective cities had been made for two interlinked reasons: on the one hand, this extended the length and complexity of the produced language sample, so that a more solid foundation for assessment was available. In the process, the second set of directions given by the student could either confirm errors that he had already made in the first one (by committing them again), or reveal them to have been exceptional ‘slips’ if he got the same type of information right in the second attempt. In turn, this would increase the reliability of the performance sample as a basis for proficiency assessment, since it would allow for a more detailed and accurate error analysis. As a whole, the exercise also offered the opportunity to verify the students’ handling of interactive strategies such as asking for clarification (although they occasionally had to be reminded to use the target language for this). In this test task, I had used a variety of maps to compile the student handouts. This was based on the idea that student B could then not copy his peer’s instructions too closely; neither would early test takers be able to give any precise pointers to later candidates. While some maps were multiply used (for instance by asking for different 32 See appendix 7 for samples. 86 Chapter 3 itineraries in different test sets), this might be considered a slight risk in terms of test reliability: given that not all students had to give directions in relation to the same map (or exactly the same itinerary on one map), the absolute, intrinsic comparability of their results might have suffered to a certain extent. However, as I had invariably taken and adapted all the maps from various ‘pre-intermediate’ coursebooks, their level of difficulty could still be considered to be reasonably similar. Moreover, the itineraries that I had deliberately incorporated into the instructions generally aimed at a similar amount of necessary directions to reach the respective destinations. At the same time, of course, different itineraries were often possible, offering a certain amount of choice to the test candidate; in turn, this contributed to making the test task less rigidly controlled and thus these choices made more varied and spontaneous oral performances possible to a certain extent. In general, one might argue that the items and tasks used in this second speaking test offered more frequent opportunities for genuinely free oral production when compared to the numerous closed questions and rather strictly guided S-S interaction (where precise cues had to be respected) in the first test. In that sense, one may reasonably argue that both content and construct validity were further raised in view of verifying overall oral proficiency in this instance, since the students had to fill in more information on their own and were free to expand on their answers to a higher extent. In other words, the various speaking tasks transcended the fairly narrow lexical-grammatical scope of the exercises that had still characterised parts of speaking test 1. Both in the free discussion at the end of the first set of activities and in the various possible details the students chose for giving directions in part two, they could thus use their oral production skills in a fairly free and authentic manner that may not have been possible to the same extent in the first speaking test. On the other hand, of course, this progression was only natural in the sense that students had had another entire term to further develop their general proficiency level in the target language. Hence the more the school year progressed, the more possibilities I would evidently get to “open up” the possible answer scope in individual test tasks as a result of my students’ steadily increasing amount of available language resources. Chapter 3 87 3.3.2. Form, strategy and theoretical implications of assessment used As a number of issues had surfaced in respect to the marking grid used during the practical implementation of the first speaking test (see 3.2.6. above), various modifications had to be applied to my assessment strategy in this second attempt. Thus, my students’ competences in relation to the two vast concepts of vocabulary and grammar were to be separately appreciated this time in order to get a more nuanced view of their progress in terms of both fluency and accuracy. The arguably disproportionate weight initially attributed to pronunciation factors in the overall assessment needed to be addressed as well, and the appropriateness of individual band descriptors had to be verified even more closely in terms of their suitability for A2-level learners. In my search for a differently organised assessment scheme, I found a particularly useful tool in a marking grid specifically developed (and published in the national syllabus for 8e and 9e TE 33 ) for such a purpose by a group of language teachers who had been officially charged with the mission of designing and adapting competence-based and CEFR-aligned ways of assessment for English courses in the Luxembourg school system 34 . A look at the general layout of this grid 35 reveals the useful organisation of different speaking criteria into the following four categories: 1. ‘Content / task response’: while this section roughly corresponded to the ‘Global Achievement’ criteria used in the previous marking grid, it did not overtly include considerations about “attitude” or “risk-taking”. However, such factors could still be derived from the meaning-related descriptors and thus be indirectly incorporated into the assessment. Hence, the ‘band 3’ (or ‘basic standard’) descriptor stipulated that ‘basic meaning [was] conveyed’ but ‘some effort’ was necessary ‘on behalf of the listener’. In contrast, the more proficient ‘band 5’ descriptor attested that ‘communication [was] handled’ and ‘meaning conveyed successfully’ (while no ‘effort on behalf of the listener’ was implied). It is certainly impossible to communicate more than ‘basic meaning’ if one only provides monosyllabic or otherwise minimalistic answers. Even if a willingness to expand on individual points 33 See the relevant syllabus for 9TE on http://programmes.myschool.lu/ for details. The group in question is the ‘groupe de travail - Socles de compétences pour l’anglais’; one of their stated aims (according to their workspace on www.myschool.lu) consists in ‘establishing CEFR-based standards and benchmark levels for English to be reached by secondary students in Luxemburg as well as developing the corresponding descriptors’. While I joined that taskforce during the period of time when this thesis was written, the marking grids for speaking and writing described here had already been developed prior to that; I had thus played no part in establishing them. 35 See appendix 8 or p.18 of the 9TE syllabus (http://programmes.myschool.lu/) for the complete grid. 34 88 Chapter 3 in more depth does not automatically guarantee that more than ‘basic meaning’ is ‘successfully conveyed’ (as sufficient accuracy might for example still prove a problem), it is nevertheless a prerequisite in terms of attitude to make the achievement of a higher band possible in the first place. 2. ‘Pronunciation and discourse management’: although pronunciation still appeared in the title and (rightly) in the content of this group of criteria, a closer look at the various descriptors quickly revealed that actual phonological control and intonation skills constituted only one of various components here. In comparison to the marking grid used in the previous term, ‘discourse management’ elements taken into consideration in this instance added more focus on the connected nature and overall fluency of the speaking performance as a whole. In that sense elements such as coherence and cohesion, largely absent from the ‘pronunciation’ section in the first marking grid, were taken into more consideration this time. 3. ‘Lexis’: reflecting the sought separation from grammatical foci, this section was firmly focused on vocabulary range and accuracy, while an interesting addition consisted in the explicit and consistent reference to paraphrasing skills across all defined bands. For language learners, the ability (and deliberate strategy) to look for alternatives if a precise expression eludes them is most certainly an important skill to develop in an aim to keep any act of communication from breaking down completely. For that reason, the inclusion of ‘paraphrase’ in this section was certainly a sensible one. Of course, this criterion may not always appear in all student performances; however, if attempts to paraphrase are made, this can further reveal the students’ willingness to take risks and become creative to work around temporary setbacks. 4. ‘Grammatical structures’: as a result of the separate focus on grammar, more nuanced and varied criteria were included in this section in comparison to the broader ‘Grammar and Vocabulary’ category in the previous marking grid I had used. Thus range and accuracy were more strictly separated into individual descriptors, while elements such as use of ‘basic forms’ (band 3) or ‘both simple and complex forms’ (band 5) specified in a little more detail the qualitative differences that one could expect within the various levels of proficiency defined. Interestingly, accuracy was also partly expressed through a quantitative appreciation of grammatical errors (‘frequent’ in band 3 but only ‘occasional’ in band 5). In terms of general organisation, this marking grid also fulfilled another requirement that had emerged from my first assessment of speaking skills in the previous term: Chapter 3 89 intermediate bands had been inserted between the three defined ones of ‘insufficient’, ‘basic standard’ and ‘target standard’. These five ‘gradations of quality’ catered for the type of “mixed” performances that had repeatedly surfaced in regard to a specific group of criteria in the first speaking test. Thus, ‘band 2’ was defined as containing ‘features from band 1 and 3’ while ‘band 4’ took a similar midway position between bands 3 and 5. A close alignment to the CEFR is visible throughout this grid. Thus, the ‘discourse management’ descriptor in band 3 states that a basically suitable learner performance in a given test is characterised by ‘very short basic utterances, which are juxtaposed rather than connected’. This expectation is clearly within the A2 level of the CEFR, according to which one should expect that the student ‘can give a simple description ... as a short series of simple phrases and sentences linked into a list’ in ‘overall oral production’ (CEFR, p.58). However, it should be noted that in order to obtain the highest possible mark in relation to a specific criterion, the necessary performance in a summative test needs to match ‘band 5’. In this area, the marking grid in question certainly respects the CEFR indication that even the best possible performance (in free production) is not devoid of remaining shortcomings. In the ‘lexis’ section of the grid, for instance, the ‘band 5’ descriptor simply attests an ‘adequate’ vocabulary range ‘for the task’. This does not imply very complex and extensively developed lexical knowledge overall, but simply indicates that the learner has shown that he can handle appropriate topical lexis in ‘everyday’ contexts that are suitable for A2 learners (such as descriptions of people or places); evidently it also presupposes that the test task itself has been designed to firmly stick to such ‘everyday’ contexts and needs. Anything that goes beyond such familiar contexts causes problems, which is duly noted in ‘band 5’ in the sense that ‘attempt[s] to vary expressions’ occur ‘with inaccuracy’ and ‘paraphrase’ is only used ‘with mixed success’. Indeed, this mirrors the vocabulary-related CEFR descriptors which postulate that an A2 learner ‘has sufficient vocabulary to conduct routine, everyday transactions involving familiar situations and topics’ (‘Vocabulary Range’, CEFR p.112) and ‘can control a narrow repertoire dealing with concrete everyday needs’ (‘Vocabulary Control’, p.112). Yet there are also some cases where the expectations in ‘band 5’ of this grid are slightly higher than what the CEFR defines as ‘A2-typical’. The most striking example of this occurs in the ‘Grammatical Accuracy’ section: the best possible performance is here described as showing a ‘good degree of control of simple grammatical forms and sentences’, which seems to clash with the CEFR descriptor for ‘grammatical accuracy’ at 90 Chapter 3 A2: ‘Uses some simple structures correctly, but still systematically makes basic mistakes’ (p.114). Indeed, this general expectation is more compatible with the ‘band 3’ grammar descriptors in the marking grid (‘limited control of basic forms and sentences / only a limited range of structures is used’). Does this, then, take us back to unfairly high expectations (in regard to the attainability of maximum marks) that students working within A2 should not be supposed to achieve? Not necessarily. In fact, this is one striking case where the non-school-based origin of the CEFR is strikingly highlighted. As already suggested in the previous chapter, the CEFR does not specify how knowledge and skills are developed over time; for that reason, it also fails to take into account how a reinforced focus on elements such as grammar and vocabulary in a school environment can raise the expectations in relation to student performance in those areas. Evidently, this holds particularly true for a summative test that is concerned with achievement assessment to a significant extent; it is normal that in relation to pre-taught objectives, ‘systematic…basic mistakes’ should ideally be kept to a minimum. In that sense, they should certainly not appear in the best possible type of performance in such a test; for a ‘sufficient’ performance (i.e. ‘basic standard’), however, they will still be appropriate as long as the overall communicative message still clearly comes across. 3.3.3. Analysis of test outcomes Following my intention to gradually increase the importance of speaking skills in view of the overall mark of term 36 , I chose to allot a total of 20 marks to this second summative oral test. In doing so, I was able to use the simple multiplying factor of 1 again to arrive at a final mark from the applied grid (which included 4 categories of criteria and 5 qualitative bands and thus allowed for 4x5 = 20 possible combinations). As this marking system included the two intermediate bands 2 and 4, this also had the beneficial consequence that I did not have to use any “half” marks this time. Overall, the results of this speaking test showed some encouraging tendencies. For instance, the average mark of the class was 14.67/20, which corresponded to a significantly higher average (73.35% of the maximum obtainable) than the students had generally achieved in the first speaking test (62.5%). This was partly due to the fact that only one insufficient mark (7/20) and two marginal “pass” grades (10/20) had been 36 See section 3.2.6. Chapter 3 91 awarded this time 37 ; in contrast, six very high scores had been achieved (including four perfect scores as well as one achievement of 19/20 and 18/20, respectively). However, in view of the very vast amount of variables involved, it would of course be a mistake to read too much into these numbers: evidently, different grammatical and lexical items were tested on the two occasions, which makes it virtually impossible to compare the actual degree of difficulty of both tests; during the time between the two test occasions, two fairly “weak” students had left the class and one “stronger” student (with a seminative-speaker background) had joined; different types of test tasks and corresponding marking grids were used in these two cases; a whole range of student-related reliability factors (e.g. less test anxiety in the second instance, or even the simple fact of having “good” or “bad” days) could have played a part; and so forth. In purely statistical terms, then, the progress expressed by these two average marks should not be rashly overestimated. However, one conclusion that one may certainly draw from these results is that the students in my ‘pre-intermediate’ class had encountered no major problems in dealing with the various speaking tasks they were confronted with in this test. The qualitative expectations implied in the A2-based marking grid had thus been reached by almost all of them; in fact, the majority of students operated in between the ‘basic’ and ‘target standards’ at this point in time (which was evidently a desirable development, as it was a general aim to push as many students as possible towards the ‘target standard’ until the A2 learning cycle was completed at the end of the subsequent term). Moreover, several parallels existed between the results in this test and the previous one. For instance, the two individuals who had scored the highest marks on the first occasion were also among the best performers in this test (accounting for two of the perfect scores); similarly, the two students who marginally passed the second test had already achieved 6/12 and 5.5/12 in the previous one. Apart from a very limited number of exceptions, the same students who had scored comparably “high” marks in the first instance had done so again in this second test; similar tendencies could be established for those with “average” and “low” marks, respectively 38 . As established in chapter 1, such empirically indicated consistency 37 It should be added that the insufficient mark was mainly due to a very negative attitude on the part of the corresponding student, who did not take the test seriously at all and did not even try to deal with the content requirements in an adequate manner. In essence, that learner did thus not give a true account of his real abilities. 38 The students’ results in both tests can be compared by means of the graphs in appendices 5 + 10. 92 Chapter 3 of results contributes to confirming the general reliability of both tests (in terms of effectively separating “good” and “poor” performances). In view of these overall results, it is also particularly worth mentioning that the highest-scoring students could actually be divided into two categories. On the one hand, there were those who usually did fairly well in all other types of tests as well (focusing on the other three major skills); as generally “strong(er)” learners, their good oral performances were thus perhaps fairly unsurprising. On the other hand, two students who had obtained high marks in both speaking tests (one of whom was the student who had achieved 11.5/12 in the first one) had in fact received fairly average marks in previous, predominantly written tests. In one case, this was largely based on weaknesses in grammar and orthographic control; in the other, the student had mainly had problems to deal with written test tasks in the allotted amount of time and thus repeatedly conceded a lot of marks through unfinished work. The fact that both these students excelled in speaking activities very clearly underlines the general necessity to adapt our approach to testing much more to the undeniable variety of learner types in our classrooms to create a more level playing field for all of them. The results of the class in relation to the four main groups of criteria in the marking grid showed a remarkable consistency. In three of them, the average mark achieved was exactly the same (3.48/5); the very marginally “weakest” feature in the students’ performances emerged in relation to pronunciation and discourse management (where the average mark achieved was a slightly lower 3.38/5). Only one insufficient mark (2/5) was respectively awarded in regard to ‘content’ and, remarkably, ‘grammatical structures’; three insufficient scores were reached in each of the two remaining categories (2/5 in five cases and 1/5 once in ‘pronunciation and discourse management’). In comparison to the previous speaking test, the students had thus seemingly managed to improve their performances in terms of grammatical and lexical accuracy. Again, however, great caution is advised in the interpretation of such results. Thus virtually the only explicit grammatical achievement focus in this second test was put on the correct use of comparative forms in the students’ first turn (predominantly based on free production), while the (S-S) interactive question-answer task in speaking test 1 had involved multiple grammar challenges (such as the correct use of auxiliaries, tenses and word order). In fact, the latter task had generally been more mechanical and guided in nature, thus almost automatically inviting the assessor to put the sufficient mastery of grammatical structures under increased scrutiny. Chapter 3 93 At the same time, however, the rather punishing ‘band 1’ descriptors of lexical competence from the first marking grid had been revised this time. Instead of mentioning that ‘speech’ was ‘often presented in isolated or memorised phrases’ as characteristic for an insufficient performance, the grid used in this instance stipulated that ‘only isolated words or memorised utterances [were] produced’ at ‘band 1’. If this may seem like a mere semantic nuance on the surface, it actually turns out to be a description that only gives room to much more limited linguistic samples in practice. As a result, it was certainly less likely this time that a very low band could be awarded in regard to this particular criterion since most students had tried to develop their oral contributions at least to the extent of producing short sentences. As a whole, this example once more stresses the importance of rigorously choosing sensible and meticulously appropriate terminology (in regard to the test tasks and the overall proficiency level) in every single descriptor in a chosen marking grid. The separation of ‘grammar and vocabulary’ criteria in this assessment of speaking performance provided some similarly useful insights. After the test, a clear tendency was noticeable when comparing the scores that individual candidates had obtained in relation to the four main criteria: in all cases, the four marks that had been separately awarded were either absolutely identical or very close to each other (i.e. one mark higher or lower than in other criteria). In fact, each student’s overall performance corresponded to a combination of maximally two (adjacent) different bands across the four main criteria 39 . On the one hand, this seems to confirm the abovementioned suspicion that most student performances will not present major discrepancies as far as grammatical and lexical competences are concerned (for instance). On the other hand, however, the usefulness of separating both criteria and judging each one individually is powerfully illustrated by the example of one student who ultimately received a marginal “pass” mark. This particular student’s performance was marked by several long pauses, particularly in the individual long turn at the beginning: as this learner struggled to find appropriate vocabulary items to describe the two visual prompts (and, as she failed to paraphrase instead, received a lot of prompting on my part), she ultimately received ‘band 2’ scores in the ‘lexis’ and ‘discourse management’ categories. However, throughout the test (and especially in the ‘giving directions’ task), she displayed a sufficient level of competence in terms of using grammatical structures: in that respect, her speaking performance thus reached the ‘basic 39 See assessment grid for student 14 in appendix 9. 94 Chapter 3 standard’ (i.e. 3 marks), and this essentially swayed her final score towards the marginally sufficient mark she ultimately received. Had grammatical and lexical accuracy still been judged together, an ‘insufficient’ band could easily have been attributed in this case as the negative impression in regard to ‘vocabulary’ could have tainted the overall assessment in regard to the entire group of criteria. This example also underlines the key necessity to avoid going with ‘gut-feeling’, holistic impressions (which teachers are, unfortunately, all too often inclined to do when confronted with such problematic speaking performances). Instead of prematurely extrapolating the clearly perceptible problems in one particular area to the entire effort made by the student, it is important to consider to what extent remaining components of the speaking performance might actually reach the basic standards that are to be reasonably expected at a given level. This is not to say that one should be excessively lenient in one’s general approach to assessment; far from it. Certainly not always will a student be able to compensate for weaknesses in one area through sufficient strengths in others, and in many cases, an insufficient performance in ‘lexis’ may in fact very well be accompanied by a correspondingly poor handling of grammatical structures (for instance). However, especially in cases where the decision over ‘pass’ or ‘fail’ is a very close and difficult call, meticulous consultation of the individual criteria in a marking grid will, as in this case, help to lead to a much more informed and justified decision than a possibly “prejudiced” holistic impression. Evidently, this can only be the case if the marking grid itself constitutes an appropriate tool for a valid and reliable assessment of the student’s level of ability in relation to reasonable standards. Not least through the abovementioned, general alignment of the ‘basic standards’ with the realistic performance criteria defined for the A2 level in the CEFR, that prerequisite was effectively respected in this instance. Yet even with such useful tools at one’s disposal, tricky situations can (and most probably will) still occasionally arise over the course of the assessment process. In this particular speaking test, this was strikingly exemplified by a noticeably recurring mishap: in the second part of the test, a number of students consistently seemed to confuse the essential expressions ‘left’ and ‘right’ in their attempts to give directions. Should the assessor interpret this as evidence of limited control of topic-relevant lexis and thus correspondingly lower the mark awarded in relation to this criterion? At first, such a decision certainly seems reasonable. However, at closer look, this confusion of very basic lexical items was in fact often accompanied by other, correct details (such as appropriate Chapter 3 95 lexis and prepositions of position used to describe the buildings near the expected destination). Clearly, then, the student had identified the correct itinerary on his map, but failed to express this in a completely adequate way by including decisive accuracy-based mistakes into his speech. Yet was this really the result of an error in the student’s interlanguage? After all, this was rather unlikely given the very reasonable assumption that most students in 9TE should be able to distinguish between ‘left’ and ‘right’! In fact, the error-inducing factor on these occasions was very probably not the students’ lexical competence, but rather pointed to an issue with their ability to read a map correctly – clearly a competence that was not language-related in the first place. For the assessor, this presents a certain dilemma: is it possible to award full marks in terms of ‘content’ or ‘lexis’ if the actual speaking performance led us to a wrong destination? On the other hand, is it justifiable to detract marks in this language-focused assessment based on a competence that is not, in fact, linguistic in nature? In this instance, I chose to give my students the “benefit of the doubt” and thus refrained from clamping down on this clear mistake if all other indications the student had given were in fact accurate. At the same time, however, this certainly underlines how unexpected difficulties can arise in designing adequate competence-based and contextualised test tasks at times. Furthermore, it also shows that even if all the criteria in the marking grid correspondingly used for assessment have been painstakingly and reasonably adapted to the linguistic requirements of the test, they may still not always make it possible to anticipate all the controversial aspects of speaking performance that can ultimately arise. In such cases, the assessor’s analysis of potential (linguistic and nonlinguistic) error sources remains the most essential factor to avoid unfair impacts on the eventually awarded final mark. In this chapter, the theoretical implications and empirical data of two implemented speaking tests have revealed that the competence-based assessment of students’ speaking skills is certainly feasible even at ‘pre-intermediate’ level. At the same time, however, the analyses of the corresponding two case studies have revealed numerous complex factors and prerequisites that one needs to take into account during the respective design, implementation and assessment phases of such tests to ensure that their outcomes are not only reasonably valid and reliable, but also sensibly and constructively interpreted. 96 Chapter 3 Before more general conclusions will be drawn about the best possible place and form for competence-based assessment in our national English curriculum, I will now turn to a similar study of possible ways to design tests and conduct assessments in relation to the productive skill of writing. Chapter 4: Competencebased ways of assessing writing at A2 level. Whereas completely new room currently needs to be made for the systematic testing and assessment of speaking skills at lower levels, a strong insistence on writing has traditionally constituted a central feature of the assessment culture in our national school system. In that sense, a different type of innovation is necessary in relation to this particular productive skill: in contrast to speaking, not the fact whether it is tested needs to be addressed, but rather how this can (and should) best be done in a competenceoriented teaching and assessment system. After all, as outlined in chapter 1, exercises that merely require the insertion of a single word or even the construction of isolated sentences cannot be regarded as valid evidence of true writing skills. Key modifications are certainly needed to make this area of summative testing more compatible with the logic and requirements of a fundamentally competence-based approach. To highlight the essential factors that make such a shift actually desirable, the starting point of this chapter will be a close analysis of a more “traditional” type of written test task and the correspondingly used assessment methods in a specific practical example. In a subsequent stage, I will then trace the steps which are necessary to effect a change towards possible strategies of testing and assessment that truly tap into writing competences even at A2 level. 4.1. Reasons for change: a practical example 4.1.1. Description of the implemented test task Over the course of a school term, timing can (and usually does) have a very significant impact on possible test content. This was no different in the case of the very first summative test that I implemented in my ‘pre-intermediate’ class1 during the early stages of the 2009-10 school year. The fact that this test had to be implemented only a few weeks into the first term effectively limited my options in test design in a number of 1 The class referred to here (a 9TE at the Lycée Technique Michel Lucius) was the same one that was described in more detail in section 3.2.1. in the previous chapter of this thesis. 98 Chapter 4 ways. After all, not much time had been available to extensively practise writing in the classroom and thus familiarise the students with more skills-oriented task types than the (rather heavily knowledge-based) ones they had been used to from their experiences in the previous school year. Moreover, their target language resources were evidently still rather limited at that early point in time as well. Nevertheless, I still wanted to include a writing task that gave the students a chance to produce a longer and more personalised language sample; this would provide me with evidence of their ability to express themselves in writing in a wider and more communicative context than was possible in the “English in use” section of the same test. Naturally, to ensure sufficient validity and a feasible degree of difficulty of such a test task, it needed to build upon the same thematic areas and grammatical structures which had presented the central foci of most classroom activities. In this case, the realm of everyday, routine actions and habits, as well as the correspondingly necessary grammatical structures to describe them (i.e. the present simple tense and adverbs of frequency), constituted that fundamental core. Correspondingly, the free writing activity that I included at the end of this first summative test had the following instructions: Describe your ‘holiday routine’. What do you always (or often) do when you’re on holiday? Write about 60 words. (8 marks) In themselves, these instructions certainly represent a prime example of a ‘classic’ task for a ‘pre-intermediate’ summative test that essentially revolves around a central achievement focus: in this case, the students’ ability to incorporate and use the present simple tense in their own, personalised written productions. Very similar activities are invariably bound to have been dealt with in class prior to such a test; for instance, the students will almost inevitably have been asked to describe their everyday ‘term-time’ routines in class (or in a homework assignment) by writing down a number of sentences about their daily habits (either in a simple list or even in the form of a connected paragraph). In fact, in this particular instance, comparable activities had additionally involved a short group work game: in small groups, the students had been asked to imagine the daily routine of a famous person of their choice. Based on adequate hints and tips in those descriptions (i.e. suitably symptomatic and singular actions inherent to a given person’s exceptional lifestyle), their classmates then had to guess which person each group had respectively described. Chapter 4 99 The corresponding test task thus built on prior classroom activities and simply gave a slightly different context for the necessary description of daily habits. As such, a certain amount of content and face validity were granted; after all, the students really would have to write meaningful sentences (thus producing the actual behaviour the task intended – and pretended – to test) in relation to a topic that they had encountered in similar form before (which was evidently important for a summative test). However, a close look at the instructions shows that specifications about the exact qualitative requirements which a suitable answer would have to fulfil were rather scarce. In fact, apart from a vaguely defined topic (holiday routines) and an indicated word limit (leaving room for slight but unspecified deviations either side of the targeted number of 60), no further details were given. As a result, numerous questions were left unanswered, particularly in terms of suitable structure of the text: was a list of isolated sentences appropriate, or did the text have to be connected, for example through linking words? How complex and precise did individual sentences have to be? What was the minimal number of different routine activities that had to be mentioned, since no target had been given in terms of sentences to use or various ideas to include? Hence, what exactly did the students have to respect to ensure they could achieve maximum marks in this exercise? 4.1.2. “Traditional” method of assessment used The absence of such clearly indicated (or inferable) specifications ultimately led to the fact that the corresponding assessment of individual student productions also stood on fairly shaky ground. After all, what could be the decisive and justifiable factors to determine a ‘good’ or ‘bad’ mark in view of such vaguely defined content requirements? Indeed, a student who had simply enumerated a sufficient number of very simple holiday activities could, in essence, be adjudged to have adequately fulfilled all content-based task requirements (provided that he had neither surpassed nor failed to reach the word limit). Given the limited amount of structures used in such a basic and simple performance, the student would evidently have reduced the risk of committing numerous mistakes at the same time. However, what about those students who had attempted to connect their various ideas into a flowing, coherent text and perhaps even to use more precise and varied expressions in the process? The impulsive answer of many assessors will surely be that the better structure (and possibly more complex language) of such answers, and the correspondingly better overall impression they produce on the reader, should logically translate into a higher content-related mark; in fact, in a fundamentally 100 Chapter 4 norm-referenced assessment scheme, this is usually precisely what happens. This can be exemplified by looking at three actual student productions that these particular task instructions yielded in practice 2 : Student A: On holiday I get up at 11 a.m., then I go to the bathroom and have a shower. I brush my hair. Then I go to the kitchen and eat something. I go to the computer and speek with my friends. About 2 o’clock I go out with friends, we go on the beach and we have a good time. (60 words) Student B: I never get up before 8 a.m. in my holiday. I wake up at 9 a.m. and I stay for a few minutes in my bed. Then I wash myself in the bathroom and I get dressed. I have breakfast at 10 a.m. After my breakfast, I go out with my dog or I play football with my brother. After my lunch, I like going to swim or playing playstation3. I eat with my parents in a restaurant in the evening. At twelve o’clock I go in my bed. (90 words) Student C: I wake up before the restaurant closed. I get dressed and I go eat. I take a juice and pancakes with chocalat. Than I go in my room and wash my teeth and I take my bicini. After I go to the piscine and stay there. In the evening I take my bath and get dressed for the supper. Than I go with my parents to the restaurant. We eat fish,... . (65 words) In strictly content-based terms, one could argue that all three students basically fulfilled the topical requirements by describing things that they usually did on holiday (even if in case of production B, it is unclear whether the student was describing an experience at home or abroad). In terms of coherence, the general structure of their answers was also fairly similar – all three descriptions were more or less chronologically sequenced to take the reader through a whole, typical day (even though student A ended her description at an earlier point in time than the other two pupils). However, the cohesive devices used to link these actions had been used with varying degrees of skill: whereas students A and B managed to indicate the order of actions rather clearly (using expressions such as ‘then’, ‘after’ and other precise time indications with relative ease), student C tended towards a more rudimentary enumeration of separate actions (while also misspelling the time word ‘then’). The different productions were rounded off in diverse ways as well: while student B logically concluded his text by explicitly referring to the end of the day, student A left her description more open-ended (but still providing closure to a certain extent through the 2 See appendix 11 for copies of the original student productions. Chapter 4 101 general indication that she and her friends had ‘a good time’); in contrast, student C’s very short and rather random final sentence (‘We eat fish’) would definitely not provide a suitable conclusion to any flowing text. As a consequence, the general impression made by the overall structure of productions A and B was certainly a more positive one than it was the case in the final example. In a holistic, norm-referenced assessment, the assessor might thus understandably tend toward higher ‘content’ marks for the first two productions. However, the theoretical justification of such a decision might prove problematic. Indeed, the instructions had not actually asked for the specific, structured description of an entire day but only for routine actions in a more general manner: would it be fair, then, to actually subtract marks for an awkward ending if no coherently structured format had been defined and required in the first place? Naturally, one might still argue that the statement ‘we eat fish’ did not fulfil task requirements in another respect: in the absence of adverbs of frequency, and considering the student’s generally discernible tendency to struggle with grammar, it is not absolutely clear whether she really intended to point to an actual holiday habit (or just to a general fact). Since student C had actually failed to explicitly refer to the ‘holiday’ context in her entire answer, a maximum mark for ‘content’ would most probably have been exaggerated in this instance. On the other hand, as she had still mostly given information that suited the topic, her answer was much more relevant than some other texts that had been produced in the same test (for instance, one student had failed to read the instructions carefully enough and had therefore simply repeated the type of content that had appeared in the group work activity in class, i.e. she had described a normal day in the life of a celebrity – a clearly unacceptable disregard of instructions). In that sense, answer C broadly deserved a sufficient ‘content’ mark; even the comparably less cohesive structure of this particular production could not legitimately be drawn upon to justify a lower mark in comparison to those of students A and B, as this had not been clearly stated (or implied) in the instructions as a criterion to be respected. In regard to purely language-related factors, it is clear that student C’s answer was most affected by problems of accuracy as well; misspelled words (‘than’, ‘chocalat’, ‘bicini’…), wrong vocabulary items (‘piscine’) and occasional grammar mistakes (e.g. ‘closed’ instead of ‘closes’) were more frequent in this production than in those of students A and B. Correspondingly, answer C unsurprisingly received the lowest score in this area when compared to the more flawless performances of her peers. The differences were of a more nuanced nature in relation to the productions of students A and B; neither 102 Chapter 4 one of them had demonstrated major problems with accuracy in their respective texts. In purely quantitative terms, student A’s production was characterised by the fewest ‘slips’; however, student B had gone into more extensive detail, which had evidently increased the chance of actually committing mistakes (on the other hand, of course, he had not respected the word count and could thus lose another mark in relation to that criterion). While these occasional grammatical or lexical mishaps did not impede successful communication in either of those two cases, their mere presence was often enough to lead to a slight deduction of marks (i.e. 3 or 3.5 would be awarded instead of 4). The abovementioned distinction of two broad criteria reflects a rather traditional and widely used assessment method for free writing tasks, which consists in allotting part of the mark (normally 50%) to content and the other part of it (the remaining 50%) to a holistic appreciation of language features such as grammatical and orthographic accuracy as well as vocabulary range and control. In both areas, the highest possible mark would often only be attributed to a virtually faultless performance. As this was also the approach that I adopted in this particular case, no student eventually reached maximum marks; instead, virtually all pupils scored between 3/8 and 7/8 overall. However, the limits of this assessment method quickly became evident: absolutely clear marking criteria had not been laid down either in terms of ‘content’ or ‘form’. Similarly, the task had not indicated specific content points to be absolutely included or a precise format and structure to respect. Therefore, qualitative distinctions that were made between individual productions were difficult to translate into marks according to a consistent and rigorously defined system; norm-referencing (i.e. the comparison of one student’s effort to those of others) often had to bridge that gap instead. In a similar vein, the ‘form’ (or ‘language’) assessment was not backed up by clearly defined criteria that took into account realistic expectations for an A2-level performance; instead, holistic overall impressions ultimately often turned out to be decisive. On the other hand, the students’ various productions had shown some interesting tendencies: even though the instructions had not explicitly asked the learners to present their answers as a connected text, virtually all of them automatically linked their individual ideas within a cohesive overall structure. While this was done (as discussed above) with varying degrees of complexity, it clearly indicated that the corresponding assessment tool needed to become more suited to take such efforts (and their respective success) into account. In turn, the actual task instructions also had to become more specific, not only to make the marking system more efficient and reliable, but also to Chapter 4 103 provide more detailed guidelines to the students as to what was expected of them. To find practical ways of truly getting my students involved in more purposeful free writing, it was thus first of all necessary to identify the central features that were required for an appropriate design of suitable, competence-based test tasks. 4.2. Key features of a ‘good’ free writing task As Harmer vitally reminds us, ‘the writing of sentences to practise a grammar point may be very useful for a number of reasons, but such exercises are not writing skill activities’ 3 . To a certain extent, then, asking for a few sentences about routine activities did not represent a truly valid writing task. In fact, if the students had simply followed the instructions handed to them in the aforementioned first test “to the letter”, the entire task could easily have resembled an extended grammatical/lexical exercise rather than one that aimed at communicative production, even if it had effectively left the students a certain amount of freedom to express their own ideas in a slightly less controlled way. In that sense, it did not fully respect one of the most salient features that such “free writing” activities should be founded on: language production means that students should use all and any language at their disposal to achieve a communicative purpose rather than be restricted to specific practice points. 4 In the precise context of a summative test, of course, one might slightly revise this statement: since such a test is fundamentally concerned with verifying achievement of ‘specific practice points’, it is only natural that the latter should be targeted as cornerstones of a corresponding writing task as well (after all, you can only “test what you teach”). Hence if the vast majority of classroom time has been spent on contexts and thematic areas that require the present simple tense, the main writing task in the ensuing summative test would not have a high degree of validity (and reliability) if it suddenly asked the learners to describe a past experience or to make predictions for the future. However, the central expressions in Harmer’s statement are evidently those of ‘communicative purpose’: if the students see a clear aim which their written production should achieve and thus understand why they are writing, only then the task is meaningful and goal-oriented. Learners will indeed approach a task in very different ways depending on what their written message is for: hence, they can for example be asked to describe, 3 4 Harmer, op.cit., p.249. Italics added. Ibid., p.249. Emphasis added. 104 Chapter 4 analyse, complain or persuade 5 ; in each case, the purpose of their writing will inevitably have a big and varying impact on the style and overall content of their answers. To increase the overall impression of purpose to a given writing task, it evidently helps if the corresponding sense of authenticity of the communicative act is augmented: if the task mirrors a context that is relevant in “real life”, then the students are more likely to recognise its usefulness and thus they might be more willing to develop – or, in a test, activate – the necessary skills and knowledge to master it to a satisfactory degree. Evidently most written productions at school will still almost inevitably constitute examples of what Brown calls ‘display writing’: the students will, to a certain extent, always remain aware that their writing does not have an actual effect on the “real world” (as, for instance, a letter to an external person or company would do), but usually stays confined to classroom (and often assessment) purposes instead. In that sense, it is certainly true that ‘writing is primarily for the display of a student’s knowledge’ 6 and skills in a school context. Nevertheless, this realisation should not keep us from trying to get as close as possible to authentic situations, contexts and text types: hence, while Harmer agrees that ‘many…writing tasks [at school] do not have an audience other than the teacher’, he also rightly insists that this ‘does not stop us and them working as if they did’ 7 . In a similar vein, Brown stresses that even ‘display’ writing can still be authentic in that the purposes for writing are clear to the students, the audience is specified overtly, and there is at least some intent to convey meaning. 8 This points to a further significant element that needs to be kept in mind when creating meaningful writing tasks: the implied audience of the written production. Indeed, very diverse requirements and characteristics will affect a student’s communicative message if its intended receiver is (for example) a close friend or family member or, in contrast, the piece of writing is addressed to a teacher, a newspaper or even a potential employer. Closely linked to this notion is evidently a necessary awareness of different genres, which in turn results in a suitable choice of style for the written production 9 . This selection is evidently dependent on the format of the written production that is required: for instance, informal letters, emails or postcards all come 5 This distinction of various possible communicative purposes in writing tasks is based in part on Clarke, ‘Creating writing tasks from the CEFR’, p.2. 6 Brown, Teaching by Principles, pp.395-396. 7 Harmer, op.cit., p.259. Italics added. 8 Brown, Teaching by Principles, p.403. 9 Harmer, op.cit., p.259. Chapter 4 105 with different conventions than their formal counterparts; similarly, a story needs to be written in a style that significantly varies from the one used in a report or review. This also takes us back to the importance of sociolinguistic competence alluded to in the previous chapter: within a particular context for writing, the students should learn to respect appropriate conventions and thus, for instance, operate within a suitable register as well. In all of these aspects, of course, it is highly important that a sensible selection is made by the teacher with his students’ level of proficiency in mind; in the case of ‘preintermediate’ learners, for instance, it is clear that fairly short, informal text types (such as emails or letters to friends) are more suitable than complex argumentative essays. A similar reflection imposes itself on a final element of key importance to a convincing writing task: the topic of the expected production. It goes without saying that the choice of subject matter needs to fit the learners’ language level if the writing task is to be valid and reliable. In A2 classes, for instance, an exaggerated degree of complexity must of course be avoided; immediate communicative needs are thus much more likely to characterise an appropriate choice of topic at that level. In this respect, the CEFR once again proves a particularly useful tool, as it implies suitable thematic areas (and the corresponding communicative acts) that learners can and should be expected to deal with at the various levels of proficiency. In the case of ‘A2 creative writing’, for instance, the Framework stipulates that the learner ‘can write about everyday aspects of his/her environment, e.g. people, places, a job or study experience…’ as well as come up with ‘short, basic descriptions of events, past activities and personal experiences’ (CEFR, p.60). Within such generally defined fields, however, one should not underestimate the importance of identifying topics that coincide with the learners’ interests. As Harmer points out: [i]f students are not interested in the topics we are asking them to write…about, they are unlikely to invest their language production with the same amount of effort as they would if they were excited by the subject matter. 10 In a class of twenty (or more) adolescent individuals, this is of course not always an easy task. Even if teachers try to adapt their course and test contents to their teenage students as far as possible, ‘there is no magical way of ensuring that [the pupils] will be engaged with the topics we offer them’; furthermore, the learners’ inherently different 10 Ibid., p.252. In addition to this intrinsic motivation, one might of course point out that the marks awarded for good performances in summative tests would evidently also provide further (extrinsic) motivation to ‘invest [one’s] language production with [a high] amount of effort.’ 106 Chapter 4 personalities oblige teachers ‘to vary the topics [they] offer them so that [they] cater for the variety of interests within the class’ 11 . Syllabus requirements will have to be taken into consideration as well, since they will generally indicate a range of thematic areas to be covered over the course of the school year. Nevertheless, trying to present the respectively treated subject matter in interesting, creative and engaging ways (and conferring a communicative purpose to them) is certainly important not only to engage the learners in classroom activities in general, but also to generate the best possible productions in free writing tasks. In view of these multiple and simultaneous requirements, it quickly becomes clear that simple instructions such as ‘describe your holiday routine’ fail to take into account a number of important factors that are necessary for a truly meaningful writing task. Hence, the only criterion that these instructions actually fulfilled was a vague indication of topic; however, neither the audience or form had been specified and could thus give pointers about the expected answer style. Fortunately, however, it is not necessarily very difficult to transform such a basic set of instructions into a meaningful, more authentic and suitably contextualised task. In this case, for instance, a simple yet very useful variation would be the following formulation: You are on holiday in a different country. Write a postcard to a friend (or to your family) at home. Tell him/them what you (usually) do there every day. In that case, the topic would virtually stay the same; one might even deliberately split up the required content by asking for typical morning, afternoon and evening activities (although this would of course restrict the student’s freedom to personalise his answer). A clear communicative purpose would have been added: writing to describe different, daily habits and to inform friends or family about them. The presence of this specified audience would further confirm the informal style to adopt, whereas the indicated format that the text should take would also have been explicitly stated; it could even be reinforced through the visual representation (on the test sheet) of an empty, “authentic” postcard to write on. In terms of necessary lexical and grammatical resources, this variation of the task would not significantly raise the level of difficulty, either; however, the students would have to provide some evidence of their sociolinguistic competence by expressing their message in a way that would suit the typical style of a postcard. Following similar guidelines in the design of writing tasks in subsequent summative tests was thus essential to ensure that more meaningful and contextualised written 11 Ibid., p.253. Chapter 4 107 samples could be produced by my students. At the same time, however, this also meant that the assessment scheme would have to be altered accordingly, so that a sufficiently nuanced appreciation of the students’ writing skills would be possible. 4.3. Central features of interest to the assessment of writing All the main features which writing and speaking performances generally share (see section 3.1.1.) evidently need to play a fundamental role in the specific assessment of written productions as well. However, as briefly noted in the previous chapter, structural elements such as coherence and cohesion usually take on a bigger role in the appreciation of students’ writing efforts than it is the case in the assessment of their speaking performances. Thus, a higher degree of coherence (i.e. the logical sequencing of ideas) is generally expected in writing tasks due to the higher amount of time that is available to the students for the thoughtful planning of their written productions (in comparison to the often more spontaneous communicative behaviour in speaking tasks). Cohesion, on the other hand, is ‘a more technical matter’ that concerns ‘the various linguistic ways of connecting ideas across phrases or sentences’ 12 : the students have to show an appropriate use of discrete linking words (such as ‘and’, ‘because’, ‘so’, but also time words like ‘then’ or ‘before’) to ensure not only a logical but also a “mechanical” link between the various parts of their answers. As a result, this criterion rather blurs the line between ‘content’ and language-related (in this case lexical and syntactical) factors. In addition, writers are more often expected ‘to remove redundancy’ and ‘to create syntactic and lexical variety’ than speakers 13 . Therefore, even students at lower levels are usually required to avoid repeating the exact same expressions to some extent (for instance, to indicate the order of successive actions in time, learners should vary their linking words as far as they can; instead of simply starting each new sentence with ‘then’, they are thus expected to interject alternatives like ‘before’ and ‘after that’). This increased importance of both coherence and cohesion in writing is inherently due to the ‘distant’ nature of this medium in comparison to the more immediate context in which speaking takes place. In oral communication, an interlocutor may immediately ask for clarification if he has trouble following the logic of the speaker’s line of argument; when reading a written production (in the absence of its author), this is impossible. 12 13 Ibid., p.246. Brown, Teaching by Principles, p.398. 108 Chapter 4 Making one’s intended meaning clear not only through appropriate vocabulary range and control but also through accurate use of structural devices is therefore an important skill for language learners to develop especially in writing, where their final product has to speak for itself. The general structure of a text can be further underlined through the appropriate use of features such as punctuation and paragraphing. In this respect, it is interesting to note that the CEFR only includes both features into ‘orthographic control’ from level ‘B1’ onwards 14 ; before that, at A2, only the ‘reasonable phonetic accuracy’ of words is described (CEFR, p.118). This represents a clear example of the type of ‘inconsistency’ in the descriptor scales which Alderson et al. rightly criticised (see chapter 2). Indeed, if we want our students to produce meaningful, contextualised and “authentic” language samples, such features can certainly not be ignored completely. After all, using basic punctuation marks to indicate different sentence types (such as questions or exclamations) is surely not beyond the reach of A2 learners; similarly, some paragraph breaks might be reasonable to expect in test tasks such as letter writing, especially if the learners have repeatedly been confronted with a particular format (and have practised using it themselves) during regular classroom activities. At the same time, one must not forget the potentially useful effects of plurilingual competences: since English is not the first language that pupils learn in our school system, they will already have grown accustomed to punctuation and paragraphing conventions in other languages (at least to a certain extent). On the one hand, this might of course occasionally lead to interferences between the different languages they have learnt (especially, for instance, in regard to language-specific elements such as quotation marks) – in error analysis, this interlingual influence is an important factor to keep in mind. On the other hand, it also means that students who start learning English in the Luxembourg school system (as their third or even fourth foreign language) do not generally do so without any knowledge of how a written text may be logically and usefully structured (particularly given the heavy focus on writing activities in their other language courses). Certainly, it once again seems commendable to make use of those preexisting abilities instead of pretending that they were simply non-existent. Finally, a significant alteration seemed sensible in comparison to some writing tasks that appear in A2-level proficiency tests such as KET: the length of the samples that the 14 At level B1, ‘spelling, punctuation and layout are accurate enough to be followed most of the time’. (CEFR, p.118) Chapter 4 109 students were supposed to write. The imposed word limit in the most extensive KET writing activity is, in fact, set at a mere 25-35 words 15 . However, if a student’s sample is to reveal most (or all) of the aforementioned features of writing, and the task has been defined so as to impose a ‘true’ communicative purpose, then this very limited word count offers very little opportunity for a free and fluent expression of ideas. Moreover, as the very first summative test had already shown, the students in my ‘pre-intermediate’ class had demonstrated a clear ability to meet much higher requirements. Therefore, it seemed more appropriate to offer the students a wider possible answer scope (which, in turn, a detailed assessment of general proficiency could also be more safely founded on). As a result, the length of written productions which I usually aimed for in free writing tasks in summative tests often approached regions of about 80-100 words instead (for tasks that were worth about 8 to 10 marks). 4.4. Defining an appropriate assessment scheme for writing tasks Due to the multitude of factors to take into account when assessing a ‘true’ free writing performance, it was clear from the outset that a marking grid would again be the most appropriate tool to use after suitable test tasks had been designed and implemented. The corresponding criterion-referenced approach not only allowed me to focus on a variety of salient features of the students’ overall writing skills, but also to get away from the possible caveats of excessive norm-referencing: instead of merely comparing students’ performances with each other, each individual production could thus be gauged against equal and unvarying standards that were founded on realistic targets for ‘preintermediate’ language learners. To ensure a general and valid alignment to the CEFR A2 level in this process, I once again referred to an official marking grid that had been published for assessments of ‘writing’ in the 9TE syllabus and that constituted the direct counterpart to the ‘speaking’ grid referred to in the previous chapter 16 . The syllabus itself states that ‘it seems logical to develop separate marking grids for writing and for speaking’ 17 ; given the divergences between both skills (and the correspondingly produced language samples) identified above, this certainly seems a sensible decision. However, before moving on to the direct application of this tool in precise practical examples, it is 15 See Key English Test for Schools – Handbook for Teachers, p.11 / p.19 See appendix 12 or p.21 of the syllabus for 9TE (July 2010 version) published on http://programmes.myschool.lu/ 17 Syllabus for 9TE (July 2010 version) on http://programmes.myschool.lu/, p.14. 16 110 Chapter 4 necessary to specify and analyse what differences actually existed between both grids, and in what ways these distinctions could be theoretically justified. In general, both ‘speaking’ and ‘writing’ grids usefully respected a common layout that reflected and underlined the numerous common features of the two productive skills while simultaneously ensuring a certain degree of consistency among the applied assessment systems. As a result, three of the four main criteria from the ‘speaking’ grid reappeared in the one for ‘writing’: namely, the general categories for ‘content’, ‘lexis’ and ‘grammatical structures’. For obvious reasons, however, the fourth main set of criteria from the oral assessment grid, ‘pronunciation and discourse management’, was not applicable to written productions; instead, its place was taken by ‘coherence and cohesion’ in the marking grid for writing. Although these two interlinked features had already been integrated into the ‘discourse management’ category of the ‘speaking’ grid, the choice to single them out as a main set of criteria in this case (instead of merely including them as subordinated components) handily reflected their increased importance in writing performances alluded to above. The overall five-band system (including the two useful intermediate bands 2 and 4) had also been retained in the ‘writing’ grid, further increasing the consistency in the marking systems for the two productive skills. A close analysis of individual criteria also revealed the following interesting features: 1. The ‘content’ criteria included both qualitative and quantitative factors. The former were evidently necessary to analyse the student’s ability to ‘clearly… communicate the message to the reader’ (band 5). However, an interesting nuance in comparison to the ‘speaking’ grid resided in the more precisely defined quantitative dimension of ‘content’ criteria. Thus, an ‘insufficient’ performance would be characterised by the fact that ‘only about 1/3 of the content elements’ had been ‘dealt with’ in the student’s written production; two thirds would correspond to ‘basic standard’, whereas the ‘target standard’ (and thus maximal marks) could evidently just be reached if all ‘content elements’ had ‘in general’ been ‘addressed successfully’. At first sight, this quantitative gradation evidently seems rather arbitrary: indeed, while it is self-evident that 33% of the required answer content cannot possibly warrant a ‘sufficient’ mark, why set the expectations for the middle band at 66% of the necessary content rather than (for example) 50%? In fact, a generally implied message of these descriptors is of course that a ‘basic standard’ performance should still contain clear evidence of the student’s Chapter 4 111 general understanding of (and ability to deal with) the task requirements. Even if the student has not addressed all the content points, the majority of his answer should still unmistakably carry sufficient relevance to the topic. As a result, even if the ‘message is only partly communicated to the reader and/or not all content elements’ have been ‘dealt with successfully’, the bulk of the given information still tends to be appropriate. If answering only half the content points were set as expected ‘basic’ requirement, it would be much more difficult for the assessor to decide which half of such a “hit and miss” performance was to be seen as truly representative for the student’s ability (or lack thereof) to deal with the task. Therefore, it is certainly appropriate to set the bar a bit higher in this instance to make sure that fundamentally ambiguous assessments are avoided. Another important inclusion in this area of the marking grid was the reference to the student’s ‘awareness of format’; indeed, if writing tasks firmly aimed to be as ‘authentic’ and contextualised as possible, the content of the students’ answers could not possibly be entirely adequate if they did not observe the respective conventions of the required text type. To a certain extent, this criterion could thus for example be drawn upon to verify sociolinguistic factors such as appropriate register in the written production. 2. As mentioned above, the two elements of coherence and cohesion play a particularly important role in writing because of the ‘physical…and temporal distance’ 18 between the student’s composition process and the assessor’s act of reading and interpretation: the learner does not get a second chance to clarify his line of argument after handing in his written production. Hence, if all the relevant information is present in a student’s answer yet the assessor has to make a considerable effort to “connect the dots” himself, the achieved content mark may still be fairly high; on the other hand, the student will have shown a lack of ability to ‘structure his discourse’ in a satisfactory manner and thus receive a lower mark in terms of ‘coherence and cohesion’. Therefore, it certainly makes sense to separate this set of criteria from the abovementioned ‘content’ category. A particularly useful feature of the marking grid for ‘writing’ additionally consisted in the explicit specification of expressions and structures to look for in a given student production. Thus, the ‘band 5’ descriptor indicated that the best possible 18 Brown, Teaching by Principles, p.364. 112 Chapter 4 performance needed to include the ‘use of simple linking devices such as ‘and’, ‘or’, ‘so’, ‘but’ and ‘because’’, whereas the ‘basic standard’ corresponded to ‘short simple sentences which [were] simply listed rather than connected’. Both descriptions were undoubtedly detailed enough to be directly used in the analysis of a particular performance and thus – if applied with sufficient rigour and consistency – certainly helped to increase the reliability of the assessment as a whole. 3. In comparison to the ‘speaking’ grid, the ‘lexis’ section for writing evidently needed to include an added focus on orthographic control. In line with the previously mentioned A2 requirements of ‘reasonable phonetic accuracy’, the ‘band 3’ descriptors called for only ‘limited control of spelling’. In contrast, merely ‘few minor errors’ could mark a written production if the ‘target standard’ was to be reached. As the latter definition is more in line with the B1 level (which stipulates that ‘spelling [is] accurate enough to be followed most of the time’), this once again underlines that CEFR descriptors sometimes tend to be fairly lenient in terms of accuracy. In a school context, especially in summative tests that build on pre-taught and thoroughly practised vocabulary items, it is clear that slightly higher target standards occasionally need to be set (at least in relation to these familiar expressions) than what the originally defined descriptors for the A2 level call for. 4. The same remark could also still be applied to the ‘grammatical structures’ section in this marking grid since its descriptors in this regard were very similar to the ones used for speaking assessments. However, the descriptors in this grid also importantly drew attention to the fact that ‘faulty punctuation’ could ‘cause difficulty in terms of communication’ (band 3); as mentioned above, this was an important addition in regard to the CEFR scale of ‘orthographic control’, which does not mention this element at A2 level at all. Both in regard to ‘lexis’ and ‘grammatical structures’, this marking grid also importantly pointed out that the presence of ‘few minor errors’ could be tolerated as long as they did ‘not reduce communication’. This clearly echoes the general spirit of the CEFR, which fundamentally approaches grammar not as an end in itself, but as a means to a wider, communicative end. Of course, it should be noted that notions such as ‘few’, ‘minor’ (and, in the same vein, ‘basic’ in the marking grid for speaking) may all still mean different things to different people; thus, even if various teachers use the same tool, it does not mean that Chapter 4 113 they will also use it in the same way 19 . The lack of definitions that Alderson et al. criticised in relation to the terminology in the CEFR descriptors 20 can thus just as easily affect these marking grids as long as it has not been made explicit what exact language structures, functions and items are implied by such fairly vague expressions. Hence, some teachers might be forgiving enough to consider the occasional omission of a ‘third person –s’ ending in the present simple as a ‘minor’ mistake that happened in the “heat of the action”, while others may immediately see this as evidence of ‘basic errors’ in the student’s production and correspondingly veer towards a lower band in the ‘grammatical structures’ criterion than their colleagues. Similarly, quantitative factors such as ‘few errors’ could mean merely two or three wrong forms to one assessor, but five or six to another. 21 In a fundamental way, however, the descriptors in this marking grid usefully underlined possible reasons which could lead to ‘few minor errors’ of accuracy even in ‘target standard’ (i.e. band 5) performances: namely, the two factors of ‘inattention’ and ‘risk taking’ in a student’s written production. This is particularly relevant in the context of free writing. Hence, even if a particular writing task in a summative test is understandably based on the student’s general application of ‘specific practice points’ in a wider context, one must be aware that the degree of accuracy is bound to suffer more in such exercises than if these ‘practice points’ were focused on in isolation. Compared to discrete item tasks, the term ‘inattention’ must thus be put into a different context. In an exercise where the student only has to fill in isolated grammatical items and can thus focus his entire attention on using accurate forms, it is clear that ‘inattention’ is fairly inexcusable and will have a legitimately negative impact on the awarded mark. In free writing, however, there are multiple and complex demands and difficulties that the student must simultaneously deal with in the composing process. As seen above, fulfilling the ‘communicative purpose’ of such a task requires the activation of ‘all and any language at [his] disposal’ (implying an occasional need for ‘risk taking’ in the process). For the teacher, this means that the learner’s overall proficiency comes into play and needs to be taken into account; the assessment focus can no longer solely remain on 19 In my own experience, this was strikingly exemplified when I attended a teaching seminar about the implementation of marking grids (David Horner, ‘The logic behind marking grids’, held in Luxembourg in March 2010): for example, the question of whether the ‘past simple tense’ could be considered as a ‘simple’ structure led to very different interpretations among the attending experienced teachers. 20 See section 2.2.2. 21 Possible ways of limiting such ‘inter-rater’ discrepancies are discussed in more detail in chapter 5 (see section 5.1.2.). 114 Chapter 4 pure achievement factors. For the student, paying attention to grammatical or orthographic accuracy in relation to ‘specific practice points’ becomes a much bigger challenge while he tries to juggle numerous additional elements and processes, such as developing ideas that are relevant to the topic, structuring them in an adequate way and looking for appropriate lexis to express them in the first place 22 . As a result, occasional ‘inattention’ in one (or several) of these domains is virtually unavoidable. This is particularly so if factors such as time pressure and stress affect the student’s performance (which is very likely in summative classroom tests, given the fact that free writing constitutes only one of the various tasks that the learners have to deal with in a limited amount of time). Placed in such a context, occasional grammatical mistakes clearly become something to be expected; for that reason, expectations of virtual faultlessness can certainly no longer be accepted as realistic requirements for maximum marks in formrelated criteria. Once the necessary adaptations to task design and the corresponding assessment system had thus been identified, my summative tests could be systematically modified so as to offer a better insight into my students’ writing competences. In the following section, I will now describe and analyse a range of different tasks that I implemented and assessed in various summative tests in my ‘pre-intermediate’ class with that aim in mind. 22 This complex interplay of multiple cognitive demands can easily lead to what Astolfi calls a ‘surcharge cognitive’. See also Jean-Pierre Astolfi, L’erreur, un outil pour enseigner, ESF (Paris: 1997), p.86. Chapter 4 115 4.5. Case study: using a marking grid to assess written productions in a summative test 4.5.1. Description of test tasks The two test tasks which will be focused on in this section were implemented in the first summative test of the second term. Throughout the weeks leading up to the test, numerous classroom activities had focused on future events and plans to make the students more accustomed to (and confident in their use of) the ‘will-future’ as well as the ‘going to-future’ 23 . The two topics of horoscopes and upcoming summer activities were recurrently used to contextualise the use of both future forms; since the eventual summative test needed to build on the relevant topical lexis which had correspondingly been treated in the classroom, both of these elements reappeared in two “free writing” tasks (from which the students could choose one) 24 . In both cases, a clear communicative purpose was to be pursued: in the ‘horoscope’ productions, the students had to establish a number of predictions in the form of a magazine article; the ‘summer activities’ task, on the other hand, targeted a text resembling a publicity ‘brochure’ for a ‘summer camp’ (where the students – as ‘camp managers’ – had to tell potential customers what they would be able to do and see there). As a result, the correspondingly adopted writing styles would ultimately have to differ in each case; however, as these two types of texts had of course been encountered and practised in class prior to the summative test, the students were generally familiar with both of them. In terms of language, both of these alternatives were linked through their basic reliance on a common ‘practice point’: predictions about (as well as plans for) the future. Though they were of course no actual grammar activities, it seemed logical to verify to what extent the students would be able to apply the encountered future forms in a wider, more communicative context as well – in that sense, achievement indications in relation to this common objective would be given in written productions dealing with either task. However, due to the different topical lexis areas that the two tasks targeted, no such overlaps could be expected in terms of vocabulary items that the students would use in 23 Such activities included, for example, a group work task in which the students had to compile a ‘horoscope’ for the year 2010 in their teacher’s life; suitable, typical features of this type of text were then elaborated on the basis of these creative productions. As for summer plans, they were not only orally discussed in the plenary group, but also included in a creative group work activity where students had to organise a stay in a different country (with a range of correspondingly available activities) as a ‘competition prize’ (this activity was based on the starting pages of Unit 4 in Tom Hutchinson, Lifelines PreIntermediate Student’s Book, OUP (Oxford: 1997), pp.32-33.) 24 See appendix 13 116 Chapter 4 their respective written productions. In turn, this highlights that the writing performances as a whole would also allow me to draw more global inferences about the actual proficiency levels the students had reached up to that point. To a certain extent, one might therefore argue that the overall reliability of the test results risked being intrinsically affected by the fact that such a choice was offered to the students. As not all the students would necessarily deal with exactly the same test tasks, the direct comparability of their results would indeed be partially compromised (as there was no “scientific” way of making sure that the two proposed writing tasks were of precisely the same difficulty level). However, the decision to offer such a choice had of course not been made lightly: in fact, it constituted an attempt to cater for varying student interests and thus to generate higher productivity by giving each student the choice of addressing the subject matter that he or she found more exciting. In the process, this autonomous decision would also confer more responsibility to the learners: in essence, it was up to them to pick a topic which they knew (or thought) they would be able to answer in a (more) satisfactory way 25 . To further increase the comparability of requirements in these two test tasks, both sets of instructions included an equal number of structural and content-related clues to guide and support the students in their efforts. This type of scaffolding indicated the text format that had to be respected, and it simultaneously established a number of content points that needed to be incorporated into the written productions. In both cases, three major elements had to constitute the core of the students’ answers (which of course had the added benefit that the amount of successfully addressed points could easily be related to the “quantitative” content descriptors for bands 1, 3 and 5 in the marking grid). Each of the two assignments also respected the other key features of meaningful writing tasks identified above: not only did they both specify a clear topic (that, in each case, had already been previously treated in class), but each of them also stated a clear communicative purpose. For the ‘horoscope’, the students thus had to describe some future events that were reasonably likely to occur (and generally emulated the rather vague type of information usually included in such texts). For the summer camp ‘brochure’, a persuasive factor came into play in addition to the general description of 25 In this respect, it is interesting to note that official proficiency examinations such as the Cambridge ESOL Preliminary English Test usually also offer such a choice to candidates in the free (or creative) writing section. These choices generally involve different text types and thus answer formats as well (for example, the student can often choose between a letter or a story to write); see for instance the sample test in University of Cambridge ESOL Examinations, Preliminary English Test for Schools – Handbook for teachers, UCLES (Cambridge: 2008), p.14. Chapter 4 117 camp activities and facilities (which recycled elements from the ‘competition prize’ activity implemented in class): the students were encouraged to write their answers in a lively, direct and advertising style. In both cases, the implied audience of the students’ texts was not merely the teacher, but rather the imagined readership of the corresponding magazine or brochure. Due to all these different aspects, the two tasks certainly represented a big step away from much more basic and traditional “writing” instructions (such as ‘make five predictions about the future’). 4.5.2. Description of test conditions As the ‘free writing’ sections which I included into summative tests were usually combined with other types of tasks (i.e. reading, listening or ‘English in use’ exercises), it was clear that a certain amount of time pressure would inevitably risk affecting the learners’ performances, while tiredness after dealing with earlier exercises could also play a part by affecting concentration in the final, free writing portion of the test. However, as a double lesson was usually available for the implementation of these tests, I could (and generally did) give the students a little more time than I would have been able to do in a single 50-minute lesson. Thus, an extra period of ten or fifteen minutes could be granted to the test candidates so as to alleviate the impacts of stress and time pressure on their performances (which would have affected the reliability of their results to a certain extent). The tests were also invariably written in the familiar environment of the students’ own classroom, which (as seen in chapter 1) had a further positive influence on overall test reliability. In this first summative test of the second term, a slightly more complicated mathematical operation was exceptionally necessary to arrive at the final marks for “free writing”, since only a total of 9 marks had remained available for this section after the rest of the test had been set up. Thus a mark out of 10 was ultimately first calculated from the marking grid (by using a multiplying factor of 0.5 to arrive at numerical values from the bands reached). This number was then converted into the finally awarded mark (out of the maximum of 9) through a simple rule of three (i.e. divided by 10, multiplied by 9). To avoid excessive decimal nuances, the resulting scores were rounded up to .5 (or .0) if the decimal value was superior or equal to .25 (or .75); otherwise, they were rounded down to the closest .0 or .5 value respectively. As these rather complicated conversions suggest, though, a mark out of 10 (or 20) is of course to be strongly recommended for reasons of practicality whenever possible. 118 Chapter 4 4.5.3. ‘Horoscope’ task: analysis and assessment of student performances The vast majority of students (14 out of 19 learners who had been present on the day of the summative test) ultimately opted for the ‘horoscope’ task as their favoured writing activity. A first interesting detail which was immediately noticeable was that not all of them had given the same type of ‘format’ to their respective texts. While most individuals presented their respective answers as one connected paragraph, others had addressed the various content elements in successive steps, including sub-headings for each section. These different approaches are clearly visible in the following two samples 26 : Student A: Love and friendship: The week will start bad because you meet some people strange. They will lie. You are going to have a nice time whit your best friend. You will meet a very nice and interesting person and He will give you his heart. money and finances You will have a lot mony and you will win a new chance. work and studies You will have a lot of arguments because you never will listen to your partner. You will miss a lot of thinks. You will go to the doctor. Student B: Your week is going to start good. You will fall in love, and it will be the man/woman of your life. If you are bad in finances: don’t worry! This week you will win the Loto! If you work, your director will give you more money, but if you study, you should learn more, it’s the only solution! Your sence of humor will be nice! You should be more sportive and think more on yourself. It will be a nice week for you! One may argue that “real” horoscopes can be encountered either in the form that student A chose or in one connected paragraph (as in student B’s production), depending on the magazine or newspaper one picks. In that sense, the decision to approach the different content points individually (and by addressing them under separate headings) did not have a negative impact on the “authenticity” (and thus the appropriateness) of student A’s sample in this case. Incidentally, since the instructions had in fact been presented in “bullet points” on the test sheet as well (see appendix 13), some students could also have interpreted this as a sign that their answers needed to replicate that precise structure. In short, there were thus no reasonable grounds to consider one type of format as intrinsically more appropriate for this particular genre (and to value one of them more 26 See appendix 14 for copies of the original student samples and the correspondingly used assessment sheets. Chapter 4 119 highly than the other in terms of the corresponding ‘content’ criterion in the marking grid). Furthermore, both answers superficially seemed to address all the required content points: sentences about all major categories to be included could be found in each case. However, particularly student A tackled these issues with varying success. Especially in relation to ‘money and finances’, it is clear that the intended ‘message [was] only partly communicated to the reader’. Even if it was still fairly obvious what the student meant with ‘you will have a lot mony’ (i.e. ‘you will become rich’), the second part of the sentence (‘you will win a new chance’) did not lead to ‘successful’ communication. A similar remark could be made about the penultimate sentence in her text: while the spelling mistake in itself (‘thinks’ instead of ‘things’) did not impede communication, the actual sentence as a whole was excessively vague and did not really seem to fit into the category of ‘work and studies’. In combination with the final sentence, the suspicion arises that she might have intended to refer to an absence at work (or school) due to illness; however, this example certainly underlines that a considerable ‘effort on behalf of the reader’ was necessary to make sense of some parts of the student’s answer. In comparison, student B’s ‘horoscope’ mostly stuck to the imposed content points a bit more firmly; while one may argue that the piece of information about a lottery win is too specific for such an article, the overall message in the ‘money and finances’ section certainly came across much more ‘clearly and fully’ in this instance. The only corrective effort that the reader really had to make in that precise case was to infer the intended meaning of the inaccurate expression ‘Loto’ – by no means a feature that would impede successful communication. While clear L1 interference occurred in various other parts of the answer as well, it mainly risked having an impact on the ‘lexis’ and ‘grammatical structures’ marks (i.e. ‘start good’ / ‘be more sportive’); the message in itself was not usually affected in a negative way. The only exception where ‘content’ truly became an issue in this answer was in the sentence ‘Your sence of humor will be nice’: no obvious context, relevance or even intended meaning could be discerned for this prediction. The relevance of the following sentence (‘You should be more sportive…’) also remains in doubt, although the student’s reason for including it might be explained by the last statement in the actual task instructions (‘you can add other important elements that you can think of’). All in all, the intermediate ‘band 4’ was thus awarded for ‘content’ in this case as the student had ‘in general’ managed to ‘communicate the message fully and clearly to the reader’ (band 5), yet some interpretive ‘effort’ had been necessary towards 120 Chapter 4 the end of the written production (band 3). In contrast, student A’s performance was not only marked by the partly unsuccessful handling of one content point (i.e. only about ‘2/3 of the content elements’ had truly been dealt with adequately); as it also repeatedly caused more ‘effort on behalf of the reader’, it ultimately corresponded rather directly to the descriptors in ‘band 3’ (which was thus logically awarded). A similarly analytic approach was also helpful to reach an informed, criteria-based decision about the ‘coherence and cohesion’ factors in both written productions. Thus, student B’s answer was consistently marked by a lively, supportive tone (e.g. ‘don’t worry!’ / ‘It will be a nice week for you!’); efforts to provide an adequate opening and ending were also clearly evident in this instance and enhanced the impression of overall coherence and sensible organisation of the answer as a whole. While ‘simple linking devices’ were generally only used within individual sentences (rather than to connect two successive ones with each other), successful attempts to use more ‘complex sentence forms’ (i.e. ‘If you work, your director will give you more money, but if you study…’) further contributed to the positive effect on the reader. Yet due to the fact that the final two sentences in the main body of the text presented no obvious logical or linguistic link to the preceding content (and thus corresponded to the band 3 descriptor of ‘listed rather than connected’), ‘band 4’ was ultimately awarded in this set of criteria as well (instead of the highest possible one). In student A’s performance, some organisational features were also present, as underlined by the three included sub-headings and corresponding section breaks. However, ‘listing’ rather than ‘linking’ of individual ideas not only appeared in this obvious three-part structure of the answer, but also in relation to the various sentences in each section (pointing to the ‘band 3’ descriptor of cohesion). Hence, simple linking devices such as ‘because’ or ‘and’ were consistently used correctly at sentence level, but no effort was made to use cohesive devices to connect sentences or paragraphs directly. In terms of coherence, the aforementioned inconsistencies in the last part of the text as well as the absence of a satisfactory final sentence further underlined the impression that the student tended to simply add ideas as she went along rather than having a complete, connected message in mind. Finally, the overall coherence of her text was slightly problematic as a lot of different events were packed into the predictions for a single week, yet no expressions of contrast (such as ‘but’) were used to move from positive to negative events or implications (e.g. ‘They will lie. You are going to have a nice time…’). The amalgamation of such inconsistencies, combined with the generally ‘listed’ answer style, Chapter 4 121 ultimately led to a ‘band 3’ score for ‘coherence and cohesion’. Hence, even though one band 5 descriptor had been partly respected (‘use of simple linking devices such as ‘and’… and ‘because’), the remaining features of the student’s answer clearly pointed towards ‘basic’ rather than ‘target’ standard. In terms of linguistic aspects such as ‘lexis’ and ‘grammatical structures’, neither candidate obviously managed to provide a completely faultless performance. For instance, both of them struggled with adverbs (e.g. ‘start bad’ and ‘start good’, respectively), and some structures and expressions that the students used clearly belied their need to improvise in order to bring their message across (student A: ‘You will win a new chance’ / student B: ‘if you are bad in finances’). This “risk taking” was, as seen above, not always crowned with success. Nevertheless, particularly in student B’s production, most of the grammatical and lexical errors committed ‘did not reduce communication’, which once again underlines the necessity of setting the linguistic imperfections of a particular student performance against the more general consideration of whether or not the student has reached a sufficient level of proficiency to express herself clearly enough to fulfil the communicative purpose of the task. On the other hand, the interplay of ‘form’ and ‘content’ also occasionally became evident in this assessment. Hence, student A’s sentence ‘you will win a new chance’ not only risked having an impact on her mark in relation to ‘lexis’ criteria, but as it simultaneously reduced her ability to address the corresponding ‘content’ point in an adequate way (i.e. to ‘fully communicate the message’), it would keep her from achieving maximum marks in the global ‘content’ section as well. At times, it was thus difficult to keep different criteria completely separate from each other (which, in turn, might also partly explain the fact that marks for different criteria within a single student’s assessment frequently tended to be rather close to each other). Nevertheless, as in the case of the marking grid for speaking, the decision to keep ‘lexis’ and ‘grammatical structures’ as separate criteria proved a useful one to analyse the precise areas where the students encountered problems (even if the same bands were fairly often reached by individual students in relation to both aspects). In this instance, student A’s production was occasionally marked by ‘limited control of spelling’ (‘lexis’ band 3); however, some of these mistakes seemed to be due to inattention (e.g. ‘money’ spelled correctly once but misspelled in the subsequent line). Although it was ultimately because of her ‘limited range’ in this area that the second content point could not be entirely addressed in a correct way (and she had thus not completely reached the ‘target 122 Chapter 4 standard’ of an ‘adequate range’), the expressions she chose in the remainder of her text generally seemed ‘appropriate and adequate for the task’ (band 5). Moreover, most of her spelling mistakes could indeed be seen as ‘minor errors’ as their impact on communication was usually minimal. However, since ‘band 5’ had not been reached in every aspect of the ‘lexis’ criteria, the slightly lower band 4 was thus awarded instead. In terms of grammar, the occasional use of fairly basic sentences (‘You will...’) initially pointed to a ‘limited range of structures’ (certainly in comparison to student B, who had included rather complex first conditional sentences). Moreover, slightly wrong word order (e.g. ‘some people strange’) also characterised student A’s answer at times. Yet a closer look revealed that none of these mistakes seriously impeded communication; in fact, important elements such as syntax and use of verbs and tenses were correct enough throughout to avoid ‘difficulty in terms of communication’ (band 3). General achievement objectives for this summative test had additionally been met by the student, since she had used future forms correctly throughout her answer (apart from one instance in the first sentence which, as it was a clear exception, pointed to an uncharacteristic ‘slip’ resulting from ‘inattention’ rather than constituting a systematic ‘error’). In that sense, if one focused purely on the ‘grammatical’ aspect of the student’s text, it became clear that there were only ‘minor errors’ that did not, in themselves, create misunderstanding or obscure meaning. All in all, since ‘features of bands 3 and 5’ were thus visible in her performance, she ultimately obtained the intermediate ‘band 4’ mark for ‘grammatical structures’ as well. The thorough error analysis encouraged by this criteria-based assessment system thus ultimately led to a number of useful consequences. On the one hand, by separating the foci on lexis and grammar, i resist the temptation of simply counting the total number of ‘form’-related mistakes and correspondingly reaching a holistic judgment that might have been fairly negative. Indeed, without distinguishing between spelling, vocabulary and grammar mistakes, one might simply have reached the conclusion that there was hardly a completely mistake-free sentence in student A’s performance, and correspondingly a fairly low mark for ‘form’ (or ‘language’) might then have been awarded overall. In contrast, when analysing the text with reference to the various specific descriptors for ‘lexis’ and ‘grammar’, it was not only possible to understand where most of these different mistakes had come from; this approach also crucially helped to find out which errors truly hindered the successful achievement of a ‘communicative purpose’ (and which ones did not), and to adjust the various criteria- Chapter 4 123 related marks accordingly. In turn, this not only translated into a more nuanced (and more soundly justified) assessment; it also provided a better opportunity to give targeted feedback to the student (by pointing out the area that had caused the most problems in the test performance and thus primarily needed to be worked on). 4.5.4. ‘Summer camp’ task: analysis and assessment of student performances Although the second proposed writing task in this summative test generally imposed three main content points to be included as well, it also left the students a lot of room to become creative. As a result, the students who chose this option came up with very diverse answers, as exemplified in the following two productions 27 : Student C: Student D: Hi! I’m Mr. Cruse the manager of the new sommer camp “Think about It”. My camp is in Knogge near the see. Next to my camp Is a Zoo with a lot of animals. The participants can do a lot of wateractivitys, like surfing, Water-gym… The Camp rooms are verry comfortible, they have two bed’s a bathroom, television and balcon. In the camp we have a restaurant with 20 diffrent nationalitys of food, Chinese food… We have also a welnessroom, with sauna, solarium, hairstyler… When it conviced you come here, it would be funny! Two weeks summer camp in France without parents and brothers or sisters. Groups with 4-5 person. We will climbing, do water ski, rally, and differents sports (football, tennis…) We will sleep in a old house. In a room there sleep 4-5 person (girls or boys). The last week we will sleep in Nature not in a building, and cooking four yourself. In the evening we will make a disco. You will have sometimes freetime. 10 females and 8 males will sitting you. We have place for 60 childrens (14-18 years) When you come with us in the summer camp you’re never lost the very funny time. You will have a lot of sports and will learn a lot of diffrent people. Once again, both students generally addressed the required content points: there is evidence of available activities, buildings and rooms in each of the two productions (even though the rather luxurious sleeping facilities described in production C do not really correspond to a typical ‘camp’). In different ways, students C and D also respected the instruction to make their ‘brochures’ as lively as possible; both of them chose to do so by 27 See appendix 14 for copies of the original student samples and the correspondingly used assessment sheets. 124 Chapter 4 addressing the correct target audience (teenagers who should come to the camp) in a very direct way. However, whereas the first student chose to create an imaginary persona (‘Mr Cruse’) that consistently addressed the reader, the speaker in the second text rather randomly alternated between the subjects “we” and “you”; as a result, the overall coherence of production D suffered to a certain extent. In terms of format, the ‘brochure’ context could potentially be invoked to defend the style of student D’s first two sentences: while in general, incomplete constructions (i.e. sentences without verbs) should evidently be avoided in writing, one could indeed point out that advertisements or titles can occasionally take the form that the student had used to begin her answer with. The fact that the student consistently used verbs throughout the remainder of her answer further compounded the impression that these first two “sentences” were intentionally contracted. Nevertheless, in regard to overall coherence and cohesion, student D’s answer certainly did not exceed the ‘basic standard’: there was no systematic use of ‘linking words’ or other cohesive devices (apart from ‘and’), and thus the individual points that she made were ‘simply listed’ throughout. On the other hand, the intentional use of paragraphing for new sets of ideas certainly indicated a form of ‘organisation’ that was clearly missing in student C’s text (but which, in turn, was compensated for by the fitting opening and concluding sentences that provided a general, coherent frame to that latter text). Yet what makes these written productions particularly interesting samples for a more thorough analysis (in view of assessment) is the fact that both of them clearly contained a range of problems with grammatical and lexical accuracy. In student C’s production, for instance, a whole range of ‘lexis’-related mistakes appeared: inconsistent use of lower and upper cases, frequently inaccurate spelling, as well as a few wrongly improvised expressions based on L1 transfer (e.g. ‘welnessroom, hairstyler’). Some of these mistakes may certainly be attributed to ‘risk taking’, since an A2 learner is not necessarily familiar with the expressions ‘convinced’ and ‘comfortable’; spelling mistakes in those rather complex words are thus understandable. However, others cannot be excused in this way: the student was, after all, supposed to know the correct spelling of words like ‘sea’ and ‘very’. Of course, the single usage of each word does not allow us to verify whether these were mere ‘slips’ (due to ‘inattention’, which is arguably the case for the wrongly copied word ‘sommer’) or consistent errors in the student’s interlanguage. Nevertheless, they certainly contribute to the overall impression that this learner’s orthographic control was fairly poor in general. Furthermore, some grammatical items Chapter 4 125 were also used inaccurately (e.g ‘bed’s’ instead of the correct plural ‘beds’); in regard to this criterion in the marking grid, an occasionally inappropriate use of punctuation marks affected the overall logic of some sentences as well (for example in ‘diffrent nationalitys of food, Chinese food…’, where the comma indicated an enumeration, yet a colon would have been more adequate). In a ‘classic’ assessment scheme where all these mistakes would have been considered in a unified ‘form’ criterion, an insufficient mark could easily have been the consequence, especially in reference to supposedly ‘known’ items such as ‘summer’, ‘sea’ and the general rule in the English language that most nouns are spelled in lower case. To a certain extent, it might also be reasonable to expect that a ‘pre-intermediate’ student should get these elements right, particularly in a summative test which (partly) verifies adequate mastery of previously encountered material (which, in this case, would definitely apply to the words ‘sea’, ‘very’ and ‘different’). However, taking into account the actual complexity of the free writing process mentioned above, a slightly more lenient approach to accuracy might be more appropriate in this instance. More importantly, following the CEFR approach of analysing what the learner actually did right in this assignment, it would be utterly harsh to deny the general communicative success that her writing performance essentially achieved in spite of its lexical and orthographic shortcomings. In fact, one might argue that only the final sentence may truly have caused problems in terms of communication; the learner’s choice of words was not only marked by the rather important confusion of ‘if’ and ‘when’, but the expression ‘it would be funny’ (instead of ‘you will have lots of fun’ or ‘that would be great’) ultimately led to a different meaning than the initially intended one. In contrast, spelling mistakes are often much more forgivable as they do not frequently cause a complete failure to bring the intended message across; the student’s orthographic control thus effectively corresponded to the relevant A2 descriptor which simply calls for ‘reasonable phonetic accuracy’. As that descriptor functioned as the foundation for the ‘basic standard’ defined in the marking grid, a corresponding ‘band 3’ was ultimately awarded to the ‘lexis’ criterion of the student’s performance due to her clearly ‘limited control of spelling and word formation’. Once the assessment focus was turned exclusively onto the grammatical proficiency that the learner displayed in this text, it became even clearer that there were not many elements which actually hindered communicative success. Apart from a few minor mistakes such as wrong word order (‘we have also’ instead of ‘we also have’) or slightly 126 Chapter 4 inaccurate constructions (‘next to my camp Is a Zoo’ instead of ‘there is a zoo’), the candidate generally managed to express her ideas rather fluently. Even the basic mistake of placing an incorrect apostrophe in the plural ‘bed’s’ did certainly not obscure the intended message (although it is certainly an element that a student should be supposed to get right even at that level). Once again, the final sentence caused the greatest problem: the wrongly constructed conditional sentence did not really work there. However, one needs to bear in mind that the ‘Second Conditional’ does not belong to the grammatical structures that a language learner is expected to have mastered (or even encountered) at that early stage; as such, the inaccuracy resulted from ‘risk taking’ in this instance and should therefore not be overestimated. In a different respect, though, the student’s overall text had actually failed to respect one of the instructions: the first content point to be included had been specified by way of a sentence in the ‘will-future’ (i.e. ‘describe the things that the participants will see and do’). In contrast, student C had consistently used the present simple tense in her answer instead. In that sense, she had not demonstrated that she could use one of the targeted grammatical structures in a wider context. To a certain extent, of course, her choice of tense was not totally inappropriate for the purpose of describing the available summer camp activities per se (since the present simple is, after all, perfectly fine for general descriptions). In turn, this underlines once more that such free writing tasks can only be drawn upon in a limited way to verify the achievement of a particular grammatical objective, since it is not always entirely predictable which structures the students are ultimately going to use. However, at least in relation to the first content point of the ‘summer camp’ task, the instructions had provided a clear hint that future forms were expected (even if this might not have been stressed quite as obviously as in the ‘horoscope’ activity discussed above). Even though student C’s text had not led to much obvious ‘difficulty in terms of communication’ with the linguistic choices she had independently made, her production was thus essentially not marked by a ‘sufficient range of structures for the task’ (but rather a ‘limited’ one as she virtually stuck to the present simple tense throughout). Taking into account the various instances of ‘faulty punctuation’ in her production as well, I ultimately decided to award the corresponding ‘band 3’ for ‘grammatical structures’, which essentially reflected that communication had still largely been possible even though accuracy was recurrently a problem. The general fulfilment of the communicative goal was also visible in student C’s final mark (6/9), Chapter 4 127 which was sufficient but still underlined a lingering tendency towards ‘basic standard’ and the occasional need for ‘some effort by the reader’. An example of how limited accuracy can lead to much more noticeable problems of communication can be seen in student D’s text. Throughout this production, it is evident that numerous rather basic grammatical elements had not been mastered by this learner yet: frequently encountered plurals (‘childrens’ / ‘person’ instead of ‘people’), indefinite articles (‘a old house’) and verb forms (‘we will climbing’) constituted only a selection of items that were not used correctly in this text. As these items and structures had all repeatedly been previously encountered and rehearsed, they certainly did not result from ‘risk taking’ but rather from an insufficient achievement of relevant objectives (which, in turn, had a negative impact on the learner’s overall proficiency in writing). In relation to the ‘grammatical structures’ criterion, the student thus tended towards the ‘band 1’ descriptor that implied ‘essentially no control of structures’; however, even if the student’s handling of grammar certainly lacked consistency, there were also various constructions that did work (e.g. ‘You will sleep in a[n] old house / ‘We have place [i.e. room] for 60 children[…]’). Ultimately, communication was therefore still possible to a sufficient extent to provide some relevant information in regard to a number of content points. As punctuation was also generally handled adequately, ‘band 2’ was ultimately awarded in this category. Numerous issues also affected the student’s lexical and orthographic competences: in this case, attempts to improvise in order to bring the general message across were unsuccessful on various occasions. While spelling mistakes such as ‘cooking four yourself’ did not impede communication, other linguistic “experiments” already seemed rather awkward (e.g. ’10 females and 8 males’ / using the word ‘childrens’ to refer to 14to 18-year-old teenagers). Finally, some constructions were only possible to understand if one was familiar with the student’s L1 (in this case Luxembourgish): sentences like ‘learn a lot of diffrent people’ (instead of ‘get to know’) or ‘you’re never lost the funny time’ (probably meaning ‘you will never be bored’) would certainly cause problems of interpretation for a native speaker of English. Similarly, the sentence stipulating that the camp employees ‘will sitting you’ did not lead to successful communication; while it is likely that the term had been derived from ‘babysitting’ and was supposed to mean ‘to watch over you’, this example clearly underlines that ‘risk taking’ did not always lead to positive results in this student’s production. 128 Chapter 4 The suggested possible interpretations of the three abovementioned (parts of) sentences certainly illustrate that a ‘serious effort by the reader’ was necessary to make sense of student D’s text at times. In essence, there were thus too many problems with lexis and spelling to consider that the ‘basic standard’ had been reached here; ultimately, the vocabulary ‘range’ exhibited by the learner did not entirely fulfil the condition of being ‘minimally adequate for the task’ in all aspects. On the other hand, as her intended meaning could still be followed (particularly in the first half of her answer) or inferred (towards the end) most of the time, I considered that the second band in the marking grid was the most adequate one to attribute to this aspect of the student’s performance as well. The marks in both ‘form’-related categories thus reflected the insufficient accuracy in the student’s performance, but they also took into account that a complete breakdown in communication had in many cases been avoided. Once again, it seems rather likely that the student would have risked getting a very low mark overall if all her mistakes had simply been added up and seen as the main basis for her final score. In terms of overall ‘content’ marks, these accuracy problems evidently had an effect as well. After all, some ideas that the student had had in mind did not totally come across due to the wrong expressions or constructions she had occasionally picked (especially in the final paragraph). Hence it was evident that the message had not been ‘fully communicated to the reader’ (i.e. band 5 had most certainly not been reached). On the other hand, there were also many elements that one could still understand even if these pieces of information had not always been perfectly phrased. For example, the camp location, facilities, available activities as well as targeted customer groups were all described in intelligible (if not completely accurate) language 28 . In relation to the marking grid, this certainly means that more than ‘1/3 of the content [had] been dealt with’; moreover, ‘excessive effort’ was not necessary to understand the bulk of the message. As a result, the performance markedly exceeded the insufficient ‘band 1’ in terms of overall content. In fact, the ‘band 3’ descriptor seemed a much more appropriate one to sum up the student performance: the ‘message [had] only [been] partly communicated’ and ‘some effort on behalf of the reader’ had been necessary. Since a similar basic adequacy was (as seen above) attested to the ‘coherence and cohesion’ factors of student D’s production, 28 Indeed, one can hardly say that it is impossible to comprehend what the student wanted to say in such a sentence as the following: ‘The last week we will sleep in Nature not in a building, and cooking four yourself’. Particularly the expressions ‘not in a building’ show the student’s attempt to paraphrase: she was aware that her choice of expressions might not be completely accurate and thus used such strategies to further clarify her meaning. Chapter 4 129 the two ‘basic standard’ scores in ‘content’-related criteria ultimately saved a marginally sufficient final mark (4.5/9) in this instance. All in all, the results of the assessment procedure thus reflected the two most crucial elements of the student’s performance: there was a clear need to improve the overall accuracy in both grammar and lexis; however, as a whole, the communicative goal had still in large parts been reached. 4.5.5. Outcomes of the applied assessment procedure and general comments The detailed analysis of individual student samples above clearly reveals the systematic and nuanced assessment procedure which the use of a competence-based marking grid encourages. Of course, even this type of assessment does not completely eliminate a subjective component inherent to the personal judgments of an individual assessor; in fact, other teachers might have interpreted and applied the defined marking criteria in slightly different ways than it was done in these four cases, and differing marks might thus still have been reached between different scorers. However, the overall principle of constantly keeping the student’s adequate fulfilment of a communicative purpose in mind certainly helps to guide the assessment into a direction that fundamentally considers the student’s writing skills rather than the mere display of his grammatical or lexical knowledge. Moreover, as the application to two different test tasks proves, the descriptors in this marking grid were sufficiently versatile and constantly relevant enough to allow for a founded and balanced appreciation of various student productions as well as text types. In this regard, one might add that no major discrepancy ultimately existed between the final scores respectively awarded to ‘horoscope’ and ‘brochure’ productions. Both sets of marks presented a fairly similar range (between 4.5/9 and 8/9 for the 14 productions that had dealt with the first task, and between 4.5/9 and 7/9 for the 5 ‘summer camp brochures’) 29 . It is of course hard to guess how individual students might have fared if they had chosen the alternative topic, even if it seems reasonable to assume that their results would not have differed to a huge extent (given that the overall level of proficiency they had reached was bound to lead to comparable levels of accuracy and fluency in other writing tasks as well). A higher overall reliability of test results could thus arguably have been reached if all the students had had to address the same 29 See appendix 15 for a detailed breakdown of marks. It should also be stressed, though, that valid statistical comparisons can of course only be drawn to a limited extent based on such a small sample of productions. 130 Chapter 4 instructions (and to use more closely similar vocabulary ranges and language structures as a result). Nevertheless, the comparable values which marked the two sets of results certainly indicated two important facts: on the one hand, both writing tasks were feasible for these A2-level students with the language resources they had built up over the course of the term (and year) up to that point. On the other hand, the two test tasks were also sufficiently reliable in their own right in the sense that each of them allowed the assessor to make a clear distinction between ‘good’ and ‘weaker’ performances. A further interesting fact is revealed by a closer look at the various factors which had led to the final marks. Thus, a certain consistency could once again be seen across the bands which individual students had attained in relation to the four main criteria in the marking grid: as in the speaking tests (see chapter 3), the marks attributed to the various aspects of each learner’s performance invariably spanned only across one or two different bands. In part, this was evidently due to the interplay of ‘form’ and ‘content’ alluded to above: if a student had for example exhibited a clearly limited vocabulary range, it was fairly logical that a ‘band 5’ achievement in relation to ‘content’ criteria (and the corresponding ‘full and clear communication of the message’) was very difficult to achieve for that individual. On the other hand, completely irrelevant answers (and thus large discrepancies between high language-based and low ‘content’-related marks) had also been avoided in this instance through the clear indications that each set of instructions had given about respective elements to be addressed. To some extent, this certainly helps to explain why no insufficient ‘content’ marks had to be distributed in this test. As seen in chapter 1 30 , slightly ‘restricting the scope of variety in answers’ in that manner contributed to increasing test-related reliability as well. In a similar way, the partially suggested structure in the instructions had also supported the learners in their efforts to imbue their productions with ‘coherence and cohesion’, which might have had a positive impact on their scores in that particular area as well. On the other hand, it also needs to be remarked that maximal marks were either very rarely reached in relation to specific criteria, or (in the cases of ‘lexis’ and ‘content’) even not at all. Seeing that the ‘band 5’ descriptors had left some room for imperfections in the learners’ performances, this might certainly be considered as a disappointing result. Yet given the rather wide vocabulary range that was required to fulfil all parts of the test tasks ‘fully and clearly’ (band 5), it is not particularly surprising that the lowest average 30 See the comments of Cohen et al. regarding test reliability in section 1.2.2. Chapter 4 131 scores in the students’ performances were obtained in regard to ‘lexis’ criteria (and this, as seen above, sometimes affected their ability to answer all content points in sufficient levels of detail and complexity). Since the overall focus of the assessment clearly lay on proficiency rather than specific achievements, a certain amount of unsuccessful experimentation with less practised words and structures was to be expected (and accepted); however, as this sometimes hampered the successful communication of the intended message in the students’ productions, the ‘target standard’ could then not be awarded. On the one hand, one may therefore argue that the test tasks had aimed at a vocabulary range that was perhaps slightly too complex in this instance; this was arguably an issue that could be more satisfactorily adjusted in subsequent tests. At other times in this summative assessment, though, too many mistakes had simply occurred in the usage of more familiar items, so that these ‘slips’ no longer really fulfilled the band 5 criteria of either being ‘few’ in number or of ‘minor’ impact. However, it is also necessary to keep in mind that the ‘target standard’ (i.e. band 5) represents the level which students should ideally reach; whether the degree of difficulty of a particular summative test is appropriate is thus best measured through the number of students who manage to attain the ‘basic standard’ (band 3) for A2 learners. In this case, the vast majority of students who had taken the test had encouragingly done so in relation to ‘lexis’ and ‘grammatical structures’; moreover, those who had failed to reach sufficient marks due to a lack in accuracy could still reach a ‘pass’ mark if they had managed to communicate the intended message to a satisfactory extent. This once again underlines that free writing tasks should not be misused for excessively grammar- or lexis-based achievement assessments, as we then risk losing sight of the more crucial communicative purpose of the assignment as a whole. 132 Chapter 4 4.6. Alternative types of writing tasks It was earlier pointed out that the students’ ability to express themselves should be verified in relation to a wide range of topics and situational contexts (as well as through a variety of different text types). For that reason, I will now conclude this chapter with a brief exploration of other writing tasks that were designed for (and implemented in) summative tests at other points of the school year with such an aim in mind. 4.6.1. Informal letters and emails One communicative purpose which has undoubted potential to meet the interests of adolescent language learners is the act of sharing information with friends, which provides an apt, A2-compatible framework for the description of everyday activities and past experiences. Although contemporary teenagers might tend to prefer other types of written communication such as text messaging or online chatting, letter writing certainly still represents a useful skill to develop and master for any language learner. As a result, both formal and informal letter types were studied and practised in classroom activities (and homework assignments) over a number of weeks in the third term 31 . Based on the practice that my students had thus gained, a matching writing task was then implemented in their first summative test of that term. The corresponding instructions indicated a number of features that were established as essential for a ‘true’ writing task above 32 : a context for writing (i.e. a year abroad), a target audience (i.e. a friend at home), a number of content points to include and the type of format and writing style to use (i.e. an informal letter) 33 . In this particular case, the students were not offered a choice of writing task (since it could be safely assumed that they would all find something to write about to a friend); as a result, the results would provide more reliable results in terms of allowing legitimate comparisons of displayed proficiency levels. Even though the assignment partly revisited the topic from the very first summative test of the year, it certainly represented a much more communicative task this time; the learners were not only invited to provide a written production that included adequate information, but also to present it in an adequate format and to follow appropriate sociolinguistic conventions. Moreover, the compulsory content points had, in this case, 31 Some of these activities were based on Hutchinson, Lifelines Pre-Intermediate Student’s Book, pp.58-59; for instance, the students were asked to write a homework assignment in which they imagined that they were writing to their old school friends twenty years from now, telling them about everything that had happened in the meantime. 32 See section 4.2. 33 See appendix 16. Chapter 4 133 been deliberately chosen to target the learners’ ability to ‘describe past experiences or events’; in turn, this would also supply some indications about the ways in which the students had learnt to apply a number of tenses they had encountered over the course of term in a wider, coherent context (which would be interesting in terms of achievement assessment). However, a potential drawback of these instructions was the fairly artificial way in which they imposed the content points: by simply telling the students what they had to do, an ‘authentic’ feeling of truly engaging in written interaction with another person did evidently not arise. In other words, the fact that their contributions purely followed the purpose of ‘display writing’ was essentially not disguised in this instance. An interesting variation which arguably tends to confer a higher degree of authenticity to this type of communicative context (and purpose) consists in confronting the learners with an actual written message that they first need to read and interpret, and only then reply to in a meaningful and relevant way. In an attempt to put such a strategy into practice, I deliberately implemented another test task where I did not present the writing instructions in a classic, impersonal form; instead, the test sheet simply showed an email message from an English-speaking exchange student who wanted some information about my pupils’ school (and its canteen) by addressing them in a distinctly informal writing style 34 . Hence, during the summative test, the candidates first of all had to read the text they had been given, scan it for precise questions to answer and then only start “replying” to the exchange student by writing an appropriate, informal “email message” in return. As this exercise activated not only productive but also receptive language skills, this writing task was strategically implemented in the final test of the year (when my students had ostensibly reached their highest level of proficiency and would not be overwhelmed by the more extensive amount of provided input). Nevertheless, it is clear that the combination of reception and production can be a double-edged sword. On the one hand, as Harmer points out, ‘reception’ is of course often a ‘part of production’: in many situations production can only continue in combination with the practice of receptive skills. […] Letters are often written in reply to other letters, and email conversation proceeds much like spoken dialogues. […] The fact that reception and production are so bound up together suggests strongly that we should not have students practise skills in isolation even if such a thing were possible. 35 34 35 See appendix 16. Harmer, op.cit., p.251. 134 Chapter 4 It goes without saying that if ‘reception’ is thus used as a stepping-stone leading to more authentic ‘production’ in classroom activities, there is no reason why the same should not be applicable to summative tests. Yet if the aim of such tests is to gain insight into competence development explicitly in regard to productive skills, there are a number of issues that do in fact arise in such combined tasks. In the summative test described above, two problems in particular were noticeable in the students’ final written productions. Firstly, some students ultimately failed to address all the content points, presumably because they had not identified all the questions in the initial text (or simply forgotten to address some of them); as a result, they did not give a sufficiently complete reply in their own emails. Secondly, in response to those elements that they had spotted, their answers contained numerous expressions they had simply copied from the text; as a result, the evidence that could be gathered about the students’ own lexis-related competences was ultimately blurred and unreliable. In other words: what the communicative context and purpose of the exercise might have gained in (semi-)authenticity was lost in terms of providing a safely founded (and “authentic”) assessment of the students’ actual lexical resources. Hence, a shorter email containing only a few short questions could be a useful alternative to limit the amount of modelled input and readily provided lexis, yet still maintain a communicative task layout that is more engaging than a few mechanical and direct writing instructions. 4.6.2. Story writing A much less guided type of written production which allows the students to express themselves very freely and creatively can be reached through exercises that encourage story writing. However, in this instance as well, different types and amounts of scaffolding can be provided to support the learners in their efforts. Rather than only giving them a very basic type of instruction (e.g. ‘write a story about a fantastic day’), the topic that they should write about can thus be supplied in more engaging (but also sometimes more challenging) ways. A particularly useful variation consists for example in handing the students the beginning of a story and asking them to finish it in an adequate way. In their resulting written productions, the students are thus not only required to include relevant ideas and subject matter; a truly convincing effort will also be marked by a suitably coherent and cohesive structure which does not disrespect the details that have been offered in the opening lines of the story. In comparison to the two Chapter 4 135 types of test tasks analysed in the case study above, a more thorough analysis of the students’ use of such elements as paragraphing and cohesive devices can be possible in the case of such story writing. While this is another example of ‘reception leading to production’, the possibility of copying provided expressions and ideas has been significantly reduced in this instance. In contrast to the “email reply” mentioned above, the students have to develop a storyline and cannot directly repeat the information that the test sheet already contains. Such an exercise was for example implemented in the last test of the second term36 (a point in time when the students had already had the chance to significantly develop their linguistic resources since the beginning of the year). Evidently, one can only expect the students to master such a task in an adequate way if they are familiar with story writing strategies and have had the chance to experiment with this type of creative production in the classroom (and have, of course, received relevant feedback about their efforts) prior to the test. Similarly, they are most likely to have adequate command of topic-relevant vocabulary and grammatical structures if the test task closely respects thematic areas and grammar points that have been treated in some depth beforehand. All of these conditions were respected in this particular instance; not only had the students been asked to write a number of stories both in the classroom and in homework assignments, but the test task itself was also largely based on the subject matter (scary or embarrassing incidents) and narrative tenses (past simple and past continuous) that the course content had focused on for a number of weeks. In that sense, the complex intertwinement of achievement and proficiency assessments that needed to be applied to the resulting student productions was once again underlined: while the learners certainly needed to bring into play ‘all and any language at their disposal’ (i.e. their entire proficiency repertoire) to compose adequate stories, it was also clear that the achievement of recently treated objectives could not be entirely disregarded in the corresponding assessment. It should also be pointed out that a choice of two different story beginnings was offered to the students in the ensuing summative test. However, since the type of text that had to be produced was essentially the same in both cases, I would argue that the act of supplying two slightly different options only had a very limited impact on the overall reliability of the test results in this case. In fact, they offered the students an additional 36 See appendix 17. 136 Chapter 4 chance to show what they were capable of in writing; especially if such a strongly creative effort was demanded by a task as in this case, it seemed only fair to offer at least a variety of options that could suitably stimulate the learners’ imagination. As further support, two very broad guidelines were included below the story extracts on the test sheet: by insisting that the students had to include both ‘what they did’ and ‘how they felt’ in the given situation, their room for creativity was not majorly reduced, yet a very basic answer content had been indicated which they could fall back on. Additionally, this inclusion of two core features to be respected also had a useful impact on the ensuing assessment procedure, as it made a more systematic and criteria-oriented analysis of the content in the students’ texts possible. Nevertheless, especially if this type of exercise is implemented in summative tests on a more regular basis, it certainly seems legitimate to eventually leave out such clues and thus to encourage students even more to come up with independent and creative solutions in their aim to fulfil the general communicative goal. In fact, a more extreme way of limiting the amount of ‘reception’ as a basis for ‘production’ is exemplified in a proficiency-based test task that appears in Cambridge PET 37 examinations: in the most extensive writing tasks, the candidates can show their creative writing abilities by developing a suitable story simply based on a short opening sentence (such as ‘Jo looked at the map and decided to go left.’ 38 ). In this case, a very wide variety of student answers is evidently possible, and the students are certainly encouraged to activate the full range of their writing competences. However, great care needs to be taken in the corresponding content assessment since the relevance of various student productions may sometimes be problematic; if no points to be included have been indicated, it is much easier for students to veer off topic and thus to produce an answer that may be fairly proficient in terms of language, yet can still be largely devoid of suitable content. Interestingly, one element of a ‘good’ writing task is actually missing in the abovementioned examples of story writing: there is no implied audience (other than the teacher) which the written production should address. Of course, artificial contexts could be imagined and indicated to circumvent this, for example by stating in the instructions that the students had entered a ‘story writing contest’ and should therefore try to make 37 While the PET examination actually targets the CEFR B1 level, the type of task described in this instance can certainly be adapted to A2-level purposes as well. 38 Example taken from Preliminary English Test for Schools – Handbook for teachers, p.21. Chapter 4 137 their story as original and interesting as possible. However, the imposed nature of such indications does not seem to add very much to the task in the way of “true” authenticity. In fact, in contrast to situated writing activities such as letter writing, where an implied reader is of great importance to the adopted genre and style of the produced text, the absence of an ‘audience’ is not nearly as decisive in the context of story writing. As the students are aware that their written production should focus on the logical development and sequencing of events, it seems safe to assume that they will aim to respect the particular conventions of this type of creative writing even if no audience is explicitly mentioned. The different types of tasks described in this chapter evidently only represent a very small portion of possible ways in which “true” writing skills can be activated and verified in summative tests; many more are easily conceivable and realisable. In addition, it has certainly been underlined that the transformation from “traditional”, excessively grammar-based and mechanical writing tasks to more stimulating and competence-based ones does not need to be an intrinsically difficult and challenging proposition at all. In fact, a number of simple yet careful and reasoned modifications can often suffice to turn a basic writing exercise into a contextualised and meaningful writing task. If tasks with truly communicative purposes are implemented in tandem with such a multifaceted and universally applicable assessment tool as the marking grid described in this chapter, a more thorough and nuanced insight into the students’ writing competences can certainly be gained through the students’ written productions in regular summative tests. 138 Chapter 4 Chapter 5: Conclusions and future outlook. At the beginning of this thesis, the three notions of validity, reliability and feasibility were established as inevitable cornerstones for truly pertinent forms of testing and assessment. The question was then raised to what extent a competence-based approach to the productive skills of speaking and writing could be helpful to enhance the adequacy of our summative tests in relation to all three of these key concepts. After exploring the theoretical implications and usefulness of a framework such as the CEFR, as well as describing and analysing attempts to apply some of its guiding principles to test design and corresponding assessment strategies in practice, it is now possible to reach founded conclusions about the opportunities – but also challenges – that such a revamped approach implies. 5.1. The impact of a competencebased approach on test design and assessment 5.1.1. Effects on validity The renewed focus which the CEFR-inspired, competence-based approach has put on inherently communicative purposes of speaking and writing acts can only have a positive impact on the content and face validity of tests that correspondingly target the students’ productive skills. Indeed, if a given test task does not merely try to get an overview of the student’s knowledge of discrete language items, but rather asks the learner to activate a wide range of linguistic competences to reach a specific communicative goal, then we are truly testing the student’s ability to produce meaningful writing or speaking samples (and thus the test is ‘actually measuring what it intends to measure’ 1 ). As seen in the respective practical examples, important steps to this aim can for example be taken by getting students to perform a range of productive and interactive tasks in oral tests, or by striving to contextualise writing tasks and imbuing them with a clear communicative purpose whenever possible. In turn, the inferences which can be 1 See pp.13-14 in chapter 1. 140 Chapter 5 drawn from student performances in such tests in regard to their overall language proficiency and levels of competence reached raises construct validity in relation to these two concepts; in excessively grammar-centred tasks, for instance, this would not nearly be possible to the same extent. Naturally, it is important to keep in mind that summative tests alone can never give a complete picture of a student’s veritable proficiency level. While they can certainly contribute useful indications to that effect (especially if they observe such guidelines as including a range of different task types), one should not forget that competences can never be fully revealed by any single, isolated performance. Hence, even with test tasks that are marked by increased degrees of content and construct validity (in terms of truly allowing the assessor to verify speaking and writing skills), regular summative assessments still have to be complemented by a whole range of formative ones. Only if the latter are continuously and conscientiously gathered in the classroom can we hope to reach completely informed conclusions about the proficiency levels that our learners have attained. In relation to speaking, this means for example that a student’s five-minute performance in a term’s oral test must be seen in conjunction with his regular contributions in classroom activities; even if the test tasks sample a range of different aspects, they cannot possibly cover all the activities, situations and subject-matters that are dealt with in class over the course of a term or year. While a specific mark can of course be attributed to an isolated performance in a summative test, a broader proficiency assessment can thus only be reached after diverse language samples have been provided over the course of a more extensive period of time (and thus, in addition to summative tests, also in contexts where time and achievement pressures have less or no impact on student-related reliability). As far as a comprehensive examination of writing competences is concerned, a further interesting feature that is inherent to this particular productive skill may additionally be pointed out. By their very nature, summative tests almost inevitably need to focus on product writing: only the final result (i.e. the student’s finished text) is considered as a basis for the eventual assessment. While the previous chapter has demonstrated that numerous key insights can of course be gained from such a complex sample, another unavoidable corollary is that summative assessments cannot possibly take into account the various features of the actual process that have led to this product. Chapter 5 141 Yet using strategies such as planning, drafting, editing and re-drafting 2 in appropriate ways is undeniably part of the student’s writing competences as well. In this regard, portfolio work (including samples from different stages of the composition process) could for example provide a useful alternative to draw upon in order to complement the product focus of summative tests with a process one, and thus to lead to an even more complete picture of the student’s writing abilities. Aside from the aforementioned positive impact on the content validity of test tasks, an alignment to the communicative approach of the CEFR can also raise the validity of the applied assessment principles by catering more suitably for the proficiency levels that the students are likely to have reached at precise stages of their language learning. Through a realistic, criteria-referenced system that allows for remaining imperfections in the students’ language productions, more appropriate and precisely defined standards can thus be set as targets that the students’ productions need to meet. Indeed, especially at lower levels (such as A2), the beneficial effects of moving away from an excessively form-focused assessment scheme have repeatedly been pointed out in this thesis. By avoiding an unfair comparison of the students’ work to the unrealistic expectations of native-speaker flawlessness, more important (and valid) features such as the overall communicative success of a particular production sample can be put into the centre of the assessment system, and gauged in relation to the actual linguistic means that students can be expected to have at their disposal at a given stage of their language learning process. At the same time, however, the practical experimentation with marking grids has also revealed a number of prerequisites that need to be respected when trying to derive a workable summative assessment system from the original CEFR scales. Hence, the need for adaptation to precise purposes and needs, firmly and explicitly encouraged by the Framework writers themselves, is reflected in the more concise and specific terminology that has to be used in marking grids for both speaking and writing tests. Of course, the CEFR descriptor scales can serve as an important basis when it comes to defining appropriate ‘basic’ and ‘target’ standards to be generally attained in the students’ corresponding performances. Yet their overall proficiency focus ultimately implies that any attempt to simply equate the original descriptor scales with numerical values (e.g. mark X for level A1, mark Y for level A2…) for an isolated summative assessment would be ill-advised. Since single performances can never completely reveal the underlying 2 See Harmer, op.cit., p.257 142 Chapter 5 competences at work, it would indeed be contradictory to directly apply the CEFR descriptors in a way that would suggest that this was actually possible. Moreover, a marking system spanning several CEFR levels would evidently not be appropriate to identify the relatively small nuances between the individual performances in a class of school-based learners who mostly operate at one same level (in this case A2). Instead, it seems more sensible to use CEFR-derived descriptors that still reflect the overall competence levels to be reached in a given learning cycle at school yet can be applied to situate a specific, “one-off” performance in relation to both content- and language-related criteria with more precision. In turn, this also leads back to the difficulty of uniting achievement and proficiency factors when assessing a student’s speaking or writing performance in a summative test. If verifying the achievement of particular learning objectives constitutes a fundamental purpose of any such test, the added influence of overall target language proficiency becomes evident if purposeful, communicative writing or speaking tasks are to be dealt with successfully. One possible way in which this dual challenge can be approached has been exemplified in the various sample tests described in this thesis: if the designed productive tasks are generally (but not exclusively) centred on grammatical structures and topical lexis encountered and developed in class prior to the summative test, the students’ ability to deal with the tasks in an appropriate way depends in no small part on an adequate mastery of those items (and thus their achievement of fairly narrow objectives). However, as they additionally need to call upon ‘all and any [supplementary] language at their disposal’ to unify the different parts of their answers into a coherent whole, a more general insight into overall speaking or writing proficiency may also be gained through such meaningful and communicative productions. Rather than unsuccessful ‘experimental’ structures that the student may have included in his performance, recurrent errors in relation to familiar and rehearsed items might then be regarded as more legitimate reasons for lower form-related marks since they arguably represent the learner’s failed achievement of intermediate objectives. If too many such errors mark the student’s production, an overall failure to correctly transmit the communicative message may of course be the consequence as well. However, if this is not the case, it is important that the overall assessment of a student performance takes into account that numerous additional elements (rather than just a few specific grammar and lexis items) come into play during a true “free” writing or speaking activity, and that the resulting end product should thus not be approached with the same grammar-focused achievement foci as it Chapter 5 143 would be the case for discrete-item tasks. Through the use of multifaceted marking grids that sensibly split up various content, coherence, lexis and grammar features, such a nuanced form of assessment is certainly encouraged. Nevertheless, it has also become clear in this thesis that several CEFR components need to be further adjusted to a school-based context. Thus, the leniency with which some A2 descriptors handle accuracy (in terms of grammar but also spelling and punctuation, for instance) does not always make them particularly suited to the application in summative assessments. This explains why, as in the cases explored in the previous two chapters, marking grids for speaking and writing performances of ‘A2’ learners may sometimes aim for ‘B1’-level accuracy. While still allowing for a number of mistakes to be included in a suitable performance (and thus respecting one of the most valuable characteristics of the lower CEFR levels), these descriptors also reflect that systematic and in-depth exploration of specific language items has preceded the test performance in the classroom; this should, after all, not be neglected. 5.1.2. Effects on reliability The use of suitable criteria-based marking grids leads to promising ramifications in regard to various aspects of reliability as well. As illustrated in chapters 3 and 4, student performances can be analysed in much more nuanced, targeted and unbiased ways if soundly researched and clearly defined assessment criteria rather than holistic and arguably impulsive impressions provide the basis for assessment. In that sense, individual scorer reliability can certainly be increased through the rigorous application of the same exact criteria to all student productions. Of course, this does not mean that strategies such as norm-referencing need to be completely abandoned; for example, comparing two student productions in relation to a same criterion in the marking grid in order to see which one more fittingly corresponds to the phrasing of a particular descriptor certainly makes sense. However, it is important that each student’s work is not exclusively compared to his peers’, so that the attainment of a given mark is possible by reaching a fair(er) and reasonable standard. Ideally, criteria-referenced assessments can also pave the way towards higher interrater reliability. Indeed, if every single teacher consistently bases his assessments on exactly the same marking criteria as his colleagues, the chance that major discrepancies will affect the eventually awarded scores is bound to be reduced. However, this is not an automatic certainty. As was pointed out in regard to the marking grids that were used to 144 Chapter 5 assess both speaking and writing performances, the descriptors which such grids contain cannot be absolutely precise since they need to anticipate (and be applied to) a sufficiently wide range of linguistic structures and content elements that the students might use in their performances. Yet if that is the case, then not all the expressions in such descriptors will be interpreted in exactly the same way by all the teachers who refer to them. Indeed, while the syllabus for 9TE (for instance) provides a list of grammatical functions (as well as possible topic areas, notions and communicative tasks) to be covered over the course of the learning cycle 3 , there is no explicit gradation in terms of ‘simple’ or ‘complex’ structures, for example. This is certainly an understandable omission in the light of the extremely complicated (and perhaps impossible) decision-making process which this would have involved; in fact, such a distinction of structures into different levels of complexity would be very difficult to justify in purely objective terms, and would almost inevitably have to rely on arbitrary judgments to a certain extent. However, as a result, this means that various subjective interpretations can still be applied to the descriptors in the marking grids, with correspondingly divergent outcomes 4 . To counteract the risks of such persisting discrepancies, it is clear that a consensus needs to be firmly established among teachers as to what exact requirements can be associated to the different quality descriptors in marking grids. A first and foremost prerequisite evidently consists in familiarising oneself with these assessment tools and with the competence-based framework that inspired them; only by actively consulting and applying them in practice can a true ‘feel’ for the logic behind these grids be acquired. To reach a consensus of interpretation, the necessity of cooperation with other teachers then also becomes apparent. If the risk of simply substituting one way of reaching overly subjective (and thus unreliable) assessments with another is to be avoided, it is clear that open-minded discussions and exchanges about potentially ambiguous descriptors or criteria must become a priority; only in that way can a satisfactory and unified approach to summative assessment ultimately be reached. At school level, informed opinions (but also encountered challenges) that individual teachers have reached through practice could be exchanged in departmental meetings. Regularly attending teacher training seminars which expressly deal with the practical application of marking grids to specific benchmark samples also constitutes a useful initiative, particularly if the insights gained 3 4 See syllabus for 9TE (2009-2010 version) as published on http://programmes.myschool.lu, pp.10-12. See section 4.4. in the previous chapter. Chapter 5 145 are subsequently shared with other teachers at one’s own school. On a national scale, discussions and decisions about the applied assessment strategies or tools (for example in meetings of the different Commissions Nationales des Programmes) can further lead to consensus building and thus an increase of inter-scorer reliability can be deliberately and systematically sought for. All of these different steps certainly stress the importance of using a common framework to guarantee a greater harmonisation of testing and assessment strategies across the country. 5.1.3. Feasibility of the explored testing and assessment systems The practical implementation of skills-oriented writing and speaking tasks in the classroom has led to useful insights about feasibility in regard to the three key test phases of design, administration and assessment. First of all, these experiments have illustrated that setting up test tasks which deliberately focus on communicative purposes is well within the reach of every practitioner. The systematic contextualisation and search for authenticity-enhancing factors may of course increase the necessary amount of time to compile convincing skills-based test tasks in comparison to more traditional, narrowly grammar-focused exercises. However, particularly in the case of writing, the required modifications in test design are often much smaller and easier to realise than one might initially have expected. As long as elements such as targeted format, audience, topic and purpose are respected, a writing task can quickly pursue a meaningful, communicative goal. In the case of speaking, the integration of both production and interaction elements may prove slightly more challenging; the search for suitable visual prompts may additionally require some time. Nevertheless, considering the numerous possibilities that are offered by multimedia and Internet resources nowadays, even this process has rapidly become much less time-consuming. In this respect, additional benefits of a more collective approach to teaching could easily emerge as well. If competence-based tasks and test layouts are mutually shared (or even cooperatively set up) by teachers in one same school, for instance, excessive amounts of time and effort spent on individual research and preparations can be replaced by a much more fruitful and efficient way of working as part of a resourceful ‘community of practice’. If the resulting tests are then implemented in various classes of the same level, increased reliability will additionally exist between the results of those different classes (and the respective students may also perceive increased fairness if the level of 146 Chapter 5 difficulty of the tests they take is not inextricably linked to the person who teaches them that year). During the administration phase of the speaking and writing tests described in the previous two chapters, it was possible to gain some useful insights about their feasibility as well. The inclusion of more communicative writing tasks into summative class tests did not pose any major problems, as the elicitation of slightly more extensive writing samples could simply take the place of more traditional “writing” exercises that had essentially focused on grammar or rather inaptly referred to “reading comprehension” (i.e. questions about known texts essentially representing pure memorisation tasks). Nevertheless, as numerous students tended to address these writing tasks only towards the end of the summative tests, factors such as time pressure and concentration loss may occasionally have affected the quality of their performances; in turn, this slightly reduced the reliability of inferences that could consequently be drawn about their proficiency levels. As an alternative, one could for example explore the possible implementation of “pure” writing tests (instead of combining them with tasks that focus on other skills), which would allow the learners more time to plan and organise their answers and thus arguably offer them a fairer chance to access the full potential of their writing competences. At the same time, however, it is of course paramount not to confer a disproportionate amount of classroom time purely to the testing of different skills; after all, sufficient time must remain for language learning in the classroom as well. This can only happen if students do not constantly have the impression that their main activity at school consists in an endless circle of preparing for and writing tests. A more complicated case emerges in regard to the feasibility of administering regular speaking tests in the classroom. As illustrated in chapter 3, the actual implementation of a unique summative oral test will almost inevitably take a significant amount of time. If sufficiently extensive performance samples are to be gathered through a range of productive and interactive activities, a minimal test duration of 5 minutes per student (or 10 minutes per pair of candidates) is virtually inevitable even at A2 level. Yet with classes of often more than twenty students, this means that three or four entire lessons will then be necessary for a single test. Even though the rest of the class can of course be asked to deal with other assignments while their peers are being examined, and classroom time is thus not entirely “lost”, teachers may still feel uncomfortable to devote such a significant amount of lessons purely to testing. Additionally, it has been suggested that solutions will have to be found to keep the majority of the class occupied during the Chapter 5 147 test administration phase; as spare rooms and available staff to supervise the students are often at a premium in daily teaching practice, one may have to conduct speaking tests in the students’ own classroom (with their peers in the background). While such worries may be countered by pointing at the previously neglected and thus long overdue focus on speaking skills which must (and also legitimately should) be part of a balanced, competence-based teaching and assessment scheme, it is true that time and location problems may surface if the theoretical intentions of the curriculum are not adequately supported through suitable practical solutions 5 . However, taking into consideration that skills-oriented tests do not need to focus as narrowly on a few individual objectives as it used to be the case for traditional grammarbased tests, one may argue that the intervals between regular summative tests could in fact be further stretched without imposing excessive amounts of revision material on the learners. In that sense, one possible initiative to counteract time pressures would be to rethink the current three-term organisation of our secondary school system in favour of a semester-based one. Presuming that the number of tests to be administered would not simultaneously (or, if so, only marginally) be raised, the consecration of four lessons to speaking tests might then be easier to take on board, as more classroom time would still be available for extensive and systematic competence-based language learning. Alternatively, if the current course organisation is kept, the reintroduction of genuine “test periods” (e.g. one week to be completely set aside for testing at the end of a term) might be worthy of consideration, particularly taking into account that similar oral tests will have to be conducted on a regular basis in other foreign language courses as well. A third imaginable solution would be to repeatedly conduct smaller summative assessments during regular speaking activities instead of using a single, bigger test for that purpose. In that scenario, the teacher would simply walk around the classroom during those activities while the students would all be simultaneously engaged in speaking acts; marking grids could then be filled out for a few pairs of learners by listening into their respective conversations without needing extra time for another formal test. However, there are several potential drawbacks to that strategy. For instance, not all students’ performances could be sampled during one such activity; different speaking activities would subsequently have to be implemented in other lessons and, as a consequence, not all the student performances would have dealt with the same types of topics or tasks, and 5 These time problems are of course particularly relevant in EST classes, where a maximum of four English lessons is available per week in 8e and 9e TE (as opposed to six in 6eM classes of the ES system). 148 Chapter 5 the reliability of test results would inevitably suffer. The collection of data would also be problematic due to the high level of noise in the classroom and the impossibility of recording material (at least in usable quality). Finally, this would also take us back to the problem of essentially turning an excessive number of classroom activities into tests instead of more constructive learning opportunities. However, even if the current system may not be ideal for a maximally efficient implementation of speaking tests just yet, the feasibility of integrating such tests into normal teaching routines is ultimately strongly linked to the practitioners’ willingness to find suitable solutions. In chapter 3, the described practical setup provided one example of how this can be achieved; other ways are surely imaginable as well. Considering that the massive importance of speaking skills (even at low levels) can finally be truly acknowledged in a competence-based teaching and assessment system, simply using time and location issues as an “easy way out” would lead to nothing more than a misguided perpetuation of unbalanced testing traditions – and, considering the disregard of explicit syllabus instructions, an actual ‘dereliction of duty’. Hence, even if conditions favouring the feasibility of speaking tests could certainly still be optimised in the present school system (for example through the more general adaptations suggested above), it would clearly be wrong to view such tests as impossible to accommodate in our daily teaching routines at this point in time. As far as the assessment of student performances goes, one of the most important conclusions to be drawn from the results of the speaking and writing tests described in this thesis is certainly the realisation that the students were generally able to deal with all the constituent tasks in satisfactory ways (as reflected in their largely sufficient final marks). Thus, the proposed types of skills-based tasks were clearly feasible for “preintermediate” learners even with the limited resources that their low proficiency level implied. Hence, especially the widespread assumption that significant speaking samples cannot be collected in A2-level classes (and that speaking tests should therefore not be systematically implemented there) has been emphatically refuted. In the same vein, the students’ various writing samples have shown that they are undeniably capable of producing coherent and communicative texts if the corresponding tasks (and assessment criteria) are appropriately tailored to their proficiency level; an over-insistence on more restrictive, discrete-item exercises in “writing” tests has therefore become obsolete. The applied marking grids themselves have also proven to be very useful once the theoretical soundness of their descriptors and a practical ‘gradation of quality’ (for Chapter 5 149 example into five bands rather than merely three) have been established. The resulting criteria-referenced assessment system favours a more analytic and nuanced interpretation of the students’ speaking and writing performances than an overly holistic and impulsive judgment would allow. Even though this new approach might be slightly more timeconsuming, it encourages a more thorough and relevant type of error analysis. In turn, this not only augments the validity and reliability of the assessment; it also facilitates the provision of pertinent and founded feedback to the student about precise areas that he or she needs to improve on. Besides, it conveys increased objectivity and fairness to the student. In that sense, the small amount of extra time arguably invested into the assessment procedure is undoubtedly worthwhile. 5.2. The Luxembourg ELT curriculum, the CEFR and competence based assessment: perspectives The conclusions reached above illustrate that competence-based forms of testing and assessment certainly have the potential of leading to numerous promising and vital changes in the overall approach to language teaching in our national school system. However, numerous challenges still lie ahead before the Luxembourg ELT curriculum has been fully adapted so as to maximally realise this potential; this final section will therefore explore some of the most salient developments which may still be necessary in the future. In that context, it is first of all important to stress the exact meaning and scope of the two key concepts of ‘curriculum’ and ‘syllabus’. As Jonnaert et al. point out, a ‘curriculum’ is a much wider concept providing the overall framework and direction for a national education system in terms of its guiding pedagogic principles and intended outcomes: Un curriculum est un ensemble d’éléments à visée éducative qui, articulés entre eux, permettent l’orientation et l’opérationnalisation d’un système éducatif à travers des plans d’actions pédagogiques et administratifs. Il est ancré dans les réalités historiques, sociales, linguistiques, politiques, économiques, religieuses, géographiques et culturelles d’un pays, d’une région ou d’une localité. 6 6 Philippe Jonnaert, Moussadak Ettayebi, Rosette Defise, Curriculum et compétences – Un cadre opérationnel, De Boeck (Brussels: 2009), p.35. 150 Chapter 5 While its predominant focus can lie on a number of different factors 7 , any pertinent curriculum ensures that the corresponding school system provides its students with a relevant and efficient education by taking into account such crucial factors as sociopolitical, linguistic and economic contexts and necessities. A ‘syllabus’ (or ‘programme d’études’), on the other hand, is in fact one of the practical means to realise the general curricular goals: Si le curriculum oriente l’action éducative dans un système éducatif, les programmes d’études définissent les contenus des apprentissages et des formations. In essence, Jonnaert et al. therefore define the relation between ‘curriculum’ and ‘syllabus’ as one of ‘hierarchical inclusion’: if the curriculum postulates the overall direction and aims of an education system, the different syllabi correspondingly need to guide teachers as to how these can be pursued in practice through clear indications of course contents and teaching methods. To avoid contradictions or excessively dissimilar foci and orientations between the various syllabi, it is of crucial importance that the curriculum links them through a coherent and shared logic 8 . By describing, illustrating and analysing examples of how summative tests can be adapted to more competence-oriented ways of assessing the productive skills, this thesis has been centred on a core and decisive feature of the ongoing reformation of a national school system which largely seeks an alignment with the six proficiency levels and overall communicative approach of the CEFR. At the same time, this focus on testing and assessment reflects the most common influence which the CEFR has had across Europe since its publication; as Little points out, ‘to date, [the CEFR’s] impact on language testing far outweighs its impact on curriculum design and pedagogy’ 9 . However, it is clear that numerous other elements than only testing and assessment need to be adapted if 7 Jonnaert et al. draw attention to four different ‘curricular ideologies’ (p.30): 1. The ‘Scholar Academic Ideology’, which prioritises the transmission of knowledge; 2. The ‘Social Efficiency Ideology’, which focuses on social factors and aims to produce individuals who fit into their surrounding society and help to maintain it; 3. The ‘Learner Centered Ideology’, which targets the ‘social, intellectual, emotional and physical’ development of the individual learner; 4. The ‘Social Reconstruction Ideology’, which is ‘based on a vision of society’ and thus ‘considers education as a means to facilitate the construction of an equitable society’. 8 Ibid., p.31: ‘Le curriculum assure une cohérence inter-programmes d’études et évite que ces derniers ne s’isolent en autant de silos avec une logique et une terminologie qui leur est chaque fois spécifique.’ 9 David Little, ‘The Common European Framework of Reference for Languages: Perspectives on the Making of Supranational Language Education Policy’ in The Modern Language Journal, 91, iv (2007), p.648. Chapter 5 151 a truly coherent and competence-oriented curriculum is to guide the language teaching practices in our national education system. As Little further stresses, [t]here are two ways in which the CEFR can influence official curricula and curriculum guidelines. On the one hand, desired learning outcomes can be related to the common reference levels… On the other hand, the CEFR’s descriptive scheme can be used to analyze learners’ needs and specify their target repertoire in terms that carry clear pedagogical implications. 10 In the current Luxembourg ELT curriculum, attempts to integrate both of these steps are clearly visible. The way in which ‘learning outcomes’ of the various syllabi are being aligned with specific CEFR levels is for example illustrated through the A2-level proficiency that 9e and 6e students are currently expected to reach in order to successfully complete their respective learning cycles. Correspondingly, the ‘target repertoire’ of linguistic competences (which has been adapted in view of the ‘learners’ needs’ from the CEFR) is clearly stated in regard to the ‘four skills’ both in the official syllabi for those classes and in the end-of-cycle document (‘complément au bulletin’) which certifies their achievements accordingly. In this respect, one may argue that the overall alignment of a language curriculum with the CEFR potentially grants a certain coherence to the syllabi for the various constituent learning cycles, as the learner’s progression is consistently measured through the attainment of different proficiency levels that stem from one same, unified framework. In terms of overall curricular aims, the competence-based approach of the CEFR also points to a promising, learner-centred pedagogy that abandons an obsolete, pure transmission of knowledge which may prove of little use to the learners once they have finished school; instead, the school system aims to produce independent and competent language users (rather than individuals who may know more about the technicalities of a language, but struggle to communicate effectively and with sufficient fluency). Given the Europe-wide impact of the CEFR on the education systems of numerous member states (including the entry requirements to their higher education institutions), the choice to pursue such an alignment also seems a sensible one if the comparability and pertinence of results which students ultimately achieve in our school system should be increased on an international level (even if, at the same time, Fulcher’s warnings about a premature ‘reification’ of the CEFR should not be ignored, and thus an overly precipitated, forced and potentially invalid alignment to the Framework must be carefully avoided). 10 Ibid., p.649. 152 Chapter 5 However, the challenges which still lie ahead in the ongoing overhaul of the Luxembourg ELT curriculum also become evident. As has been pointed out, the CEFR neither presents an inherent focus on a school-based context nor (for that reason) indicates pedagogical steps and measures which facilitate the progression from one proficiency level to the next. What further complicates matters is that this development process is not a linear one; as Heyworth notes, the levels ‘are not designed to be split up into equal chunks of time in a syllabus, and it will take longer to move from B2 to C1 than from A1 to A2’ 11 , for instance. This creates a range of problems in terms of syllabus design, both within each individual stage (or cycle) of the language learning course and when it comes to linking them with each other in a coherent and valid way. Indeed, how can we make sure that the expected CEFR level that we set for the end of a given learning cycle is a realistic learning outcome, and that the pace of progression which we define is a sensible and feasible one? While the ‘A2’ level which this thesis has focused on was a logical choice for the lowest classes of the English curriculum (given the extreme limitations of ‘A1’ and the clearly excessive requirements of ‘B1’ for first-time 6e, 8e and 9e students 12 ), schoolbased learners’ further rate of progression is more difficult to predict. Preliminary progress estimates (in terms of CEFR levels to be gradually attained) have been established for both ES and EST systems in the Luxembourg school system 13 , replacing the previous (rather vaguely defined) gradations from ‘elementary’ to ‘advanced’ proficiency levels, but in essence they still need to be validated through practice. At the time when this thesis is being written, fitting competence-based syllabi for the attainment of higher CEFR levels (from ‘B1’ onwards) are still being developed and need to be finetuned to our students’ language needs; naturally, they also have to tie in seamlessly with the preceding A2 syllabi to guarantee a coherent and thoughtfully interlinked system. A possible way to fulfil this latter requirement may for example consist in using the ‘target standards’ defined for ‘A2’ as ‘basic standards’ for ‘B1’; indeed, a student who has fully 11 Heyworth, art.cit., p.17. The adequacy of ‘A2’ for these classes was also partially underlined by the results of online placement tests designed by the University of Oxford (http://) and implemented to identify the students’ proficiency in terms of the CEFR levels. The majority of 6e and 9e students from the Luxembourg school system who took these tests in June 2010 were ultimately attested an ‘A2’ level. However, these tests were purely based on ‘English in use’ and ‘listening’ exercises; without a thorough assessment of more extensive writing and speaking samples (i.e. evidence of productive skills development), these results can only be of an indicative nature. 13 See synopsis for syllabus for 6eM and 5eC (2009), p.4 and synopsis for syllabus for 9TE (2009), p.7 (accessible via www.myschool.lu) 12 Chapter 5 153 achieved the ‘target standard’ in the first English cycle has effectively started working within ‘B1’ (even more so since some of the ‘band 5’ descriptors in the corresponding marking grids are, as has been shown, more geared towards ‘B1’ or ‘A2+’ than a basic ‘A2’). Nevertheless, given that the CEFR does not offer any precise projections on how long the transition from one level to the next may take in a school-based context, this will to a certain extent have to be established (and possibly re-evaluated) through empirical means. Another important consideration affects the persisting reliance on an approach that revolves around the ‘four skills’ and is astutely touched upon by Keddle: In many secondary schools the programme tends to focus on reading and writing skills, and makes more progress in these than in speaking and listening. When matching a standard classroom syllabus with the CEF, one has to ‘hold back’ progress in reading and writing in order to allow the speaking and listening areas to catch up. 14 While Keddle consequently welcomes the increased valorisation of speaking and listening as a result of the CEFR’s impact on syllabus design, her quotation also reveals a deepergoing issue of the ‘four skills’ approach in view of certified achievements and defined learning outcomes. In fact, the question arises whether we should invariably insist on a single, unified CEFR level to be attained by our students at the end of a learning cycle (i.e. A2 in all four skills), or whether ‘mixed profiles’ involving different levels for different skills might sometimes make more sense. If Keddle opposes writing and reading to speaking and listening, one might similarly set the students’ progress in the receptive skills against the headway they make with the productive ones. After all, a student whose writing corresponds to A2 proficiency may be perfectly able to read and understand a B1level text. In that sense, it will be necessary to consider to what extent global or mixedlevel requirements may be the most suitable solution for the definition of targeted learning outcomes at various stages of the ELT curriculum. In a similar vein, the question also arises which (unified or mixed) final levels our ES and EST systems should respectively aim for, and how these achievements can be validly certified – after all, if the students’ performances in both receptive and productive skills should determine the attainment of the targeted proficiency level, then the format of our final examinations (and the correspondingly applied assessment strategies) will have to significantly change as well. 14 Keddle, art.cit., p.48. 154 Chapter 5 All of the abovementioned issues evidently transcend the scope of the present thesis by a long way; however, they will most certainly have to be addressed in depth if a coherent and valid competence-based ELT curriculum is to guide our national education system at all levels. A distinctly positive development which the move towards a competence-based teaching and assessment system implies (and which has been underlined in this thesis) is the move away from an excessively de-contextualised and grammar-focused (i.e. knowledge-centred) language curriculum. This does not mean that grammar will no longer remain an important component in our English courses; far from it. However, when ‘preparing…syllab[i] and developing activities that truly reflect both the CEF and the tried and tested grammar strands’ 15 , a two-way adaptation process is necessary. On the one hand, the original CEFR descriptors and levels do not carry sufficiently precise indications about specific grammatical structures and forms that are needed to reach a given proficiency level; therefore, a more systematic grammar dimension must be added to them in a school-based context. On the other hand, if a curriculum is largely based on the distinct functional/situational approach of the CEFR, the excessive concentration on grammatical elements will be reduced and supplemented through more communicative aims. As a result, more purposeful and situated learning (where grammar is not studied in a de-contextualised way but rather integrated and used to fulfil a communicative aim) will come to dominate both classroom activities and test tasks. As Jonnaert et al. also vitally stress, it is only in such a way that a competencebased curriculum can ultimately truly make sense. Not only do they point to the particularly fruitful effects of an appropriate contextualisation of learning situations 16 ; they also postulate that competences must in fact be constructed through action in – or in response to – precise situations. The assessment of whether or not a given competence has been developed to a sufficient extent is then based on the adequate treatment of a similar situation at a later point in time 17 . In that sense, a fundamentally competenceoriented curriculum should inform about the types of behaviour that a learner needs to demonstrate to be deemed ‘competent’, and (through the various syllabi) communicate pedagogic steps and strategies to teachers as to how the development of these 15 Ibid., p.44. Jonnaert et al., op.cit., p.70: ‘les stratégies orientées vers la contextualisation des contenus d’apprentissage et la construction de sens par les apprenants, ont statistiquement le plus d’effets positifs sur les résultats des apprentissages.’ Emphasis added. 17 Ibid., pp.68-71. 16 Chapter 5 155 competences can be favoured through appropriate situations in the classroom 18 . The alignment of curricular goals with the CEFR and the different competences it defines as necessary for the fulfilment of various communicative purposes certainly comes much closer to such a logic than an approach which simply defines a range of de-contextualised and grammar-based objectives to be achieved. Understandable anxieties as to whether this could lead to an excessive disregard of transmitting knowledge about the target language to the learners is countered by Jonnaert et al. in a way that is reminiscent of the CEFR approach focusing on ‘what the learners do with grammar’: les savoirs disciplinaires ne sont ni exclus ni minimisés dans une approche située. Leur utilisation comme ressource en situation renforce au contraire la pertinence de leur construction par les étudiants. 19 In a final perspective, a further interesting quality of the ‘situated approach’ which Jonnaert et al. thus advocate is the implication that the corresponding learning situations are bound to be complex and multi-disciplinary 20 . Although the competence-oriented rewriting of individual syllabi for the various secondary school subjects is currently taking place in a parallel rather than interconnected way, exploiting their commonalities more deliberately and efficiently could certainly prove a valuable and rewarding direction to take in the future. Indeed, substituting the strict separation and juxtaposition of different disciplines with a more integrative and combined approach (reflecting and maximising for example the benefits of the plurilingual aspects of numerous learner competences alluded to in the CEFR) may constitute a potent (albeit challenging) way to further enhance the curricular coherence in our education system. Yet wherever we ultimately go from here, it is certain that the successful shift towards a fully effective competence-based teaching and assessment scheme will only be possible if the practitioners do their utmost to find suitable and practicable applications of the curricular guidelines “in the field” 21 . As this thesis has tried to exemplify, there are for instance many pertinent and feasible ways in which this can be done in regard to developing, testing and assessing our students’ productive skills even at lower levels. Given the thorough and nuanced insight which this gives us into their overall target 18 Ibid., p.71. Ibid., p.90. Emphasis added. 20 Ibid., p.72 : ‘Ces situations sont par nature complexes et pluridisciplinaires’. 21 See ibid., p.52: ‘L’adhésion des enseignants au curriculum constitue une variable très importante de la réussite des innovations proposées à travers les réformes curriculaires.’ 19 156 Chapter 5 language proficiency (and thus makes it possible to verify the success of our language teaching in a significantly wider scope), it highlights one of the numerous things that we can do to guide our language teaching into a direction that is not only much more relevant to “real world” contexts, but also produces results of higher validity, reliability and potentially even international comparability. 157 Bibliography Books: ASTOLFI, Jean-Pierre, L’erreur, un outil pour enseigner, ESF (Paris: 1997) BROWN, H. Douglas, Principles of Language Learning and Teaching (5th ed.), Longman/Pearson (New York: 2007) BROWN, H. Douglas Teaching by Principles, An Interactive Approach to Language Pedagogy (3rd ed.), Pearson Longman (New York: 2007) COHEN, Louis, MANION, Lawrence & MORRISON, Keith, Research Methods in Education, Routledge (London / New York: 2007) COUNCIL OF EUROPE, Common European Framework of Reference for Languages: Learning, Teaching, Assessment, Cambridge University Press (Cambridge: 2001) DENSCOMBE, Martin, The Good Research Guide (3rd ed.), Open University Press (New York: 2007) HARMER, Jeremy, The Practice of English Language Teaching, Pearson Longman (Harlow, England: 2006) HASSELGREEN, Angela et al., Bergen ‘Can Do’ project, Council of Europe (Strasbourg: 2003) JONNAERT, Philippe, ETTAYEBI, Moussadak & DEFISE, Rosette, Curriculum et compétences – Un cadre opérationnel, De Boeck (Brussels: 2009) UNIVERSITY OF CAMBRIDGE ESOL EXAMINATIONS, Key English Test – Handbook for Teachers, UCLES (Cambridge: 2009); no individual authors indicated. UNIVERSITY OF CAMBRIDGE ESOL EXAMINATIONS, Preliminary English Test for Schools – Handbook for teachers, UCLES (Cambridge: 2008); no individual authors indicated. UR, Penny, A Course in Language Teaching: Practice and Theory, Cambridge University Press (Cambridge: 2006) Articles and essays: ALDERSON, J. Charles et al., ‘Analysing Tests of Reading and Listening in Relation to the Common European Framework of Reference: The Experience of the Dutch CEFR Construct Project’ in Language Assessment Quarterly, 3, 1 (2006), pp.3-30. ALDERSON, J. Charles, ‘The CEFR and the Need for More Research’ in The Modern Language Journal, 91, iv (2007), pp.659-663. BONNET, Gerard, ‘The CEFR and Education Policies in Europe’ in The Modern Language Journal, 91, iv (2007), pp.669-672. FIGUERAS, Neus, ‘The CEFR, a Lever for the Improvement of Language Professionals in Europe’ in The Modern Language Journal, 91, iv (2007), pp.673-675. FULCHER, Glenn, ‘Are Europe’s tests being built on an ‘unsafe’ framework?’ in The Guardian Weekly (18 March 2004), accessible at http://www.guardian.co.uk/education/2004/mar/18/tefl2. 158 Bibliography FULCHER, Glenn, ‘Testing times ahead?’ in Liaison Magazine, Issue 1 (July 2008), pp.20-23, accessible at http://www.llas.ac.uk/news/newsletter.html. GOODRICH, Heidi, ‘Understanding Rubrics’ in Educational Leadership, 54, 4 (January 1997), pp.14-17. GOODRICH ANDRADE, Heidi, ‘Using Rubrics to Promote Thinking and Learning’ in Educational Leadership, 57, 5 (February 2000), pp.13-18. HEYWORTH, Frank, ‘Why the CEF is important’ in Morrow, Keith (ed.), Insights from the Common European Framework, Oxford University Press (Oxford: 2004), pp.1221. HUHTA, Ari et al., ‘A diagnostic language assessment system for adult learners’ in J. Charles Alderson (ed.), Common European Framework of Reference for Languages: learning, teaching, assessment: case studies, Council of Europe (Strasbourg: 2002), pp.130-146 HULSTIJN, Jan H., ‘The Shaky Ground Beneath the CEFR: Quantitative and Qualitative Dimensions of Language Proficiency’ in The Modern Language Journal, 91, iv (2007), pp.663-667. LENZ, Peter, ‘The European Language Portfolio’ in Morrow, Keith (ed.), Insights from the Common European Framework, Oxford University Press (Oxford: 2004), pp.2231. LITTLE, David, ‘The Common European Framework of Reference for Languages: Perspectives on the Making of Supranational Language Education Policy’, The Modern Language Journal, 91, iv (2007), pp.645-653. KEDDLE, Julia Starr ‘The CEF and the secondary school syllabus’ in Morrow, Keith (ed.), Insights from the Common European Framework, Oxford University Press (Oxford: 2004), pp.43-54. KRUMM, Hans-Jürgen, ‘The CEFR and Its (Ab)Uses in the Context of Migration’, in The Modern Language Journal, 91, iv (2007), pp.667-669. MARIANI, Luciano, ‘Learning to learn with the CEF’ in Morrow, Keith (ed.), Insights from the Common European Framework, Oxford University Press (Oxford: 2004), pp.32-42. MORROW, Keith, ‘Background to the CEF’ in Morrow, Keith (ed.), Insights from the Common European Framework, Oxford University Press (Oxford: 2004), pp.3-11. NORTH, Brian, ‘Relating assessments, examinations, and courses to the CEF’ in Morrow, Keith (ed.), Insights from the Common European Framework, Oxford University Press (Oxford: 2004), pp.77-90. NORTH, Brian, ‘The CEFR Illustrative Descriptor Scales’ in The Modern Language Journal, 91, iv (2007), pp.656-659. POPHAM, W. James, ‘What’s Wrong – and What’s Right – with Rubrics’ in Educational Leadership, 55, 2 (October 1997), pp.72-75. WEIR, Cyril J., ‘Limitations of the Common European Framework for developing comparable examinations and tests’ in Language Testing, 22 (2005), pp.281-299. Accessible at http://ltj.sagepub.com/cgi/content/abstract/22/3/281 Bibliography 159 WESTHOFF, Gerard, ‘Challenges and Opportunities of the CEFR for Reimagining Foreign Language Pedagogy’ in The Modern Language Journal, 91, iv (2007), pp.676679. Websites and online documents: CAMBRIDGE ESOL: Teacher Resources – KET, accessible at http://www.cambridgeesol.org/resources/teacher/ket.html#schools CAMBRIDGE ESOL: Teacher Resources – PET, accessible at http://www.cambridgeesol.org/resources/teacher/pet.html COUNCIL OF EUROPE, Manual for relating Language Examinations to the Common European Framework of Reference for Languages, accessible at http://www.coe.int/t/dg4/linguistic/Manuel1_EN.asp#TopOfPage. OXFORD ENGLISH TESTING: Oxford Online Placement Test, accessible at http:// www.oxfordenglishtesting.com mySchool! (for syllabi and syllabus-related documents for 6eM and 9TE), accessible at http://programmes.myschool.lu and www.myschool.lu Seminar handouts and documents: CLARKE, Martyn, PowerPoint notes from the seminar ‘Creating writing tasks from the CEFR’, held in Luxembourg City in October 2009, © Oxford University Press CLARKE, Martyn, PowerPoint notes from the seminar ‘Creating speaking tasks from the CEFR’, held in Luxembourg City in February 2010, © Oxford University Press HORNER, David, PowerPoint notes from the seminar ‘The logic behind marking grids’, held in Luxembourg City in March 2010. MORROW, Keith, handouts from the seminar ‘Mapping and designing competencebased tests of speaking and writing’, held in Luxembourg City in October 2009. Coursebooks and other sources of teaching material: GAMMIDGE, Mick, Speaking Extra, Cambridge University Press (Cambridge: 2004) HUTCHINSON, Tom, Lifelines Pre-Intermediate Student’s Book, Oxford University Press (Oxford: 1997) HUTCHINSON, Tom & WARD, Ann, Lifelines Pre-Intermediate Teacher’s Book, Oxford University Press (Oxford: 1997) Google Images, www.google.com (for pictures) Wikipedia, en.wikipedia.org (for information about famous people and pictures) 161 List of appendices A. Appendices linked to speaking tests • Appendix 1: Scripted T-S questions used in Speaking Test 1 p.163 • Appendix 2: Cue cards and visual prompts used in Speaking Test 1 p.164 • Appendix 3: Marking grid used for Speaking Test 1 p.168 • Appendix 4: Sample assessment sheets used during Speaking Test 1 p.169 • Appendix 5: Final results of Speaking Test 1 p.172 • Appendix 6: Sample handouts with visual prompts used in Speaking Test 2 p.173 • Appendix 7: Sample handouts for S-S interaction in Speaking Test 2 p.179 • Appendix 8: Marking grid used for Speaking Test 2 p.181 • Appendix 9: Sample assessment sheets used during Speaking Test 2 p.182 • Appendix 10: Final results of Speaking Test 2 p.185 B. Appendices linked to writing tests • Appendix 11: Sample student productions from test I,1 p.187 • Appendix 12: Marking grid used for writing tasks p.188 • Appendix 13: Free writing tasks for summative test II,1: instructions p.189 • Appendix 14: Sample student productions and assessment sheets from Test II,1 p.190 • Appendix 15: Final results of writing performances in summative test II,1 p.194 • Appendix 16: Alternative free writing tasks – informal letters and emails p.195 • Appendix 17: Alternative free writing tasks – story writing p.196 163 A) APPENDICES LINKED TO SPEAKING TESTS Appendix 1: Scripted T-S questions used in Speaking Test 1 1 Teacher questions to student A Good morning / afternoon. What’s your first name? Teacher questions to student B Backup prompts Can you tell me your first name, please? And what’s your first name? Can you spell that, please? What’s your surname? Can you spell that, please? Student B, where are you from? And where are you from? How do you write your surname? What is your nationality? What is your home country? Do you have any brothers and sisters? What is your father’s job? And what is your address? Do you like your hometown? Why? And how many brothers and How many people do sisters do you have? you live with? What are their names? What does your father do? What job do you want to do? What does your mother do? Where do you live? What is your address? Do you live in…? And where do you live? What about you? Do you like your hometown? Are there any problems where you live? Are there any problems in your hometown? What’s negative about living in your hometown? What do you do in your free time? What don’t you like to do in your free time? What did you do last weekend? What did you do on Friday and Saturday? Do you have an e-mail address? Can you spell it, please? And what is your e-mail address? Can you spell it, please? 1 Note: per pair of candidates, only a selection of questions was used (and not necessarily in identical order as presented on this page). Name: Date of Birth: Place of Birth: Work: Celebrity number: 1 Ask your partner the following questions about his/her celebrity: Celebrity number: 2 Ask your partner the following questions about his/her celebrity: Celebrity number: 3 Ask your partner the following questions about his/her celebrity: ‐ What / your celebrity / look like? ‐ When / born? ‐ Where / born? ‐ Which awards / win? (When?) ‐ married? ‐ have / children? (How many? Names?) ‐ What / your celebrity / look like? ‐ When / born? ‐ Where / born? ‐ Job? ‐ married? ‐ have / children? (How many? Names?) ‐ What / your celebrity / look like? ‐ When / born? ‐ Where / born? ‐ What / parents’ names? ‐ What / his last film? When / make it? ‐ have / famous friends? (Who?) 164 Name: Brad Pitt Date of Birth: 18/12/1963 Place of Birth: Oklahoma Parents: William Alvin Pitt / Jane Etta Last film: Inglourious Basterds (2009) Friends: George Clooney, Matt Damon Angelina Jolie 04/06/1975 Los Angeles 1 Oscar (1999) Brad Pitt 6 (3 adopted, 3 biological) Appendix 2: Cue cards and visual prompts used in Speaking Test 1 Source of images : Google images (www.google.lu) / Wikipedia (en.wikipedia.org) Paris Hilton 17/02/1981 New York City model, singer, actress, businesswoman Marital status: single (boyfriend: Doug Reinhardt) Children: / Name: Date of Birth: Place of Birth: Awards: Husband: Children: Robert Pattinson 13/05/1986 English 2004 single Los Angeles Name: Date of Birth: Nationality: Last album: Most famous song: Marital status: Amy Winehouse 14/09/1983 English Back to Black (2006) ‘Rehab’ (2006) divorced (Blake Fielder‐Civil) Name: Date of Birth: Place of Birth: Start of career: First hit: Marital status: Katy Perry 25/08/1984 Santa Barbara, California 2001 ‘I Kissed A Girl’ (May 2008) single (boyfriend: Russell Brand) Celebrity number: 5 Ask your partner the following questions about his/her celebrity: Celebrity number: 6 Ask your partner the following questions about his/her celebrity: ‐ What / your celebrity / look like? ‐ When / born? ‐ Where / from? ‐ When / start / career (= Karriere)? ‐ married? ‐ Where / live / now? ‐ What / your celebrity / look like? ‐ When / born? ‐ What nationality? ‐ Title of last album? When / come out? ‐ Most famous song? ‐ married? ‐ What / your celebrity / look like? ‐ When / born? ‐ Where / born? ‐ When / start / career (= Karriere)? ‐ What / be / first hit? When / come out? ‐ married? 165 Celebrity number: 4 Ask your partner the following questions about his/her celebrity: Appendix 2 Name: Date of Birth: Nationality: Start of career: Marital status: Place of residence: 166 Eminem Marshall Bruce Mathers III 17/08/1972 rapper, actor ‘The Real Slim Shady’ (2001) ex‐wife Kimberley Anne Scott, daughters Alaina & Whitney Artist name: Real name: Date of Birth: Start of career: Last album: Children: 50Cent Curtis James Jackson III 06/07/1975 1996 Before I Self‐Destruct (2009) 1 boy (Marquise Jackson) Name: Beyoncé Giselle Knowles Date of Birth: 04/09/1981 Place of Birth: Houston, Texas Biggest success: 5 Grammy Awards (2004) Marital status: married (rapper/producer Jay‐Z) Children: / Celebrity number: 7 Ask your partner the following questions about his/her celebrity: Celebrity number: 8 Ask your partner the following questions about his/her celebrity: Celebrity number: 9 Ask your partner the following questions about his/her celebrity: ‐ What / your celebrity / look like? ‐ What / artist name? ‐ What / real name? ‐ What / job? ‐ Title of first hit? When / come out? ‐ have / family? (married? children?) ‐ What / your celebrity / look like? ‐ What / artist name? ‐ What / real name? ‐ When / start / career (= Karriere)? ‐ Title of last album? When / come out? ‐ have / children? (How many? Names?) ‐ What / your celebrity / look like? ‐ When / born? ‐ Where / born? ‐ What / biggest success? (When?) ‐ married? ‐ have / children? (How many? Names?) Appendix 2 Artist name: Real name: Date of Birth: Work: First hit: Family: Shakira Shakira Isabel Ripoll 02/02/1977 Colombian Laundry Service (2001) The Bahamas Name: Date of Birth: Work: Place of Birth: First solo hit: Girlfriend: Robbie Williams 13/02/1974 singer‐songwriter Stoke‐on‐Trent, England ‘Freedom’ (1996) Ayda Field Name: Thierry Henry Date of Birth: 17/08/1977 Place of Birth: Les Ulis, Paris, France Job: football player (Barcelona FC) Biggest successes: World Cup win (1998) and Champions League win (2009) Children: 1 daughter (Téa) Celebrity number: 10 Ask your partner the following questions about his/her celebrity: Celebrity number: 11 Ask your partner the following questions about his/her celebrity: Celebrity number: 12 Ask your partner the following questions about his/her celebrity: ‐ What / your celebrity / look like? ‐ What / artist name? ‐ What / real name? ‐ Where / from? ‐ Title of first album? When / come out? ‐ Where / live / now? ‐ What / your celebrity / look like? ‐ When / born? ‐ Job? ‐ Where / born? ‐ Title of first hit? When / come out? ‐ girlfriend? (Name?) ‐ What / your celebrity / look like? ‐ When / born? ‐ Where / born? ‐ Job? ‐ What / biggest success? (When?) ‐ have / children? (How many? Names?) Appendix 2 Artist name: Real name: Date of Birth: Nationality: First English album: Place of residence: 167 Band 3 Uses basic sentence structures and grammatical forms correctly. • Speech is intelligible throughout though mispronunciation may occur. • Succeeds in using relevant vocabulary to communicate in everyday situations and carry out the tasks set. • Speech is mostly fluent, with little (self-)correction. Basic control of intonation. • Occasional L1 interference still to be expected and acceptable. • Speech is mostly intelligible, despite limited control of phonological features. • No serious effort and little or no prompting by listener required. • Uses a limited range of structures with some grammatical errors of most basic type. • • • • 0 • Sufficient control of relevant vocabulary with minor hesitations. Some efforts, prompting and assistance by listener required. Has little awareness of sentence structure and little control of very few grammatical forms. • • Insufficient control of relevant vocabulary, speech often presented in isolated or memorised phrases. • Considerable efforts, prompting and assistance by listener required. • Interactive Communication • Communication is confidently handled in everyday situations. • • Occasional hesitations and pauses; inconsistent handling of intonation. Some effort by the listener is required. Pronunciation and intonation are still heavily influenced by L1. Speech is repeatedly unintelligible, with frequent mispronunciation. Speech is monotonous; little awareness of intonation. Considerable effort by the listener is required. • • • • Global Achievement • All parts of all tasks are successfully dealt with in the time allotted. In general, responses are relevant and meaning is conveyed successfully. • Speech and attitude reflect willingness to engage in English. Can react quite spontaneously, ask for clarifications and give them when prompted. • Student shows readiness to take measured risks in making him/herself clearly enough understood. One part of the task is not (or not fully) dealt with. Communication is occasionally strenuous, but everyday situations are still mostly dealt with. Responses tend to be evasive, but meaning is generally conveyed successfully. Can ask for clarification in English. Communication is erratic and repeatedly breaks down. Some long pauses may occur. • Inability to respond or response is largely irrelevant. • May use L1 to ask for clarification. Pronunciation and intonation are excessively aligned to L1. No rateable language. Totally incomprehensible. Totally irrelevant. • • Speech tends to be minimalistic, with a reluctance to engage in English. • Student generally avoids taking risks. • Most of the tasks are dealt with insufficiently (or not at all). • Speech and attitude produce a negative impression (reflect unwillingness to engage in English). • Unwillingness to take part in everyday-type conversations. No risk-taking. Appendix 3: Marking grid used for Speaking Test 1 • • 1 Pronunciation 168 2 Grammar and Vocabulary 169 Appendix 4: Sample assessment sheets used during Speaking Test 1 1 Student 20 Note: the crossed-out, corrected marks in this assessment grid reflect the two-stage assessment process mentioned in section 3.2.5. of the main text. They exemplify the slight amendments that were occasionally made in the closer consideration of individual criteria during the second stage. Student 9 1 The text boxes have been inserted to guarantee the learners’ anonymity. Student numbers are identical to those used in appendices 5 and 10. 170 Appendix 4 Student 12 Student 10 Appendix 4 Student 16 Student 6 171 172 Appendix 5: Final results of Speaking Test 1 Speaking Test 1: Final Marks 12 10 8 6 4 2 0 Grammar and Vocabulary marks Pronunciation marks 3 3 2 2 1 1 0 0 1 3 5 7 9 11 13 15 17 19 21 1 Interactive Communication marks 3 2 2 1 1 0 0 3 5 7 9 11 13 15 17 19 21 5 7 9 11 13 15 17 19 21 Global Achievement marks 3 1 3 1 3 5 7 9 11 13 15 17 19 21 173 Appendix 6: Sample handouts with visual prompts used in Speaking Test 2 1 Speaking test II,3: Part ONE 1. Describe the two pictures. What do you see in each one of them? 2. You have the chance to go to ONE of these two places. Which one will you choose? Why? 1 Source of images: Microsoft Word (Office 2007) Online Clipart Library / Google images (www.google.com) 174 Appendix 6 Speaking test II,3: Part ONE 1. Describe the two pictures. What do you see in each one of them? 2. You have the chance to live in ONE of these two places. Which one will you choose? Why? Appendix 6 175 Speaking test II,3: Part ONE 1. Describe the two pictures. What do you see in each one of them? 2. You have the chance to get a holiday in ONE of these two places. Which one will you choose? Why? 176 Appendix 6 Speaking test II,3: Part ONE 1. Describe the two pictures. What do you see in each one of them? 2. You have the chance to get ONE of these two houses. Which one will you choose? Why? Appendix 6 177 Speaking test II,3: Part ONE 1. Describe the two pictures. What do you see in each one of them? 2. You have the chance to get ONE of these two jobs. Which one will you choose? Why? 178 Appendix 6 Speaking test II,3: Part ONE 1. Describe the two pictures. What do you see in each one of them? 2. You have the chance to do ONE of these activities. What will you do? Why? 179 Appendix 7: Sample handouts for S-S interaction in Speaking Test 2 Speaking test II,3: Part 2 (student A) 1 You are a tourist in this town, and you don’t know your way around. Ask your partner for directions to the following places: 1. You are at the bus station. You want to go to the Royal Hotel. Ask for directions politely. 2. Now you want to leave Royal Hotel. You want to go to a bakery next. Ask for directions politely. Student B instructions: You live in this town. Your partner is a tourist here and will ask you for directions to two different places. Use the map to give your partner the correct directions. Be as precise as possible (use street names, the position of buildings…). 1 Source of map: Tom Hutchinson & Ann Ward, Lifelines Pre-Intermediate Teacher’s Book, Oxford University Press (Oxford: 1997), p.128. 180 Appendix 7 Speaking test II,3: Part 2 (student B) 2 You are a tourist in the following town, and you don’t know your way around. Ask your partner for directions to the following places: 1. You are at the train station. You want to go to the hospital. Ask for directions politely. 2. Now you’re finished at the hospital. You want to go to the cinema next. Ask for directions politely. Student A instructions: You live in this town. Your partner is a tourist here and will ask you for directions to two different places. Use the map to give your partner the correct directions. Be as precise as possible (use street names, the position of buildings…). 2 Source of map: Tom Hutchinson, Lifelines Pre-Intermediate Student’s Book, Oxford University Press (Oxford: 1997), p.52. BAND 1 BAND 2 BAND 4 Basic meaning conveyed in very familiar everyday situations. Most of the response is not relevant. CONTENT Task response BAND 3 Response is mostly some digressions. Little communication possible. Communication handled everyday situations. relevant, Meaning conveyed successfully. Requires considerable prompting and support. Only isolated words or memorised utterances are produced. Little or attempted. no Limited control of grammatical forms. paraphrasing very few Cannot produce basic sentence forms. Noticeable pauses and slow speech, frequent repetition and self-correction. Very short basic utterances, which are juxtaposed rather than connected or linked through repetitious use of simple connectives. Requires prompting and support. Limited range of vocabulary which is minimally adequate for the task and which may lead to repetition. Features of bands 3 and 5. Little effort by listener required. Hardly any control of organisational features even at sentence level. Features of bands 1 and 3. Serious effort by listener required. short Produces simple speech fluently. Usually maintains flow of speech but uses repetition, selfcorrection and / or slow speech to keep going. May over-use connectives and markers. certain discourse Requires little prompting and support. In general, adequate range of vocabulary for the task. Attempt to vary expressions, but with inaccuracy. Paraphrasing rarely attempted. Paraphrase attempted, but with mixed success. Only a limited range of structures is used. In general, adequate range for the task. Limited control of basic forms and sentences. Subordinate structures are rare and tend to lack accuracy. Good degree of control of simple grammatical forms and sentences. Both simple and complex forms are used. Frequent grammatical errors. Minor and occasional errors. 181 GRAMMATICAL STRUCTURES Appropriacy Range Accuracy No rateable language. Totally incomprehensible. Totally irrelevant. LEXIS Appropriacy Range Accuracy Paraphrase Long pauses before most words. Responses limited to phrases or isolated words. Intelligible throughout though mispronunciation may occasionally cause momentary strain for the listener. Speech is intelligible, despite limited control of phonological features. Effort on behalf of the listener required. Speech is often unintelligible. in In general, response is relevant . Some effort on behalf of the listener required. PRONUNCIATION AND DISCOURSE MANAGEMENT Pronunciation Fluency (speech rate and continuity) Effort to link ideas and language so as to form a coherent, connected speech Prompting and support BAND 5 Appendix 8: Marking grid used for Speaking Test 2 Source : official syllabus for 9TE, October 2009 version (http://programmes.myschool.lu), p.72. (p.72) BAND 0 182 Appendix 9: Sample assessment sheets used during Speaking Test 2 1 Student 13 Student 14 1 The text boxes have been inserted to guarantee the learners’ anonymity. Student numbers are identical to those used in appendices 5 and 10. Appendix 9 Student 12 Student 10 183 184 Appendix 9 Student 5 Student 21 185 Appendix 10: Final results of Speaking Test 2 1 Speaking Test 2: Final Marks Student 23 Student 22 Student 21 Student 20 Student 19 Student 18 5 4 4 3 3 2 2 1 1 0 0 1 3 5 1 7 11 13 15 17 19 21 23 5 5 4 4 3 3 2 2 1 1 0 0 1 3 5 7 11 13 15 17 19 21 23 3 5 7 11 13 15 17 19 21 23 Grammatical structures marks Content marks 1 Student 17 Pronunciation and discourse management marks Lexis marks 5 Student 16 Student 15 Student 14 Student 13 Student 12 Student 11 Student 10 Student 7 Student 6 Student 5 Student 4 Student 3 Student 2 Student 1 20 18 16 14 12 10 8 6 4 2 0 1 3 5 7 11 13 15 17 19 21 23 The omission of students 8+9 in these charts is based on the fact that these students had left the class after the first term. For a similar reason, student 23 (who joined the class in the second term) has been added. All other student numbers have been kept identical to the results of speaking test 1 to allow for a better comparability of performance. 187 B) APPENDICES LINKED TO WRITING TESTS Appendix 11: Sample student productions from Test I,1 1 Student A Student B Student C 1 The text boxes have been inserted to guarantee the learners’ anonymity. GRAMMATICAL STRUCTURES Range Accuracy Response is seriously incoherent. Hardly any control of organisational features even at sentence level. Extremely limited range. Hardly any control of spelling and word formation. Extremely limited range structures. Essentially no control structures and punctuation of of BAND 3 BAND 4 BAND 5 Only 2/3 of the content elements dealt with: message only partly communicated to the reader; and /or not all content elements dealt with successfully: message requires some effort on behalf of the reader. Format only partly respected. In general, content elements addressed successfully: message clearly and fully communicated to the reader. Awareness of format. In general, coherent response though there may be some inconsistencies. Short simple sentences which are simply listed rather than connected and presented as a text. Good control of simple sentences and use of simple linking devices such as ‘and’, ‘or’, ‘so’, ‘but’ and ‘because’. Complex sentence forms attempted, but they tend to be less accurate than simple sentences. Information presented with some organisation Limited range which is minimally adequate for the task and which may lead to repetition. Limited control of spelling and / or word formation. Limited range of structures. Frequent grammatical errors and faulty punctuation, which cause difficulty in terms of communication. Features of bands 3 and 5. Little effort by reader required. Only about 1/3 of the content elements dealt with and /or hardly any content elements dealt with successfully: message hardly communicated; message requires excessive effort by the reader. No awareness of format. Features of bands 1 and 3. Serious effort by reader required. LEXIS Appropriacy Range Accuracy No rateable language. Totally incomprehensible. Totally irrelevant. COHERENCE AND COHESION Logic Fluency Control of linking devices, referencing... Paragraphing BAND 2 188 CONTENT Task achievement Format (if required) Effect on reader BAND 1 In general, appropriate and adequate range for the task. Few minor errors, which do not reduce communication and are mainly due to inattention or risk taking. Sufficient range of structures for the task. Few minor errors, which do not reduce communication and are mainly due to inattention or risk taking. Appendix 12: Marking grid used for writing tasks Source : official syllabus for 9TE, October 2009 version (http://programmes.myschool.lu), p.76. BAND 0 189 Appendix 13: Free writing tasks for summative test II,1: instructions 8. Free writing (9 marks) Choose either A or B and write about 80-100 words (on your answer sheet). A. ‘Next week’s horoscope’ You are a horoscope writer for a magazine. Your article for next week is almost finished, but you still have to write the last horoscope (for Pisces). In your text, make predictions and give advice (= Ratschläge / conseils) about: • • • love and friendship money and finances work and/or studies You can add any other important elements that you can think of. OR: B. ‘The greatest summer camp ever!’ You are the manager of a summer camp. You want more people to come to your camp, so you write a publicity brochure about it. In your text, describe: • • • the things that the participants will see and do at the camp; the buildings, rooms and services that you offer; any other reasons why people should come to your camp. Remember: You want as many people as possible to come to your camp. So make your text convincing (= überzeugend) and original! 190 Appendix 14: Sample student productions and assessment sheets from Test II,1 1 1. ‘Horoscope’ task Student A (student 18) 1 The text boxes have been inserted to guarantee the learners’ anonymity. Student numbers are identical to those used in appendices 5 and 10. Appendix 14 Student B (student 13) 191 192 Appendix 14 2. ‘Summer camp’ task Student C (student 17) Appendix 14 Student D (student 20) 193 194 Appendix 15: Final results of writing performances in summative test II,1 1 Writing Test II,1: Final Marks 9 8 7 6 5 4 3 2 1 0 Coherence and cohesion bands reached Content bands reached 5 5 4 4 3 3 2 2 1 1 0 0 1 1 2 3 4 5 6 7 101112131415161718192022 Lexis bands reached 5 4 4 3 3 2 2 1 1 0 0 1 3 5 7 11 13 15 17 5 7 11 13 15 17 19 22 Grammatical structures bands reached 5 1 3 19 22 1 3 5 7 11 13 15 17 19 22 Student numbers refer to the same individual students as in the general result graphs for both speaking tests. Students 8+9 had already left the class prior to this test. Student 23 had not yet joined while student 21 was absent on the day the test was written. Therefore, these students have not been included into these graphs. 195 Appendix 16: Alternative free writing tasks – informal letters and emails 1. Informal letter writing Ex.7: Free writing – a letter to a friend (10 marks) You are spending a year in another country at the moment. Write a letter of 80-100 words to your best friend at home. Use a new page on your answer sheet. Say: • where you are and why; • what you have done there until now; • your best/worst experiences (= Erfahrungen) in the foreign country. (What happened? When?) Use the correct style for an informal letter. Put your address (in the foreign country), your friend’s address and the date into the right places on the page. 2. Informal email writing You have received the following email from an English-speaking exchange student. Reply with an informal email and answer all his questions in 100-120 words. Hi there, my name’s Eddie and I’m a student from Miami, Florida. Next month I’m coming to Luxembourg as an exchange student. I’ll have to go to your school for three months, so that’s why I’m writing to you! My cousin did the same exchange programme last year. She really liked your school, but she says that the food in your canteen was horrible when she was there. Do you agree? What kinds of food did they use to serve there? What was it like last year? Now, my cousin has heard that your school canteen has been completely redone. So tell me, what has changed? Has the food become better? Over here in the States, most students go to the canteen every day. Do many students go to your canteen, or do they go to any other places for lunch? What do you usually do at midday? Please write to me soon. I’ll write again next week with more questions. Thanks for your time! See you, Eddie. 196 Appendix 17: Alternative free writing tasks – story writing 5. Free writing: ‘An extraordinary night…’ (10 marks) On your answer sheet, write an ending for ONE of the following stories. A. Last summer, my parents were away on holiday. One night, some of my best friends came over to my house. During the first two hours, we all had great fun: we ordered pizzas, put on some music, and we talked a lot. But suddenly, just as … B. I am a student in London. One time, I went to the cinema with a good friend from university. But when I returned, I saw that the front door of my flat was slightly open. I could see that a light was on inside … In your story, describe: • what you and other people (for example your friends) did (be creative!); • how you felt. Your story must have about 80-100 words and a clear, coherent structure (Æ paragraphs, linking words…).
© Copyright 2026 Paperzz