Finding and defining pertinent and practicable ways of assessing

Finding and Defining Pertinent and Practicable Ways of Assessing Language Learners’ Productive Skills at ‘A2’ Level How viable marking grids can be established for the competence‐based assessment of pre‐intermediate students’ speaking and writing skills, and how they could be sensibly integrated into the Luxembourg ELT curriculum. Plagiarism statement I hereby certify that all material contained in this travail de candidature is my own work.
I have not plagiarised from any source, including printed material and the internet. This
work has not previously been published or submitted for assessment at any institution. All
direct quotation appears in inverted commas or indented paragraphs and all source
material, whether directly or indirectly quoted, is clearly acknowledged in the references
as well as in the bibliography.
Michel Fandel
Michel Fandel
Candidat au Lycée Technique Michel Lucius
Finding and defining pertinent and practicable ways of assessing language learners’ productive skills at ‘A2’ level How viable marking grids can be established for the competence‐based assessment of pre‐intermediate students’ speaking and writing skills, and how they could be sensibly integrated into the Luxembourg ELT curriculum. Luxembourg (2010) 4
Abstract Due to their direct and ultimately decisive impact on final grades that decide whether or
not individual students pass their school year (or successfully complete their learning
cycle), summative tests inherently influence the learners’ chances to progress through the
various stages of our school system. It is thus of paramount importance that these
instruments provide a theoretically sound and fundamentally reliable basis on which the
teacher can reach adequately informed judgments about a student’s true level of
achievement and competence. Making sure that both the content of summative tests (i.e.
what is checked and assessed) and their form (how they verify knowledge and skills) live
up to these standards is just as challenging as it is crucial. Especially in a significantly
changing national education system which is becoming ever more focused and reliant on
competence-based methods of language teaching, it is clear that long-standing practices in
the field of summative testing need to be reconsidered and newly adapted as well.
One of the first objectives of this thesis therefore consists in identifying common
fallacies that have affected predominant testing and assessment schemes in the
Luxembourg school system for many years. After outlining salient theoretical
cornerstones that must be at the root of appropriate testing and assessment procedures,
Chapter 1, in particular, analyses problematic elements in the ‘traditional’ ways of
approaching speaking and writing in summative tests.
In the search for more suitable alternatives, Chapter 2 chiefly focuses on the
enormous potential offered by the Council of Europe’s Common European Framework of
Reference for Languages as a basis for a competence-oriented teaching and assessment
scheme, though not without highlighting some of the contentious elements of this
groundbreaking document in the process. The third and fourth chapters, through detailed
descriptions and analyses of practical examples, then illustrate how competence-based
tests and assessments can be implemented and integrated into daily teaching practice in
relation to each of the two productive skills.
The concluding chapter 5 not only stresses the resulting beneficial effects of these
new ways of assessment, but also outlines the challenges that still lie ahead before the
Luxembourg ELT curriculum has turned into a maximally effective and coherent
framework for competence-based teaching, testing and assessment, uniting and spanning
across all its different levels.
5
Table of contents Plagiarism statement _____________________________________________________ 2 Abstract _______________________________________________________________ 4 Table of contents ________________________________________________________ 5 Chapter 1: Analysing ‘traditional’ ways of assessing productive skills at pre‐intermediate level. ___________________________________________________ 7 1.1. Tests and assessment: theoretical considerations ___________________________ 8 1.1.1. ‘Test’ versus ‘assessment’ ______________________________________________________ 8 1.1.2. Different types of assessment __________________________________________________ 9 1.2. Factors defining a ‘good’ test ___________________________________________ 13 1.2.1. Validity ___________________________________________________________________ 13 1.2.2. Reliability __________________________________________________________________ 16 1.2.3. Feasibility _________________________________________________________________ 21 1.2.4. Authenticity ________________________________________________________________ 22 1.3. Problem zones in ‘traditional’ ways of assessing productive skills at pre‐
intermediate level __________________________________________________________ 23 Chapter 2: Towards a different, competence‐based assessment method of speaking and writing. ___________________________________________________________ 29 2.1. The Common European Framework as foundation for competence‐based assessment.
__________________________________________________________________________ 30 2.1.1. Why choose the CEFR as a basis for teaching and assessment? _______________________ 30 2.1.2. The need for caution: political and economic concerns _____________________________ 32 2.1.3. The need for caution: pedagogic concerns _______________________________________ 34 2.1.4. Reasons for optimism: chances offered by the CEFR ________________________________ 35 2.2. Challenges of introducing competence‐based assessment _______________________ 42 2.2.1. The achievement versus proficiency conundrum in summative tests ___________________ 42 2.2.2. Contentious aspects of CEFR descriptors and scales ________________________________ 45 2.3. The CEFR and the Luxembourg school system: possible uses and necessary adaptations
__________________________________________________________________________ 50 Chapter 3: Competence‐based ways of assessing speaking at A2 level. ___________ 55 3.1. Central features of interest to the assessment of speaking ______________________ 56 3.1.1. Features shared by both productive skills ________________________________________ 56 3.1.2. Features specific to speaking __________________________________________________ 59 3.2. Case study 1: speaking about people, family, likes and dislikes ___________________ 61 3.2.1. Class description ____________________________________________________________ 61 3.2.2. Laying the groundwork: classroom activities leading up to the test ____________________ 62 3.2.3. Speaking test 1: description of test items and tasks ________________________________ 63 3.2.4. Practical setup and test procedure ______________________________________________ 66 3.2.5. Form, strategy and theoretical implications of assessment used ______________________ 69 3.2.6. Analysis of test outcomes _____________________________________________________ 77 6
3.3. Case study 2: comparing places and lifestyles / asking for and giving directions. _____ 82 3.3.1. Speaking test 2: description of test items and tasks _________________________________ 82 3.3.2. Form, strategy and theoretical implications of assessment used _______________________ 87 3.3.3. Analysis of test outcomes _____________________________________________________ 90 Chapter 4: Competence‐based ways of assessing writing at A2 level. _____________ 97 4.1. Reasons for change: a practical example _____________________________________ 97 4.1.1. Description of the implemented test task _________________________________________ 97 4.1.2. “Traditional” method of assessment used ________________________________________ 99 4.2. Key features of a ‘good’ free writing task ____________________________________ 103 4.3. Central features of interest to the assessment of writing _______________________ 107 4.4. Defining an appropriate assessment scheme for writing tasks ___________________ 109 4.5. Case study: using a marking grid to assess written productions in a summative test _ 115 4.5.1. Description of test tasks ______________________________________________________ 115 4.5.2. Description of test conditions _________________________________________________ 117 4.5.3. ‘Horoscope’ task: analysis and assessment of student performances __________________ 118 4.5.4. ‘Summer camp’ task: analysis and assessment of student performances _______________ 123 4.5.5. Outcomes of the applied assessment procedure and general comments _______________ 129 4.6. Alternative types of writing tasks _______________________________________ 132 4.6.1. Informal letters and emails ___________________________________________________ 132 4.6.2. Story writing _______________________________________________________________ 134 Chapter 5: Conclusions and future outlook. _________________________________ 139 5.1. The impact of a competence‐based approach on test design and assessment ______ 139 5.1.1. Effects on validity ___________________________________________________________ 139 5.1.2. Effects on reliability _________________________________________________________ 143 5.1.3. Feasibility of the explored testing and assessment systems __________________________ 145 5.2. The Luxembourg ELT curriculum, the CEFR and competence‐based assessment: perspectives _______________________________________________________________ 149 Bibliography ________________________________________ Erreur ! Signet non défini. List of appendices ______________________________________________________ 161 7
Chapter 1: Analysing ‘traditional’ ways of assessing productive skills at pre­intermediate level. Few elements in the field of evaluation can be as treacherous as the numerical marks that
sum up a student’s level of performance in an end-of-term report. By virtue of their
apparent clarity and specificity, numerical values tend to assume a definitive authority
that all too often remains unquestioned and absolute. As a result, such numbers represent
a student’s learning achievements over the course of a term in a tidy and seemingly
objective form which can easily tempt teachers, students and parents alike into drawing
overly general conclusions about a particular learner’s target language proficiency and
overall progress.
As any devoted teacher will testify, however, assessing a student’s writing or
speaking performances encompasses a much more complex and tricky process than a sole
summative value can ever possibly represent, no matter if it is expressed by means of
percentage points, broad categories (such as A-F or 1-6) or, as in the very peculiar case of
the Luxembourg secondary school system, within the framework of a 60-mark scheme.
Thus, as H. Douglas Brown puts it, ‘[g]rades and scores reduce a mountain of linguistic
and cognitive performance data to an absurd molehill.’ 1 From the moment when a
summative test is designed until a final grade or mark is awarded, every single decision
can ultimately have a major impact on the teacher’s interpretation and valuation of a
student’s product. Yet since a numerical mark invariably concludes the summative
assessment process, it is all the more important that the reasoning which has led to that
conclusion is based on theoretically and practically sound test items and tasks, assessment
tools and strategies.
In this context, it is certainly necessary to analyse to what extent the ‘traditional’
ways of assessing the productive skills may not always have been built on sufficiently
solid foundations, and to identify the weaknesses and problem zones that have stubbornly
persisted in them up to this point. Before this is possible, however, a number of key
1
H. Douglas Brown, Teaching by Principles, An Interactive Approach to Language Pedagogy (3rd ed.),
Pearson Longman (New York: 2007), p.452.
8
Chapter 1
theoretical concepts and considerations that inevitably underpin the complex procedures
of testing and assessment need to be highlighted and defined.
1.1. Tests and assessment: theoretical considerations 1.1.1. ‘Test’ versus ‘assessment’ The often interconnected notions of ‘tests’ and ‘assessment’ are unquestionably
central components in most, if not all, language courses. However, they are also
‘frequently misunderstood terms’ 2 which can easily be confused with each other; as a
result, it is essential to clarify their respective functions and scopes. According to Brown,
a test is a ‘method of measuring a person’s ability [i.e. competences and/or skills] or
knowledge in a given domain, with an emphasis on the concepts of method and
measuring.’ In that sense, tests constitute ‘instruments that are (usually) carefully
designed and that have identifiable scoring rubrics.’ 3 Importantly, they are normally held
at fairly regular intervals, particularly in the case of so-called achievement tests. Students
are aware of their importance and implications, and they can usually prepare for them in
advance. As a consequence,
tests are prepared administrative procedures that occupy identifiable time periods
in a curriculum when learners muster all their faculties to offer peak
performance, knowing that their responses are being measured and evaluated. 4
In that sense, Brown argues, tests can be seen as ‘subsets’ of the much wider and
extremely multifaceted concept of assessment. This view is supported by the Council of
Europe’s Common Framework of Reference for Languages (CEFR), which states that ‘all
language tests are a form of assessment, but there are also many forms of assessment …
which would not be described as tests.’ 5 In fact, this process affects almost all elements
and activities in the language classroom. Virtually any spoken or written sample of
language produced by a student prompts an implicit (or indeed explicit) judgment on the
part of the teacher, who thus spontaneously gauges the demonstrated level of ability even
2
Ibid., p.444.
Ibid., p.445. Italics are the author’s.
4
Ibid., p.445.
5
Council of Europe, Common European Framework of Reference for Languages: Learning, Teaching,
Assessment, Cambridge University Press (Cambridge: 2001), p.177. All subsequent references to this text
are to this edition. The abbreviation CEFR is used throughout the remainder of the text (except where the
alternative abbreviation CEF is used in quotations by other authors).
3
Chapter 1
9
in the absence of a genuine test situation. Ultimately, this implies that ‘a good teacher
never ceases to assess students, whether those assessments are incidental or intentional.’ 6
The type of feedback provided to the student evidently changes accordingly. Apart
from the limited number of occasions when a teacher’s judgment of a language sample
coincides with the correction and marking of a classroom test, the result certainly need
not be a numerical score or grade at all (which, in Brown’s terms, would constitute a type
of formal assessment). To name but a few informal alternatives, an assessment may just
as well lead to such varied teacher responses as verbal praise, probing questions to elicit
further elaboration, or even a simple nod of the head to confirm a correct or otherwise
useful answer. However, even in the comparatively ‘high-stakes’ domain of summative
tests, a number of very diverse approaches to assessment can be adopted.
1.1.2. Different types of assessment According to the CEFR, summative assessment ‘sums up attainment at the end of
the course with a grade’ (p.186). Its central purpose consists in allowing the teacher to
‘evaluate an overall aspect of the learner’s knowledge in order to summarize the
situation.’ 7 In very broad terms, summative assessment thus focuses on a particular
product that illustrates the student’s achievement of a specific set of learning objectives,
even though it seems crucial to point out that the reference to ‘the end of the course’ in
the CEFR definition is misleading. In fact, in the context of multiple fixed-point class
tests (“devoirs en classe”), whose exact number is usually predetermined by an official
syllabus in the Luxembourg school system, summative assessment actually occurs much
more frequently. After all, such tests generally provide closure to particular learning
sequences at various successive points (rather than merely at the end) of the school year.
In contrast, formative assessment is ‘an ongoing process of gathering information
on the extent of learning, on strengths and weaknesses, which the teacher can feed back
into their course planning and the actual feedback they give learners.’ (CEFR, p.186) It is
thus centred on the student’s learning process, which it aims to both analyse and support
via constructive feedback rather than an isolated summative value. As Penny Ur puts it,
‘its main purpose is to ‘form’: to enhance, not conclude, a process’, which, according to
the same author, ‘summative evaluation may contribute little or nothing to.’ 8
6
Brown, op.cit., p.445.
Penny Ur, A Course in Language Teaching: Practice and Theory, Cambridge University Press
(Cambridge: 2006), p.244
8
Ibid., p.244
7
10
Chapter 1
Whether or not the feedback provided by summative assessments may contain
formative elements and thus also support the student’s future learning process is in fact a
contentious issue which will be discussed in more detail in later chapters. However, their
clear focus on a learner’s ‘attainments’ at precise points in time certainly includes a
variety of different approaches and types of tests.
In this respect, one key distinction opposes what the CEFR defines as ‘the
assessment of the achievement of specific objectives – assessment of what has been
taught’ and, on the other hand, ‘proficiency assessment’, which focuses on ‘what
someone can do [or] knows in relation to the application of the subject in the real world’.
Achievement assessment thus ‘relates to the week’s/term’s work, the course book, the
syllabus’ (p.183) whereas proficiency assessment is a broader form of judgment with the
potential of covering a much wider spectrum of linguistic skills and competences. As a
result, the CEFR states that the ‘advantage of an achievement approach is that it is close
to the learner’s experience’, especially in a school-based context. In contrast, one of the
major strengths of a ‘proficiency approach’ resides in the fact that ‘it helps everyone to
see where they stand’ because, in that case, ‘results are transparent’ (pp.183-184).
This central difference has important repercussions on the form, purpose and
respective benefits of the tests that are correspondingly administered to language learners.
Achievement (or progress) tests, for instance, are ‘related directly to classroom lessons
[or] units’, thus ‘limited to particular material covered … within a particular time frame’
and deliberately ‘offered after a course has covered the objectives in question’ 9 . For
understandable reasons, such tests have traditionally occupied a predominant place in
language assessment in the Luxembourg school system. They provide a practical means
for teachers to split up the material to be tackled over the course of an entire year into
smaller, usually topic- or grammar-oriented chunks. As a result, both the size and scope
of the corresponding tests can be significantly reduced while allowing teachers ‘to
determine acquisition of [very specific] course objectives at the end of a period of
instruction’ 10 .
On the other hand, the possible content of achievement tests is for the same reasons
also fairly limited. As Harmer points out, such tests
only work if they contain item types which the students are familiar with. This
does not mean that in a reading test, for example, we give them texts they have
9
Brown, op.cit., p.454.
Ibid., p.454.
10
Chapter 1
11
seen before, but it does mean providing them with similar texts and familiar task
types. If students are faced with completely new material, the test will not
measure the learning that has been taking place […]. 11
Due to their explicit focus on previously covered language elements, achievement tests
are thus a very useful tool to identify whether specific concepts have been internalised to
a sufficient extent. Brown rightly suggests that they can also, as a corollary, ‘serve as
indicators of features that a student needs to work on in the future’, even if this is not their
‘primary role’ 12 .
However, if the ‘aim in a test is to tap global competence in a language’ 13 , then
only proficiency tests can ‘give a general picture of a student’s knowledge and
ability’ 14 . As they aim to establish an overview of a student’s overall strengths and
weaknesses in the target language, these tests are generally ‘not intended to be limited to
any one course, curriculum, or single skill’ 15 . Instead, the learners have to tackle a variety
of tasks that usually require them – at different points of the test – to access each of the
four basic language skills. As a result, proficiency tests paint a composite profile of a
particular language learner whilst highlighting language skills and competences that have
already been (or, in contrast, still need to be) developed. As famous examples of
standardised proficiency tests, Brown and Harmer mention the TOEFL (Test of English
as a Foreign Language) and IELTS (International English Language Testing System)
examinations, which often involve especially high stakes for the language learner; the
results in such tests may for example decide whether he or she can attend a particular
university course. Given the current increasing focus on competence-based assessment in
Luxembourg, proficiency assessment is quickly gaining more immediate and constant
importance in the local and everyday context of our school system as well. 16
A particular challenge that lies ahead in this respect is highlighted by a further
crucial distinction presented in the CEFR: the difference between performance
11
Jeremy Harmer, The Practice of English Language Teaching, Pearson Longman (Harlow, England: 2006)
Brown, op.cit., p.454.
13
Ibid., p.453.
14
Harmer, op.cit., p.321. Emphasis added.
15
Brown, op.cit., p.453.
16
In addition to achievement and proficiency tests, there are two other test types in particular which are
commonly used in educational contexts and described in test theory:
• diagnostic tests, which aim at pinpointing specific (remaining) learner difficulties in order to
adapt subsequent learning objectives accordingly;
• placement tests, which seek to ‘place a student into an appropriate level or section of a language
curriculum at school’ (Brown, op.cit., p.454).
As both of these test types are used in more exceptional circumstances (and pursue more specialised
purposes) than the scope of a common summative test allows, they are not treated in further detail in this
thesis.
12
12
Chapter 1
assessment and knowledge assessment. The former ‘requires the learner to provide a
sample of language in speech or writing in a direct test’; in the latter, students have to
‘answer questions which can be of a range of different item types in order to provide
evidence of their linguistic knowledge and control’ (CEFR, p.187). In other words, a
performance assessment asks the learner to directly produce entire stretches of language
himself (for example in an oral interview), while a knowledge assessment would require
more indirect proof of what the student knows about the language through the adequate
selection of more discrete, separate items (for instance in a gap-filling or multiple-choice
exercise). Both types of assessment may contribute to determining a student’s overall
proficiency in the target language. However, one clear strength of performance
assessment certainly consists in the more varied, extensive and direct language samples
that it is based on in comparison to the insights gained from knowledge-based test tasks.
Nevertheless, if one exclusively considers the results of a unique test rather than a number
of successive efforts over the course of a substantial period of time, it is crucial to bear in
mind that a single learner performance can never be more than indicative of his actual
language competences.
Whereas the above-mentioned aims and purposes of various assessment types
already decisively affect the form and content of the tests that learners are confronted
with, important differences also exist in the ways in which the respective results are most
commonly interpreted. A first frequently used approach consists in norm-referencing,
where the main objective is the ‘placement of learners in rank order’; the test-takers are
compared with each other, ‘their assessment and ranking in relation to their peers’ is of
central importance (CEFR, p.184). In such a context, the quality of a learner’s
performance in a particular test is deliberately viewed against other productions in that
class. If strategies of differentiated learning are adopted, norm-referencing may also
involve subjecting ‘stronger’ students to different, more complex test items or task types
than ‘weaker’ pupils.
In stark contrast, criterion-referencing is focused on the performance (and
possibly the traceable development) of a single learner in reference to a specifically
developed set of performance standards. Instead of comparing an individual student’s
efforts against those of his classmates, ‘the learner is assessed purely in terms of his/her
ability in the subject’ (CEFR, p.184). This approach evidently presupposes that the
criteria which the learner’s performance is measured against are clearly defined as well as
theoretically and empirically proven to be adequate for the learner’s level. In terms of
Chapter 1
13
purely describing a student’s proficiency, one may argue that the criterion-referenced
approach permits more precise and intricate characterisation of learner strengths and
weaknesses than a norm-referenced one.
1.2. Factors defining a ‘good’ test While the previous section focused on defining and contrasting a number of key concepts
in the theory of testing and assessment, it is now time to analyse a few features which all
tests must contain to be considered appropriate and theoretically sound measuring tools.
Two particular features invariably emerge as central factors deciding whether a test can
be considered as an adequate basis for assessment: validity and reliability. In addition,
feasibility (or practicality) and authenticity constitute key elements in contemporary
language tests. Exploring the most salient characteristics of each element will lead to
further insight into the theoretical soundness – or issues – of traditional testing assessment
methods in the Luxembourg school system.
1.2.1. Validity Validity is often regarded as the most important but also ‘by far the most
complex’ 17 criterion when it comes to the theoretical legitimacy of a particular test. In
very general terms, it can be defined as ‘the degree to which the test actually measures
what it is intended to measure.’ 18 However, this rather broad definition does not
necessarily reflect the multifaceted nature and implications of this notion; for that reason,
it is normally broken down into a number of more detailed components that allow for a
more focused analysis. For the sake of conciseness, however, the following theoretical
exploration will remain limited to three particular areas of test validity that are most
frequently cited as salient in educational research 19 .
•
Content validity, in general terms, requires any measuring instrument to ‘show that it
fairly and comprehensively covers the domain or items that it purports to cover.’ 20
More specifically, both achievement and proficiency tests must thus ‘actually sample
the subject matter about which conclusions are to be drawn’, which in turn implies
that the assessment tool ‘requires the test-taker to perform the behavio[u]r that is
17
Brown, op.cit., p.448.
Ibid., p.448.
19
For a more exhaustive list of validity aspects, see for example Louis Cohen, Lawrence Manion & Keith
Morrison, Research Methods in Education, Routledge (London / New York: 2007), p.133.
20
Cohen et al., op.cit., p.137. Emphasis added.
18
14
Chapter 1
being measured’ 21 . For instance, if the teacher’s aim consists in assessing the
students’ writing skills in the target language, the corresponding test task cannot
merely be a multiple-choice exercise since the actually produced behaviour (i.e.
ticking a box to confirm comprehension) would then not offer any evidence of actual
writing skills. Correspondingly, no meaningful or legitimate inferences could be
drawn about the students’ veritable proficiency in writing because the task content
would be invalid for that purpose.
•
Face validity is closely connected to this first concept. However, it shifts the focus
from the test content itself to the test-takers’ interpretation of the tasks they are being
asked to fulfil. In fact, face validity is granted if learners feel that what they are asked
to do in the test is relevant, reasonable and a fair reflection of what they can actually
(and justifiably) expect to be assessed on. By looking at the “face” of the test, they get
the impression that it truly allows them to show what they are capable of in the target
language according to their progress up to that point. Indeed, most learners would feel
unfairly treated if they were suddenly subjected to a task that seemingly had nothing
to do with the learning objectives, subject matter and task types previously
encountered. The psychological link between these first two types of validity is
summed up by Brown as follows:
Face validity is almost always perceived in terms of content: If the test samples the
actual content of what the learner has achieved or expects to achieve, then face
validity will be perceived. 22
•
Construct validity, in contrast to the aforementioned aspects, is a concept that is
even more firmly rooted in theory rather than practice. As Cohen, Manion and
Morrison put it, ‘a construct is an abstract; this separates it from the previous types of
validity which dealt in actualities – defined content.’ 23 Construct validity is
fundamentally concerned with the conceptual purpose and relevance of tests and their
constituent items, as well as the theoretical conclusions that they permit to draw.
Brown describes it in the following terms:
One way to look at construct validity is to ask the question “Does this test actually
tap into the theoretical construct as it has been defined?” “Proficiency” is a
construct. “Communicative competence” is a construct. […] Tests are, in a manner
21
Brown, op.cit., p.449. Emphasis added.
Ibid., p.449.
23
Cohen et al., op.cit., p.137.
22
Chapter 1
15
of speaking, operational definitions of such constructs in that they operationalize
the entity that is being measured. 24
For instance, if a given test has been billed as capable of measuring the test-taker’s
“proficiency” in (a given aspect of) the target language, several factors must be
respected to grant the construct validity of that test. First of all, it must have been
clearly defined – prior to the test – what the construct of “proficiency” stands for, as
well as under what circumstances it can be adequately demonstrated by a particular
learner performance. The test itself, then, needs to provide an adequate possibility to
show that “proficiency” has indeed been reached to a satisfactory degree (and as seen
above, this is far from a straightforward procedure in a single test!). As for the
construct of “communicative competence”, insufficient validity may for example be
attributed to tests which are exclusively composed of gap-filling tasks. Indeed, if one
defines this particular competence as the learner’s ability to express himself fluently
enough in the target language to communicate meaning successfully and
independently, then such an indirect test task would certainly not constitute a valid
means to prove it. In other words, such a test would not ‘operationalize’ the learner’s
‘communicative competence’ in an adequate way and thus it would represent an
invalid assessment tool for the construct it claimed to focus on.
However, while content and construct validity are often closely linked, a
significant difference between both aspects is cited by Brown via an example from the
TOEFL. Interestingly, Brown states that although this well-known proficiency test
‘does not sample oral production’, that practical choice ‘is justified by positive
correlations between oral production and the behavio[u]rs (listening, reading,
grammaticality detection, and writing) actually sampled on the TOEFL.’ 25 In pure
terms of content, the component of oral production is thus completely absent from this
particular test; there is no moment when the ‘behaviour’ of speaking is actually
demanded from the student. As a construct, however, proficiency can still be inferred
if the test-taker completes all other tasks in a satisfactory way. This strikingly
exemplifies how ‘it becomes very important for a teacher to be assured of construct
validity’ in cases ‘when there is low, or questionable content validity in a test.’ 26
Evidently, this is only possible if such correlations have been clearly and solidly
demonstrated by educational research.
24
Brown, op.cit., p.450.
Ibid., p.450.
26
Ibid., p.450; italics added.
25
16
Chapter 1
Whichever path is chosen to validate a given test, it is vital to point out that completely
unassailable validity is practically unattainable for any assessment instrument; hence, it is
important to bear in mind that absolute validity is a virtual impossibility. Yet if careful
and systematic measures are taken in order to respect the above-mentioned criteria, then
test validity can of course be crucially increased. In fact, any teacher should effectively
strive to eliminate any factors that potentially reduce test validity as far as possible.
1.2.2. Reliability Similarly to validity, reliability is a complex notion that can be positively or
negatively affected by a wide range of factors. The CEFR defines this ‘technical term’ as
‘basically the extent to which the same rank order of candidates is replicated in two
separate (real or simulated) administrations of the same assessment’ (CEFR, p.177).
Borrowing Harmer’s considerably simpler terms, this essentially means that ‘a good test
should give consistent results’ 27 . Evidently, in the context of a “one-off” classroom test
which a teacher needs to design for and administer to a particular set of students at a
specific point in time, this aspect of reliability is very difficult to verify in practice.
Indeed, how often does a teacher, already pressed for time to ensure a reasonable level of
progression, find the time to make his class (or a comparable one with sufficiently similar
proportions of ‘strong’ and ‘weak’ learners) take the same test more than once? Even
then, could one realistically expect to recreate exactly the same conditions as on the first
occasion? Clearly, in everyday circumstances, the reliability of a particular test is nearly
impossible to prove by empirical means. Yet it certainly seems desirable that any given
test should consistently and unambiguously allow the assessor to separate ‘good’
performances from ‘bad’ ones. Crucially, there are a number of guidelines and
precautions that one can try to respect so as to prevent potentially adverse effects on
reliability. The different elements that need to be taken into account are generally
associated to the following categories:
•
First of all, the reliability of the test itself depends on the items that it comprises and
the general way in which it is constructed 28 . In this respect, Harmer specifies that
reliability can be ‘enhanced by making test instructions absolutely clear’ (so that their
wording does not induce errors of misinterpretation) or ‘restricting the scope for
27
28
Harmer, op.cit., p.322.
Brown, op.cit., p.447.
Chapter 1
17
variety in answers’ 29 . Appropriate size, complexity and context of test tasks, as well
as a sensible ‘number and type of operations and stages’ 30 that such tasks comprise,
are further elements which can contribute to increasing test reliability. Additionally,
overall length is important due to the fact that ‘the test may be so long, in order to
ensure coverage, that boredom and loss of concentration impair reliability’ 31 , even
though some researchers argue that in general terms ‘other things being equal, longer
tests are more reliable than shorter tests’ 32 . Interestingly, culture- and gender-related
issues can also have an effect on the reliability of test results; as Cohen et al. point
out, ‘what is comprehensible in one culture is incomprehensible in another.’
Furthermore, certain ‘questions might favour boys more than girls or vice versa’; for
example, according to the same authors,
[e]ssay questions favour boys if they concern impersonal topics and girls if they
concern personal and interpersonal topics. […] Boys perform better than girls on
multiple choice questions and girls perform better than boys on essay-type
questions […], and girls perform better in written work than boys. 33
These statistically proven tendencies underline the teacher’s need to be mindful of
different learner types and learning styles when aiming to set up a test with a high
degree of reliability.
•
During the actual administration of a test, a multitude of variables related to the
physical conditions in which it takes place can hamper its overall reliability as well.
As Harmer points out, it is vital that ‘test conditions remain constant’ 34 , yet
unfortunately this is not always within the teacher’s control. As an example, noise
from outside (caused, for instance, by road or repair works) can suddenly intrude into
the classroom and make the results of an otherwise immaculate test task unreliable as
the students’ concentration spans and levels will be affected. Similar ‘situational
factors’ which Cohen et al. identify as potential obstacles are for example ‘the time of
day, the time of the school year [or] the temperature in the test room’ 35 . While the
teacher is essentially powerless to counteract most of these elements, he can still
strive to ensure stability wherever possible; for instance, tests should, whenever
29
Harmer, op.cit., p.322.
Cohen et al., op.cit., p.161.
31
Ibid., p.161.
32
Ibid., p.159.
33
Ibid., p.161.
34
Harmer, op.cit., p.322.
35
Cohen et al., op.cit., p.160.
30
18
Chapter 1
possible, take place ‘in familiar settings, preferably in [the students’] own classroom
under normal school conditions’ 36 .
•
However, uncontrollable influences on test performance are not just limited to the
afore-mentioned situational factors; other causes of unreliability reside within the
huge diversity of the individuals who actually take the test (student-related
reliability). Elements which can significantly differ from student to student include
‘motivation, concentration, forgetfulness, health, carelessness, guessing [and] their
related skills’, but also the test-specific effects of ‘the perceived importance of the
test, the degree of formality of the test situation [and] “examination nerves”’ 37 . While
extrinsic motivation is usually ensured by the important consequences of summative
tests on the students’ overall chances of passing their year, intrinsic motivation to do
well in a particular test is heavily linked to the ways in which learners accept the
usefulness and reasons behind its constituent tasks. As Cohen et al. put it, ‘motivation
to participate’ – and arguably to do well – ‘in test-taking sessions is strongest when
students have been helped to see its purpose’ 38 .
As mentioned above, the students’ response to situational factors also varies and
thus a calm, reassuring atmosphere in the classroom should be aimed for. The
learners’ confidence level can additionally be raised through simple and unambiguous
instructions – if they understand what they have to do, they are more likely to do it
well. Nevertheless, even if great care has been given to the clarity of instructions, one
needs to bear in mind that the reliability of the corresponding results is still likely to
be affected by the questions and items that have ultimately been chosen. According to
Cohen et al.,
the students may vary from one question to another – a student may have
performed better with a different set of questions which tested the same matter. 39
While the students’ work may thus not always give a fair reflection of their
actual knowledge in the test situation, the same applies to their overall skills (which is,
of course, particularly important for the reliability of proficiency tests). Indeed,
unreliability may result from the fact that ‘a student may be able to perform a specific
36
Ibid., p.160.
Ibid., p.159. In this respect, Cohen et al. also mention the so-called ‘Hawthorne effect, wherein […]
simply informing students that this is an assessment situation will be enough to disturb their performance –
for the better or worse (either case not being a fair reflection of their usual abilities).’ (p.160; emphases
and italics added)
38
Ibid., p.160.
39
Ibid., p.160.
37
Chapter 1
19
skill in a test but not be able to select or perform it in the wider context of learning.’
Vice-versa, ‘some students can perform [a given] task in everyday life but not under
test conditions’ 40 . This intricate connection between test (item) reliability and studentrelated reliability underlines the need to use high caution in the interpretation of test
results.
Because
of
the
large
number
of
variables
involved,
wrongful
(over)generalisations about learner skills based on an isolated performance must
indeed be carefully avoided, however tempting they may be.
•
If the different students in a classroom inherently constitute a potential source of
unreliability due to the diversity that characterises them as human beings, it is not
surprising that a similar situation presents itself at the other end of the assessment
system. Indeed, the person (and personality) of the assessor can also affect the
reliability of a given test in a lot of different ways; in test theory, this is usually
referred to as scorer reliability.
Especially insofar as the finally awarded grades or marks are concerned,
divergences between the assessments of different scorers are virtually inevitable.
‘Inter-rater reliability’ is thus threatened because ‘different markers giv[e] different
marks for the same or similar pieces of work’ 41 – unfortunately a reality in most
schools which can understandably make some students feel unfairly treated. Indeed,
the low mark awarded by their own teacher might (at least in some cases) have been a
considerably higher mark in another teacher’s class. While some may argue that
scoring consistency could be increased by having all summative tests assessed by two
or more markers (a method which the Luxembourgish ‘double correction’ system in
13e and 1ère classes at least partially tries to implement), this is obviously impossible
to put into practice in all classes and levels in everyday teaching.
In the usual situation where a teacher is the only marker of all the tests taken by
his own students, becoming aware of – and counteracting – the various potential
inconsistencies in one’s own marking is, in Brown’s view, ‘an extremely important
issue that every teacher has to contend with’ 42 . Familiar examples include ‘being
harsh in the early stages of the marking and lenient in the later stages’43 , overlooking
similar mistakes in one student’s paper but not in another’s due to stress, fatigue,
carelessness or time pressure. Teachers may also proceed in an overly subjective
40
Ibid., pp.160-161.
Ibid., p.159.
42
Brown, op.cit., p.447.
43
Cohen et al., op.cit., p.159.
41
20
Chapter 1
manner or use ‘unclear scoring criteria’; in fact, even if the criteria are reasonably
valid in every single way, their ‘inconsistent application’ may lead to unfair ‘bias
toward “good” or “bad” students’ based on prior performances (also referred to as
“Halo” effect) 44 . Such scorer-related problems further underline the manifold
challenges a teacher faces when trying to ensure reliability at all stages of the
assessment process; not only external factors can have a negative impact, but one’s
own subjective tendencies, lack of rigour and discipline as well.
Interestingly, students and teachers also affect each other’s performances over
the course of an assessment procedure in different ways. While separate reliability
issues are already inherent to both groups, their relationship in the classroom leads to
another set of problematic factors. For instance, Cohen et al. stipulate that during a
test situation ‘students respond to such characteristics of the evaluator as the person’s
sex, age and personality’; they also ‘respond to the tester in terms of what he/she
expects of them’ 45 . In other words, rather than providing evidence of their veritable
range of abilities, students risk engaging in a “guessing game” instead, artificially
altering their actual answer style and content in anticipation of perceived teacher
expectations. On the other hand, teachers can also (albeit often unintentionally) be
guided by the overall impression they have of their students, rather than focusing on
their performances in a test alone. Aside from the already mentioned ‘Halo’ effect,
‘marking practices are not always reliable’ because ‘markers may be being too
generous, marking by effort and ability rather than performance.’ 46
Of course, as in the case of validity-affecting issues, it is certainly impossible to
completely prevent all different types of reliability-threatening behaviour and factors
at once. However, it will be one main aim of this thesis to explore important
counteractive measures which teachers might take in this respect on a more regular
basis.
44
Brown, op.cit., p.448.
Cohen et al., op.cit., p.160.
46
Ibid., p.161.
45
Chapter 1
21
1.2.3. Feasibility As seen so far, the principles of validity and reliability have a significant and
extremely multifaceted impact on the theoretical soundness of any ‘good’ test and
assessment. Due to the numerous constraints encountered in daily teaching practice,
however, teachers also need to keep in mind that the instrument they choose to implement
respects the criterion of feasibility (also referred to as practicality). According to Brown,
this means that a ‘good test … is within the means of financial limitations, time
constraints, ease of administration, and scoring and interpretation’ 47 .
Whereas financial aspects usually play a subordinated role in classroom testing, the
other factors Brown mentions all constitute familiar and predominant concerns for most
teachers. Keeping in mind that a normal school lesson in Luxembourg is limited to 50
minutes, timing is indeed a decisive element for practitioners when it comes to choosing
the types and contents of tasks to include when setting up a test. As a result, concessions
may for example have to be made in regard to ‘open-ended’ questions (requiring longer,
more complex answers) to be incorporated in a 60-mark test, especially if the students
have to complete it within a single lesson. Instead, multiple-choice or gap-filling
exercises may be preferred to a certain extent because of their time-saving nature rather
than their cognitive requirements and content validity. Indeed, less time is necessary for
the students to complete such tasks and the duration consecrated to assessment is in turn
reduced as well. Restricting possible answers to a low, fixed number of discrete items
also certainly increases the ‘ease of administration’. Evidently, though, weighing up
content validity and feasibility factors is often tricky, and sensible compromises have to
be found in most cases. Yet doing so is an absolutely crucial matter, as ultimately neither
element can be sacrificed excessively without affecting the test in a negative way.
Practicality is also important for the form(s) of assessment the teacher uses after the
test has been administered. To cite but one example, the CEFR states that ‘feasibility is
particularly an issue with performance testing’ because in that case ‘assessors operate
under time pressure. They are only seeing a limited sample of performance and there are
definite limits to the type and number of categories they can handle as criteria’ (CEFR,
p.178). This example already clearly underlines the importance of selecting a “feasible”
method of judging a particular performance; the teacher, as assessor, must avoid being
overwhelmed by an excessive mass of input – and of marking criteria to assess it.
47
Brown, op.cit., p.446.
22
Chapter 1
Carefully selecting the most salient and practical factors in relation to the task at hand is
vital in order to maximise the efficiency and relevance of the assessment method,
particularly given the often considerable amounts of tests that teachers have to mark
within a limited time span. I will return to more precise and detailed considerations about
these aspects of criterion-referenced assessments in subsequent chapters.
1.2.4. Authenticity According to Brown, another element which can significantly make a test ‘better’ is
a high degree of authenticity. While he concedes that this is ‘a concept that is a little
slippery to define, especially within the art and science of evaluating and designing
tests’ 48 , it certainly affects the constituent tasks and items of any test. The issue here is
that, particularly in the past, ‘unconnected, boring, contrived items were accepted as a
necessary by-product of testing’. Yet even nowadays ‘many item types fail to simulate
real-world tasks’ because they are excessively ‘contrived or artificial in their attempt to
target a grammatical form or lexical item’ 49 , betraying a lingering influence of pedagogic
principles from the grammar-translation approach. However, such a de-contextualised
approach must be avoided as much as possible if tests are to mirror real-life contexts,
allowing learners to truly demonstrate communicative competence and prove their use of
language in “authentic” situations.
Brown argues that both the language and context of test items should thus be
‘natural’ and connected rather than take the form of ‘isolated’ sentences or chunks which
are neither linked to each other, nor to any specific real-world situation. Furthermore,
‘topics and situations’ should, as far as possible, be ‘interesting, enjoyable’ and perhaps
even ‘humorous’; in that way, the learner is more likely to engage with the tasks
voluntarily (i.e. intrinsic motivation is raised). Further enhancement of authenticity can
then be reached if ‘some thematic organization is provided, such as through a story line or
episode’ and a task-based approach rather than a grammar-centred one is pursued. 50 Of
course, feasibility issues might dictate to what extent a teacher is able to implement this in
each single classroom test in all of his classes. In comparison to the vast resources and
possibilities of professional test designers and internationally renowned examination
bodies, an individual practitioner will indeed find sources of “authentic” and varied
material harder to come by. Nevertheless, striving to adapt test items and tasks more
48
Ibid., p.450.
Ibid., p.451.
50
Ibid., p.451.
49
Chapter 1
23
closely to real-life situations and contexts rather than opting for “the easy way out” with
mere lists of isolated, de-contextualised sentences lies well within the reach of any
teacher and can certainly be sought for on a much more regular basis.
1.3. Problem zones in ‘traditional’ ways of assessing productive skills at pre­intermediate level In light of the theoretical and practical considerations seen above, two major questions
emerge: which contentious issues affect the methods and principles that have dominated
summative testing and assessment procedures in Luxembourg for many years? On the
other hand, which alterations does a shift towards competence-based assessment imply?
One of the most striking elements of summative tests used by the majority of local
teachers up to this point is their perceptible over-reliance on writing tasks and exercises.
Even though the importance of developing all four skills across the various language
levels has been highlighted by the English curriculum in both ES and EST systems for
years, apparent reluctance exists when it comes to including, for instance, listening or
speaking tasks into summative tests on a regular basis or even at all. 51 Of course, this
does not imply that the remaining three skills have not been catered for (hence developed)
in classroom activities during the term. However, a long-standing discrepancy in favour
of writing tasks in regular summative tests cannot be denied. As seen above, feasibility
issues might in fact play a large part in this; for example, implementing listening
exercises in classroom tests tends to be fairly time-consuming and implies additional
preparations and requirements of a technical nature (i.e. making sure that there is a CD or
MP3 player available; time elapses while it is being set up…). Even bigger concerns
certainly arise in relation to the systematic assessment of each student’s speaking skills,
considering that most classes consist of twenty or more pupils; as a result, many teachers
tend to shy away from thorough, individual oral tests due to their obviously timeconsuming nature. Nevertheless, in a competence-based assessment scheme seeking to
clearly attest the progress made by individual students in relation to all four skills, an
excessive focus on written samples (in all summative tests of a term or year) does not
51
This attitude was not least exemplified by the vivid resistance that the proposed reconsideration of the
weighting of the four skills encountered in both Commissions Nationales des Programmes in 2009-10.
While the importance of writing was preliminarily reduced to 40% of a term’s assessment (the 3 other skills
receiving a weighting of 20% each), protests from many corners stubbornly persisted and repeatedly
resurfaced in the sessions of both commissions, arguing that 40% was an insufficient valuation of writing
skills which needed to be re-adjusted to a higher number again.
24
Chapter 1
seem sustainable anymore. For listening, speaking and arguably even reading skills, most
‘traditional’ tests would indeed carry insufficient content, face and construct validity as
they would fail to provide enough (if any) data for an adequate judgment of overall
proficiency or skills development.
Moreover, the current system clearly disadvantages students who may be good and
confident speakers of English, yet have problems with the more technical aspects of
writing such as orthographic accuracy; the one-dimensionality of ‘traditional’, exclusively
written tests does not allow them to demonstrate their biggest strengths. A similar case
can certainly be made for those students who have no major trouble understanding
written or spoken input (i.e. who are good at reading and listening), but struggle to show
comparable strengths in terms of language production. As a result, summative tests that
only focus on writing ultimately disregard the existence of a wide variety of learner
types and profiles in the classroom, favouring only one particular group of students
instead.
However, not only the lack of focus on multiple skills recurs as a problem; the
actual content (and corresponding validity) of a number of commonly used ‘writing’
tasks can be contentious as well. Undoubtedly as a remnant of the grammar-translation
approach, many ‘traditional’ summative tests, particularly at ‘elementary’ and ‘preintermediate’ levels, tend to contain a disproportionate amount of grammar-focused
exercises. To make matters worse, the latter frequently lack cohesive real-life contexts
and thereby ignore the beneficial effects of (semi-)authenticity. A particularly popular
example consists in ‘gap-filling’ exercises to verify the use of tenses or other discrete
language items; similarly, vocabulary knowledge might simply be verified by way of
simple word-for-word translations. In themselves, such tasks may be perfectly valid for
restricted purposes: for instance, if one wants to check the students’ usage of a discrete
grammatical item, gap-filling exercises may permit to assess attainment of a specific
(though admittedly narrow and purely grammar-based) learning objective. As a part of an
achievement test that focuses on knowledge rather than performance, such tasks thus
certainly make sense.
Yet a problem arises if the same exercises are used to presumably attest a learner’s
overall proficiency in writing. If students are not asked to do more than fill in single
words or slightly modify sample sentences throughout an entire test, one cannot consider
the accordingly limited amount of produced language as an example of writing
performance. Perhaps symptomatically, I found out at the beginning of the 2009-10
Chapter 1
25
school year that none of my 10e students (47 learners supposedly entering ‘intermediate’
or ‘B1’ level) had ever been subjected to a genuine ‘free writing’ task in an English test
before; throughout the first two years of learning the language, their success in summative
tests (and consequently their school year) had thus mostly depended on their adequate
completion of discrete-item tasks about grammar and vocabulary. In most ‘traditional’
summative tests, the only exception often consists in comprehension questions about set
texts treated in class prior to the test, which arguably lead to longer and hence more
“productive” answers. However, even the students’ answers to these questions can
actually not be interpreted as purely tapping into their writing skills, given the strong
dependence of such answers on the predefined content and lexis items of the texts. In fact,
one might even argue that such presumable ‘reading comprehension’ questions do not
truly verify reading skills either, as the students’ responses are inspired more by their
ability to memorise and recall specific details than by the direct application of reading
skills.
The necessity to reconsider traditional methods of testing productive skills at ‘preintermediate’ level thus already becomes clear in respect to writing tasks. The case is
arguably even more extreme when it comes to the systematic sampling of speaking
performance. As previously suggested, speaking skills have often been completely
neglected in summative testing due to feasibility issues. However, aside from the high
amount of time needed for an in-depth assessment of each individual’s oral productions,
some teachers may also have justified this choice by pointing to the students’ relatively
limited speaking abilities at this early stage of their language learning. Not infrequently,
this has led to the belief that low-level learners are simply incapable of providing
sufficiently extensive performance samples for a valid and reliable judgment of their
speaking skills to be based on. As a result, speaking tasks have simply been left out of
summative tests entirely. It seems clear that in a competence-based learning environment,
such attitudes need to be changed if both productive skills are to be assessed in more
equal measure than it has been the case so far. Sensible ways of regularly collecting and
assessing spoken samples (in addition to written ones) must therefore be found as quickly
as possible.
Whereas a number of problem zones thus exist in the ‘traditional’ selection of items
and tasks for summative tests, some frequently used assessment procedures present
several potential weaknesses as well. Up to this point, one of the major shortcomings in
the evaluation of written and spoken productions has been the absence of absolutely
26
Chapter 1
clear and unified assessment criteria for both productive skills. As a result, the
assessment of free writing often remains very subjective; holistic marking based on the
teacher’s overall impression usually prevails, even though basic distinctions between
form and content are normally used to break down the assessment into two broad scoring
factors. For instance, one frequent practice consists in attributing half of the overall mark
for a written sample on content (based on the presence, complexity and cohesion of key
ideas and concepts) while the other half of the mark is determined by the linguistic
correctness of the student’s answer (resulting from an overall impression of the student’s
grammatical and lexical accuracy, vocabulary range, spelling…). Alternative weighting
may consist in a slight emphasis of content (e.g. 2/3 of the final mark) over form (e.g.
1/3). On the one hand, the results arrived at by this method may not always differ
significantly from those of an approach that splits up and rates a given performance
according to more diverse and specifically defined (sub)criteria. It can also be argued that
a certain degree of subjectivity is not always negative in assessment, particularly as we
should, after all, be able to put a reasonable amount of trust into the professional
judgment of a trained and experienced teacher.
Nevertheless, with holistically attributed marks, scorer reliability issues will
inevitably remain; one can simply not get away from the fact that ‘hard’ and ‘soft’
markers may score a same sample of written performance in completely dissimilar ways.
This can evidently have disadvantageous consequences for pupils: their final mark does
not solely depend on the quality of their performance, but also on the person who teaches
them in a given year. Additionally, as seen above, even one same marker may reach
unreliable results in relation to various students’ performances in a particular class test,
especially when assessing samples of writing without having solid, unchanging guidelines
to fall back on. In this context, a norm-referenced approach might help to establish a
“qualitative hierarchy” between better and less convincing efforts. However, in some
circumstances this might lead to a perfectly satisfactory answer being marked down only
due to its perceived inferiority to an exceptionally good performance by another student
who actually exceeds expectations for a given level. Rather than rewarding the positive
elements of the first answer, such scoring behaviour would ultimately penalise remaining
weaknesses that might be perfectly normal for a student at ‘pre-intermediate’ level.
This type of reasoning highlights another typical flaw of ‘traditional’ assessment
strategies: especially in free writing tasks, maximum marks are normally not awarded at
all or only in very exceptional cases of excellent and virtually mistake-free work. Yet is it
Chapter 1
27
fair to make the highest possible score only available to students who have actually
already surpassed the level of proficiency that is realistically to be expected? Does it not
point to a rather negative assessment culture that is more focused on punishing mistakes
than on reinforcing positive steps and perceived progress?
A final problem of ‘traditional’ assessment procedures fittingly affects the
concluding step to each summative assessment: the feedback that teachers give to
students (and/or their parents) based on their performance in a particular test. An assessor
will inevitably find it more difficult to justify a given mark if they cannot underpin it by
referring to precise and transparent assessment criteria. Yet the unavailability of such
clearly defined criteria and universally expected standards of performance has been a
constant problem up to this point; providing such a solid “backbone” to assessments
could certainly reduce problems with scorer (especially inter-rater) reliability.
Simultaneously, the assessment process could become more transparent for all parties
involved. In that respect, criterion-referenced assessment can contribute to creating a
“level playing field” for all students, which an exclusively norm-referenced approach
would make impossible. As a corollary, however, a competence-based system also
implies that summative classroom tests may not always stay confined to achievement
assessment; particularly in relation to productive skills, establishing the proficiency levels
reached by individual students is bound to take on increased significance in the proposed
new assessment scheme.
In the next chapters of this thesis, I will explore possible ways of designing and
implementing such new competence-based forms of assessment. However, as an in-depth
analysis of all four skills would exceed the possible scope of this study, I will exclusively
concentrate on the two productive skills of speaking and writing; it is in this domain that
students produce the most immediate and extensive evidence of the progress they have
made in their language learning. Given the aforementioned doubtful validity of writing
tasks and complete omission of speaking elements in ‘traditional’ summative testing, this
thesis seeks to pave the way towards a different approach that allows for a more informed
and accurate picture of the student’s various achievements in terms of target language
production.
Furthermore, the exclusive focus on the ‘pre-intermediate’ level of proficiency is
neither random nor simply due to reasons of conciseness. In fact, the study intends to
question and contest the seemingly widespread assumption that productive skills
28
Chapter 1
(particularly “free” writing and speaking) cannot be extensively tested at this early stage
of language learning (commonly resulting in an over-insistence on drilling grammatical
basics instead). Moreover, the thesis is being written at a time when the shift towards
competence-based teaching is still at an early stage in the Luxembourg secondary school
system. As it is essential to build on solid foundations, it is both logical and necessary to
design and implement corresponding methods of assessment at the lowest levels of the
proficiency scale first. This thesis aims to actively contribute to that development by
finding and implementing such competence-based ways of testing and assessment in the
classroom, as well as exploring to what extent this may enhance the respective validity,
reliability and feasibility of these procedures.
29
Chapter 2: Towards a different, competence­based assessment method of speaking and writing. As a result of the present government’s large-scale move towards competence-based
forms of assessment, groundbreaking changes to the ‘traditional’ ways of designing,
implementing and marking classroom tests are inevitably in store for the teachers (and
consequently the requirements to learners) in the Luxembourg school system. The explicit
focus on the students’ multiple and varied language skills implies that writing tasks can
no longer be the only component of regular summative tests. To give a more complete
account of their capacities, learners must also demonstrate that they can communicate
orally as well as understand spoken and written input in the target language to a
satisfactory degree. Even in respect to writing skills, test strategies and items need to be
adapted so as to veritably tap into competences instead of merely demanding a
mechanical, de-contextualised application of discrete grammatical structures and lexical
items (for instance). Yet how can one reliably collect and interpret samples of
performance that trace tangible achievements in terms of skills rather than knowledge?
What precisely does this complex yet often only vaguely understood notion of
‘competence-based assessment’ actually encompass? And how can one truly certify
that a student has developed a particular language ‘competence’ to a satisfactory
degree?
Tackling these crucial questions in a theoretically founded and convincing way is
paramount if the ongoing drastic overhaul of the approach to assessment is to stand any
chance of being legitimate and, ultimately, successful. It comes as no surprise, then, that
rather than engaging in the monumental and extremely risky task of devising an
independent system “from scratch”, Luxembourg – like many other European countries –
is currently seeking to align its education system with one of the most heralded
breakthroughs in recent linguistic research: the CEFR. In a first step, it is thus necessary
to analyse what exactly this Common European Framework represents and what benefits
and dangers it contains as a foundation for a competence-based assessment scheme.
30
Chapter 2
2.1. The Common European Framework as foundation for competence­based assessment. 2.1.1. Why choose the CEFR as a basis for teaching and assessment? Ever since the first public release of the CEFR in 2001, several of its core elements
have been adopted (and in many cases adapted) by various international coursebook
writers, curriculum designers and examination bodies. This is most notably the case for
the six main proficiency levels (A1-C2) that it defines, as well as its multitude of detailed
and skills-specific descriptor scales. As a consequence of this rapid and widespread
influence and success, numerous teachers (and indeed most people connected to language
education in general) would undoubtedly agree with Julia Keddle’s claim that the CEFR
is fast becoming ‘an essential tool for the 21st century’ 1 in the field of language learning
and assessment. In short, as Alderson puts it, ‘nobody engaged in language education in
Europe can ignore the existence of the CEFR.’ 2
The reasons for this are manifold. For instance, one may argue that a first major
strength of the CEFR inherently consists in the highly ambitious goal it has set out to
achieve:
In documenting the many competences which language users deploy in
communication, and in defining different levels of performance in these
competences, the authors of the Framework have made explicit the true
complexity of the task that confronts learners – and teachers – of a language. 3
In doing so, the CEFR authors approach and define the notion of competence from
various angles. While ‘competences’ are first of all broadly described as ‘the sum of
knowledge, skills and characteristics that allow a person to perform actions’ (CEFR, p.9),
an important distinction is then made between ‘general’ and ‘communicative language
competences’. ‘General competences’, on the one hand, are ‘not specific to language,
but…called upon for actions of all kinds, including language activities’; they are built on
the four main pillars of ‘knowledge (savoir),…skills and know-how (savoir-faire),…
existential competence (savoir-être)’ as well as the ‘ability to learn (savoir-apprendre)’
(pp.9-11). Of even more direct relevance to language courses, the concept of
‘communicative language competence’ is then defined in terms of ‘linguistic,
sociolinguistic and pragmatic competences’, all of which need to be ‘activated in the
1
Julia Starr Keddle, ‘The CEF and the secondary school syllabus’ in Keith Morrow (ed.), Insights from the
Common European Framework, Oxford University Press (Oxford: 2004), p.43.
2
J. Charles Alderson, ‘The CEFR and the Need for More Research’ in The Modern Language Journal, 91,
iv (2007), p.660.
3
Keith Morrow, ‘Background to the CEF’ in Morrow, op.cit., p.6.
Chapter 2
31
performance of…various language activities’ (pp.13-14). In other words, the CEFR’s
strongly action-oriented focus becomes clearly visible from the outset; the various levels
of language proficiency are not just considered in terms of knowledge about the language,
but via a description of how linguistic resources are actively and effectively put to use in
different socio-cultural contexts and situations 4 .
In general, making the different processes and competences involved in language
learning (and communication as a whole) more transparent is certainly a crucial
requirement for any pertinent school-based language course as well. After all, as Morrow
further states,
learners, teachers, and ministers of education want to know that decisions about
language teaching are based on a full account of the competences that need to be
developed. For the first time, [due to the CEFR,] such an account is now
available. 5
In the specific context of Luxembourg, the pre-existence of such a detailed inquiry into
the multiple facets of language learning is an undeniable asset. Since competence-based
teaching constitutes a distinctly new and thus largely unexplored path in our education
system, the inevitable task of defining a relevant, valid and workable framework seems a
daunting one indeed. Encouragingly, the CEFR not only represents a possible, detailed
alternative for this purpose; through its inherent status as a European project, it also
intrinsically offers an opportunity to make the certification of our students’ achievements
more easily comparable to international standards, and thus more widely recognisable
(and more willingly recognised) beyond our local borders. The exclusively marks-based
system thus far used in Luxembourg has recurrently presented more problems in that
respect, not least due to its singular reliance on a 60-mark scheme that is not generally
found (and therefore not always adequately interpreted) in other European countries.
In contrast, the six Common Reference Levels defined by the CEFR (reaching from
‘A1’ or ‘Breakthrough’ stage to the ‘C2’ or ‘Mastery’ level) present a number of very
useful features. For example, as Heyworth points out, they are already ‘being used as the
reference for setting objectives, for assessment and for certification in many contexts
around Europe’ 6 , facilitating comparisons of different national systems and their
respective outcomes in the process. This is partly due to the fact that the traditionally used
4
In this paragraph, emphasis has only been added to the words ‘action’ and ‘performance’. Otherwise,
italics and emphases are the CEFR authors’.
5
Ibid., p.7.
6
Frank Heyworth, ‘Why the CEF is important’ in Morrow, op.cit., p.17.
32
Chapter 2
words ‘intermediate’ and ‘advanced’ [for instance] are vague terms which tend
to be interpreted very differently in different contexts. 7
The Common Reference Levels, on the other hand, have been painstakingly defined and
extensively underpinned by nuanced and level-specific proficiency descriptors. As a
result, the ‘scale from A1 to C2 makes level statements more transparent’; hence, it is
‘increasingly being used to describe and compare levels both within and across
languages’ 8 .
2.1.2. The need for caution: political and economic concerns To a certain extent, however, caution should be maintained in this context. After all,
it would be negligent for any sensible analysis of the CEFR to overlook the underlying
political dimension that partially explains its current, speedy propagation across the
European continent. As critics like Glenn Fulcher astutely point out, a significant reason
why ‘language education in Europe is lurching toward a harmonised standard-based
model’ resides in ‘the interests of so-called competitiveness in global markets’ 9 . In vying
to serve the ‘primary economic needs of a [European] superstate, competing for its place
in the global economy’ 10 , the harmonisation of education systems in individual member
states clearly provides a crucial stepping-stone. A common denominator like the CEFR
can most certainly play an integral role in this unifying effort, even if its authors did not
produce it with that particular design in mind 11 . Ultimately, it is important to realise that
‘aligning our teaching and assessment practices to the CEFR is not an ideologically or
politically neutral act’ 12 . The improvement of intercultural understanding and of the
quality of local language education systems may very well be real and reasonable grounds
for seeking such an alignment, and will hopefully emerge as some of its most positive
consequences in the future. Nevertheless, it would evidently be naïve to suppose that the
CEFR would have enjoyed the same instant success if, in addition to its cultural and
educational benefits, it did not promise to be so useful in the wider context of European
politics and economics.
7
Ibid., p.17.
Ibid., p.17.
9
Glenn Fulcher, ‘Testing times ahead?’ in Liaison Magazine, Issue 1 (July 2008), p.20. Accessible at
http://www.llas.ac.uk/news/newsletter.html.
10
Ibid., p.23.
11
As Morrow (art.cit., p.3) crucially points out, the ‘Council of Europe has no connection with the
European Union’ and ‘while its work is ‘political’ in the broadest sense – aiming to safeguard human rights,
democracy, and the rule of law in member countries – an important part of what it does is essentially
cultural in nature rather than narrowly political or economic.’
12
Fulcher (2008), p.23.
8
Chapter 2
33
While Fulcher does recognise that the ‘adoption [of the Framework even] beyond
Europe testifies to its usefulness in centralised language policy’, he also warns that the
ensuing ‘pressure to adopt the CEFR’ can easily lead to undesirable consequences.
Essentially, the CEFR risks becoming a ‘tool of proscription, […] made effective through
institutional recognition of only those programmes that are “CEFR-aligned”’ 13 . In theory,
of course, requiring compliance to a universally accepted standard need not be an
inherently negative prerequisite; as seen above, such harmonisation certainly has the
potential of markedly increasing the comparability of results. In practice, on the other
hand, it is precisely this intended development which can cause a number of problems,
particularly if its implementation is rushed and forced:
For the users of language tests, the danger is that any test that does not report
scores in terms of CEF levels will be seen as “invalid” and hence not
“recognised”. […] For many producers of tests, the danger lies in the desire to
claim a link between scores on their tests and what those scores mean in terms of
CEF levels, simply to get “recognition” within Europe. 14
The main concern is that the establishment of a valid ‘link’ between particular tests and
the CEFR levels is far from a straightforward procedure. While a preliminary Manual 15
has been published with guidelines for this exact purpose, the validation process remains
a complex and difficult one. As Alderson points out in this context, it does not help that
‘the Council of Europe has refused to set up an equivalent mechanism to validate or even
inspect the claims made by examination providers or textbook developers’ (a decision
which the same author emphatically slams as ‘a serious dereliction of professional
duty’) 16 . Unsurprisingly, if they remain largely left to themselves, test designers (and, in
the same vein, coursebook writers and curriculum developers for entire national education
systems) can easily become tempted to skip or abbreviate some of the necessary
meticulous steps for validation. This is particularly the case if they find themselves under
time pressure to produce “CEFR-compatible” material, be it for economic or political
purposes. At its worst, Fulcher states, ‘the “linking” is’ in those cases ‘mostly intuitive’ 17 .
Contrary to the intended outcome, different tests superficially ‘claiming’ a link to a same
13
Ibid., p.21.
Glenn Fulcher, ‘Are Europe’s tests being built on an ‘unsafe’ framework?’, The Guardian Weekly (18
March 2004), accessible at http://www.guardian.co.uk/education/2004/mar/18/tefl2. Emphasis and italics
added.
15
Council of Europe, Manual for relating Language Examinations to the Common European Framework of
Reference for Languages, January 2009. Accessible at
http://www.coe.int/t/dg4/linguistic/Manuel1_EN.asp#TopOfPage.
16
Alderson, art.cit., pp.661-662.
17
Fulcher (2004)
14
34
Chapter 2
CEFR level may thus actually provide a much more unreliable basis for comparison than
their apparent commonalities may initially suggest.
Hence, if European states are vying to integrate the CEFR into the teaching and
assessment strategies of their national education systems to increase the comparability of
their results across borders, they need to be highly alert to the aforementioned dangers of
excessively precipitating this adaptation. Clearly, the speed with which changes are
executed should always remain subordinated to the importance of careful planning and
thorough inquiry into their theoretical soundness. Considering the fundamental and allencompassing way in which our national school system is currently being reshaped, it is
clear that in Luxembourg, too, a truly valid alignment to the CEFR can only be arrived at
if rash, “intuitive” decisions and adaptations are avoided in this crucial period of time.
2.1.3. The need for caution: pedagogic concerns Whereas such considerations are important to keep in mind, questioning the
political and economic purposes and ramifications of the CEFR’s success in further depth
would evidently go beyond the scope of the present study. After all, this thesis is
primarily interested in exploring the pedagogic value that the competency model and
illustrative scales of the Framework offer for practitioners in the classroom. However,
even in that respect, Fulcher voices some clear doubts regarding the theoretical
foundation and validity of the ways in which the CEFR descriptors were established.
While North points to the fact that the illustrative scales of the CEFR ‘form an item bank
of empirically calibrated descriptors’ based on ‘extensive qualitative research’ 18 , Fulcher
counters by insisting on a number of problematic elements in the chosen approach:
There is a widespread, but mistaken, assumption that the scales have been
constructed on a principled analysis of language use, or a theory of second
language acquisition. However, the descriptors were drawn from existing scales
in many testing systems around the world, and placed within the CEFR scales
because teacher judgments of their level could be scaled using multi-faceted
Rasch. […]
The selection of descriptors and scale assembly were psychometrically driven, or
… based entirely on a theory of measurement. The scaling studies used intuitive
teacher judgments as data rather than samples of performance. What we see in
the CEFR scales is therefore ‘essentially a-theoretical’. 19
18
Brian North, ‘The CEFR Illustrative Descriptor Scales’ in The Modern Language Journal, 91, iv (2007),
p.657.
19
Fulcher (2008), pp.21-22; emphasis added. Also see Fulcher (2004) for a more detailed explanation of the
different steps that were used to compile the CEFR descriptors and scales.
Chapter 2
35
These are very serious issues, considering that numerous language curricula and
examinations all over Europe (including Luxembourg) are already setting (or aiming to
set) the attainment of particular CEFR proficiency levels as targeted – and ultimately
certified – learning outcomes. Evidently, the validity of the latter is very much at stake if
the CEFR scales they are built on ‘have no basis in theory or SLA research’ 20 . Problems
with construct validity inevitably arise if CEFR descriptors, which supposedly describe
‘learner proficiency’ in the target language, must in fact be more accurately defined as
being merely ‘teachers’ perceptions of language proficiency’.
North is aware of this possible caveat, but he crucially stresses that ‘at the time the
CEFR was developed, SLA research was not in a position to provide’ the ‘validated
descriptions of SLA processes’ implied in the illustrative descriptors 21 . This is echoed by
Hulstijn, who underlines that ‘the CEFR authors, in the absence of fully developed and
properly tested theories of language proficiency, had to go ahead by scaling the available
descriptors’ 22 . One may even argue that since the CEFR descriptors were selected and
compiled from various existing scales and testing systems, there is a strong chance that
the latter were, at least in part, based on reasonably safe and systematic theoretical
principles in the first place. In that sense, their blending and reorganisation into the CEFR
scales might imply that a sound theoretical basis is still indirectly present even if no one
scale or system may have been rigorously maintained. Nevertheless, it certainly seems
sensible to heed Fulcher’s warnings against a premature ‘reification’ (which he defines as
the ‘propensity to convert an abstract concept into a hard entity’) of the CEFR as a
genuine tool for attesting proficiency if it is ‘not theoretically justified’ 23 at its core.
2.1.4. Reasons for optimism: chances offered by the CEFR Such considerations should rightly make us wary of adopting the Council of
Europe’s Framework in an overly uncritical and unselective fashion. However, this does
not mean that we ought to shy away from it completely; far from it. The key resides in
looking at the CEFR as a flexible, malleable and useful foundation instead of a complete,
dogmatic and ready-made solution. As North puts it, ‘the CEFR is a concertina-like
20
Fulcher (2008), p.22.
North, art.cit., p.657.
22
Jan H. Hulstijn, ‘The Shaky Ground Beneath the CEFR: Quantitative and Qualitative Dimensions of
Language Proficiency’ in The Modern Language Journal, 91, iv (2007), p.664. Italics added.
23
Fulcher (2008), p.22.
21
36
Chapter 2
reference tool, not an instrument to be “applied”’ 24 . It is also in this sense that Fulcher
ultimately sees a truly suitable purpose for it:
It is precisely when the CEFR is merely seen as a heuristic model which may be
used at the discretion of the practitioner that it may become a useful tool in
the construction of tests or curricula. […] The purpose of a model is to act as a
source of ideas for constructs that are useful in our own context. 25
Importantly, this same view is in fact pre-emptively voiced by the authors of the
CEFR no later than in the introductory ‘notes for the user’ of their document. From the
outset, they insist that their findings are by no means intended to be prescriptive; they
expressly stress that they ‘have NOT set out to tell practitioners what to do, or how to do
it’ (p. xi). Adaptation of the CEFR’s content to specific local or pedagogic contexts is
consistently encouraged by the writers, which further underlines that the document ‘is not
dogmatic about objectives, about syllabus design, or about classroom methodology’ 26 . At
numerous points throughout the CEFR, the user is directly challenged to actively pick his
own preferred strategies or approaches from a number of alternatives that are
meticulously and factually described rather than dogmatically imposed 27 . Incidentally, as
Heyworth notes,
[a]n interesting implication of the presentation of options in this way is that it
pre-supposes teachers who are responsible, autonomous individuals capable
of making informed choices, and acting upon them. 28
On the surface, then, the authors have gone to great lengths to pursue and
demonstrate a descriptive approach in their Framework, overtly promoting autonomous
reflection and decision-making on the part of the user. Nonetheless, there are of course
some foci that recurrently emerge as particularly prominent in the CEFR’s descriptive
scheme. For instance, it is clearly perceptible that
the emphasis throughout the CEF is on how languages are used and what
learners/users can do with the language – on language being action-based, not
knowledge-based. 29
24
North, art.cit., p.656.
Fulcher (2008), p.22.
26
Morrow, art.cit., p.8.
27
The sections in question are clearly recognisable boxes in which ‘users of the Framework’ are explicitly
encouraged ‘to consider and where appropriate state’ what their preferred approach to key issues is; biased
or prescriptive suggestions are markedly omitted. An interesting example is provided in the ‘Errors and
mistakes’ section, where different (traditional as well as innovative) attitudes and strategies to error
correction are offered to the teacher, who then needs to decide what the best and most efficient alternative
may be in order to achieve the most productive and lasting effect on his students’ language learning.
(CEFR, pp.155-156)
28
Heyworth, art.cit., p.20. Emphasis added.
29
Heyworth, art.cit., p.14. Emphasis added.
25
Chapter 2
37
Such an approach, as Alderson attests, is ‘in large part … based on ideas from the AngloAmerican world of communicative language teaching and applied linguistics’ 30 , and it
definitely carries some intrinsic benefits. For instance, Keddle points out that the ‘focus
on learner language’ goes hand in hand with a welcome ‘move away from mechanical
grammar work’. In fact, the ‘renewed focus [of the CEFR] on situational and functional
language, and on the strategies students need in the four skills’ not only implies increased
reliance on ‘communicative language practice’ 31 as well as on the nurturing of language
(and learning) skills in the classroom. The insistence on strategy development also
provides a clear indication that the CEFR regards the fostering of learner autonomy as a
highly desirable process.
This consistent tendency is most obvious in the CEFR’s recommendation of selfassessment through learner-centred ‘can do’ descriptors and the systematic use of the
European Language Portfolio (ELP). Unfortunately, specific applications of this latter,
particular tool (which, in its root form, predominantly focuses on an independent and
arguably adult language learner rather than a school-based teenage one) cannot be
discussed in further detail in this thesis as this would significantly transcend the
framework of classroom-based summative testing 32 . Nevertheless, the fundamental
structure and purpose of the ELP already exemplify two key characteristics that underline
the CEFR’s potential to trigger a veritable paradigm shift in language teaching and
testing. On the one hand, by recording his own performances and results in different parts
of the portfolio, the learner himself is clearly in the centre of both the learning and
assessment processes. Through the numerous and skills-specific ‘I can…’ descriptors in
the ELP, the student acquaints himself in much more depth with the various competences
needed to become a proficient foreign language user than he would be able to do in a
more traditional, teacher-centred classroom environment. At the same time, the learner’s
heightened metacognitive awareness contributes to making him trace and realize the
progress he has gradually made (and continues to make) as a language learner in a much
more conscious and engaged fashion.
While the aforementioned promising effects arguably find their greatest
development in the context of portfolio-based work (hence the recommended use of the
ELP), they can also be attributed to the CEFR’s descriptive scales and their underlying
30
Alderson, art.cit., p.660.
Keddle, art.cit., p.43.
32
For a particularly interesting use of how the ELP has been put to use, see for example Angela
Hasselgreen et al., Bergen ‘Can Do’ project, Council of Europe (Strasbourg: 2003)
31
38
Chapter 2
concept in general. As suggested above, these scales allow us to break down the
competences involved in language learning in a very nuanced way, making it easier to
distinguish between particular areas where mastery has been achieved to a specific
degree, and those which still need to be developed further. Simultaneously, progress is
rigorously traced in exclusively positive terms. As rendered perfectly evident by their
deliberate phrasing, the designed purpose of the CEFR descriptors consists in identifying
what the learner can rather than cannot do. An interesting and symptomatic example is
provided by the general descriptor for the ‘grammatical accuracy’ of a typical A2 learner:
Uses some simple structures coherently, but still systematically makes mistakes
– for example tends to mix up tenses and forget to mark agreement; nevertheless,
it is usually clear what he/she is trying to say. (CEFR, p.114)
In comparison to more traditional assessment methods which focus heavily on
grammar correction even in free writing, one notices that the descriptor does not deny the
presence of remaining imperfections in the learner’s performance. Yet instead of
dismissing the entire performance as insufficiently proficient (due, for example, to the
wrong use of tenses or a missing 3rd-person ‘–s’ in the present simple), it also recognises
the overall, successful communicative effect of the learner’s performance (i.e. it is ‘clear
what he/she is trying to say’ in spite of the still limited linguistic means at his/her
disposal). It is precisely this approach which can pave the way towards a much more
positive assessment culture overall: instead of penalising the students’ weaknesses, the
CEFR, as a framework for teaching and assessment, allows us to single out the elements
that the learners already do well. Keddle rightly points out that this points to significant
changes for teachers, who all too often ‘still tend to measure student performance against
a native-speaker, error-free absolute, even at beginner levels’, even if they are actually
‘aware of theories of language acquisition’ that prohibit precisely such expectations of
“instant perfection”. In turn, this promises a fairer chance for students, ‘accustomed as
they are to being marked down for ‘mistakes’’. Indeed, particularly at lower levels such
as A2, the CEFR descriptors incite a radical change of approach in that they actually
‘allow an ‘imperfect’ performance to be appropriate for someone of that level, rather than
being perceived as failure’ 33 .
At the root of this more positive attitude to remaining shortcomings is an awareness
of the central concept of the learner’s interlanguage. Brown defines this notion as ‘a
33
Keddle, art.cit., pp.45-46. One may add, of course, that even a ‘native-speaker’ performance might not
always be ‘error-free’.
Chapter 2
39
system that has a structurally intermediate status between the native and target
languages’. As learners progressively construct an understanding of the foreign language,
they develop ‘their own self-contained linguistic systems’ 34 consisting of approximations
and, initially, flawed interpretations of the target language systems they are trying to
acquire. In other words, one needs to be aware that second language learners (at lower
levels in particular) do not simply amalgamate all the chunks of language that they have
been exposed to and then correctly connect them right away (and continue to do so in a
consistent, permanent manner afterwards). Instead, their progressively developing
interlanguage represents ‘a system based upon the best attempt of learners to bring order
and structure into the stimuli surrounding them’ 35 . Remaining errors are perfectly normal
at this stage, and should thus be expected when learner performances in the target
language are assessed.
Admittedly, the CEFR still lacks a consistent, SLA-theory-based explanation about
the exact steps that are necessary for the progression from a low level of proficiency to a
higher one (e.g. A2 to B1), as well as a systematic inquiry into how and when such a
progression is achieved by the learner – or facilitated by the teacher – in pedagogical
terms. In this respect, Weir questions the CEFR’s ‘theory-based validity’, since it ‘does
not currently offer a view of how language develops across these proficiency levels in
terms of cognitive or metacognitive processing’ 36 . However, taken in isolation, the
detailed descriptions for each of the six levels provide a remarkably thorough insight into
the characteristics of the various stages of proficiency that a language learner can achieve
(even if, of course, not every learner can or will always reach C2!). Within any particular
level, the CEFR descriptors certainly have the merit of drawing attention to the types of
shortcomings that still have to be expected. Yet imperfection is not represented as an
automatically negative flaw to be extinguished immediately and unrelentingly; in the
classroom, such an excessively accuracy-focused stance would understandably deter
many students from engaging in a more spontaneous and adventurous experimentation
with the target language as a result. Instead, occasional flaws are included as an
intrinsically normal and natural feature of the complex process of language learning.
34
H. Douglas Brown, Principles of Language Learning and Teaching (5th ed.), Longman/Pearson (New
York: 2007), p.256.
35
Ibid., p.256.
36
Cyril J. Weir, ‘Limitations of the Common European Framework for developing comparable
examinations and tests’ in Language Testing, 22 (2005), p.282/p.285. Accessible at
http://ltj.sagepub.com/cgi/content/abstract/22/3/281
40
Chapter 2
In this context, a further strength of the Framework is its explicit distinction
between errors and mistakes,
errors being examples of the learner’s interlanguage, which demonstrate her/his
present level of competence, whereas mistakes occur when learners, like native
speakers sometimes, do not bring their knowledge and competence into their
performance – i.e. they know the correct version, but produce something which
is wrong. 37
This crucial difference forces the assessor to be cautious and nuanced in his interpretation
of incorrect elements in student productions. When it comes to determining the level of
competence that a particular learner has attained, error analysis can, and should, play a
vital role. Rather than simply counting the number of incorrect verb endings or wrongly
used tenses to reach an overall verdict on a specific performance (which, in summative
tests, is often synonymous with a final mark), careful consideration of the frequency and
nature of particular “mishaps” is needed to paint a more composite and accurate picture of
the student’s achievements. Keeping in mind the above-mentioned descriptor for
‘grammatical accuracy’, for instance, the assessor should verify whether a wrong verb
ending (such as the notoriously forgotten ‘-s’ for the third person singular in the present
simple) occurs repeatedly and systematically in a given student’s production (in which
case it represents an ‘error’ in his/her interlanguage). On the other hand, if it is only an
uncharacteristic ‘slip’ (i.e. a mistake which may appear merely once or twice) that stands
out from other, consistently correct uses of the same grammatical item, then one can
assume that the student has simply forgotten to ‘bring [his/her] knowledge and [his/her]
competence into [his/her] performance’ on this one exceptional occasion. Especially in
challenging productive tasks involving free speaking or writing, numerous and varied
linguistic elements (e.g. grammar, spelling, syntax…) need to be focused on
simultaneously. As a consequence, such ‘slips’ are most likely to happen to students in
those particularly demanding task types.
Distinguishing between errors and mistakes thus allows the teacher to get a clearer
view of the level of competence reached by the learner in a particular domain at a certain
point in time. It also represents a powerful diagnostic tool through which the assessor can
provide more precise feedback (or, more accurately, ‘feed-forward’) concentrating on
specific areas which the student still needs to work on in the future. At the same time, the
CEFR descriptors remind us not to get too caught up in our “search” for errors, either: the
overarching question of whether the overall communicative act has been successful or not
37
Heyworth, art.cit., p.19. Emphasis added.
Chapter 2
41
should always remain in the back of the assessing teacher’s mind. After all, as Brown
reiterates,
The teacher’s task is to value learners, prize their attempts to communicate, and
then provide optimal feedback for the [interlanguage] system to evolve in
successive stages until learners are communicating meaningfully and
unambiguously in the second language. 38
If the focus of assessment expressly shifts towards the overall communicative
success of a learner’s performance, then it is of course essential that the tasks which the
students have to carry out in summative tests offer them adequate opportunities to
produce relevant and contextualised samples in the target language. In this respect,
another strength of the CEFR becomes strikingly evident, as its ‘renewed focus on
situational and functional language’ could significantly contribute to ‘bring[ing] the real
world back into the classroom’ 39 . In other words, aspirations to make learning situations,
contexts and test tasks as authentic as possible could receive a significant boost through
the importance conferred to communicative language use in the CEFR. Even if, at its
origin, the scope that the Framework covers is predominantly ‘out in the real world and
distant from the unnatural habitat of the classroom’, Keddle argues that an appropriate
application of the CEFR in a school-based context could go a long way towards
increasing the degree of authenticity of language learning tasks:
The CEF, with its references to text types (e.g. news summaries, messages,
advertisements, instructions, questionnaires, and signs), provides teachers with a
checklist they can use to incorporate genuine communicative skills/strategies
work into their teaching. Authentic text-types can be adapted to suit the interests
and age levels of students, and clear objectives can be set to fit their language
level… 40
Indeed, the descriptive scheme of the CEFR includes a myriad of lists and examples of
possible text types and linguistic tasks, demonstrating in meticulous detail where and
when each one of the core language skills is called into action within the four main
domains or ‘spheres of action’ of everyday life. Furthermore, as Heyworth stresses,
[t]he concepts of domain – personal, public, occupational, educational – and the
descriptive categories of location, institution, person, object, event, operation,
and text, provide a framework for the design of needs analysis questionnaires,
and for the definition of outcomes. 41
38
Brown, Principles of Language Learning and Teaching, p.281
Keddle, art.cit., p.43.
40
Ibid., p.45.
41
Heyworth, art.cit., p.18.
39
42
Chapter 2
Without a doubt, stimulating competence-based lessons and learning objectives can be
derived from this extensive scheme as well. In addition, however, the impact on test foci
and features can be just as significant. As Fulcher points out,
Language testing rightly prioritises purpose as the driver of test design
decisions. The context of language use is therefore critical, as it places
limitations upon the legitimate inferences that we might draw from test scores,
and restricts the range of decisions to which the score might be relevant. 42
In that sense, the functional/situational focus of the CEFR’s descriptor scales promises a
step into the right direction, even if a careful and selective application to the school
context evidently still needs to be ensured. Hence, the most essential communicative
functions and situations must be chosen and defined in relation to immediate learner
needs, and the corresponding instructional material adapted to the students’ competence
level. However, if such an action-oriented, competence-based and authenticity-promoting
framework as the CEFR is chosen as the foundation for school-based learning and
assessment, it is glaringly obvious that tests composed only of de-contextualised discreteitem tasks become obsolete once and for all.
2.2. Challenges of introducing competence­based assessment The theoretical usefulness of the CEFR as foundation for a competence-based
teaching and assessment scheme is thus perceptible in a variety of its most defining
aspects. Yet how can this Framework be used in practice to arrive at a workable and
sound assessment system? What are the difficulties and obstacles that have to be
surmounted before we can truly affirm that our school system also develops and measures
language skills rather than exclusively knowledge?
2.2.1. The achievement versus proficiency conundrum in summative tests One of the major challenges inherent to competence-based assessment is underlined
by the following passage in the CEFR:
Unfortunately, one can never test competences directly. All one ever has to go
on is a range of performances, from which one seeks to generalise about
proficiency. (CEFR, p.187; emphasis added)
If the aim is to certify learning outcomes with a high degree of validity and reliability,
this realisation certainly seems inconvenient at first. To what extent can we trust – or
justify – the objectivity of ‘generalisations’ that an individual assessor makes about a
42
Fulcher (2008), p.22. Emphasis added.
Chapter 2
43
given learner’s apparent level of competence if the judgment is inexorably founded on
inference rather than a direct and straightforward connection between tangible evidence
and its factual significance? Moreover, if a ‘range of performances’ is needed to reach a
conclusion about the learner’s proficiency, what can a single summative test contribute to
that effect? The complicated nature of the situation is underlined further by the following
statement:
Proficiency can be seen as competence put to use. In this sense, therefore, all
tests assess only performance, though one may seek to draw inferences as to the
underlying competences from this evidence. (p.187)
As seen in the previous chapter, the traditional core function of regularly scheduled
summative tests consists, by definition, in verifying achievement of fairly narrow,
previously covered objectives rather than overall target language proficiency. Rating a
particular, isolated student performance is thus their predominant focus and purpose. Yet
if the entire school system becomes centred on the development and assessment of
competences, it seems highly desirable that summative tests should simultaneously
provide hints, at regular intervals, as to where a given student stands in relation to the
ultimately targeted competence level at a given point in time. While no single test can
illustrate the entire extent of progress that the learner has made in terms of his linguistic
competences, each one of them nevertheless constitutes one piece of the gradually
emerging “proficiency puzzle”. Considered together rather than in isolation, the student’s
performances in summative tests thus do provide a significant contribution to the overall
picture of his proficiency (and, by extension, help to assess the levels of competence
reached in regard to the different skills).
Furthermore, the general form and constituent tasks of individual summative tests
can be altered so as to present a stronger indication about proficiency in their own right.
Essentially, the more diverse the individual task types are, the more they tap into different
aspects of specific competences; hence, through the inclusion of a range of varied tasks
and requirements (even in relation to one same overarching skill such as spoken or
written production), a ‘range of performances’ may in fact be collected even within the
obvious limitations of a single test. However, it would evidently be a big mistake to
simply equate the student’s actual proficiency in the foreign language exclusively with his
performances in summative tests. Over the course of a school year, numerous other,
formatively assessed activities in the classroom elicit language productions that complete
the teacher’s overall impression of the student’s proficiency and thus the extent of
44
Chapter 2
competence development. In that sense, summative tests should never be considered as a
sole and sufficient source for proficiency assessment.
This realisation also bears crucial implications if one wants to use the CEFR
descriptor scales as a general basis for assessment. Through their intrinsically
competence-oriented nature, the individual descriptors are inherently characterised by a
focus on the “bigger picture”: overall proficiency. Hence, their scope is far too extensive
to allow for a direct and sufficiently nuanced rating of a specific, isolated student
performance in a particular summative test task for instance. This is most evident in the
phrasing of the global level descriptors which, as Morrow points out, ‘like any attempt to
capture language performance in terms of language, … cannot be absolutely precise’ 43 .
Indeed, it would be difficult for a single summative test to verify in depth whether a
particular student aspiring to attain the ‘A2’ level ‘can communicate in simple and routine
tasks requiring a simple and direct exchange of information on familiar and routine
matters’ (CEFR, p.24). Evidently, one can easily imagine a sample task which would
check one particular aspect of this global competence; for example, in an oral interview,
one could simply interrogate a student about his favourite hobbies or other daily habits. It
is very conceivable that a student in 6e or 9e might answer all questions about such a
topic in a satisfactory way following a few weeks of correspondingly topic-based
language learning (thus proving the achievement of a precise objective). However, this
one successful performance does not allow us to draw definite conclusions about the
student’s overall level of competence – such an inference would be rash and excessive
after one narrowly-focused task, and thus ultimately invalid.
This also clearly underlines that, in their original form, the descriptors in the CEFR
scales are neither intended nor ready-made to be simply ‘copy-pasted’ into marking grids
and then have precise numerical values allotted to them so that they can be used in the
summative assessment of individual test performances. The global insights which they
can offer into the competence profile of a given language learner can only be safely
attested after numerous, varied indications have been collected over a more substantial
period of time. This explains why in the present Luxembourg education system, the most
suitable place for such global CEFR descriptors is in the so-called “complément au
bulletin” awarded at the end of an entire instruction “cycle”. For example, after complete
learning cycles of one (6e) or two (8e/9e) school years, teachers have a much wider range
43
Morrow, art.cit., p.8.
Chapter 2
45
of performances at their disposal to decide whether their students have attained the ‘A2’
level in English in terms of a number of salient CEFR descriptors. As already implied,
this final and global assessment evidently needs to go beyond the students’ performances
in the summative tests alone; for a truly meaningful certification of a student’s veritable
proficiency level, all the additional classroom activities conducted over that period of
time most certainly need to be taken into account as well.
2.2.2. Contentious aspects of CEFR descriptors and scales In a school-based context, a number of other aspects of the CEFR descriptors can appear
problematic and have therefore been criticised in various quarters. For instance, if the six
main proficiency levels are to be used as the foundation for an entire language
curriculum, the phrasing of several descriptors can cause a number of concerns.
A first major problem is often seen in the rather vague choice of terminology
inherent to a number of descriptors. As far as assessment is concerned, this is particularly
problematic in that ‘the wording for some of the descriptors is not consistent or
transparent enough in places for the development of tests’ 44 . To cite but one example, a
learner is adjudged to have attained a ‘low A2’ (or ‘A2-’) level in ‘vocabulary range’ if he
‘has a sufficient vocabulary for the expression of basic communicative needs’ and ‘for
coping with simple survival needs’ (CEFR, p.112). What exact ‘communicative’ and
‘survival needs’ qualify as ‘basic’ or ‘simple’? Perhaps more importantly in a school
context, what amount of ‘vocabulary’ is deemed to be ‘sufficient’ and thus has to be
demonstrated by a student at that level? The nature of the problem is further clarified by
Weir:
The CEFR provides little assistance in identifying the breadth and depth of
productive or receptive lexis that might be needed to operate at the various
levels. Some general guidance is given on the learner’s lexical resources for
productive language use but, as Huhta et al. (2002: 131) point out, ‘no examples
of typical vocabulary or structures are included in the descriptors.’ 45
Going back to the A2 descriptor for ‘grammatical accuracy’, similar questions might
easily be raised about the exact meaning of the phrase ‘some simple structures’: indeed,
which grammatical elements are considered ‘simple’ and which ones are not? If grades
and progress in our school system are ultimately connected to such assessments of
44
Weir, art.cit., p.282.
Ibid., p.293. The reference and quote in this passage are to Ari Huhta et al., ‘A diagnostic language
assessment system for adult learners’ in J. Charles Alderson (ed.), Common European Framework of
Reference for Languages: learning, teaching, assessment: case studies, Council of Europe (Strasbourg:
2002), pp.130-146.
45
46
Chapter 2
competence, the salience of these considerations can certainly not be underestimated. It
seems an unavoidable requirement for both curriculum and test writers to specify the
precise functions, grammatical items and topical lexis areas to be acquired and mastered
by the students. This definitely reinforces the claim that the CEFR has to be adapted to
particular circumstances and needs. In this context, Weir underlines the fact that ‘the
CEFR ‘is intended to be applicable to a wide range of different languages (Huhta et al.,
2002)’’; sufficiently detailed language-specific guidelines and lists are impossible to
provide in that instance. At the same time, however, ‘this offers little comfort to the test
writer who has to select texts or activities uncertain as to the lexical breadth or knowledge
required at a particular level within the CEFR’ – a problem which is further amplified by
the fact that, in the CEFR, ‘activities are seldom related to the quality of actual
performance expected to complete them (scoring validity)’ 46 .
In other words, there are no benchmark samples to be found in the Framework
which directly demonstrate the standards of performance to be expected at a given
proficiency level. Once again, such an omission is evidently understandable in view of
the plurilingual aims, purposes and overall applicability of the CEFR. Including suitable
examples in a multitude of European languages for all the identified competences and
levels would indeed have been an excruciatingly extensive (and, considering the inherent
necessity of language-, culture- and country-related adaptations, most difficult) endeavour
indeed. As a result, a clear need emerges for curriculum developers to illustrate the
theoretical phrasing of the descriptor scales with language-specific samples of
performance rooted in practice, or, at the very least, to provide guidelines as to which
particular functions, topics as well as grammatical and lexical ranges correspond to the
chosen standards. Ideally, of course, teachers might even expect to be offered both types
of specifications from curriculum and syllabus writers.
In this respect, grammar is arguably the traditional core element of many language
courses which the CEFR most notably fails to scrutinise in extensive depth – that is, as far
as direct usability in classroom tests is concerned. Indeed, due to the generalising nature
of the grammar-related CEFR descriptors, they are largely unsuited to be put to
immediate use for the summative assessment of grammatical micro-skills within a
relatively small and thus limited sample of spoken or written production. In itself, this is
not necessarily negative; as Keddle remarks, the approach chosen in the Framework does
46
Weir, art.cit., p.293/p.282.
Chapter 2
47
not aim to verify the mastery of any isolated grammatical items; instead, ‘the CEF puts
the emphasis on what you achieve with grammar’ 47 . Nevertheless, Keddle admits that
the CEFR’s rather patchy treatment of grammatical components does lead to problems
when it comes to adapting the Framework to the requirements of school-based language
teaching. In essence, the predominantly functional/situational focus of the CEFR clashes
with the more form-focused, grammar-oriented layout and exigencies of numerous
traditional school syllabi. Hence, while ‘the global descriptors fit any successful language
learning situation, including the school classroom,’ Keddle rightly points out that
presently ‘the detailed descriptors often don’t match with the grammar focus found at
school level’ as they are ‘not sufficiently linked to concept areas to provide a basis for a
teaching programme’ 48 . As an example that forcefully underlines the inconsistent
approach to grammatical components in the CEFR, Keddle notes a surprising discrepancy
in the global descriptors. While some elements such as ‘speaking about the past’ are
overtly discussed,
[t]here are grammar-based concept areas that are not covered early enough, or at
all, which could be included, e.g. ‘talking about the future’. […] The future is in
fact only indirectly referred to in the descriptor ‘make arrangements to meet’
(A2). As most teachers also knowingly cover the concept of the future it is a
shame that it is absent from the descriptors. 49
Such realisations further underline the complex double nature of the adaptation process
which is needed to unify CEFR contents and grammar-led school syllabi in a satisfactory
way. Not only is it necessary for curriculum developers to select elements already present
in the CEFR that are suitable for particular age and learner groups; at times, they may
also have to fill in remaining gaps by supplementing new elements that were not initially
treated in the descriptor scales at all.
Further problematic implications tied to the phrasing of the CEFR descriptors were
systematically investigated and described by Alderson et al. in a project carried out for
the Dutch Ministry of Education, Culture and Science 50 . Although, in contrast to the
present thesis, the main aim of that project affected the receptive skills of reading and
listening, the insights offered by that research study can be applied in a general way to the
vast majority of descriptor scales in the CEFR and thus also prove valuable in relation to
47
Keddle, art.cit., p.47. Emphasis and italics added.
Ibid., p.44/p.49.
49
Ibid., p.49.
50
J. Charles Alderson et al., ‘Analysing Tests of Reading and Listening in Relation to the Common
European Framework of Reference: The Experience of the Dutch CEFR Construct Project’ in Language
Assessment Quarterly, 3, 1 (2006), pp.3-30.
48
48
Chapter 2
the Framework’s exploration of productive skills. Alderson et al. found four main types
of problems when analysing and comparing different CEFR descriptors and levels:
1. First of all, a number of ‘inconsistencies’ appear in various descriptor scales. These
may affect different proficiency levels, for instance ‘where a feature might be
mentioned at one level but not at another’ 51 . As an illustrative example, Alderson et
al. cite concepts such as ‘speed’ or ‘standard’ which are not consistently mentioned at
all levels. Similarly, references to the operation of using a dictionary are mysteriously
absent from the lower levels, but discussed at B2 level. 52 This latter point clearly
indicates that omissions of certain terms at some levels (but not others) are not always
logical; they cannot always, for example, simply be justified by the inapplicability to
low-level learners on the grounds of such students’ limited target language resources.
If anything, the use of dictionaries might even be of more importance to A1 and A2
learners than to B2 students. Particularly in the interests of a school-based application
that aims to develop specific skills over time, it would definitely be helpful to
maintain a strict consistency of terms and concepts across levels with more rigour.
Furthermore, even at one and the same level, problems of inconsistency may
surface:
A feature may appear in one descriptor for a level, but not in another for the
same level. For example, what is the difference between specific information
(A2) and specific predictable information (A2)? 53
Less critical readers and users of the Framework might consider (and sometimes
dismiss) the insistence on such nuances as mere “semantic nitpicking”. Nevertheless,
the careful study of the various CEFR descriptor scales does reveal a number of slight
variations which could arguably have been avoided in order to facilitate the
systematic tracing of cognitive development and skills-related progress over time.
2. In close relation to the previous point, further criticism affects the CEFR writers’ use
of a variety of verbs to describe similar cognitive operations (e.g. ‘identify’ and
‘recognise’), as it frequently remains unclear whether true synonymy exists
between two different expressions. On the one hand, Alderson et al. note that the
CEFR authors might have had ‘stylistic reasons’ for this inconsistent use of
terminology; however, a more likely explanation is that the use of different verbs
actually betrays that, as seen above, ‘the can-do statements were originally derived
51
Ibid., p.9.
Ibid., p.10.
53
Ibid., p.10. Italics are the authors’.
52
Chapter 2
49
from a wide range of taxonomies’ 54 . In either case, Alderson et al. chose to pursue a
higher degree of standardisation for the terminology used in the remainder of their
own project. With concerns of validity and reliability in mind, such a course of action
seems commendable indeed.
3. The problems with descriptor phrasing are further compounded by a ‘lack of
definitions’ for numerous expressions used to describe cognitive and linguistic acts as
well as text or task types. Alderson et al. draw attention to the unclear meaning of the
term ‘simple’ in numerous descriptors; as it is never clearly defined or illustrated
(either by theoretical means such as lists of grammatical or lexical items, or
empirically through sample productions), it is equally unclear what degree of quality a
language learner’s performance needs to reach to correspond to a specific CEFR
level. In addition,
[t]he same definitional problem applies to many expressions used in the CEFR
scales: for example, the most common, everyday, familiar, concrete, predictable,
straightforward, factual, complex, short, long … and doubtless other
expressions. These all need to be clarified, defined, and exemplified if items and
tasks are to be assigned to specific CEFR levels. 55
The clarification and standardisation of these terms is crucial in several ways,
particularly if CEFR descriptors are to play a central role in the assessment of
language proficiency in an entire school system. Evidently, every individual assessor
needs to be unmistakably aware of the exact criteria that a given student performance
needs to fulfil to attain a particular competence level, so that high scorer reliability is
granted. In addition, to increase inter-scorer reliability as well, all the different
teachers in that system need to share a common understanding about the exact
meaning and levels of performance that correspond to such standardised terminology.
Potential ways of attaining both of these crucial prerequisites in practice will be
discussed in subsequent chapters of this thesis.
4. A final weakness that Alderson et al. identify in the CEFR is constituted by several
remaining ‘gaps’ 56 in the descriptor scales. Given that their project focused on
receptive skills, some of their justified concerns affect the absence of ‘a description of
the operations that comprehension consists of and a theory of how comprehension
54
Ibid., p.10.
Ibid., p.12. Italics are the authors’.
56
Ibid., p.12. The authors specify that they ‘considered a feature missing if it was mentioned in general
terms somewhere in the CEFR text but then was not distinguished according to the six CEFR levels or
was not even specified at one level.’
55
50
Chapter 2
develops’, echoing the general lack of a rigorous SLA-theory-based background
affecting the CEFR overall. They also argue that ‘the text of the CEFR introduces
many concepts that are not then incorporated in the scales or related to the six levels
in any way’, such as, for instance, ‘competence, … activities, processes, domain,
strategy and task […]’ 57 . While this criticism is not factually incorrect, it nevertheless
appears to neglect the complementary structure of the CEFR’s two major components
(i.e. the descriptive scheme and the descriptor scales). Considering the already fairly
extensive scope and phrasing of most descriptors, as well as of the explanations
provided in the descriptive scheme (conciseness is definitely not always the CEFR’s
strongest suit), it is hard to fathom how all the concepts defined and discussed in the
descriptive scheme could additionally be applied and included in the scales as well
without making the latter excessively bulky and overloaded. Precisely elements such
as activities or domains can surely be derived from – or related to – the descriptor
scales even if they are not explicitly incorporated there. However, the regret voiced by
Alderson et al. about a missing direct link between such notions as ‘processes’ or
‘tasks’ and the six levels appears more warranted, particularly with test writers and
language teachers in mind. Especially in a pedagogic aim, it would indeed be helpful
to have a clear overview of precise types of processes and tasks that can be associated
to – and thus expected from – learners evolving at a given level of proficiency; in a
similar vein, such information would be of tremendous help in view of deriving fair
and valid test specifications and constructs from the CEFR.
2.3. The CEFR and the Luxembourg school system: possible uses and necessary adaptations In this chapter, a close look at the Common European Framework has revealed it to be an
instrument that is promising and challenging in almost equal measure. On the one hand, it
certainly represents a most useful and long overdue catalyst for change, capable of
triggering drastic alterations to more traditional approaches to language learning, teaching
and assessment. Through its detailed and thorough catalogue of skills and competences
that any speaker of a foreign language needs to call upon, its fundamentally positive focus
on what language learners can do, and its undeniable, ever-increasing weight on an
international scale, the CEFR unites all the necessary ingredients to be considered a
57
Ibid., p.12.
Chapter 2
51
legitimate and stimulating basis for a competence-based system of teaching and
assessment.
On the other hand, the controversy surrounding a number of elements in the CEFR
unmistakably shows that several adaptations are necessary before the Framework can be
used as a workable and potent tool, particularly for assessment, in the Luxembourg school
system. Its initial focus on independent, presumably adult learners must be acknowledged
and reshaped for classroom purposes, with the particular needs and interests of adolescent
pupils in mind. As far as ESL courses are concerned, one also needs to consider that
English is in fact the third foreign language that students learn in our school system.
When deriving a subject-specific alignment to the CEFR for lower-level English courses
in Luxembourg, one thus needs to take into account that students do not start developing
all the corresponding competences from the very bottom all over again. This aspect of
plurilingualism is, in fact, explicitly pointed out by the CEFR authors:
A given individual does not have a collection of distinct and separate
competences to communicate depending on the languages he/she knows, but
rather a plurilingual and pluricultural competence encompassing the full range of
the languages available to him/her. (CEFR, p.168)
Those who have learnt one language also know a great deal about many other
languages without necessarily realizing that they do. The learning of further
languages generally facilitates the activation of this knowledge and increases
awareness of it, which is a factor to be taken into account rather than proceeding
as if it did not exist. (p.70)
Ignoring this extremely useful realisation would be robbing oneself of one of the most
powerful means of encouraging smart and efficient learning: activating skills and
knowledge that have already been developed to a certain extent. As Heyworth puts it, this
could prove invaluable to help students ‘develop… strategies and skills for ‘learning to
learn languages’’:
Teachers sometimes assume that a beginner starts from scratch, but in fact most
have experiences of other languages and skills and knowledge they can apply
usefully to learning the new language. 58
At the same time, however, this creates a new set of problems when it comes to linking
CEFR levels with course objectives at school. If students have (to a certain extent)
already developed some skills and learning strategies (but not others) before beginning a
particular course in a ‘new’ language, which targets do we set them for the successful
achievement of their school year (or cycle) in that subject? In other words, if the overall
58
Heyworth, art.cit., p.15.
52
Chapter 2
target level after the first learning cycle of the English curriculum is globally set at A2 (as
it is currently the case in Luxembourg), does that mean that we should “only” expect A2level performances across all (sub-)sets of competences, or are there perhaps some
elements where a minimal target requirement of B1 might be more appropriate due to the
students’ vast repertoire of prior language learning (in other foreign languages)?
Before such questions can be answered, and the attainment of particular CEFR
levels attested, it is of course necessary to devise a valid and reliable CEFR-aligned
assessment system in the first place. As seen above, this is far from a straightforward task.
The problems of inconsistency and partial incompleteness characterising the terminology
in the different descriptor scales, as documented by Alderson et al., certainly call for a
great deal of care when it comes to adapting them for local education and assessment
purposes. Moreover, their role in the overall assessment system can prove rather complex.
Due to their inextricable link to the concept of language proficiency, the CEFR
descriptors seem to be most appropriate for an end-of-cycle certification that takes into
account multiple and varied performances over a substantial period of time. No such
conclusions can be drawn if they are not underpinned by a sufficient number of regularly
collected indications about a given student’s level of competence. Hence, the purpose of
our regular summative tests will increasingly have to be geared towards a mixture of both
achievement and proficiency assessments as well – once again no simple undertaking.
Since a single summative test cannot explore the learner’s whole range of competences
and skills (i.e. his veritable level of proficiency), the corresponding assessment system for
each single test can, by definition, not solely rely on proficiency-based descriptors such as
those from the CEFR scales. In other words, the original CEFR descriptors are neither
intended nor particularly suited to simply be associated with a specific numerical mark in
the interests of reaching a final score in a classroom test. This is not to say, however, that
they cannot partially be drawn upon to assess such an isolated performance; for instance,
if a marking grid is compiled for the explicit purpose of summative assessment, various
skills-related CEFR descriptors might certainly help to characterise a range of aspects of
even an isolated student production. However, in such a context it is highly likely that
more precise, achievement-oriented elements must complement them if, aside from the
learners’ general ability to communicate meaningfully, their mastery of very specific
language elements (e.g. previously treated topical lexis or grammatical forms) is to be
verified as well.
Chapter 2
53
In essence, then, the CEFR is likely to take on a dual role in our new competencebased education system. To confirm the satisfactory attainment of a targeted, global
CEFR level at the end of a learning cycle (for instance by means of a complément au
bulletin attesting the level of proficiency demonstrated by the student against target
standards such as A2 or B1), the Framework’s descriptors can be drawn upon in almost
direct fashion, even if certain inconsistencies and gaps might still have to be addressed in
the interests of validity and reliability. CEFR descriptors can also intervene in the
establishment of marking grids for regularly conducted summative tests, as they
potentially provide a basis for a more reliable, criteria-based assessment of specific
learner performances. However, due to their rather vague, generalising and proficiencycentred terminology, their inclusion into summative marking grids is not as
straightforward in this second instance; further adaptations and specifications might have
to be supplied in that case to increase the immediate relevance and aptitude of the
assessment.
The next two chapters of this thesis will now provide a detailed description and
analysis of activities, test and task types implemented in the classroom to stimulate and
develop the learners’ productive skills, and the corresponding means and strategies that
were used to assess the students’ resulting performances. This will not only allow us to
explore how such competence-based ways of approaching speaking and writing might be
included into everyday teaching practice, but also to look for beneficial effects as well as
potential remaining shortcomings of the applied methods.
54
Chapter 2
55
Chapter 3: Competence­based ways of assessing speaking at A2 level. As briefly outlined in chapter 1, the systematic assessment of speaking skills has
traditionally played a strongly subordinated role in lower-level ESL classes in the
Luxembourg school system. Instead, writing tasks have usually been favoured in
classroom testing, often to the point where the summative assessment of the other
productive skill has been completely ignored throughout the students’ first two (or more)
years of English language courses. Yet an important aim of our schools undoubtedly
consists in producing independent individuals who are fundamentally capable of calling
upon a foreign language to communicate with others in a variety of situations. Does it
make sense, then, to neglect the most spontaneous, common and immediate form of
communication in such a disproportionate manner? Certainly not, particularly within a
competence-based teaching and assessment scheme largely inspired by the strongly
communicative approach of the CEFR. Language teachers all around Luxembourg, by
virtue of new official syllabi that oblige them to test all major skills over the course of the
‘A2’ English cycles of both the ES and EST systems, now suddenly find themselves in a
position where they have to alter their own approach to testing speaking, often in
fundamental ways. As with all major changes to the traditional assessment system, some
initial uncertainty evidently exists as to how this new curricular requirement can be
suitably put into practice. Which precise skills and competences are to be called into
action for an appropriate and sufficiently thorough assessment of speaking? What type of
tasks can be used to achieve a satisfactory activation of those skills and competences?
And what type of instrument(s) should teachers correspondingly rely on to arrive at a
highly valid and reliable assessment of student performances?
In this chapter, I will first outline a number of salient features of speaking
performance that need to be taken into account when drawing up, implementing and
assessing summative test tasks that focus on this complex productive skill in the
classroom. These theoretical considerations will then be illustrated through the
description and analysis of a range of activities that I conducted in a ‘pre-intermediate’
56
Chapter 3
class of the EST system, in an aim to integrate a competence-based approach to
assessing speaking into everyday teaching practice.
3.1. Central features of interest to the assessment of speaking 3.1.1. Features shared by both productive skills In terms of what is relevant to testing and assessment, the two productive skills of
speaking and writing have several features in common. As Harmer points out, ‘there are a
number of language production processes which have to be gone through whichever
medium we are working in’ 1 . Such similarities occur in various aspects of the language
samples that our students produce.
As far as the form of a linguistic sample goes, the two concepts of fluency and
accuracy are of key importance to both speaking and writing performances. Hence, the
student is generally supposed to demonstrate adequate range as well as appropriate
choice and use in terms of lexis and grammar in order to deal with the task he is given.
In general, as Brown stresses, it is ‘now very clear that fluency and accuracy are both
important goals to pursue in CLT and/or TBLT’. In the context of ‘pre-intermediate’
classes, it is also interesting to note that ‘fluency may in many communicative language
courses be an initial goal in language teaching’ 2 to ensure that learners can bring across a
basic message; subsequently, accuracy often takes on an increased role as it gets more
and more important to “fine-tune” the quality of the student’s performances. However, in
an isolated test, the teacher (as assessor) must decide to what extent fluency is to be
prioritised over accuracy (or vice-versa) in accordance with the specific cognitive
demands of each constituent task. Hence, if the achievement of a narrow objective is to be
verified (such as the student’s appropriate use of discrete lexis items in response to
controlled, closed questions), it is clear that accuracy plays a predominant role in the
corresponding assessment. On the other hand, the focus tends to be put on the student’s
fluency if longer samples of fairly free writing or speaking are elicited, for example
through more open questions. If the various components of a summative test aim to allow
for a mixture of both achievement and proficiency assessments, it is clear that
1
Harmer, op.cit., p.246. Emphasis added.
Brown, Teaching By Principles, p.324. Italics added. (CLT = communicative language teaching; TBLT =
task-based language teaching)
2
Chapter 3
57
considerations about both accuracy and fluency must be taken into account and weighed
against each other in regard to the respective task requirements.
Convincing oral and written productions also have to meet some comparable
requirements on a structural level. As Harmer summarises,
for communication to be successful we have to structure our discourse in such
a way that it will be understood by our listeners or readers. 3
To achieve this, the two components of coherence and cohesion are of central
importance. Coherence can broadly be defined as ‘the way a text is internally linked
through meaning’ 4 ; hence, the ideas in a given text or speech are structured and
sequenced in such a way that the listener or reader can follow the author’s intended
meaning and reasoning without confusion. Cohesion, on the other hand, may be seen as
‘the way a text is internally linked through grammar/lexis’ 5 and stylistic elements in
general such as, for instance, linking words or phrases used to connect successive
paragraphs with each other. Due to the higher amount of time and planning (as well as the
possibility of re-drafting) generally available when approaching a writing task, it may be
sensible to expect (and thus insist on) a higher degree of coherence and cohesion in
written productions than in spoken ones (which often result from a much more impulsive
and immediate engagement in communication). As Harmer points out, ‘spontaneous
speech may appear considerably more chaotic and disorganised than a lot of writing’; in
general, however, ‘speakers [still] employ a number of structuring devices’ which may
even include speech-specific features such as ‘language designed to ‘buy time’ [or] turntaking language’ 6 . In speaking performances of ‘pre-intermediate’ language learners, the
latter two elements may of course be too challenging to be included, particularly if such
conversational strategies have not been pre-taught. Yet even at that level, each assessment
of speaking skills globally needs to take into account to what extent the learner is able to
structure his discourse so as to increase the intelligibility of the intended message or line
of argument.
Evidently, the overall content of a particular speaking or writing performance also
needs to fulfil a number of conditions to be considered a successful effort on the part of
the learner. Features such as completeness and level of detail of the information provided
3
Harmer, op.cit., p.246.
Martyn Clarke, PowerPoint notes from the seminar ‘Creating writing tasks from the CEFR’, held in
Luxembourg City in October 2009, © Oxford University Press, p.3. Emphasis added.
5
Ibid., p.3. Emphasis added.
6
Harmer, op.cit., p.246.
4
58
Chapter 3
in the student production, as well as the general relevance to the actual topic or
instructions, usually combine to determine an adequate achievement of (or response to)
the set task. However, one needs to be very cautious in the interpretation and valuation of
content-related criteria, especially as they are generally considered to be of overarching
importance in performance assessment. Indeed, the complex interplay of form and
content can lead to a number of problematic cases. For instance, an inadequate
performance in terms of content can sometimes be explained by the student’s insufficient
mastery of form-related knowledge or skills (i.e. a student lacking the necessary lexis or
grammar will not be able to express task-relevant ideas in appropriate ways). In that
instance, a potential unsuitability of the task and its excessively demanding requirements
might be at fault rather than the learner himself; for example, insufficient scaffolding and
pre-teaching of key elements may in fact have been offered to the student in preparation
for the test. On the other hand, a learner performance may be perfectly proficient in
regard to purely language-related features (such as vocabulary range and grammatical
structures); yet it can evidently still be seen as an inadequate effort – and rightly so – if it
completely ignores the topical specifications of the test task.
Finally, to further enhance the relevance and adequacy of their spoken and written
performances, learners need to be aware of appropriate styles and genres to use 7 .
Depending on the purpose (e.g. information, inquiry, commentary,…), audience and
setting of a given act of speaking or writing, competent language users must show that
they can choose an appropriate format and register (e.g. formal / informal) for the
required communicative act. As Harmer stresses, this also presupposes an awareness of
‘sociocultural rules’; particularly in the context of foreign language learning, students
indeed have to develop an understanding of conversational or textual conventions that
‘guide [typical] behaviour in a number of well recognised speech events’ within the target
culture, ‘such as invitation conversations, socialising moves, and typical negotiations’, to
name but a few 8 . A student’s sociolinguistic competence, ‘concerned with the
knowledge required to deal with the social dimension of language use’ (CEFR, p.118),
may not be an explicit and stand-alone focus when it comes to assessing a particular
spoken or written language sample. Nevertheless, it is often crucial that learners develop
it to a satisfactory degree so as to prevent their task response from becoming
inappropriate in a range of other interlinked areas.
7
8
Ibid., p.247.
Ibid., p.247.
Chapter 3
59
3.1.2. Features specific to speaking The use of correct orthography is an aspect of accuracy that is generally confined to
the assessment of written performances. In speaking, the students’ pronunciation skill
may be considered to be its logical counterpart. Such parallels are indicated in the
respective CEFR scales: for orthographic control, the A2 descriptor expects ‘reasonable
phonetic accuracy (but not necessarily fully standard spelling)’ (CEFR, p.118) from a
language learner of that level. The assumption goes that occasional wrong or missing
letters can be automatically corrected by the reader so that a breakdown in
communication is avoided. For a speaking performance, the CEFR similarly stipulates
that an A2 learner’s ‘pronunciation is generally clear enough to be understood despite a
noticeable foreign accent’ (p.117); again, the listener may be expected to make some
reasonable “compensations” for the imperfections in his interlocutor’s speech. However,
the importance of phonological control can easily be underestimated; in fact, one might
wonder whether the corrective ‘effort’ made by a listener is always comparable to the one
made by a reader. One could even argue that faulty pronunciation can have a more
adverse impact on successful communication than occasional misspellings, at least as far
as the English language is concerned. Thus, a single misplaced letter in writing may not
cause unintelligibility very often; one wrong vowel sound in speaking, on the other hand,
can radically alter or obfuscate the meaning of an utterance. In that respect, it is certainly
advisable not to prematurely dismiss pronunciation as a mere sub-skill of relatively little
importance when it comes to assessing a general speaking performance.
On a different scale, the communicative process of interaction is one of the key
areas where the applicable scope of the two productive skills clearly differs. In the
broadest sense, writing can of course easily be used for interactive purposes (for example,
a letter may be written in reaction to another one). However, a very different type of
interaction certainly occurs in oral communication, where speakers can immediately react
and adapt to each other’s utterances. In that respect, oral discussions can undeniably
develop complex dynamics that would be difficult to emulate in written form (except
perhaps via contemporary means of written interaction offered by online services such as
instant messaging programs). When designing speaking tests, a clear distinction thus
needs to be made between tasks which aim at activating skills of interaction and others
which are centred on a longer, individual turn for a single student. As Heyworth notes, the
‘CEF description’ of speaking skills usefully caters for this crucial distinction: it ‘allows
60
Chapter 3
us to distinguish between spoken production and spoken interaction’ 9 and
correspondingly provides different descriptor scales for both aspects rather than just
approaching speaking as one large general skill.
In view of test design, recognising this fundamental importance of interaction in
many speech acts opens up a number of stimulating possibilities. Teachers pursuing more
traditional strategies often seek to collect evidence of students’ speaking skills purely
from rehearsed speeches such as formal presentations. Yet such an approach is often
flawed since the corresponding oral productions tend to be prepared or even read out from
notes; in that sense, they do not effect a “true” and direct activation of speaking skills. In
contrast, truly interactive tasks require the students to access their repertoire of oral
competences “in real time”; no extensive preparation (for example through word-forword memorisation of precise sentences and formulations) and thus no such distorted or
falsified picture of actual oral proficiency can occur in that instance. In addition, as
Brown puts it,
learning to produce waves of language in a vacuum – without interlocutors –
would rob speaking skill of its richest component: the creativity of
conversational negotiation. 10
A number of suitable and varied activities are conceivable in the aim of creating such an
exchange. Teacher-student (T-S) interaction, for example a short teacher-guided
interview, may thus be included into a speaking test just as easily as different types of
student-student (S-S) interaction. Hence students may for instance be asked to engage in
opposing debates or, in contrast, to work collaboratively towards the achievement of a
common goal. As an added benefit, different micro-skills may be verified in all of these
types of interactive activities (T-S and S-S). Students may, for instance, be expected to
show their ability of asking for repetition or clarification in the target language;
similarly, appropriate use of turn-taking skills can take on an important role particularly
in S-S tasks 11 . Naturally, however, the insistence on interactive competences does not
prohibit the simultaneous inclusion of more ‘production’-oriented activities (requiring a
longer, exclusive turn for each individual student) in one same speaking test. In fact, as
seen in chapter 2, the use of a wider range of considerably diverse task types (generating
9
Heyworth, art.cit., p.16. Emphasis added.
Brown, Teaching by Principles, p.327.
11
Incidentally, for all of these strategic micro-skills, the CEFR once again provides useful, separate
descriptor scales: see tables for ‘Taking the floor’, ‘Co-operating’ and ‘Asking for Clarification’ (CEFR,
pp.86-87).
10
Chapter 3
61
a ‘range of performances’ rather than a single, one-dimensional sample) is advisable if the
assessment of general proficiency is the defined purpose of the speaking test. 3.2. Case study 1: speaking about people, family, likes and dislikes After identifying the main components that a thorough assessment of speaking skills
should be founded on, the next step consists in finding a suitable way of integrating these
concepts into daily teaching, testing and assessment practice. At this point, I will
therefore venture into a detailed description and analysis of two different speaking tests
that I implemented in the classroom with that aim in mind. The first one, detailed in this
section, mainly focused on the thematic area of physical descriptions, as well as a general
discussion of A2-typical areas of interest such as personal information, family, likes or
dislikes, and free-time activities.
3.2.1. Class description The practical experimentations with competence-based tests and assessments
presented in this thesis were conducted in one same 9TE class at the Lycée Technique
Michel Lucius in the school year 2009-10. That class was initially composed of 22
students of different nationalities (12 girls and 10 boys; 12 pupils were Luxembourgish, 5
students had a Portuguese background and 5 others were of Italian descent). However,
two students left the class and another pupil joined the group over the course of the school
year 12 . A slight peculiarity consisted in the fact that no fewer than seven students were retaking their year; hence, they arguably started out with a slightly higher initial level of
proficiency in English than their peers. Correspondingly, student age generally ranged
from 14 to 16 years at the beginning of the year, with the exception of one 17-year-old
boy. No majorly disruptive behaviour affected the general climate in the classroom, and
so the overall atmosphere among the students was mostly a co-operative and mutually
supportive one. As a positive corollary, the absence of serious disciplinary problems was
certainly conducive to the students’ ability to concentrate on competence-based language
activities in an efficient and attentive way.
12
One (Luxembourgish) girl left for a similar class in another school, while the other (Portuguese) student,
already re-taking her year, voluntarily joined a 9PO class after the first term. The new student, a 15-year-old
Luxembourgish girl, transferred into the class from a school belonging to the ES system in the middle of the
second term.
62
Chapter 3
3.2.2. Laying the groundwork: classroom activities leading up to the test As the summative speaking test described in this case study was scheduled for the
penultimate week of the first term, ample time was available to gradually prepare the
students during the weeks leading up to it. In a number of ways, it was of paramount
importance to familiarise the students both with the subject-matter and the types of tasks
that would be the cornerstones of the eventual summative assessment. First of all, of
course, the necessary lexical and grammatical groundwork had to be laid so that the
students would have a sufficient knowledge base to master the test tasks in a satisfactory
way. However, on a methodological level as well, it was particularly necessary to provide
scaffolding and to model expected behaviour in different activity types as the students
had been completely unused to having their speaking skills assessed in summative tests
up to this point. Hence they needed to become accustomed to elements such as interactive
pair work activities and the verbal description of visual prompts so that student-related
reliability could be increased in their test performance. While reliability-affecting factors
such as ‘exam nerves’ can of course never be completely pre-empted, they would
certainly have a less prominent impact in the eventual test if the students recognised the
task requirements (i.e. the expected type of behaviour and performance) in the test
situation due to previous encounters with similar tasks in “regular” classroom time. At the
same time, the more the test components corresponded to the material and activities
treated in class, the more their content and face validity would be granted.
Since one of the primary thematic areas to be incorporated in the test consisted in
physical descriptions of people, a number of classroom activities evolved around that
topic as well. To raise intrinsic motivation and familiarise students with interactive
strategies at the same time, short games proved a particularly useful tool. In one instance,
the students were thus asked to describe the looks of a specific person to their respective
neighbours; each student was handed a cue card with the picture of a celebrity, and
according to his or her descriptions, their partner had to guess who the described man or
woman was (if necessary by asking further questions for clarification). Another activity,
simultaneously recycling the use of personal pronouns and family vocabulary, asked the
students to complete a family tree by drawing some missing facial features into various
partially blank portraits on their handouts; each pair of students had received two
different versions of the same family tree so that they had to elicit missing information
Chapter 3
63
from their partner through targeted questioning and negotiating 13 . In a playful manner,
the students were thus encouraged not only to apply some newly acquired topical lexis,
but also to actively develop their interactive speaking skills and become used to working
with a partner in the process.
In terms of feasibility, one might add that it is by no means a difficult or
particularly time-consuming endeavour to integrate such explicitly speaking-focused
activities into one’s general teaching routine; indeed, the necessary time-span to be set
aside for either of the described activities did not exceed twenty minutes. In fact, as all the
students were simultaneously engaged in discussions with their neighbours (while the
teacher’s role was confined to monitoring the learners’ efforts and occasionally helping
them out), such games actually constituted a very time-efficient way of actively involving
as many students as possible within a relatively short amount of time.
3.2.3. Speaking test 1: description of test items and tasks The first summative speaking test was strategically placed at the end of term so that
it could deliver relevant indications in view of both achievement and proficiency
assessments. On the one hand, its various components therefore needed to check that
adequate learning had taken place in relation to specific objectives (i.e. reflecting the
elements expressly focused on in the weeks immediately before the test, such as lexical
accuracy in physical descriptions as well as the correct grammatical use of the present
simple and past simple tenses). On the other hand, longer individual speaking turns could
also offer some insight into the general progress (i.e. competence development) made by
individual students in the three months since the start of the school year. Hence, when
compiling the items and tasks for this test it was not only necessary to ensure reasonable
validity, reliability and a sensible degree of difficulty; if a genuine, comprehensive insight
into speaking proficiency was to be reached, it was also important to include both
individual turns (of spoken production) as well as elements targeting spoken interaction.
In that sense, it quickly became clear that a range of activities (rather than a single
one) needed to be included in the test to yield conclusive results. In turn, however, this
gave rise to a new set of questions. Could one expect a sufficiently extensive sample from
a language learner working within the A2 level of proficiency to provide sufficient
indicators for all of these different foci? If so, how did the different tasks need to be
13
This activity was adapted from Mick Gammidge, Speaking Extra, Cambridge University Press
(Cambridge: 2004), pp.16-17.
64
Chapter 3
structured and sequenced to support the learners in their efforts and thus ensure
maximally effective results?
In regard to such crucial questions, Harmer reminds us that
[i]n the first place, we need to match the tasks we ask students to perform with
their language level. This means ensuring that they have the minimum language
they would need to perform such a task. 14
Therefore, the different sections of the test needed to be firmly centred on the actual
vocabulary items and linguistic structures that the students had already encountered and
practised in the classroom; the occasional provision of structural or lexical clues within
the individual test items would further help to support the learners in their efforts.
Another particularly helpful form of scaffolding that Harmer points out concerns
the thoughtful sequencing of tasks:
[t]eachers should not expect instant fluency and creativity; instead they should
build up students’ confidence ‘bit by bit’…, giving them restricted tasks first
before prompting them to be more and more spontaneous later. 15
Establishing a clear order among the tasks according to their increasing cognitive demand
certainly makes sense in a summative test. As students will be aware that their
performance has an immediate impact on their overall grade (and thus their general
progress at school), they are likely to be particularly nervous at the start of the test
situation. Making sure that comparatively simple, confidence-building tasks are
incorporated at the beginning of a test can therefore be vital, as early positive experiences
(leading to the feeling that they can do this) will help the students to quickly allay any
potential test anxiety (at least to a certain extent). As a consequence, it is more likely that
they will be able to access their full and genuine potential for the more demanding tasks
to follow; in turn, the test will then provide more reliable indications as to the students’
actual level of competence in speaking.
With that purpose in mind, I chose to use simple T-S interaction at the start of this
speaking test: while the students took the test in pairs, each individual candidate was first
asked a few simple, closed questions about his or her personal details (e.g. name,
hometown…) and about daily routines or simple likes and dislikes (e.g. favourite music,
free time activities…). In addition, a simple spelling task was included in relation to one
of the students’ answers, allowing a first precise indication about the students’
phonological control. The two candidates were questioned in regular alternation (usually
14
15
Harmer, op.cit., p.251.
Ibid., pp.251-252
Chapter 3
65
two to three questions per student at a time) so that no individual would become excluded
from the conversation for too long (and thus be allowed to “switch off”). The questions
used in this part of the test were also clearly defined and scripted beforehand 16 , in an aim
to guarantee similar answer lengths from each student yet still leaving myself a possibility
to slightly vary the topics of inquiry. In that way, later candidates would not know with
absolute certainty what questions to expect even if they had received some information
from their peers who had already taken the test; nor would a pair of candidates receive
exactly the same questions, as this would have given an unfair advantage to the second
speaker. Both the pattern and content of this opening task were inspired by the simple
exchanges that typically characterise Speaking Part 1 of the Cambridge ESOL ‘Key
English Test’ (KET) 17 – an examination specifically designed to assess and certify A2level proficiency in English.
The second, more extensive section of the speaking test then consisted in two
thematically interlinked parts: a longer individual turn, followed by a controlled S-S
interaction task 18 . At this stage, the pair of candidates were asked to choose one envelope
(for the two of them) from a pack, which I then opened. Each envelope contained four
complementary cue cards (two for each student). First, student A received one card with a
picture of a famous person (not to be revealed to the partner) and some corresponding
facts about that celebrity (e.g. date of birth, marital status…). The other card, given to
student B, contained basic cues to ask for information about that celebrity19 .
The first task, student A’s long turn, consisted in describing the looks of the
depicted celebrity in as much detail as possible (but without giving away the famous
person’s name), thus activating the competences built up through the speaking activities
implemented in class. As the student was supposed to display his ability to complete this
task independently (potentially using simple forms of paraphrasing whenever a precise
lexical item eluded him), I chose not to intervene with any linguistic prompts here if the
speaker “got stuck” (although strategic tips were occasionally given, such as ‘try using
another word or working with opposites if you can’t find the precise expression’).
16
See appendix 1.
A general description of this KET speaking activity can for example be found in University of Cambridge
ESOL Examinations, Key English Test – Handbook for Teachers, UCLES (Cambridge: 2009), p.35.
18
The S-S interaction task was partly modelled on role play tasks that can be found in KET examinations as
well (see Key English Test – Handbook for Teachers, p.37 for examples). The main difference resides in the
fact that KET role plays are based on fictitious material (e.g. information about invented schools,
libraries…) whereas the material used in this speaking test dealt with real people and accurately researched,
authentic facts (in an aim to further increase student interest).
19
See appendix 2 for samples.
17
66
Chapter 3
Once the physical description was complete, I instructed student B to use the cues
on his card to ask precise questions about the celebrity’s life and career, which student A
then had to answer by using the data provided underneath the picture on his card. After all
the questions had been answered, student B was asked to venture a guess about the
celebrity’s identity, mirroring the similar games implemented in class (although, of
course, failure to guess correctly would not coincide with a subtraction of marks as this
could evidently not be construed as a legitimate achievement focus). The roles were then
reversed; student B was given a cue card portraying another famous person, student A
received the corresponding questioning cues, and the game started anew. At the end of
this activity, both students were finally asked whether they liked their respective
celebrities (as well as why/why not), and who their favourite famous singer (or actor /
model / athlete…) was, so as to involve them into a less controlled discussion once more
and elicit a final longer sample from each pupil.
Importantly, to prevent the students from simply copying their respective partner’s
exact formulations (in terms of both physical descriptions and questions), I had taken
particular care to make the various “pairs” of celebrities as contrastive as possible. Hence
any given envelope included one male and one female celebrity to elicit the respectively
accurate personal pronouns and possessive adjectives. Similarly, I targeted significantly
different physical descriptions by trying to vary elements such as eye and hair colours and
shapes, complexions, build, and so on. The type of information to be asked or given about
the celebrities during S-S interaction also slightly varied to trigger correspondingly
diverse questions and answers.
3.2.4. Practical setup and test procedure For the implementation of this speaking test (as well as the second one described in
section 3.3. below), a number of factors were important to ensure reasonable feasibility.
Hence, the decision to have students take the test in pairs was useful not only because of
the possibility of integrating interactive tasks, but also in view of making the
administration more efficient in regard to time and required material. If the students had
been called up one by one, more time would indeed have elapsed between individual
performances, and a higher number of different test questions might have been needed to
make sure that later candidates could not completely anticipate possible test content and
thus gain an unfair advantage before even starting the test.
Chapter 3
67
Of course, the decision to put students into pairs does not come without dangers of
its own. Particularly if a summative mark is at stake, it would be negligent to overlook the
risks of allowing potential room for lopsided pairs in terms of ability and language
proficiency levels. If students need to feed off each other’s performances to fuel their own
successful reactions in interactive activities, one individual’s performance can be dragged
down by his peer’s if the latter is unable to keep a dialogue going in an appropriate way
(or even at all). In this respect, Brown draws attention to what Nunan calls ‘the
interlocutor effect’: ‘one learner’s performance is always colo[u]red by that of the
person (interlocutor) he or she is talking with.’ 20 Indeed, while helpful contributions from
a so-called “strong” student could have a positive effect on the performance of a
“weaker” one, the opposite scenario is just as likely. Thus, if a “strong” student is unable
to fulfil part of a task satisfactorily because of the insufficient clues provided by his
partner, this can easily have uncharacteristic, detrimental consequences on his final mark.
Similarly, due to elements such as the “halo effect”, the teacher may also be tempted
(albeit unintentionally) to excessively mark down a “weaker” student if his efforts in the
test have been overshadowed by a particularly good parallel performance of a much more
proficient partner. In other words, the reliability of results (and of the scorer) can be at
risk in multiple ways if insufficient care is attributed to the sensible formation of
candidate pairs.
However, the consideration of affective factors can help to compensate for this
potential weakness to a certain extent. Thus, one reason to validate this methodological
choice consisted in providing a reassuring element for the students in the test situation:
particularly as it was an entirely new type of summative test experience for them, they
might have been more intimidated or overwhelmed if they had had to face the task (and
the teacher) entirely on their own. Moreover, I left the choice of partner up to them so that
they could team up with somebody that they would work and cooperate comfortably with;
in virtually all cases, this led to the same pairs that students had worked in during
previous in-class activities. Therefore, they did not have to readjust to a different
interlocutor during the test than the one they had practised with in the classroom – a
reassuring situation with positive effects on student-related reliability. It needs to be
added that there were no extreme cases of exceptionally “strong” or “weak” students in
this class; as a result, I was satisfied with the pairs that the students chose because the
20
Brown, Teaching By Principles, p.325. Emphasis added.
68
Chapter 3
respective partners were all fairly evenly matched in terms of ability. For that reason, I
did not have to intervene and reassign students to different pairs in this case; however, as
noted above, such measures may be necessary in other cases to prevent excessively
unreliable results.
As location, an empty classroom (adjacent to the students’ usual one) was used for
the administration of the test in order to minimise the risk of external disturbances
hampering student performance. Each pair of candidates were successively called into the
“test room” while in the meantime the rest of the class were doing language work in their
own classroom, supervised by a member of the school staff 21 . Evidently, it must be noted
that such opportunity and flexibility to increase the reliability of test conditions was
surely fortunate; in everyday teaching practice, it is highly likely that spare rooms and
supervisors will frequently be unavailable. As four regular school lessons (of 50 minutes
each) were ultimately necessary to test the entire class (approximately 15 minutes per pair
of candidates, involving about 5 minutes of “pure” speaking time per student plus the
time necessary for them to get installed and to receive test material and instructions), the
conditions described in this instance must certainly be considered exceptional. In most
cases, a solution will have to be found to administer the test in one’s own classroom,
which presupposes that the rest of the class must be kept silent and occupied for the entire
duration of the actual implementation.
While test-taking students were engaged in the different speaking activities, my
own role was a rather challenging and complex one. On the one hand, I evidently needed
to provide instructions to the students and drive on the dialogues in the various parts of
the test through questioning (in T-S interaction) and occasional prompting or probing (in
S-S sections). Simultaneously, however, the quality of the students’ performances
evidently had to be assessed; given the limited amount of time available for each pair of
candidates, it was not straightforward to combine this task with the other, administrative
role. Although a first brief assessment was possible during the test phase itself (see
section 3.2.5. below), I also put an MP3 player on the teacher’s desk to record the
students’ speaking performances. This allowed me to go back to individual efforts at a
later point in time to verify some of the nuances and details that were impossible to
capture all at once during the actual test. It also provided me with useful evidence of
21
At the LTML, full-time “surveillants” were handily available and, upon my request, kindly charged with
that specific task by the deputy headmaster.
Chapter 3
69
performance which could be retrospectively consulted in case a student were to challenge
the marks he or she had received in the test.
3.2.5. Form, strategy and theoretical implications of assessment used To avoid the plain conversion of an overly impulsive, subjective and holistic (and
thus insufficiently justified) impression into eventually awarded marks, it was clear from
the outset that the summative assessment of the students’ speaking performances needed
to be founded on clear and well defined criteria. The most suitable assessment tool for
the simultaneous inclusion of a range of different foci is a marking grid, where
descriptors of expected standards can be juxtaposed to elements that define sub-par
performances as well as particularly proficient ones, thus establishing what Goodrich
calls ‘gradations of quality’ 22 .
In order to ensure that such a marking grid is immediately usable for the assessment
of a speaking performance in practice, it needs to fulfil a number of conditions. Hence it
evidently has to pay justice to the inherent complexity of speech acts, yet at the same time
it must not be overloaded so that the assessor does not get sidetracked by an excessive
number of small details and nuances. In the same vein, the descriptors used for the
different criteria need to be fairly short and concise to be of immediate use in practice; if
too much (or excessively vague) information is included, it is difficult to maintain a clear
overview of the core elements to focus on. In this respect, it quickly becomes evident why
the general proficiency descriptors in the various CEFR scales cannot be copied directly
into marking grids that are applied to the rather narrowly focused performances in a
single summative test. Thus, the A2 descriptor for ‘overall oral production’ states that the
student
can give a simple description or presentation of people, living or working
conditions, daily routines, likes/dislikes etc. as a short series of simple phrases
and sentences linked into a list. (p.58)
This certainly gives us a general idea of the type of performance that might be required,
but for the precise attribution of marks in relation to various features of a specific speech
act, it does not give us sufficiently nuanced criteria to work with (an impression that is
compounded by the fact that even CEFR descriptor scales which target a more precise
linguistic competence, such as ‘grammatical accuracy’ or ‘vocabulary range’, are equally
generalising in nature). In fact, in a 9TE (or 6eM) class, many different students might
22
Heidi Goodrich Andrade, ‘Using Rubrics to Promote Thinking and Learning’ in Educational Leadership,
57, 5 (2000), p.13.
70
Chapter 3
produce an oral performance that globally fits this description, yet a variety of nuanced
qualitative differences will certainly exist between them. Hence they might all have
surpassed the very basic A1 expectations but still obviously be below B1 level; however,
even if they all thus broadly operate within the scope of the A2 level, their performances
will evidently still not be of exactly the same level of quality in every single aspect. As
the numerical marks eventually awarded in a summative test should reflect such
differences, they need to be based on descriptors that are more closely linked to the
requirements of the actual test tasks. Hence, while the CEFR descriptors might give us a
general idea about the degree of difficulty to be expected (and set) in A2-level tests, their
inherent function remains to define overall proficiency, which can only be inferred from
(and thus subsequent to) a range of performances instead of a single, precise one.
Correspondingly, marking grids for individual summative tests do need to be informed by
the realistic proficiency expectations defined by the CEFR descriptors to make sure that
their highest bands are achievable by learners of a particular level (in this case A2);
however, the CEFR descriptors themselves are too global and generalising in nature to be
directly used in such grids and equated to numerical values.
Given the wide range of key criteria identified in relation to speaking performances
(see section 3.1.), it was first of all necessary to construct a grid that regrouped these
various factors in a logical and efficient way. However, rather than engaging in the
daunting task of devising a completely new assessment scheme (which would have been
extremely risky in terms of validity and reliability given my own limited experience with
systematic summative assessments of speaking up to that point), it was a more sensible
option to draw on already established ones that were compatible with the central aims of
the first implemented speaking test. Since the nature and purpose of a number of test
items was influenced by various elements appearing in Cambridge KET examinations, it
made sense to generally align my main assessment criteria with the ones used in those
high-stakes tests as well. The Cambridge ESOL approach to assessment described in the
corresponding Handbook for Teachers includes an interesting focus on four central
categories which permitted the inclusion of all the central features of speaking that I
intended to concentrate on in this speaking test 23 :
1. ‘Grammar and Vocabulary’, where the expected degree of accuracy could be
globally defined in relation to those two core elements;
23
See Key English Test – Handbook for Teachers, p.35 for a more detailed description of the entire
Cambridge ESOL assessment scheme in regard to speaking skills.
Chapter 3
71
2. ‘Pronunciation’, which was usefully included as an important feature of speaking to
be considered in its own right;
3. ‘Interactive Communication’, which, due to the different interactive exchanges in
the speaking test, was certainly an essential criterion to be included in the marking
grid;
4. ‘Global Achievement’, which regrouped content-related elements such as
completeness, level of detail and relevance of the answers provided; it also took into
account to what extent the students were willing to take risks to experiment with the
target language to clarify and expand their answers (rather than just doing the mere
minimum). For instance, the required physical description of a celebrity based on
visual prompts provided a prime example of an activity where students could achieve
high scores in relation to such criteria in this particular test. Whereas a ‘basic
standard’ performance would be marked by very short utterances (e.g. ‘He is tall. He
has blue eyes.’), more adventurous and complex attempts would allow the student to
reach the ‘target standard’ (e.g. through longer sentences and deliberate use of
adjectives or adverbs, as in ‘This famous person is a very handsome, tall man. He has
beautiful blue eyes.’).
The descriptors in the correspondingly established final marking grid (see appendix 3)
were inspired by the expected requirements for a successful completion of the different
KET speaking parts as described in the corresponding Handbook for Teachers. Through
the certified alignment of the Cambridge ESOL exams with the CEFR levels, one could
also generally assume that the descriptors defining acceptable performance in relation to
the different key criteria in this grid reflected standards that could genuinely be expected
from A2 language learners 24 . Descriptors were established for three different bands:
‘insufficient’, ‘basic standard’ and ‘target standard’. In between those three levels of
performance, further nuances would have been virtually impossible to define without
creating excessive overlap and confusion; therefore, I chose to leave out unspecified
additional bands altogether and to stick to this three-fold ‘gradation of quality’ instead.
A particular numerical value then needed to be assigned to the achievement of each
band. As my 9e students had never taken a summative speaking test before, I wanted to
24
For the establishment of this marking grid, I am indebted to Tom Hengen (an English teacher and
member of the ‘groupe de travail - Socles de compétences pour l’anglais’ in Luxembourg), who provided
me with the core layout and descriptors. Slight alterations were made to the initial descriptors by myself
(e.g. exactly the same expressions instead of synonymous ones were used as far as possible; where some
elements were present in one band but not others, corresponding entries in relation to those criteria were
added in the other two bands).
72
Chapter 3
avoid raising the stakes excessively so as not to put the pupils under too much pressure in
their first attempts to deal with one. At that point in time, the call to attribute 20% of the
term’s mark to speaking tests was an official suggestion but not a binding requirement
yet 25 ; therefore, it was possible to exceptionally allot only 12 marks (rather than a
possible but arguably too weighty 36) to this first-term test. That way, if a student turned
out to be overwhelmed by the new task, or uncharacteristically underperformed in it for
one reason or another, this would not immediately have a highly significant impact on his
final mark for the term (and thus potentially even diminish his overall chances of
progress). However, with the novelty wearing off in subsequent terms, I aimed to build on
my students’ increasing experience and familiarity with such tests to gradually increase
their corresponding significance and weighting over the course of the school year. For the
time being, with four major sets of criteria considered to be of equal importance, and
three bands of performance defined for each of them (allowing for a total of 4x3 = 12
possible combinations), a very simple ratio was available to associate numerical marks to
the marking grid in this instance (i.e. 1 mark per band attained per category; for instance,
a student reaching band 2 across all four categories would score 4x2 = 8/12 marks in
total).
In this respect, it is important to note that students did not have to reach the highest
level of performance to achieve a ‘pass’ mark in the test; instead, attainment of the ‘basic
standard’ (or band 2) would be sufficient. On the other hand, full marks would only be
awarded for a performance that corresponded to the ‘band 3’ descriptor for a particular
criterion. Yet to be in line with a CEFR approach that (as seen in chapter 2) grants the A2
learner some (reasonable) remaining weaknesses, even these established ‘target
standards’ crucially needed to avoid expectations of complete faultlessness. In terms of
‘pronunciation’, for instance, the highest band of quality defined in the grid still allowed
for ‘occasional L1 interference’ as a feature of student speech ‘to be expected’ and thus to
be deemed ‘acceptable’ (reflecting the CEFR’s approval of ‘a noticeable foreign accent’
in phonological control noted in 3.1.2. above). A similar awareness of interlanguageinduced interferences explains the presence of relativised statements in the descriptors for
best possible ‘Interactive Communication’ (e.g. ‘in general, meaning is conveyed
successfully’ or ‘can react quite spontaneously’).
25
In 2009-2010, when this project was implemented, this overall weighting was still only a suggestion (but
not an absolute, legal requirement) in 9e classes.
Chapter 3
73
The only exception to the rule was provided by the ‘Global Achievement’ category:
as it measured the completeness and relevance of student answers in relation to central
task expectations, this was the only area that warranted the perfect fulfilment of absolute
targets (i.e. ‘all parts of the task’ were to be ‘successfully dealt with’) to earn maximal
marks. It was also the only category of criteria that could have a fundamentally overriding
effect on the general assessment: a ‘band 1’ performance in relation to these contentrelated elements (most crucially the corresponding realisation that ‘most of the task’ had
been ‘dealt with insufficiently’) would necessarily have to lead to an insufficient mark
overall as the student would not have addressed large parts of the test in a suitable
manner.
At the lower end of the assessment spectrum, even minimal contributions would
often be sufficient to earn the students at least one mark out of a possible three per group
of criteria; only in cases of ‘totally incomprehensible’ or ‘totally irrelevant’ language
samples (or indeed the complete absence of a rateable sample in the first place), a
categorical ‘0’ mark was to be awarded for the student’s entire performance. This also
corresponded to a CEFR-compatible approach, as each sample of language would thus be
analysed in a fundamentally positive manner; even if it had not reached a sufficient level
of proficiency overall, it would still be considered as carrying some merit and adequate
features (even if they were few and far between). This does not mean, however, that
students would not have to work hard enough to earn their marks; an insufficient (‘band
1’) performance in all areas would, after all, still receive a clearly insufficient mark
(3/12). Yet if a student had shown a serious effort to work to the best of his abilities, it
would surely be wrong to discard an entire performance as completely worthless (for
example by “awarding” an outright ‘0’ mark in ‘Grammar and Vocabulary’ simply on the
grounds of a comparably high number of grammatical mistakes committed).
In general terms, as noted in the previous chapter, one of the key challenges in
competence-based assessment resides in striking a satisfactory balance between
achievement and proficiency factors. In this respect, marking grids predominantly seem
suited to assess overall proficiency. After all, it seems highly unlikely that one can fully
anticipate what exact structures a student may (or may not) use in an act of speech,
particularly if the corresponding test tasks target free production rather than discrete-item
responses. In that sense, even if a central aim of a summative test consists in verifying the
achievement of various small objectives, it is virtually impossible to incorporate a
multitude of correspondingly narrow and precise descriptors into a single marking grid
74
Chapter 3
without running the risk of overloading it (and yet still overlooking some elements that
might later surface in the actual student productions). This also explains why the
terminology in marking grids cannot usually be absolutely precise, particularly in relation
to form-focused descriptors: indeed, a reference to ‘basic sentence structure and
grammatical forms’ in the ‘Grammar and Vocabulary’ section of a marking grid is much
more flexibly applicable to a long, varied language sample than the meticulous, individual
identification of discrete forms and structures.
Moreover, if a speaking test (as in this case) includes a number of different tasks
with different foci, it would be very complicated and time-consuming to establish (and
use) a separate, specific grid for each individual task. It thus comes as no surprise that for
speaking tests such as those used in KET, for example, the general marking guidelines
specify that
assessment is based on performance in the whole [speaking] test, and is not
related to performance in particular parts of the test. 26
Such an approach is very probably the most feasible one for regular speaking tests in the
classroom as well, given the complex demands posed to a teacher who finds himself in
the challenging position of having to test and simultaneously assess multiple aspects of
oral skills. Tellingly, the KET procedure deliberately splits up the test administration and
assessment tasks between two different individuals. Thus, the assessment responsibilities
of a first examiner, acting as the students’ ‘interlocutor’ during the test, are strictly
confined to the ‘award[ing] of a global achievement mark’ 27 . The other, more specific
criteria of the students’ performances are explicitly focused on by a second “assessor”
who does not intervene in the actual test discussion at all and can thus direct his entire
attention exclusively on assessment. In the everyday classroom, of course, very few
teachers generally enjoy the luxury of having such a second assessor at their side, and so
the full burden of assessment falls upon a single person (who also partially has to act as
interlocutor and test administrator). Hence it is very clear that a general assessment in
relation to a number of different criteria is much more feasible during (and in the few
minutes immediately after) the test than one that goes into very precise detail. To
maximise the efficiency of such an assessment, it is evidently an almost unavoidable
prerequisite that the teacher has thoroughly familiarised himself with the different criteria
(and the corresponding descriptors) in the marking grid prior to the test, so that he
26
27
Key English Test – Handbook for Teachers, p.35
Ibid., p.35.
Chapter 3
75
virtually knows them by heart and needs to invest as little time and concentration as
possible on (re-)acquainting himself with them at the time of the test situation itself.
During the administration of the particular speaking test described here, I quickly
filled out a basic assessment sheet for each student (by inserting crosses into blanked out
sections of the marking grid) 28 in the short periods of time between examinations of
different pairs of candidates. As I had recorded all test performances on an MP3 player, I
was then able to listen to the various conversations again later and, if necessary, slightly
amend assessments in relation to criteria that I might not have been able to focus on in
sufficient detail during my role as interlocutor. In a way, I thus took over the roles of both
KET examiners at different stages of the assessment process; moreover, by being able to
“revisit” each conversation in a less stressful and time-pressured context, I managed to
focus on the various criteria more closely, which in turn reduced the risk of relying on
“gut-feeling” assessments (hence increasing scorer-related reliability in the process). At
the same time, however, it is of course important to resist the temptation of “overcorrection”; the intended aim is certainly not to spend hours and hours on a single student
sample (and potentially becoming excessively focused on a particular aspect in the
process).
The overall proficiency signalled by student performances across the various test
tasks is thus a first indication to go on in the assessment of their speaking skills (hinting at
the levels of competence reached). However, that does not mean that confirmation of the
achievement of more precise objectives cannot additionally be sought for on the same
occasion. In that sense, the speaking test described here certainly provided numerous
valuable clues as to whether my 9TE students had attained some of the intermediate
objectives of the entire term, even if the latter were not explicitly mentioned in the
marking grid. In this case, some of the language elements that had been previously
covered in class included for example the following:
•
Grammar: correct use of the present simple and past simple (including correct
question formation and short answers in both tenses), personal (subject and object)
pronouns and possessive adjectives;
•
Vocabulary: discrete expressions used for physical descriptions (e.g. hair, eyes,
build, general appearance), daily routines, family relationships and free time
activities;
28
See appendix 4.
76
•
Chapter 3
Pronunciation: correct vowel sounds, such as the distinction between ‘i’, ‘e’ and ‘a’
(and adequate use of English phonemes in general);
•
Interactive Communication: asking for repetition or explanation of a particular
utterance in English.
With the arguable exception of accurate physical descriptions, none of these elements
were tested in clear and deliberate isolation over the course of the speaking test. On the
contrary, most of them were flexibly combined and blended within the students’ overall
speaking performances. For that reason the test was, by definition, not suitable for a pure
achievement assessment in relation to the individual mastery of those various discrete
items. Nevertheless, as a significant amount of time had been spent on elements such as
(for example) past and present simple tenses in the classroom during the weeks and
months prior to the test, a reasonably high degree of accuracy could evidently still be
expected in relation to such familiar items. In that sense, consistent failure to use them
correctly (obviously indicating persisting errors in the students’ interlanguage) would of
course constitute a more legitimate reason to mark a student’s performance down (in
relation to ‘Grammar and Vocabulary’ criteria) than an ambitious yet partially
unsuccessful experimentation with grammatical or lexical items that had not been
previously encountered (or much less practised) in the classroom.
This forcefully underlines the complicated link that existed between achievement
and proficiency considerations in this assessment: while the adequate mastery of precise
structures and items was not directly demanded by the predominantly proficiency-based
descriptors in the marking grid, a generally satisfactory performance in the test was
evidently impossible if none of the grammatical and lexical elements encountered in class
had been acquired and internalised (i.e. precise learning objectives achieved) to a
sufficient degree. In other words, without achievement of at least some specific objectives
in the first place, there can ultimately only be little or no proficiency as a result
(especially at this early, ‘pre-intermediate’ level of language learning where no steadfast
foundation in terms of knowledge and skills has been laid yet). In that sense, this
summative speaking test was certainly capable of providing insights into both detailed
achievements and the learners’ general proficiency; in different ways, both aspects could
be derived from the descriptors used in the marking grid.
Chapter 3
77
3.2.6. Analysis of test outcomes All in all, this first summative speaking test yielded fairly satisfactory scores. While
the average mark achieved was not particularly high (7.5/12, with marks ranging from 5.5
to 11.5 29 ), there were no disastrously low marks at all, indicating that none of the students
were completely overwhelmed by the tasks they had to deal with in the test. In other
words, the basic standards that had been set were well within the students’ reach, showing
that the test had indeed ‘matched their language level’ to a satisfactory extent. Only three
students ended up with a marginally insufficient final mark (5.5 in all three cases),
although another four pupils just about passed by scoring exactly 6/12. On the other hand,
a third of the class managed to score more than 8 (i.e. 2/3 of the maximum mark),
including two particularly impressive performances (incidentally from two students who
were neither native speakers nor re-taking the year) that were rewarded with final marks
of 10.5 and 11.5 respectively.
It is perhaps not particularly surprising (yet definitely worth noting) that the target
criteria which students found hardest to meet were to be found in the predominantly
accuracy-focused ‘Grammar and Vocabulary’ section: in this area of the assessment
scheme, the highest number of insufficient marks were given (eight students received 1/3)
and the class average was correspondingly lower (1.64/3) than in the remaining three
categories. However, as indicated by the general results above, in more than half of those
cases the students managed to make up for those apparent weaknesses by meeting basic
or standard targets in other sets of criteria (which were more closely linked to fluency). In
fact, only a single insufficient mark was respectively scored in each of the remaining
three categories; the highest class averages were reached in the ‘Interactive
Communication’ (2.05/3) and ‘Pronunciation’ (1.98/3) sections. Hence a learner
performance was not automatically doomed to receive an insufficient overall mark simply
because it might have contained a fairly high number of mistakes in terms of accuracy.
Nevertheless, a few legitimate questions arise in regard to the disproportionate
number of cases where the basic standard in ‘Grammar and Vocabulary’ had ostensibly
not been met. Had the expected standards aimed too high? Had the test tasks been too
difficult in this area? Had I been particularly harsh as an assessor in relation to accuracy?
To what extent might an insufficient effort of revision on the part of the students be to
blame as well? In general, a combination of all or several of those factors is easily
29
See appendix 5 for a detailed breakdown of individually obtained marks.
78
Chapter 3
imaginable. In this case, however, I would argue that the vast majority of ‘basic standard’
descriptors in the marking grid did not seem exaggerated at all: based on the CEFR
scales, it is certainly appropriate to expect the A2 student to ‘use a limited range of
structures’ and ‘have sufficient control of relevant vocabulary’ for the tasks set. The test
tasks themselves appear reasonably valid as well, since they mainly recycled familiar and
pre-taught grammatical structures and lexical items and could thus be safely assumed not
to be excessively difficult; furthermore, their topical focus was, after all, of an ‘everyday’
nature that should be rooted firmly within the A2 level. Yet it may seem questionable
whether more than a third of my students really had such ‘little control of very few
grammatical forms and little awareness of sentence structure’ to really only deserve a
‘band 1’ in ‘Grammar and Vocabulary’; therefore, my own approach to this part of the
assessment needs to be examined more closely in this particular instance.
On the one hand, it must be noted that numerous students actually did show
consistent problems with a number of grammatical foci that had been extensively
practised over the course of the term. For instance, numerous errors were recurrently
committed in regard to question formation during the S-S interactive activity where
students had to construct suitable questions from cues even though the very same type of
activity (using similarly structured cues) had repeatedly been implemented in class. In
those attempts, the most frequent mistakes consisted in the wrong use or complete
omission of auxiliary verbs, as well as the confusion of past and present tenses (e.g.
‘*When is he born?’). If numerous mistakes noticeably occurred in relation to various
pre-taught elements within a single student’s test performance, the mere mention of ‘some
grammatical errors of the most basic type’ in the corresponding ‘band 2’ descriptor
sometimes seemed too lenient to be attributed to that performance 30 . Hence the ‘band 1’
description of ‘little control’ appeared more appropriate in such cases, especially as the
more controlled test tasks (such as the S-S question-answer one) essentially targeted only
‘few grammatical forms’. In that sense, the assessment was firmly guided by a focus on
achievement factors in this section of the test.
However, one must not forget that the remaining sections of the speaking test also
needed to be taken into consideration before it was legitimate to reach a final decision
about the generally displayed level of grammatical and lexical proficiency. In other
words, it is essential for the assessor not to let a “bad” performance in one test section
30
See appendix 3 for the entire marking grid.
Chapter 3
79
taint his impression of the student’s efforts in other parts of the test as well; if one is
excessively focused on achievement of discrete objectives, the “bigger picture” may be
forgotten in the process. At the same time, it is understandable that teachers may expect a
fairly high degree of grammatical accuracy in regard to particular grammar points which
extended periods of time have been consecrated to in regular class hours. This also
explains why I tended to clamp down on grammatical mistakes (in relation to such pretaught items) rather severely in my assessment strategy in this instance. As a more
constructive corollary, however, the recurrence of similar problems across different
student performances allowed me to identify discrete elements that still commonly posed
problems to the learners and thus especially needed to be further reinforced (and would
thus be re-integrated into objectives to be achieved) in classroom activities in the future.
On the other hand, various criteria and descriptors used in this marking grid also
proved to be rather problematic during the actual implementation phase. Hence one
element which arguably further explains the students’ relatively low scores in ‘Grammar
and Vocabulary’ can be seen in the actual phrasing of a particular ‘band 1’ descriptor in
this category: namely, the stipulation that an insufficient performance would be
characterised by ‘speech’ that ‘was often presented in isolated or memorised phrases’. In
itself, this judgment might be appropriate for longer turns where students have to keep
talking in a fairly spontaneous and uninterrupted fashion for an extended period of time
(in the case of A2 learners, of course, this might only mean one or two minutes); if a
student fails to utter more than a few disconnected parts of sentences in such an instance,
then the performance can indeed be deemed ‘insufficient’. However, as the majority of
the test tasks were designed in such a way that the students could usually respond by
means of single sentences, the inclusion of such a descriptor in the lowest defined band
became more controversial. Across large parts of the test, the use of ‘isolated’ and even
partly ‘memorised’ phrases could in fact be judged to basically fulfil task requirements.
Similarly, the choice to mention ‘minor hesitations’ as a feature of ‘basic’ rather than
‘target standard’ could (rightly) be considered to be fairly harsh; would it not be normal
even for remarkably proficient A2 learners to still be allowed to briefly hesitate in
spontaneous speech acts? The first practical experimentations with this marking grid thus
certainly revealed a number of discrete formulations that needed to be reconsidered in the
next assessment of speaking performances.
In a more general way, the decision to blend two such widely diverse elements as
‘Grammar and Vocabulary’ into a single category also turned out, in retrospect, to be
80
Chapter 3
questionable in a number of ways. One particularly detrimental effect was that some
students who clearly demonstrated they had assimilated a suitable range of topical lexis
could still fail to reach the highest band in this accuracy-based section of the grid because
of their remaining problems in grammar (and vice-versa). Of course, one may argue that
extreme discrepancies are very unlikely to occur between the grammatical and lexical
competence levels of a single learner; hence, high proficiency in grammar is rarely
accompanied by very severe deficiencies in vocabulary-related competences. A complete
dissimilarity marking the qualitative performances in grammar and vocabulary may thus
be a crass exception, and a fairly parallel competence development might normally be
expected to take place in relation to those two language components. However, while
such considerations might help to explain the decision to combine them into a single
group of criteria in a marking grid, the very general way in which both elements
respectively had to be described in this instance did not necessarily pay sufficient justice
to the importance and intricacies of either of them.
The clearer separation of ‘grammar’ and ‘vocabulary’ foci thus already seemed to
impose itself for subsequent assessments of speaking performance purely on the basis of
the respective complexities inherent to both components which could not be analysed in a
sufficiently thorough and targeted way on this occasion. In addition, one might question
the exactly equal weighting that had been attributed to the two elements of ‘Grammar and
Vocabulary’ and ‘Pronunciation’ in this assessment. Even though the latter category
included global, fluency-centred descriptors (e.g. ‘speech is mostly fluent’) in addition to
the precise analysis of phonological control (e.g. ‘L1 interference still to be expected’),
the place that it occupied in the marking scheme in comparison to grammar and
vocabulary ultimately seemed a bit exaggerated. Without denying that adequate
pronunciation is (as discussed in 3.1. above) an important feature of any successful
speech act, one may wonder whether it truly deserves to be equally valued as the
combination of such core linguistic components as grammar and vocabulary. Hence, the
inclusion of pronunciation-related criteria into this marking grid may have been a
perfectly legitimate decision; nevertheless, some fine-tuning in terms of sensible
weighting seemed to be necessary in subsequent speaking tests.
Finally, in respect to the ultimately awarded total scores, the noticeable presence of
“half” (.5) marks highlights an issue presented by the actual structure of the grid used in
this instance: namely, the absence of midway bands between the three explicitly defined
ones. Even though this had initially been (as described above) a deliberate choice, its
Chapter 3
81
effects repeatedly proved problematic in practice. At various times during the assessment
procedure, I noticed that full marks in relation to one of the four main criteria could not
always be justified in a very straightforward manner. The simple reason behind this was
that a particular student performance often did not simultaneously correspond to all
descriptors that were defined and grouped in relation to a particular criterion in one single
band. Hence, one student’s speech might have been ‘occasionally strenuous’ (‘Interactive
Communication’, band 2) yet the same student could also have shown an ability to ‘react
quite spontaneously’ and ‘ask for clarifications’ (‘Interactive Communication’, band 3).
Similar examples of “mixed” performance occurred in relation to each of the other three
major criteria sets as well, underlining the fact that the quality of a given speech act does
not always handily fit into a single band description (evidently, the more different
descriptors and criteria one assigns to a specific band in a single category, the likelier
such an outcome becomes). As my marking sheets had not sufficiently anticipated such a
need for flexibility, I needed to improvise during the assessment by putting crosses on the
line separating two different bands and correspondingly awarding half marks if no clear
tendency to either the higher or lower band could be reasonably discerned. For future
assessments, however, the provision of two extra, midway bands seemed necessary, even
if they would evidently not require specific sets of descriptors (but cater for such ‘mixed’
performances instead).
All in all, the implementation and assessment of this first speaking test thus yielded
various useful realisations. On the one hand, it clearly showed that even at such an early
stage as the first term of 9e, my students could deal with tests that veritably called upon
their speaking skills, provided that the corresponding tasks had been suitably prepared in
class and adapted to the A2 language level. On the other hand, the assessment tools and
strategies used in this instance could still be more closely tailored to the students’ real
needs and capacities. As a result, this would constitute one of my main aims in the second
summative assessment that targeted my students’ speaking skills in the subsequent term.
82
Chapter 3
3.3. Case study 2: comparing places and lifestyles / asking for and giving directions. For this second speaking test in my 9TE class, I once again chose the end of term as the
most suitable point in time for implementation as this would give my students the chance
to build up further lexical and grammatical resources in some depth over the course of
almost three entire months. As the strategy I pursued in the build-up to the test mirrored
the one from the first term (i.e. constructing knowledge of topic-based vocabulary and
fundamental grammatical structures, practising pair work activities and task types to be
included in similar form in the test), I will avoid a more detailed description of general
classroom activities in this section of the chapter. Instead, I will focus exclusively on the
design and outcomes of the specific test tasks and modified assessment strategies used.
3.3.1. Speaking test 2: description of test items and tasks Although I once again wanted the speaking test to start with a confidence-building
activity, excessive repetition of (or overlap with) elements from the first speaking test –
and thus undue predictability – needed to be carefully avoided this time around.
Therefore, while the opening activity in this second test still essentially revolved around
T-S interaction, it was set up in significantly different ways than the very direct questionanswer scheme used at the beginning of the previous term’s test. Although the students
once again took the test in pairs, the starting task was an individual one. After the
candidates had picked a sealed envelope from a pack and handed it to me, I opened it and
gave one of them (student A) a handout with two contrasting visual prompts 31 which
served a number of purposes. First of all, the pictures had been carefully selected so that
they showed places, objects and (to a lesser extent) jobs or activities the students would
be able to describe in some detail with the lexical knowledge they had built up in relation
to various thematic areas over the course of term (or even the year). Since topics such as
different places, cultures and holidays abroad had all been covered in class, a
correspondingly substantial number of test sheets included pictures of different types of
locations (e.g. city and countryside, mountains and beaches…). As variations (so that
students taking the test at later stages would not be able to prepare their answers in
advance), pictures relating to other treated topics such as different jobs and lifestyles (e.g.
rock stars or ultra-rich people versus “normal” people with “regular” jobs) or various
free-time activities (e.g. outdoor sports versus video gaming) were also included.
31
See appendix 6 for samples
Chapter 3
83
In a first stage, student A had to describe one set of pictures in as much detail as
possible. This part of the activity evidently had a clear achievement aim: students were
supposed to demonstrate that they had acquired appropriate vocabulary range and
accuracy in relation to topics encountered in numerous classroom activities. At the same
time, the visual prompts functioned as an important support for the learners in their
efforts; rather than being engaged in completely free discussions from the outset, students
could gradually “warm up” by simply describing thoroughly familiar depicted items that
should pose no particularly high cognitive challenge to them (evidently this was only
possible if they had suitably paid attention in class and done adequate revision – two key
features which any summative test should be afforded to presuppose as well!). As this
first long turn aimed to verify the student’s ability to deal with the task independently, my
role as teacher would be to provide as few prompting questions or hints as possible; in
that sense, a few hesitations on the part of the learner were absolutely acceptable in this
instance. On the other hand, if the students got stuck immediately and risked not being
able to achieve this fundamental task at all, I helped them with a few guiding questions to
“nudge” them into the right direction. However, the more prompting I needed to offer, the
more negative the impact would have to be on correspondingly awarded marks in regard
to core features like fluency and global achievement.
Once the student had finished describing both pictures in isolation, the activity
further expanded the identified topic in a second stage. As the two visual prompts
represented an inherent contrast, I now asked the students to compare them with each
other and name the most striking differences. In that way, further achievement foci were
now added; this time, however, they were not just of a predominantly lexical nature.
Instead, as the ability to compare various people, places and objects had constituted a core
learning objective over the course of the term, this part of the speaking test aimed to
activate a precise aspect of the students’ grammatical competence as well (by verifying
their ability to use comparative and superlative structures in context). Occasional
prompting was once again necessary at times, for example if the students failed to
identify adequate contrasts which they could elaborate on. In such instances, the
indication of a specific feature in the pictures would often suffice to guide the students
into the right direction and trigger further verbal contributions.
Finally, the activity then opened up into a freer discussion: the student was asked to
state a preference in relation to the two depicted elements (for example choose the place
where he would rather spend a holiday or live permanently) and to justify this choice in a
84
Chapter 3
few words or sentences. My decision to include such a section of fairly free discussion
was primarily guided by the consideration that this least controlled part of the activity
provided an indicator of general proficiency rather than specific achievements. Indeed, it
was less predictable what particular students would mention in this part of the task;
instead, they would have to display their general level of fluency to spontaneously
elaborate on the choice they had made. Once student A had completed this set of
individual activities, another handout was given to his or her partner (student B) and the
same procedure was repeated. Once again, though, I had meticulously tried to avoid
overly similar pictures on both handouts; thus the second student would have to use a
different set of topic-related expressions and could not just recycle or even copy his
peer’s utterances.
In its entirety, this initial set of activities also allowed me to get important
indications in regard to wider aspects of the student’s general attitude to speaking. Thus,
the length and level of detail of each candidate’s speech would not only illustrate overall
fluency, but also in many cases provide hints about the individual’s actual willingness to
engage in speaking and to correspondingly take a number of necessary “risks”. Hence,
some students were reluctant to give more than monosyllabic descriptions and answers,
while others clearly sought to elaborate in depth on the various pictures and questions
instead, even if this sometimes meant that they had to improvise in terms of lexis and
experiment with more complex structures they might not have previously encountered.
While the first category of students reduced the possibility of making language mistakes
in their utterances by keeping their answers very short (ostensibly limiting negative
impact on overall accuracy), they evidently risked providing incomplete answers and
contributions in relation to the topic and test expectations. In contrast, the potentially
higher number of mistakes in the more extended speeches of higher risk-takers could
easily be compensated for by the increased respect paid to content-related criteria.
Considering that a faultless performance is not expected at A2 level anyway, it was
certainly the second type of student behaviour and performance that the test ideally
targeted and that would offer the most logical and likely path towards high overall marks.
If the longer individual turns of the first half of the test had been mostly productionoriented, the second half of the speaking test was once again designed to trigger S-S
interaction (and thus truly exploit the fact that students were taking the test with a
partner). In this particular case, a short role play activity was used to that effect: in their
pairs, the students had to show their respective abilities of asking for and giving
Chapter 3
85
directions (an activity which the students were familiar with as it had been practised via
similar role plays in class prior to the test). As a visual support, both students were
therefore given a handout that contained the same fictional map and brief written
instructions (detailing their respective roles and tasks) which they were allowed to read
through quietly 32 . In the context of the depicted city, student A played the role of
“tourist” while his partner had to pretend to be a local inhabitant. At the start of the
activity, student A had to tell student B where his starting position was, and then ask his
partner for directions to a precise location from there (in this duly rehearsed type of
question, the student was also supposed to show evidence of his sociolinguistic
competence by choosing an appropriately formal register for this polite inquiry). Once
student B had given a corresponding first set of instructions to the “tourist”, the entire
operation was repeated once again (with the same roles but a different destination). Then
the students were asked to turn over their test sheets; on the second page, they found a
different city map which the pair of candidates had to use to repeat the exercise (twice)
with reversed roles. By the end of the activity, both students would thus have had to show
their ability to assume both roles in such a semi-authentic situation.
The decision to make each student give directions to two different locations in their
respective cities had been made for two interlinked reasons: on the one hand, this
extended the length and complexity of the produced language sample, so that a more solid
foundation for assessment was available. In the process, the second set of directions given
by the student could either confirm errors that he had already made in the first one (by
committing them again), or reveal them to have been exceptional ‘slips’ if he got the
same type of information right in the second attempt. In turn, this would increase the
reliability of the performance sample as a basis for proficiency assessment, since it would
allow for a more detailed and accurate error analysis. As a whole, the exercise also
offered the opportunity to verify the students’ handling of interactive strategies such as
asking for clarification (although they occasionally had to be reminded to use the target
language for this).
In this test task, I had used a variety of maps to compile the student handouts. This
was based on the idea that student B could then not copy his peer’s instructions too
closely; neither would early test takers be able to give any precise pointers to later
candidates. While some maps were multiply used (for instance by asking for different
32
See appendix 7 for samples.
86
Chapter 3
itineraries in different test sets), this might be considered a slight risk in terms of test
reliability: given that not all students had to give directions in relation to the same map (or
exactly the same itinerary on one map), the absolute, intrinsic comparability of their
results might have suffered to a certain extent. However, as I had invariably taken and
adapted all the maps from various ‘pre-intermediate’ coursebooks, their level of difficulty
could still be considered to be reasonably similar. Moreover, the itineraries that I had
deliberately incorporated into the instructions generally aimed at a similar amount of
necessary directions to reach the respective destinations. At the same time, of course,
different itineraries were often possible, offering a certain amount of choice to the test
candidate; in turn, this contributed to making the test task less rigidly controlled and thus
these choices made more varied and spontaneous oral performances possible to a certain
extent.
In general, one might argue that the items and tasks used in this second speaking
test offered more frequent opportunities for genuinely free oral production when
compared to the numerous closed questions and rather strictly guided S-S interaction
(where precise cues had to be respected) in the first test. In that sense, one may
reasonably argue that both content and construct validity were further raised in view of
verifying overall oral proficiency in this instance, since the students had to fill in more
information on their own and were free to expand on their answers to a higher extent. In
other words, the various speaking tasks transcended the fairly narrow lexical-grammatical
scope of the exercises that had still characterised parts of speaking test 1. Both in the free
discussion at the end of the first set of activities and in the various possible details the
students chose for giving directions in part two, they could thus use their oral production
skills in a fairly free and authentic manner that may not have been possible to the same
extent in the first speaking test. On the other hand, of course, this progression was only
natural in the sense that students had had another entire term to further develop their
general proficiency level in the target language. Hence the more the school year
progressed, the more possibilities I would evidently get to “open up” the possible answer
scope in individual test tasks as a result of my students’ steadily increasing amount of
available language resources.
Chapter 3
87
3.3.2. Form, strategy and theoretical implications of assessment used As a number of issues had surfaced in respect to the marking grid used during the
practical implementation of the first speaking test (see 3.2.6. above), various
modifications had to be applied to my assessment strategy in this second attempt. Thus,
my students’ competences in relation to the two vast concepts of vocabulary and grammar
were to be separately appreciated this time in order to get a more nuanced view of their
progress in terms of both fluency and accuracy. The arguably disproportionate weight
initially attributed to pronunciation factors in the overall assessment needed to be
addressed as well, and the appropriateness of individual band descriptors had to be
verified even more closely in terms of their suitability for A2-level learners.
In my search for a differently organised assessment scheme, I found a particularly
useful tool in a marking grid specifically developed (and published in the national
syllabus for 8e and 9e TE 33 ) for such a purpose by a group of language teachers who had
been officially charged with the mission of designing and adapting competence-based and
CEFR-aligned ways of assessment for English courses in the Luxembourg school
system 34 . A look at the general layout of this grid 35 reveals the useful organisation of
different speaking criteria into the following four categories:
1. ‘Content / task response’: while this section roughly corresponded to the ‘Global
Achievement’ criteria used in the previous marking grid, it did not overtly include
considerations about “attitude” or “risk-taking”. However, such factors could still be
derived from the meaning-related descriptors and thus be indirectly incorporated into
the assessment. Hence, the ‘band 3’ (or ‘basic standard’) descriptor stipulated that
‘basic meaning [was] conveyed’ but ‘some effort’ was necessary ‘on behalf of the
listener’. In contrast, the more proficient ‘band 5’ descriptor attested that
‘communication [was] handled’ and ‘meaning conveyed successfully’ (while no
‘effort on behalf of the listener’ was implied). It is certainly impossible to
communicate more than ‘basic meaning’ if one only provides monosyllabic or
otherwise minimalistic answers. Even if a willingness to expand on individual points
33
See the relevant syllabus for 9TE on http://programmes.myschool.lu/ for details.
The group in question is the ‘groupe de travail - Socles de compétences pour l’anglais’; one of their
stated aims (according to their workspace on www.myschool.lu) consists in ‘establishing CEFR-based
standards and benchmark levels for English to be reached by secondary students in Luxemburg as well as
developing the corresponding descriptors’. While I joined that taskforce during the period of time when this
thesis was written, the marking grids for speaking and writing described here had already been developed
prior to that; I had thus played no part in establishing them.
35
See appendix 8 or p.18 of the 9TE syllabus (http://programmes.myschool.lu/) for the complete grid.
34
88
Chapter 3
in more depth does not automatically guarantee that more than ‘basic meaning’ is
‘successfully conveyed’ (as sufficient accuracy might for example still prove a
problem), it is nevertheless a prerequisite in terms of attitude to make the achievement
of a higher band possible in the first place.
2. ‘Pronunciation and discourse management’: although pronunciation still appeared
in the title and (rightly) in the content of this group of criteria, a closer look at the
various descriptors quickly revealed that actual phonological control and intonation
skills constituted only one of various components here. In comparison to the marking
grid used in the previous term, ‘discourse management’ elements taken into
consideration in this instance added more focus on the connected nature and overall
fluency of the speaking performance as a whole. In that sense elements such as
coherence and cohesion, largely absent from the ‘pronunciation’ section in the first
marking grid, were taken into more consideration this time.
3. ‘Lexis’: reflecting the sought separation from grammatical foci, this section was
firmly focused on vocabulary range and accuracy, while an interesting addition
consisted in the explicit and consistent reference to paraphrasing skills across all
defined bands. For language learners, the ability (and deliberate strategy) to look for
alternatives if a precise expression eludes them is most certainly an important skill to
develop in an aim to keep any act of communication from breaking down completely.
For that reason, the inclusion of ‘paraphrase’ in this section was certainly a sensible
one. Of course, this criterion may not always appear in all student performances;
however, if attempts to paraphrase are made, this can further reveal the students’
willingness to take risks and become creative to work around temporary setbacks.
4. ‘Grammatical structures’: as a result of the separate focus on grammar, more
nuanced and varied criteria were included in this section in comparison to the broader
‘Grammar and Vocabulary’ category in the previous marking grid I had used. Thus
range and accuracy were more strictly separated into individual descriptors, while
elements such as use of ‘basic forms’ (band 3) or ‘both simple and complex forms’
(band 5) specified in a little more detail the qualitative differences that one could
expect within the various levels of proficiency defined. Interestingly, accuracy was
also partly expressed through a quantitative appreciation of grammatical errors
(‘frequent’ in band 3 but only ‘occasional’ in band 5).
In terms of general organisation, this marking grid also fulfilled another requirement that
had emerged from my first assessment of speaking skills in the previous term:
Chapter 3
89
intermediate bands had been inserted between the three defined ones of ‘insufficient’,
‘basic standard’ and ‘target standard’. These five ‘gradations of quality’ catered for the
type of “mixed” performances that had repeatedly surfaced in regard to a specific group
of criteria in the first speaking test. Thus, ‘band 2’ was defined as containing ‘features
from band 1 and 3’ while ‘band 4’ took a similar midway position between bands 3 and 5.
A close alignment to the CEFR is visible throughout this grid. Thus, the ‘discourse
management’ descriptor in band 3 states that a basically suitable learner performance in a
given test is characterised by ‘very short basic utterances, which are juxtaposed rather
than connected’. This expectation is clearly within the A2 level of the CEFR, according to
which one should expect that the student ‘can give a simple description ... as a short
series of simple phrases and sentences linked into a list’ in ‘overall oral production’
(CEFR, p.58). However, it should be noted that in order to obtain the highest possible
mark in relation to a specific criterion, the necessary performance in a summative test
needs to match ‘band 5’. In this area, the marking grid in question certainly respects the
CEFR indication that even the best possible performance (in free production) is not
devoid of remaining shortcomings. In the ‘lexis’ section of the grid, for instance, the
‘band 5’ descriptor simply attests an ‘adequate’ vocabulary range ‘for the task’. This does
not imply very complex and extensively developed lexical knowledge overall, but simply
indicates that the learner has shown that he can handle appropriate topical lexis in
‘everyday’ contexts that are suitable for A2 learners (such as descriptions of people or
places); evidently it also presupposes that the test task itself has been designed to firmly
stick to such ‘everyday’ contexts and needs. Anything that goes beyond such familiar
contexts causes problems, which is duly noted in ‘band 5’ in the sense that ‘attempt[s] to
vary expressions’ occur ‘with inaccuracy’ and ‘paraphrase’ is only used ‘with mixed
success’. Indeed, this mirrors the vocabulary-related CEFR descriptors which postulate
that an A2 learner ‘has sufficient vocabulary to conduct routine, everyday transactions
involving familiar situations and topics’ (‘Vocabulary Range’, CEFR p.112) and ‘can
control a narrow repertoire dealing with concrete everyday needs’ (‘Vocabulary
Control’, p.112).
Yet there are also some cases where the expectations in ‘band 5’ of this grid are
slightly higher than what the CEFR defines as ‘A2-typical’. The most striking example of
this occurs in the ‘Grammatical Accuracy’ section: the best possible performance is here
described as showing a ‘good degree of control of simple grammatical forms and
sentences’, which seems to clash with the CEFR descriptor for ‘grammatical accuracy’ at
90
Chapter 3
A2: ‘Uses some simple structures correctly, but still systematically makes basic mistakes’
(p.114). Indeed, this general expectation is more compatible with the ‘band 3’ grammar
descriptors in the marking grid (‘limited control of basic forms and sentences / only a
limited range of structures is used’). Does this, then, take us back to unfairly high
expectations (in regard to the attainability of maximum marks) that students working
within A2 should not be supposed to achieve? Not necessarily. In fact, this is one striking
case where the non-school-based origin of the CEFR is strikingly highlighted. As already
suggested in the previous chapter, the CEFR does not specify how knowledge and skills
are developed over time; for that reason, it also fails to take into account how a reinforced
focus on elements such as grammar and vocabulary in a school environment can raise the
expectations in relation to student performance in those areas. Evidently, this holds
particularly true for a summative test that is concerned with achievement assessment to a
significant extent; it is normal that in relation to pre-taught objectives, ‘systematic…basic
mistakes’ should ideally be kept to a minimum. In that sense, they should certainly not
appear in the best possible type of performance in such a test; for a ‘sufficient’
performance (i.e. ‘basic standard’), however, they will still be appropriate as long as the
overall communicative message still clearly comes across.
3.3.3. Analysis of test outcomes Following my intention to gradually increase the importance of speaking skills in
view of the overall mark of term 36 , I chose to allot a total of 20 marks to this second
summative oral test. In doing so, I was able to use the simple multiplying factor of 1
again to arrive at a final mark from the applied grid (which included 4 categories of
criteria and 5 qualitative bands and thus allowed for 4x5 = 20 possible combinations). As
this marking system included the two intermediate bands 2 and 4, this also had the
beneficial consequence that I did not have to use any “half” marks this time.
Overall, the results of this speaking test showed some encouraging tendencies. For
instance, the average mark of the class was 14.67/20, which corresponded to a
significantly higher average (73.35% of the maximum obtainable) than the students had
generally achieved in the first speaking test (62.5%). This was partly due to the fact that
only one insufficient mark (7/20) and two marginal “pass” grades (10/20) had been
36
See section 3.2.6.
Chapter 3
91
awarded this time 37 ; in contrast, six very high scores had been achieved (including four
perfect scores as well as one achievement of 19/20 and 18/20, respectively). However, in
view of the very vast amount of variables involved, it would of course be a mistake to
read too much into these numbers: evidently, different grammatical and lexical items
were tested on the two occasions, which makes it virtually impossible to compare the
actual degree of difficulty of both tests; during the time between the two test occasions,
two fairly “weak” students had left the class and one “stronger” student (with a seminative-speaker background) had joined; different types of test tasks and corresponding
marking grids were used in these two cases; a whole range of student-related reliability
factors (e.g. less test anxiety in the second instance, or even the simple fact of having
“good” or “bad” days) could have played a part; and so forth. In purely statistical terms,
then, the progress expressed by these two average marks should not be rashly
overestimated.
However, one conclusion that one may certainly draw from these results is that the
students in my ‘pre-intermediate’ class had encountered no major problems in dealing
with the various speaking tasks they were confronted with in this test. The qualitative
expectations implied in the A2-based marking grid had thus been reached by almost all of
them; in fact, the majority of students operated in between the ‘basic’ and ‘target
standards’ at this point in time (which was evidently a desirable development, as it was a
general aim to push as many students as possible towards the ‘target standard’ until the
A2 learning cycle was completed at the end of the subsequent term). Moreover, several
parallels existed between the results in this test and the previous one. For instance, the
two individuals who had scored the highest marks on the first occasion were also among
the best performers in this test (accounting for two of the perfect scores); similarly, the
two students who marginally passed the second test had already achieved 6/12 and 5.5/12
in the previous one. Apart from a very limited number of exceptions, the same students
who had scored comparably “high” marks in the first instance had done so again in this
second test; similar tendencies could be established for those with “average” and “low”
marks, respectively 38 . As established in chapter 1, such empirically indicated consistency
37
It should be added that the insufficient mark was mainly due to a very negative attitude on the part of the
corresponding student, who did not take the test seriously at all and did not even try to deal with the content
requirements in an adequate manner. In essence, that learner did thus not give a true account of his real
abilities.
38
The students’ results in both tests can be compared by means of the graphs in appendices 5 + 10.
92
Chapter 3
of results contributes to confirming the general reliability of both tests (in terms of
effectively separating “good” and “poor” performances).
In view of these overall results, it is also particularly worth mentioning that the
highest-scoring students could actually be divided into two categories. On the one hand,
there were those who usually did fairly well in all other types of tests as well (focusing on
the other three major skills); as generally “strong(er)” learners, their good oral
performances were thus perhaps fairly unsurprising. On the other hand, two students who
had obtained high marks in both speaking tests (one of whom was the student who had
achieved 11.5/12 in the first one) had in fact received fairly average marks in previous,
predominantly written tests. In one case, this was largely based on weaknesses in
grammar and orthographic control; in the other, the student had mainly had problems to
deal with written test tasks in the allotted amount of time and thus repeatedly conceded a
lot of marks through unfinished work. The fact that both these students excelled in
speaking activities very clearly underlines the general necessity to adapt our approach to
testing much more to the undeniable variety of learner types in our classrooms to create a
more level playing field for all of them.
The results of the class in relation to the four main groups of criteria in the marking
grid showed a remarkable consistency. In three of them, the average mark achieved was
exactly the same (3.48/5); the very marginally “weakest” feature in the students’
performances emerged in relation to pronunciation and discourse management (where the
average mark achieved was a slightly lower 3.38/5). Only one insufficient mark (2/5) was
respectively awarded in regard to ‘content’ and, remarkably, ‘grammatical structures’;
three insufficient scores were reached in each of the two remaining categories (2/5 in five
cases and 1/5 once in ‘pronunciation and discourse management’).
In comparison to the previous speaking test, the students had thus seemingly
managed to improve their performances in terms of grammatical and lexical accuracy.
Again, however, great caution is advised in the interpretation of such results. Thus
virtually the only explicit grammatical achievement focus in this second test was put on
the correct use of comparative forms in the students’ first turn (predominantly based on
free production), while the (S-S) interactive question-answer task in speaking test 1 had
involved multiple grammar challenges (such as the correct use of auxiliaries, tenses and
word order). In fact, the latter task had generally been more mechanical and guided in
nature, thus almost automatically inviting the assessor to put the sufficient mastery of
grammatical structures under increased scrutiny.
Chapter 3
93
At the same time, however, the rather punishing ‘band 1’ descriptors of lexical
competence from the first marking grid had been revised this time. Instead of mentioning
that ‘speech’ was ‘often presented in isolated or memorised phrases’ as characteristic for
an insufficient performance, the grid used in this instance stipulated that ‘only isolated
words or memorised utterances [were] produced’ at ‘band 1’. If this may seem like a mere
semantic nuance on the surface, it actually turns out to be a description that only gives
room to much more limited linguistic samples in practice. As a result, it was certainly less
likely this time that a very low band could be awarded in regard to this particular criterion
since most students had tried to develop their oral contributions at least to the extent of
producing short sentences. As a whole, this example once more stresses the importance of
rigorously choosing sensible and meticulously appropriate terminology (in regard to the
test tasks and the overall proficiency level) in every single descriptor in a chosen marking
grid.
The separation of ‘grammar and vocabulary’ criteria in this assessment of speaking
performance provided some similarly useful insights. After the test, a clear tendency was
noticeable when comparing the scores that individual candidates had obtained in relation
to the four main criteria: in all cases, the four marks that had been separately awarded
were either absolutely identical or very close to each other (i.e. one mark higher or lower
than in other criteria). In fact, each student’s overall performance corresponded to a
combination of maximally two (adjacent) different bands across the four main criteria 39 .
On the one hand, this seems to confirm the abovementioned suspicion that most student
performances will not present major discrepancies as far as grammatical and lexical
competences are concerned (for instance). On the other hand, however, the usefulness of
separating both criteria and judging each one individually is powerfully illustrated by the
example of one student who ultimately received a marginal “pass” mark. This particular
student’s performance was marked by several long pauses, particularly in the individual
long turn at the beginning: as this learner struggled to find appropriate vocabulary items
to describe the two visual prompts (and, as she failed to paraphrase instead, received a lot
of prompting on my part), she ultimately received ‘band 2’ scores in the ‘lexis’ and
‘discourse management’ categories. However, throughout the test (and especially in the
‘giving directions’ task), she displayed a sufficient level of competence in terms of using
grammatical structures: in that respect, her speaking performance thus reached the ‘basic
39
See assessment grid for student 14 in appendix 9.
94
Chapter 3
standard’ (i.e. 3 marks), and this essentially swayed her final score towards the marginally
sufficient mark she ultimately received. Had grammatical and lexical accuracy still been
judged together, an ‘insufficient’ band could easily have been attributed in this case as the
negative impression in regard to ‘vocabulary’ could have tainted the overall assessment in
regard to the entire group of criteria.
This example also underlines the key necessity to avoid going with ‘gut-feeling’,
holistic impressions (which teachers are, unfortunately, all too often inclined to do when
confronted with such problematic speaking performances). Instead of prematurely
extrapolating the clearly perceptible problems in one particular area to the entire effort
made by the student, it is important to consider to what extent remaining components of
the speaking performance might actually reach the basic standards that are to be
reasonably expected at a given level. This is not to say that one should be excessively
lenient in one’s general approach to assessment; far from it. Certainly not always will a
student be able to compensate for weaknesses in one area through sufficient strengths in
others, and in many cases, an insufficient performance in ‘lexis’ may in fact very well be
accompanied by a correspondingly poor handling of grammatical structures (for
instance). However, especially in cases where the decision over ‘pass’ or ‘fail’ is a very
close and difficult call, meticulous consultation of the individual criteria in a marking grid
will, as in this case, help to lead to a much more informed and justified decision than a
possibly “prejudiced” holistic impression. Evidently, this can only be the case if the
marking grid itself constitutes an appropriate tool for a valid and reliable assessment of
the student’s level of ability in relation to reasonable standards. Not least through the
abovementioned, general alignment of the ‘basic standards’ with the realistic performance
criteria defined for the A2 level in the CEFR, that prerequisite was effectively respected
in this instance.
Yet even with such useful tools at one’s disposal, tricky situations can (and most
probably will) still occasionally arise over the course of the assessment process. In this
particular speaking test, this was strikingly exemplified by a noticeably recurring mishap:
in the second part of the test, a number of students consistently seemed to confuse the
essential expressions ‘left’ and ‘right’ in their attempts to give directions. Should the
assessor interpret this as evidence of limited control of topic-relevant lexis and thus
correspondingly lower the mark awarded in relation to this criterion? At first, such a
decision certainly seems reasonable. However, at closer look, this confusion of very basic
lexical items was in fact often accompanied by other, correct details (such as appropriate
Chapter 3
95
lexis and prepositions of position used to describe the buildings near the expected
destination). Clearly, then, the student had identified the correct itinerary on his map, but
failed to express this in a completely adequate way by including decisive accuracy-based
mistakes into his speech. Yet was this really the result of an error in the student’s
interlanguage? After all, this was rather unlikely given the very reasonable assumption
that most students in 9TE should be able to distinguish between ‘left’ and ‘right’! In fact,
the error-inducing factor on these occasions was very probably not the students’ lexical
competence, but rather pointed to an issue with their ability to read a map correctly –
clearly a competence that was not language-related in the first place. For the assessor, this
presents a certain dilemma: is it possible to award full marks in terms of ‘content’ or
‘lexis’ if the actual speaking performance led us to a wrong destination? On the other
hand, is it justifiable to detract marks in this language-focused assessment based on a
competence that is not, in fact, linguistic in nature?
In this instance, I chose to give my students the “benefit of the doubt” and thus
refrained from clamping down on this clear mistake if all other indications the student had
given were in fact accurate. At the same time, however, this certainly underlines how
unexpected difficulties can arise in designing adequate competence-based and
contextualised test tasks at times. Furthermore, it also shows that even if all the criteria in
the marking grid correspondingly used for assessment have been painstakingly and
reasonably adapted to the linguistic requirements of the test, they may still not always
make it possible to anticipate all the controversial aspects of speaking performance that
can ultimately arise. In such cases, the assessor’s analysis of potential (linguistic and nonlinguistic) error sources remains the most essential factor to avoid unfair impacts on the
eventually awarded final mark.
In this chapter, the theoretical implications and empirical data of two implemented
speaking tests have revealed that the competence-based assessment of students’ speaking
skills is certainly feasible even at ‘pre-intermediate’ level. At the same time, however, the
analyses of the corresponding two case studies have revealed numerous complex factors
and prerequisites that one needs to take into account during the respective design,
implementation and assessment phases of such tests to ensure that their outcomes are not
only reasonably valid and reliable, but also sensibly and constructively interpreted.
96
Chapter 3
Before more general conclusions will be drawn about the best possible place and form for
competence-based assessment in our national English curriculum, I will now turn to a
similar study of possible ways to design tests and conduct assessments in relation to the
productive skill of writing.
Chapter 4: Competence­based ways of assessing writing at A2 level. Whereas completely new room currently needs to be made for the systematic testing and
assessment of speaking skills at lower levels, a strong insistence on writing has
traditionally constituted a central feature of the assessment culture in our national school
system. In that sense, a different type of innovation is necessary in relation to this
particular productive skill: in contrast to speaking, not the fact whether it is tested needs
to be addressed, but rather how this can (and should) best be done in a competenceoriented teaching and assessment system. After all, as outlined in chapter 1, exercises that
merely require the insertion of a single word or even the construction of isolated
sentences cannot be regarded as valid evidence of true writing skills. Key modifications
are certainly needed to make this area of summative testing more compatible with the
logic and requirements of a fundamentally competence-based approach. To highlight the
essential factors that make such a shift actually desirable, the starting point of this chapter
will be a close analysis of a more “traditional” type of written test task and the
correspondingly used assessment methods in a specific practical example. In a subsequent
stage, I will then trace the steps which are necessary to effect a change towards possible
strategies of testing and assessment that truly tap into writing competences even at A2
level.
4.1. Reasons for change: a practical example 4.1.1. Description of the implemented test task Over the course of a school term, timing can (and usually does) have a very significant
impact on possible test content. This was no different in the case of the very first
summative test that I implemented in my ‘pre-intermediate’ class1 during the early stages
of the 2009-10 school year. The fact that this test had to be implemented only a few
weeks into the first term effectively limited my options in test design in a number of
1
The class referred to here (a 9TE at the Lycée Technique Michel Lucius) was the same one that was
described in more detail in section 3.2.1. in the previous chapter of this thesis.
98
Chapter 4
ways. After all, not much time had been available to extensively practise writing in the
classroom and thus familiarise the students with more skills-oriented task types than the
(rather heavily knowledge-based) ones they had been used to from their experiences in
the previous school year. Moreover, their target language resources were evidently still
rather limited at that early point in time as well. Nevertheless, I still wanted to include a
writing task that gave the students a chance to produce a longer and more personalised
language sample; this would provide me with evidence of their ability to express
themselves in writing in a wider and more communicative context than was possible in
the “English in use” section of the same test. Naturally, to ensure sufficient validity and a
feasible degree of difficulty of such a test task, it needed to build upon the same thematic
areas and grammatical structures which had presented the central foci of most classroom
activities. In this case, the realm of everyday, routine actions and habits, as well as the
correspondingly necessary grammatical structures to describe them (i.e. the present
simple tense and adverbs of frequency), constituted that fundamental core.
Correspondingly, the free writing activity that I included at the end of this first summative
test had the following instructions:
Describe your ‘holiday routine’. What do you always (or often) do when you’re
on holiday? Write about 60 words. (8 marks)
In themselves, these instructions certainly represent a prime example of a ‘classic’ task
for a ‘pre-intermediate’ summative test that essentially revolves around a central
achievement focus: in this case, the students’ ability to incorporate and use the present
simple tense in their own, personalised written productions. Very similar activities are
invariably bound to have been dealt with in class prior to such a test; for instance, the
students will almost inevitably have been asked to describe their everyday ‘term-time’
routines in class (or in a homework assignment) by writing down a number of sentences
about their daily habits (either in a simple list or even in the form of a connected
paragraph). In fact, in this particular instance, comparable activities had additionally
involved a short group work game: in small groups, the students had been asked to
imagine the daily routine of a famous person of their choice. Based on adequate hints and
tips in those descriptions (i.e. suitably symptomatic and singular actions inherent to a
given person’s exceptional lifestyle), their classmates then had to guess which person
each group had respectively described.
Chapter 4
99
The corresponding test task thus built on prior classroom activities and simply gave
a slightly different context for the necessary description of daily habits. As such, a certain
amount of content and face validity were granted; after all, the students really would have
to write meaningful sentences (thus producing the actual behaviour the task intended –
and pretended – to test) in relation to a topic that they had encountered in similar form
before (which was evidently important for a summative test). However, a close look at the
instructions shows that specifications about the exact qualitative requirements which a
suitable answer would have to fulfil were rather scarce. In fact, apart from a vaguely
defined topic (holiday routines) and an indicated word limit (leaving room for slight but
unspecified deviations either side of the targeted number of 60), no further details were
given. As a result, numerous questions were left unanswered, particularly in terms of
suitable structure of the text: was a list of isolated sentences appropriate, or did the text
have to be connected, for example through linking words? How complex and precise did
individual sentences have to be? What was the minimal number of different routine
activities that had to be mentioned, since no target had been given in terms of sentences to
use or various ideas to include? Hence, what exactly did the students have to respect to
ensure they could achieve maximum marks in this exercise?
4.1.2. “Traditional” method of assessment used The absence of such clearly indicated (or inferable) specifications ultimately led to
the fact that the corresponding assessment of individual student productions also stood on
fairly shaky ground. After all, what could be the decisive and justifiable factors to
determine a ‘good’ or ‘bad’ mark in view of such vaguely defined content requirements?
Indeed, a student who had simply enumerated a sufficient number of very simple holiday
activities could, in essence, be adjudged to have adequately fulfilled all content-based
task requirements (provided that he had neither surpassed nor failed to reach the word
limit). Given the limited amount of structures used in such a basic and simple
performance, the student would evidently have reduced the risk of committing numerous
mistakes at the same time. However, what about those students who had attempted to
connect their various ideas into a flowing, coherent text and perhaps even to use more
precise and varied expressions in the process? The impulsive answer of many assessors
will surely be that the better structure (and possibly more complex language) of such
answers, and the correspondingly better overall impression they produce on the reader,
should logically translate into a higher content-related mark; in fact, in a fundamentally
100
Chapter 4
norm-referenced assessment scheme, this is usually precisely what happens. This can be
exemplified by looking at three actual student productions that these particular task
instructions yielded in practice 2 :
Student A:
On holiday I get up at 11 a.m., then I go to the bathroom and have a shower. I brush my
hair. Then I go to the kitchen and eat something. I go to the computer and speek with my
friends. About 2 o’clock I go out with friends, we go on the beach and we have a good
time. (60 words)
Student B:
I never get up before 8 a.m. in my holiday. I wake up at 9 a.m. and I stay for a few minutes
in my bed. Then I wash myself in the bathroom and I get dressed. I have breakfast at 10
a.m. After my breakfast, I go out with my dog or I play football with my brother. After my
lunch, I like going to swim or playing playstation3. I eat with my parents in a restaurant in
the evening. At twelve o’clock I go in my bed. (90 words)
Student C:
I wake up before the restaurant closed. I get dressed and I go eat. I take a juice and
pancakes with chocalat. Than I go in my room and wash my teeth and I take my bicini.
After I go to the piscine and stay there. In the evening I take my bath and get dressed for
the supper. Than I go with my parents to the restaurant. We eat fish,... . (65 words)
In strictly content-based terms, one could argue that all three students basically fulfilled
the topical requirements by describing things that they usually did on holiday (even if in
case of production B, it is unclear whether the student was describing an experience at
home or abroad). In terms of coherence, the general structure of their answers was also
fairly similar – all three descriptions were more or less chronologically sequenced to take
the reader through a whole, typical day (even though student A ended her description at
an earlier point in time than the other two pupils). However, the cohesive devices used to
link these actions had been used with varying degrees of skill: whereas students A and B
managed to indicate the order of actions rather clearly (using expressions such as ‘then’,
‘after’ and other precise time indications with relative ease), student C tended towards a
more rudimentary enumeration of separate actions (while also misspelling the time word
‘then’).
The different productions were rounded off in diverse ways as well: while student B
logically concluded his text by explicitly referring to the end of the day, student A left her
description more open-ended (but still providing closure to a certain extent through the
2
See appendix 11 for copies of the original student productions.
Chapter 4
101
general indication that she and her friends had ‘a good time’); in contrast, student C’s
very short and rather random final sentence (‘We eat fish’) would definitely not provide a
suitable conclusion to any flowing text. As a consequence, the general impression made
by the overall structure of productions A and B was certainly a more positive one than it
was the case in the final example. In a holistic, norm-referenced assessment, the assessor
might thus understandably tend toward higher ‘content’ marks for the first two
productions. However, the theoretical justification of such a decision might prove
problematic. Indeed, the instructions had not actually asked for the specific, structured
description of an entire day but only for routine actions in a more general manner: would
it be fair, then, to actually subtract marks for an awkward ending if no coherently
structured format had been defined and required in the first place?
Naturally, one might still argue that the statement ‘we eat fish’ did not fulfil task
requirements in another respect: in the absence of adverbs of frequency, and considering
the student’s generally discernible tendency to struggle with grammar, it is not absolutely
clear whether she really intended to point to an actual holiday habit (or just to a general
fact). Since student C had actually failed to explicitly refer to the ‘holiday’ context in her
entire answer, a maximum mark for ‘content’ would most probably have been
exaggerated in this instance. On the other hand, as she had still mostly given information
that suited the topic, her answer was much more relevant than some other texts that had
been produced in the same test (for instance, one student had failed to read the
instructions carefully enough and had therefore simply repeated the type of content that
had appeared in the group work activity in class, i.e. she had described a normal day in
the life of a celebrity – a clearly unacceptable disregard of instructions). In that sense,
answer C broadly deserved a sufficient ‘content’ mark; even the comparably less cohesive
structure of this particular production could not legitimately be drawn upon to justify a
lower mark in comparison to those of students A and B, as this had not been clearly stated
(or implied) in the instructions as a criterion to be respected.
In regard to purely language-related factors, it is clear that student C’s answer was
most affected by problems of accuracy as well; misspelled words (‘than’, ‘chocalat’,
‘bicini’…), wrong vocabulary items (‘piscine’) and occasional grammar mistakes (e.g.
‘closed’ instead of ‘closes’) were more frequent in this production than in those of
students A and B. Correspondingly, answer C unsurprisingly received the lowest score in
this area when compared to the more flawless performances of her peers. The differences
were of a more nuanced nature in relation to the productions of students A and B; neither
102
Chapter 4
one of them had demonstrated major problems with accuracy in their respective texts. In
purely quantitative terms, student A’s production was characterised by the fewest ‘slips’;
however, student B had gone into more extensive detail, which had evidently increased
the chance of actually committing mistakes (on the other hand, of course, he had not
respected the word count and could thus lose another mark in relation to that criterion).
While these occasional grammatical or lexical mishaps did not impede successful
communication in either of those two cases, their mere presence was often enough to lead
to a slight deduction of marks (i.e. 3 or 3.5 would be awarded instead of 4).
The abovementioned distinction of two broad criteria reflects a rather traditional
and widely used assessment method for free writing tasks, which consists in allotting part
of the mark (normally 50%) to content and the other part of it (the remaining 50%) to a
holistic appreciation of language features such as grammatical and orthographic accuracy
as well as vocabulary range and control. In both areas, the highest possible mark would
often only be attributed to a virtually faultless performance. As this was also the approach
that I adopted in this particular case, no student eventually reached maximum marks;
instead, virtually all pupils scored between 3/8 and 7/8 overall.
However, the limits of this assessment method quickly became evident: absolutely
clear marking criteria had not been laid down either in terms of ‘content’ or ‘form’.
Similarly, the task had not indicated specific content points to be absolutely included or a
precise format and structure to respect. Therefore, qualitative distinctions that were made
between individual productions were difficult to translate into marks according to a
consistent and rigorously defined system; norm-referencing (i.e. the comparison of one
student’s effort to those of others) often had to bridge that gap instead. In a similar vein,
the ‘form’ (or ‘language’) assessment was not backed up by clearly defined criteria that
took into account realistic expectations for an A2-level performance; instead, holistic
overall impressions ultimately often turned out to be decisive.
On the other hand, the students’ various productions had shown some interesting
tendencies: even though the instructions had not explicitly asked the learners to present
their answers as a connected text, virtually all of them automatically linked their
individual ideas within a cohesive overall structure. While this was done (as discussed
above) with varying degrees of complexity, it clearly indicated that the corresponding
assessment tool needed to become more suited to take such efforts (and their respective
success) into account. In turn, the actual task instructions also had to become more
specific, not only to make the marking system more efficient and reliable, but also to
Chapter 4
103
provide more detailed guidelines to the students as to what was expected of them. To find
practical ways of truly getting my students involved in more purposeful free writing, it
was thus first of all necessary to identify the central features that were required for an
appropriate design of suitable, competence-based test tasks.
4.2. Key features of a ‘good’ free writing task As Harmer vitally reminds us, ‘the writing of sentences to practise a grammar point
may be very useful for a number of reasons, but such exercises are not writing skill
activities’ 3 . To a certain extent, then, asking for a few sentences about routine activities
did not represent a truly valid writing task. In fact, if the students had simply followed the
instructions handed to them in the aforementioned first test “to the letter”, the entire task
could easily have resembled an extended grammatical/lexical exercise rather than one that
aimed at communicative production, even if it had effectively left the students a certain
amount of freedom to express their own ideas in a slightly less controlled way. In that
sense, it did not fully respect one of the most salient features that such “free writing”
activities should be founded on:
language production means that students should use all and any language at their
disposal to achieve a communicative purpose rather than be restricted to
specific practice points. 4
In the precise context of a summative test, of course, one might slightly revise this
statement: since such a test is fundamentally concerned with verifying achievement of
‘specific practice points’, it is only natural that the latter should be targeted as
cornerstones of a corresponding writing task as well (after all, you can only “test what
you teach”). Hence if the vast majority of classroom time has been spent on contexts and
thematic areas that require the present simple tense, the main writing task in the ensuing
summative test would not have a high degree of validity (and reliability) if it suddenly
asked the learners to describe a past experience or to make predictions for the future.
However, the central expressions in Harmer’s statement are evidently those of
‘communicative purpose’: if the students see a clear aim which their written production
should achieve and thus understand why they are writing, only then the task is meaningful
and goal-oriented. Learners will indeed approach a task in very different ways depending
on what their written message is for: hence, they can for example be asked to describe,
3
4
Harmer, op.cit., p.249. Italics added.
Ibid., p.249. Emphasis added.
104
Chapter 4
analyse, complain or persuade 5 ; in each case, the purpose of their writing will inevitably
have a big and varying impact on the style and overall content of their answers. To
increase the overall impression of purpose to a given writing task, it evidently helps if the
corresponding sense of authenticity of the communicative act is augmented: if the task
mirrors a context that is relevant in “real life”, then the students are more likely to
recognise its usefulness and thus they might be more willing to develop – or, in a test,
activate – the necessary skills and knowledge to master it to a satisfactory degree.
Evidently most written productions at school will still almost inevitably constitute
examples of what Brown calls ‘display writing’: the students will, to a certain extent,
always remain aware that their writing does not have an actual effect on the “real world”
(as, for instance, a letter to an external person or company would do), but usually stays
confined to classroom (and often assessment) purposes instead. In that sense, it is
certainly true that ‘writing is primarily for the display of a student’s knowledge’ 6 and
skills in a school context. Nevertheless, this realisation should not keep us from trying to
get as close as possible to authentic situations, contexts and text types: hence, while
Harmer agrees that ‘many…writing tasks [at school] do not have an audience other than
the teacher’, he also rightly insists that this ‘does not stop us and them working as if they
did’ 7 . In a similar vein, Brown stresses that even ‘display’ writing
can still be authentic in that the purposes for writing are clear to the students, the
audience is specified overtly, and there is at least some intent to convey
meaning. 8
This points to a further significant element that needs to be kept in mind when
creating meaningful writing tasks: the implied audience of the written production.
Indeed, very diverse requirements and characteristics will affect a student’s
communicative message if its intended receiver is (for example) a close friend or family
member or, in contrast, the piece of writing is addressed to a teacher, a newspaper or even
a potential employer. Closely linked to this notion is evidently a necessary awareness of
different genres, which in turn results in a suitable choice of style for the written
production 9 . This selection is evidently dependent on the format of the written
production that is required: for instance, informal letters, emails or postcards all come
5
This distinction of various possible communicative purposes in writing tasks is based in part on Clarke,
‘Creating writing tasks from the CEFR’, p.2.
6
Brown, Teaching by Principles, pp.395-396.
7
Harmer, op.cit., p.259. Italics added.
8
Brown, Teaching by Principles, p.403.
9
Harmer, op.cit., p.259.
Chapter 4
105
with different conventions than their formal counterparts; similarly, a story needs to be
written in a style that significantly varies from the one used in a report or review. This
also takes us back to the importance of sociolinguistic competence alluded to in the
previous chapter: within a particular context for writing, the students should learn to
respect appropriate conventions and thus, for instance, operate within a suitable register
as well. In all of these aspects, of course, it is highly important that a sensible selection is
made by the teacher with his students’ level of proficiency in mind; in the case of ‘preintermediate’ learners, for instance, it is clear that fairly short, informal text types (such as
emails or letters to friends) are more suitable than complex argumentative essays.
A similar reflection imposes itself on a final element of key importance to a
convincing writing task: the topic of the expected production. It goes without saying that
the choice of subject matter needs to fit the learners’ language level if the writing task is
to be valid and reliable. In A2 classes, for instance, an exaggerated degree of complexity
must of course be avoided; immediate communicative needs are thus much more likely to
characterise an appropriate choice of topic at that level. In this respect, the CEFR once
again proves a particularly useful tool, as it implies suitable thematic areas (and the
corresponding communicative acts) that learners can and should be expected to deal with
at the various levels of proficiency. In the case of ‘A2 creative writing’, for instance, the
Framework stipulates that the learner ‘can write about everyday aspects of his/her
environment, e.g. people, places, a job or study experience…’ as well as come up with
‘short, basic descriptions of events, past activities and personal experiences’ (CEFR,
p.60).
Within such generally defined fields, however, one should not underestimate the
importance of identifying topics that coincide with the learners’ interests. As Harmer
points out:
[i]f students are not interested in the topics we are asking them to write…about,
they are unlikely to invest their language production with the same amount of
effort as they would if they were excited by the subject matter. 10
In a class of twenty (or more) adolescent individuals, this is of course not always an easy
task. Even if teachers try to adapt their course and test contents to their teenage students
as far as possible, ‘there is no magical way of ensuring that [the pupils] will be engaged
with the topics we offer them’; furthermore, the learners’ inherently different
10
Ibid., p.252. In addition to this intrinsic motivation, one might of course point out that the marks
awarded for good performances in summative tests would evidently also provide further (extrinsic)
motivation to ‘invest [one’s] language production with [a high] amount of effort.’
106
Chapter 4
personalities oblige teachers ‘to vary the topics [they] offer them so that [they] cater for
the variety of interests within the class’ 11 . Syllabus requirements will have to be taken
into consideration as well, since they will generally indicate a range of thematic areas to
be covered over the course of the school year. Nevertheless, trying to present the
respectively treated subject matter in interesting, creative and engaging ways (and
conferring a communicative purpose to them) is certainly important not only to engage
the learners in classroom activities in general, but also to generate the best possible
productions in free writing tasks.
In view of these multiple and simultaneous requirements, it quickly becomes clear
that simple instructions such as ‘describe your holiday routine’ fail to take into account a
number of important factors that are necessary for a truly meaningful writing task. Hence,
the only criterion that these instructions actually fulfilled was a vague indication of topic;
however, neither the audience or form had been specified and could thus give pointers
about the expected answer style. Fortunately, however, it is not necessarily very difficult
to transform such a basic set of instructions into a meaningful, more authentic and
suitably contextualised task. In this case, for instance, a simple yet very useful variation
would be the following formulation:
You are on holiday in a different country. Write a postcard to a friend (or to
your family) at home. Tell him/them what you (usually) do there every day.
In that case, the topic would virtually stay the same; one might even deliberately split up
the required content by asking for typical morning, afternoon and evening activities
(although this would of course restrict the student’s freedom to personalise his answer). A
clear communicative purpose would have been added: writing to describe different, daily
habits and to inform friends or family about them. The presence of this specified
audience would further confirm the informal style to adopt, whereas the indicated format
that the text should take would also have been explicitly stated; it could even be
reinforced through the visual representation (on the test sheet) of an empty, “authentic”
postcard to write on. In terms of necessary lexical and grammatical resources, this
variation of the task would not significantly raise the level of difficulty, either; however,
the students would have to provide some evidence of their sociolinguistic competence by
expressing their message in a way that would suit the typical style of a postcard.
Following similar guidelines in the design of writing tasks in subsequent summative
tests was thus essential to ensure that more meaningful and contextualised written
11
Ibid., p.253.
Chapter 4
107
samples could be produced by my students. At the same time, however, this also meant
that the assessment scheme would have to be altered accordingly, so that a sufficiently
nuanced appreciation of the students’ writing skills would be possible.
4.3. Central features of interest to the assessment of writing All the main features which writing and speaking performances generally share (see
section 3.1.1.) evidently need to play a fundamental role in the specific assessment of
written productions as well. However, as briefly noted in the previous chapter, structural
elements such as coherence and cohesion usually take on a bigger role in the
appreciation of students’ writing efforts than it is the case in the assessment of their
speaking performances. Thus, a higher degree of coherence (i.e. the logical sequencing of
ideas) is generally expected in writing tasks due to the higher amount of time that is
available to the students for the thoughtful planning of their written productions (in
comparison to the often more spontaneous communicative behaviour in speaking tasks).
Cohesion, on the other hand, is ‘a more technical matter’ that concerns ‘the various
linguistic ways of connecting ideas across phrases or sentences’ 12 : the students have to
show an appropriate use of discrete linking words (such as ‘and’, ‘because’, ‘so’, but also
time words like ‘then’ or ‘before’) to ensure not only a logical but also a “mechanical”
link between the various parts of their answers. As a result, this criterion rather blurs the
line between ‘content’ and language-related (in this case lexical and syntactical) factors.
In addition, writers are more often expected ‘to remove redundancy’ and ‘to create
syntactic and lexical variety’ than speakers 13 . Therefore, even students at lower levels are
usually required to avoid repeating the exact same expressions to some extent (for
instance, to indicate the order of successive actions in time, learners should vary their
linking words as far as they can; instead of simply starting each new sentence with ‘then’,
they are thus expected to interject alternatives like ‘before’ and ‘after that’).
This increased importance of both coherence and cohesion in writing is inherently
due to the ‘distant’ nature of this medium in comparison to the more immediate context in
which speaking takes place. In oral communication, an interlocutor may immediately ask
for clarification if he has trouble following the logic of the speaker’s line of argument;
when reading a written production (in the absence of its author), this is impossible.
12
13
Ibid., p.246.
Brown, Teaching by Principles, p.398.
108
Chapter 4
Making one’s intended meaning clear not only through appropriate vocabulary range and
control but also through accurate use of structural devices is therefore an important skill
for language learners to develop especially in writing, where their final product has to
speak for itself.
The general structure of a text can be further underlined through the appropriate use
of features such as punctuation and paragraphing. In this respect, it is interesting to
note that the CEFR only includes both features into ‘orthographic control’ from level
‘B1’ onwards 14 ; before that, at A2, only the ‘reasonable phonetic accuracy’ of words is
described (CEFR, p.118). This represents a clear example of the type of ‘inconsistency’ in
the descriptor scales which Alderson et al. rightly criticised (see chapter 2). Indeed, if we
want our students to produce meaningful, contextualised and “authentic” language
samples, such features can certainly not be ignored completely. After all, using basic
punctuation marks to indicate different sentence types (such as questions or exclamations)
is surely not beyond the reach of A2 learners; similarly, some paragraph breaks might be
reasonable to expect in test tasks such as letter writing, especially if the learners have
repeatedly been confronted with a particular format (and have practised using it
themselves) during regular classroom activities.
At the same time, one must not forget the potentially useful effects of plurilingual
competences: since English is not the first language that pupils learn in our school
system, they will already have grown accustomed to punctuation and paragraphing
conventions in other languages (at least to a certain extent). On the one hand, this might
of course occasionally lead to interferences between the different languages they have
learnt (especially, for instance, in regard to language-specific elements such as quotation
marks) – in error analysis, this interlingual influence is an important factor to keep in
mind. On the other hand, it also means that students who start learning English in the
Luxembourg school system (as their third or even fourth foreign language) do not
generally do so without any knowledge of how a written text may be logically and
usefully structured (particularly given the heavy focus on writing activities in their other
language courses). Certainly, it once again seems commendable to make use of those preexisting abilities instead of pretending that they were simply non-existent.
Finally, a significant alteration seemed sensible in comparison to some writing tasks
that appear in A2-level proficiency tests such as KET: the length of the samples that the
14
At level B1, ‘spelling, punctuation and layout are accurate enough to be followed most of the time’.
(CEFR, p.118)
Chapter 4
109
students were supposed to write. The imposed word limit in the most extensive KET
writing activity is, in fact, set at a mere 25-35 words 15 . However, if a student’s sample is
to reveal most (or all) of the aforementioned features of writing, and the task has been
defined so as to impose a ‘true’ communicative purpose, then this very limited word
count offers very little opportunity for a free and fluent expression of ideas. Moreover, as
the very first summative test had already shown, the students in my ‘pre-intermediate’
class had demonstrated a clear ability to meet much higher requirements. Therefore, it
seemed more appropriate to offer the students a wider possible answer scope (which, in
turn, a detailed assessment of general proficiency could also be more safely founded on).
As a result, the length of written productions which I usually aimed for in free writing
tasks in summative tests often approached regions of about 80-100 words instead (for
tasks that were worth about 8 to 10 marks).
4.4. Defining an appropriate assessment scheme for writing tasks Due to the multitude of factors to take into account when assessing a ‘true’ free
writing performance, it was clear from the outset that a marking grid would again be the
most appropriate tool to use after suitable test tasks had been designed and implemented.
The corresponding criterion-referenced approach not only allowed me to focus on a
variety of salient features of the students’ overall writing skills, but also to get away from
the possible caveats of excessive norm-referencing: instead of merely comparing
students’ performances with each other, each individual production could thus be gauged
against equal and unvarying standards that were founded on realistic targets for ‘preintermediate’ language learners. To ensure a general and valid alignment to the CEFR A2
level in this process, I once again referred to an official marking grid that had been
published for assessments of ‘writing’ in the 9TE syllabus and that constituted the direct
counterpart to the ‘speaking’ grid referred to in the previous chapter 16 . The syllabus itself
states that ‘it seems logical to develop separate marking grids for writing and for
speaking’ 17 ; given the divergences between both skills (and the correspondingly produced
language samples) identified above, this certainly seems a sensible decision. However,
before moving on to the direct application of this tool in precise practical examples, it is
15
See Key English Test for Schools – Handbook for Teachers, p.11 / p.19
See appendix 12 or p.21 of the syllabus for 9TE (July 2010 version) published on
http://programmes.myschool.lu/
17
Syllabus for 9TE (July 2010 version) on http://programmes.myschool.lu/, p.14.
16
110
Chapter 4
necessary to specify and analyse what differences actually existed between both grids,
and in what ways these distinctions could be theoretically justified.
In general, both ‘speaking’ and ‘writing’ grids usefully respected a common layout
that reflected and underlined the numerous common features of the two productive skills
while simultaneously ensuring a certain degree of consistency among the applied
assessment systems. As a result, three of the four main criteria from the ‘speaking’ grid
reappeared in the one for ‘writing’: namely, the general categories for ‘content’, ‘lexis’
and ‘grammatical structures’. For obvious reasons, however, the fourth main set of
criteria from the oral assessment grid, ‘pronunciation and discourse management’, was
not applicable to written productions; instead, its place was taken by ‘coherence and
cohesion’ in the marking grid for writing. Although these two interlinked features had
already been integrated into the ‘discourse management’ category of the ‘speaking’ grid,
the choice to single them out as a main set of criteria in this case (instead of merely
including them as subordinated components) handily reflected their increased importance
in writing performances alluded to above. The overall five-band system (including the
two useful intermediate bands 2 and 4) had also been retained in the ‘writing’ grid, further
increasing the consistency in the marking systems for the two productive skills.
A close analysis of individual criteria also revealed the following interesting
features:
1. The ‘content’ criteria included both qualitative and quantitative factors. The former
were evidently necessary to analyse the student’s ability to ‘clearly… communicate
the message to the reader’ (band 5). However, an interesting nuance in comparison to
the ‘speaking’ grid resided in the more precisely defined quantitative dimension of
‘content’ criteria. Thus, an ‘insufficient’ performance would be characterised by the
fact that ‘only about 1/3 of the content elements’ had been ‘dealt with’ in the student’s
written production; two thirds would correspond to ‘basic standard’, whereas the
‘target standard’ (and thus maximal marks) could evidently just be reached if all
‘content elements’ had ‘in general’ been ‘addressed successfully’. At first sight, this
quantitative gradation evidently seems rather arbitrary: indeed, while it is self-evident
that 33% of the required answer content cannot possibly warrant a ‘sufficient’ mark,
why set the expectations for the middle band at 66% of the necessary content rather
than (for example) 50%?
In fact, a generally implied message of these descriptors is of course that a
‘basic standard’ performance should still contain clear evidence of the student’s
Chapter 4
111
general understanding of (and ability to deal with) the task requirements. Even if the
student has not addressed all the content points, the majority of his answer should still
unmistakably carry sufficient relevance to the topic. As a result, even if the ‘message
is only partly communicated to the reader and/or not all content elements’ have been
‘dealt with successfully’, the bulk of the given information still tends to be
appropriate. If answering only half the content points were set as expected ‘basic’
requirement, it would be much more difficult for the assessor to decide which half of
such a “hit and miss” performance was to be seen as truly representative for the
student’s ability (or lack thereof) to deal with the task. Therefore, it is certainly
appropriate to set the bar a bit higher in this instance to make sure that fundamentally
ambiguous assessments are avoided.
Another important inclusion in this area of the marking grid was the reference to
the student’s ‘awareness of format’; indeed, if writing tasks firmly aimed to be as
‘authentic’ and contextualised as possible, the content of the students’ answers could
not possibly be entirely adequate if they did not observe the respective conventions of
the required text type. To a certain extent, this criterion could thus for example be
drawn upon to verify sociolinguistic factors such as appropriate register in the written
production.
2. As mentioned above, the two elements of coherence and cohesion play a particularly
important role in writing because of the ‘physical…and temporal distance’ 18 between
the student’s composition process and the assessor’s act of reading and interpretation:
the learner does not get a second chance to clarify his line of argument after handing
in his written production. Hence, if all the relevant information is present in a
student’s answer yet the assessor has to make a considerable effort to “connect the
dots” himself, the achieved content mark may still be fairly high; on the other hand,
the student will have shown a lack of ability to ‘structure his discourse’ in a
satisfactory manner and thus receive a lower mark in terms of ‘coherence and
cohesion’. Therefore, it certainly makes sense to separate this set of criteria from the
abovementioned ‘content’ category.
A particularly useful feature of the marking grid for ‘writing’ additionally
consisted in the explicit specification of expressions and structures to look for in a
given student production. Thus, the ‘band 5’ descriptor indicated that the best possible
18
Brown, Teaching by Principles, p.364.
112
Chapter 4
performance needed to include the ‘use of simple linking devices such as ‘and’, ‘or’,
‘so’, ‘but’ and ‘because’’, whereas the ‘basic standard’ corresponded to ‘short simple
sentences which [were] simply listed rather than connected’. Both descriptions were
undoubtedly detailed enough to be directly used in the analysis of a particular
performance and thus – if applied with sufficient rigour and consistency – certainly
helped to increase the reliability of the assessment as a whole.
3. In comparison to the ‘speaking’ grid, the ‘lexis’ section for writing evidently needed
to include an added focus on orthographic control. In line with the previously
mentioned A2 requirements of ‘reasonable phonetic accuracy’, the ‘band 3’
descriptors called for only ‘limited control of spelling’. In contrast, merely ‘few minor
errors’ could mark a written production if the ‘target standard’ was to be reached. As
the latter definition is more in line with the B1 level (which stipulates that ‘spelling
[is] accurate enough to be followed most of the time’), this once again underlines that
CEFR descriptors sometimes tend to be fairly lenient in terms of accuracy. In a school
context, especially in summative tests that build on pre-taught and thoroughly
practised vocabulary items, it is clear that slightly higher target standards occasionally
need to be set (at least in relation to these familiar expressions) than what the
originally defined descriptors for the A2 level call for.
4. The same remark could also still be applied to the ‘grammatical structures’ section
in this marking grid since its descriptors in this regard were very similar to the ones
used for speaking assessments. However, the descriptors in this grid also importantly
drew attention to the fact that ‘faulty punctuation’ could ‘cause difficulty in terms of
communication’ (band 3); as mentioned above, this was an important addition in
regard to the CEFR scale of ‘orthographic control’, which does not mention this
element at A2 level at all.
Both in regard to ‘lexis’ and ‘grammatical structures’, this marking grid also importantly
pointed out that the presence of ‘few minor errors’ could be tolerated as long as they did
‘not reduce communication’. This clearly echoes the general spirit of the CEFR, which
fundamentally approaches grammar not as an end in itself, but as a means to a wider,
communicative end.
Of course, it should be noted that notions such as ‘few’, ‘minor’ (and, in the same
vein, ‘basic’ in the marking grid for speaking) may all still mean different things to
different people; thus, even if various teachers use the same tool, it does not mean that
Chapter 4
113
they will also use it in the same way 19 . The lack of definitions that Alderson et al.
criticised in relation to the terminology in the CEFR descriptors 20 can thus just as easily
affect these marking grids as long as it has not been made explicit what exact language
structures, functions and items are implied by such fairly vague expressions. Hence, some
teachers might be forgiving enough to consider the occasional omission of a ‘third person
–s’ ending in the present simple as a ‘minor’ mistake that happened in the “heat of the
action”, while others may immediately see this as evidence of ‘basic errors’ in the
student’s production and correspondingly veer towards a lower band in the ‘grammatical
structures’ criterion than their colleagues. Similarly, quantitative factors such as ‘few
errors’ could mean merely two or three wrong forms to one assessor, but five or six to
another. 21
In a fundamental way, however, the descriptors in this marking grid usefully
underlined possible reasons which could lead to ‘few minor errors’ of accuracy even in
‘target standard’ (i.e. band 5) performances: namely, the two factors of ‘inattention’ and
‘risk taking’ in a student’s written production. This is particularly relevant in the context
of free writing. Hence, even if a particular writing task in a summative test is
understandably based on the student’s general application of ‘specific practice points’ in a
wider context, one must be aware that the degree of accuracy is bound to suffer more in
such exercises than if these ‘practice points’ were focused on in isolation. Compared to
discrete item tasks, the term ‘inattention’ must thus be put into a different context. In an
exercise where the student only has to fill in isolated grammatical items and can thus
focus his entire attention on using accurate forms, it is clear that ‘inattention’ is fairly
inexcusable and will have a legitimately negative impact on the awarded mark.
In free writing, however, there are multiple and complex demands and difficulties
that the student must simultaneously deal with in the composing process. As seen above,
fulfilling the ‘communicative purpose’ of such a task requires the activation of ‘all and
any language at [his] disposal’ (implying an occasional need for ‘risk taking’ in the
process). For the teacher, this means that the learner’s overall proficiency comes into play
and needs to be taken into account; the assessment focus can no longer solely remain on
19
In my own experience, this was strikingly exemplified when I attended a teaching seminar about the
implementation of marking grids (David Horner, ‘The logic behind marking grids’, held in Luxembourg in
March 2010): for example, the question of whether the ‘past simple tense’ could be considered as a ‘simple’
structure led to very different interpretations among the attending experienced teachers.
20
See section 2.2.2.
21
Possible ways of limiting such ‘inter-rater’ discrepancies are discussed in more detail in chapter 5 (see
section 5.1.2.).
114
Chapter 4
pure achievement factors. For the student, paying attention to grammatical or
orthographic accuracy in relation to ‘specific practice points’ becomes a much bigger
challenge while he tries to juggle numerous additional elements and processes, such as
developing ideas that are relevant to the topic, structuring them in an adequate way and
looking for appropriate lexis to express them in the first place 22 . As a result, occasional
‘inattention’ in one (or several) of these domains is virtually unavoidable. This is
particularly so if factors such as time pressure and stress affect the student’s performance
(which is very likely in summative classroom tests, given the fact that free writing
constitutes only one of the various tasks that the learners have to deal with in a limited
amount of time). Placed in such a context, occasional grammatical mistakes clearly
become something to be expected; for that reason, expectations of virtual faultlessness
can certainly no longer be accepted as realistic requirements for maximum marks in formrelated criteria.
Once the necessary adaptations to task design and the corresponding assessment
system had thus been identified, my summative tests could be systematically modified so
as to offer a better insight into my students’ writing competences. In the following
section, I will now describe and analyse a range of different tasks that I implemented and
assessed in various summative tests in my ‘pre-intermediate’ class with that aim in mind.
22
This complex interplay of multiple cognitive demands can easily lead to what Astolfi calls a ‘surcharge
cognitive’. See also Jean-Pierre Astolfi, L’erreur, un outil pour enseigner, ESF (Paris: 1997), p.86.
Chapter 4
115
4.5. Case study: using a marking grid to assess written productions in a summative test 4.5.1. Description of test tasks The two test tasks which will be focused on in this section were implemented in the
first summative test of the second term. Throughout the weeks leading up to the test,
numerous classroom activities had focused on future events and plans to make the
students more accustomed to (and confident in their use of) the ‘will-future’ as well as the
‘going to-future’ 23 . The two topics of horoscopes and upcoming summer activities were
recurrently used to contextualise the use of both future forms; since the eventual
summative test needed to build on the relevant topical lexis which had correspondingly
been treated in the classroom, both of these elements reappeared in two “free writing”
tasks (from which the students could choose one) 24 . In both cases, a clear communicative
purpose was to be pursued: in the ‘horoscope’ productions, the students had to establish a
number of predictions in the form of a magazine article; the ‘summer activities’ task, on
the other hand, targeted a text resembling a publicity ‘brochure’ for a ‘summer camp’
(where the students – as ‘camp managers’ – had to tell potential customers what they
would be able to do and see there). As a result, the correspondingly adopted writing styles
would ultimately have to differ in each case; however, as these two types of texts had of
course been encountered and practised in class prior to the summative test, the students
were generally familiar with both of them.
In terms of language, both of these alternatives were linked through their basic
reliance on a common ‘practice point’: predictions about (as well as plans for) the future.
Though they were of course no actual grammar activities, it seemed logical to verify to
what extent the students would be able to apply the encountered future forms in a wider,
more communicative context as well – in that sense, achievement indications in relation
to this common objective would be given in written productions dealing with either task.
However, due to the different topical lexis areas that the two tasks targeted, no such
overlaps could be expected in terms of vocabulary items that the students would use in
23
Such activities included, for example, a group work task in which the students had to compile a
‘horoscope’ for the year 2010 in their teacher’s life; suitable, typical features of this type of text were then
elaborated on the basis of these creative productions. As for summer plans, they were not only orally
discussed in the plenary group, but also included in a creative group work activity where students had to
organise a stay in a different country (with a range of correspondingly available activities) as a ‘competition
prize’ (this activity was based on the starting pages of Unit 4 in Tom Hutchinson, Lifelines PreIntermediate Student’s Book, OUP (Oxford: 1997), pp.32-33.)
24
See appendix 13
116
Chapter 4
their respective written productions. In turn, this highlights that the writing performances
as a whole would also allow me to draw more global inferences about the actual
proficiency levels the students had reached up to that point.
To a certain extent, one might therefore argue that the overall reliability of the test
results risked being intrinsically affected by the fact that such a choice was offered to the
students. As not all the students would necessarily deal with exactly the same test tasks,
the direct comparability of their results would indeed be partially compromised (as there
was no “scientific” way of making sure that the two proposed writing tasks were of
precisely the same difficulty level). However, the decision to offer such a choice had of
course not been made lightly: in fact, it constituted an attempt to cater for varying student
interests and thus to generate higher productivity by giving each student the choice of
addressing the subject matter that he or she found more exciting. In the process, this
autonomous decision would also confer more responsibility to the learners: in essence, it
was up to them to pick a topic which they knew (or thought) they would be able to answer
in a (more) satisfactory way 25 .
To further increase the comparability of requirements in these two test tasks, both
sets of instructions included an equal number of structural and content-related clues to
guide and support the students in their efforts. This type of scaffolding indicated the text
format that had to be respected, and it simultaneously established a number of content
points that needed to be incorporated into the written productions. In both cases, three
major elements had to constitute the core of the students’ answers (which of course had
the added benefit that the amount of successfully addressed points could easily be related
to the “quantitative” content descriptors for bands 1, 3 and 5 in the marking grid). Each of
the two assignments also respected the other key features of meaningful writing tasks
identified above: not only did they both specify a clear topic (that, in each case, had
already been previously treated in class), but each of them also stated a clear
communicative purpose. For the ‘horoscope’, the students thus had to describe some
future events that were reasonably likely to occur (and generally emulated the rather
vague type of information usually included in such texts). For the summer camp
‘brochure’, a persuasive factor came into play in addition to the general description of
25
In this respect, it is interesting to note that official proficiency examinations such as the Cambridge ESOL
Preliminary English Test usually also offer such a choice to candidates in the free (or creative) writing
section. These choices generally involve different text types and thus answer formats as well (for example,
the student can often choose between a letter or a story to write); see for instance the sample test in
University of Cambridge ESOL Examinations, Preliminary English Test for Schools – Handbook for
teachers, UCLES (Cambridge: 2008), p.14.
Chapter 4
117
camp activities and facilities (which recycled elements from the ‘competition prize’
activity implemented in class): the students were encouraged to write their answers in a
lively, direct and advertising style. In both cases, the implied audience of the students’
texts was not merely the teacher, but rather the imagined readership of the corresponding
magazine or brochure. Due to all these different aspects, the two tasks certainly
represented a big step away from much more basic and traditional “writing” instructions
(such as ‘make five predictions about the future’).
4.5.2. Description of test conditions As the ‘free writing’ sections which I included into summative tests were usually
combined with other types of tasks (i.e. reading, listening or ‘English in use’ exercises), it
was clear that a certain amount of time pressure would inevitably risk affecting the
learners’ performances, while tiredness after dealing with earlier exercises could also play
a part by affecting concentration in the final, free writing portion of the test. However, as
a double lesson was usually available for the implementation of these tests, I could (and
generally did) give the students a little more time than I would have been able to do in a
single 50-minute lesson. Thus, an extra period of ten or fifteen minutes could be granted
to the test candidates so as to alleviate the impacts of stress and time pressure on their
performances (which would have affected the reliability of their results to a certain
extent). The tests were also invariably written in the familiar environment of the students’
own classroom, which (as seen in chapter 1) had a further positive influence on overall
test reliability.
In this first summative test of the second term, a slightly more complicated
mathematical operation was exceptionally necessary to arrive at the final marks for “free
writing”, since only a total of 9 marks had remained available for this section after the
rest of the test had been set up. Thus a mark out of 10 was ultimately first calculated from
the marking grid (by using a multiplying factor of 0.5 to arrive at numerical values from
the bands reached). This number was then converted into the finally awarded mark (out of
the maximum of 9) through a simple rule of three (i.e. divided by 10, multiplied by 9). To
avoid excessive decimal nuances, the resulting scores were rounded up to .5 (or .0) if the
decimal value was superior or equal to .25 (or .75); otherwise, they were rounded down to
the closest .0 or .5 value respectively. As these rather complicated conversions suggest,
though, a mark out of 10 (or 20) is of course to be strongly recommended for reasons of
practicality whenever possible.
118
Chapter 4
4.5.3. ‘Horoscope’ task: analysis and assessment of student performances The vast majority of students (14 out of 19 learners who had been present on the
day of the summative test) ultimately opted for the ‘horoscope’ task as their favoured
writing activity. A first interesting detail which was immediately noticeable was that not
all of them had given the same type of ‘format’ to their respective texts. While most
individuals presented their respective answers as one connected paragraph, others had
addressed the various content elements in successive steps, including sub-headings for
each section. These different approaches are clearly visible in the following two
samples 26 :
Student A:
Love and friendship:
The week will start bad because you meet
some people strange. They will lie. You are
going to have a nice time whit your best
friend. You will meet a very nice and
interesting person and He will give you his
heart.
money and finances
You will have a lot mony and you will win a
new chance.
work and studies
You will have a lot of arguments because
you never will listen to your partner.
You will miss a lot of thinks.
You will go to the doctor.
Student B:
Your week is going to start good.
You will fall in love, and it will be
the man/woman of your life. If you
are bad in finances: don’t worry!
This week you will win the Loto!
If you work, your director will give
you more money, but if you study,
you should learn more, it’s the
only solution! Your sence of
humor will be nice! You should be
more sportive and think more on
yourself.
It will be a nice week for you!
One may argue that “real” horoscopes can be encountered either in the form that student
A chose or in one connected paragraph (as in student B’s production), depending on the
magazine or newspaper one picks. In that sense, the decision to approach the different
content points individually (and by addressing them under separate headings) did not
have a negative impact on the “authenticity” (and thus the appropriateness) of student A’s
sample in this case. Incidentally, since the instructions had in fact been presented in
“bullet points” on the test sheet as well (see appendix 13), some students could also have
interpreted this as a sign that their answers needed to replicate that precise structure. In
short, there were thus no reasonable grounds to consider one type of format as
intrinsically more appropriate for this particular genre (and to value one of them more
26
See appendix 14 for copies of the original student samples and the correspondingly used assessment
sheets.
Chapter 4
119
highly than the other in terms of the corresponding ‘content’ criterion in the marking
grid).
Furthermore, both answers superficially seemed to address all the required content
points: sentences about all major categories to be included could be found in each case.
However, particularly student A tackled these issues with varying success. Especially in
relation to ‘money and finances’, it is clear that the intended ‘message [was] only partly
communicated to the reader’. Even if it was still fairly obvious what the student meant
with ‘you will have a lot mony’ (i.e. ‘you will become rich’), the second part of the
sentence (‘you will win a new chance’) did not lead to ‘successful’ communication. A
similar remark could be made about the penultimate sentence in her text: while the
spelling mistake in itself (‘thinks’ instead of ‘things’) did not impede communication, the
actual sentence as a whole was excessively vague and did not really seem to fit into the
category of ‘work and studies’. In combination with the final sentence, the suspicion
arises that she might have intended to refer to an absence at work (or school) due to
illness; however, this example certainly underlines that a considerable ‘effort on behalf of
the reader’ was necessary to make sense of some parts of the student’s answer.
In comparison, student B’s ‘horoscope’ mostly stuck to the imposed content points
a bit more firmly; while one may argue that the piece of information about a lottery win is
too specific for such an article, the overall message in the ‘money and finances’ section
certainly came across much more ‘clearly and fully’ in this instance. The only corrective
effort that the reader really had to make in that precise case was to infer the intended
meaning of the inaccurate expression ‘Loto’ – by no means a feature that would impede
successful communication. While clear L1 interference occurred in various other parts of
the answer as well, it mainly risked having an impact on the ‘lexis’ and ‘grammatical
structures’ marks (i.e. ‘start good’ / ‘be more sportive’); the message in itself was not
usually affected in a negative way. The only exception where ‘content’ truly became an
issue in this answer was in the sentence ‘Your sence of humor will be nice’: no obvious
context, relevance or even intended meaning could be discerned for this prediction. The
relevance of the following sentence (‘You should be more sportive…’) also remains in
doubt, although the student’s reason for including it might be explained by the last
statement in the actual task instructions (‘you can add other important elements that you
can think of’). All in all, the intermediate ‘band 4’ was thus awarded for ‘content’ in this
case as the student had ‘in general’ managed to ‘communicate the message fully and
clearly to the reader’ (band 5), yet some interpretive ‘effort’ had been necessary towards
120
Chapter 4
the end of the written production (band 3). In contrast, student A’s performance was not
only marked by the partly unsuccessful handling of one content point (i.e. only about ‘2/3
of the content elements’ had truly been dealt with adequately); as it also repeatedly
caused more ‘effort on behalf of the reader’, it ultimately corresponded rather directly to
the descriptors in ‘band 3’ (which was thus logically awarded).
A similarly analytic approach was also helpful to reach an informed, criteria-based
decision about the ‘coherence and cohesion’ factors in both written productions. Thus,
student B’s answer was consistently marked by a lively, supportive tone (e.g. ‘don’t
worry!’ / ‘It will be a nice week for you!’); efforts to provide an adequate opening and
ending were also clearly evident in this instance and enhanced the impression of overall
coherence and sensible organisation of the answer as a whole. While ‘simple linking
devices’ were generally only used within individual sentences (rather than to connect two
successive ones with each other), successful attempts to use more ‘complex sentence
forms’ (i.e. ‘If you work, your director will give you more money, but if you study…’)
further contributed to the positive effect on the reader. Yet due to the fact that the final
two sentences in the main body of the text presented no obvious logical or linguistic link
to the preceding content (and thus corresponded to the band 3 descriptor of ‘listed rather
than connected’), ‘band 4’ was ultimately awarded in this set of criteria as well (instead of
the highest possible one).
In student A’s performance, some organisational features were also present, as
underlined by the three included sub-headings and corresponding section breaks.
However, ‘listing’ rather than ‘linking’ of individual ideas not only appeared in this
obvious three-part structure of the answer, but also in relation to the various sentences in
each section (pointing to the ‘band 3’ descriptor of cohesion). Hence, simple linking
devices such as ‘because’ or ‘and’ were consistently used correctly at sentence level, but
no effort was made to use cohesive devices to connect sentences or paragraphs directly.
In terms of coherence, the aforementioned inconsistencies in the last part of the text as
well as the absence of a satisfactory final sentence further underlined the impression that
the student tended to simply add ideas as she went along rather than having a complete,
connected message in mind. Finally, the overall coherence of her text was slightly
problematic as a lot of different events were packed into the predictions for a single week,
yet no expressions of contrast (such as ‘but’) were used to move from positive to negative
events or implications (e.g. ‘They will lie. You are going to have a nice time…’). The
amalgamation of such inconsistencies, combined with the generally ‘listed’ answer style,
Chapter 4
121
ultimately led to a ‘band 3’ score for ‘coherence and cohesion’. Hence, even though one
band 5 descriptor had been partly respected (‘use of simple linking devices such as
‘and’… and ‘because’), the remaining features of the student’s answer clearly pointed
towards ‘basic’ rather than ‘target’ standard.
In terms of linguistic aspects such as ‘lexis’ and ‘grammatical structures’, neither
candidate obviously managed to provide a completely faultless performance. For
instance, both of them struggled with adverbs (e.g. ‘start bad’ and ‘start good’,
respectively), and some structures and expressions that the students used clearly belied
their need to improvise in order to bring their message across (student A: ‘You will win a
new chance’ / student B: ‘if you are bad in finances’). This “risk taking” was, as seen
above, not always crowned with success. Nevertheless, particularly in student B’s
production, most of the grammatical and lexical errors committed ‘did not reduce
communication’, which once again underlines the necessity of setting the linguistic
imperfections of a particular student performance against the more general consideration
of whether or not the student has reached a sufficient level of proficiency to express
herself clearly enough to fulfil the communicative purpose of the task.
On the other hand, the interplay of ‘form’ and ‘content’ also occasionally became
evident in this assessment. Hence, student A’s sentence ‘you will win a new chance’ not
only risked having an impact on her mark in relation to ‘lexis’ criteria, but as it
simultaneously reduced her ability to address the corresponding ‘content’ point in an
adequate way (i.e. to ‘fully communicate the message’), it would keep her from achieving
maximum marks in the global ‘content’ section as well. At times, it was thus difficult to
keep different criteria completely separate from each other (which, in turn, might also
partly explain the fact that marks for different criteria within a single student’s assessment
frequently tended to be rather close to each other).
Nevertheless, as in the case of the marking grid for speaking, the decision to keep
‘lexis’ and ‘grammatical structures’ as separate criteria proved a useful one to analyse the
precise areas where the students encountered problems (even if the same bands were
fairly often reached by individual students in relation to both aspects). In this instance,
student A’s production was occasionally marked by ‘limited control of spelling’ (‘lexis’
band 3); however, some of these mistakes seemed to be due to inattention (e.g. ‘money’
spelled correctly once but misspelled in the subsequent line). Although it was ultimately
because of her ‘limited range’ in this area that the second content point could not be
entirely addressed in a correct way (and she had thus not completely reached the ‘target
122
Chapter 4
standard’ of an ‘adequate range’), the expressions she chose in the remainder of her text
generally seemed ‘appropriate and adequate for the task’ (band 5). Moreover, most of her
spelling mistakes could indeed be seen as ‘minor errors’ as their impact on
communication was usually minimal. However, since ‘band 5’ had not been reached in
every aspect of the ‘lexis’ criteria, the slightly lower band 4 was thus awarded instead.
In terms of grammar, the occasional use of fairly basic sentences (‘You will...’)
initially pointed to a ‘limited range of structures’ (certainly in comparison to student B,
who had included rather complex first conditional sentences). Moreover, slightly wrong
word order (e.g. ‘some people strange’) also characterised student A’s answer at times.
Yet a closer look revealed that none of these mistakes seriously impeded communication;
in fact, important elements such as syntax and use of verbs and tenses were correct
enough throughout to avoid ‘difficulty in terms of communication’ (band 3). General
achievement objectives for this summative test had additionally been met by the student,
since she had used future forms correctly throughout her answer (apart from one instance
in the first sentence which, as it was a clear exception, pointed to an uncharacteristic
‘slip’ resulting from ‘inattention’ rather than constituting a systematic ‘error’). In that
sense, if one focused purely on the ‘grammatical’ aspect of the student’s text, it became
clear that there were only ‘minor errors’ that did not, in themselves, create
misunderstanding or obscure meaning. All in all, since ‘features of bands 3 and 5’ were
thus visible in her performance, she ultimately obtained the intermediate ‘band 4’ mark
for ‘grammatical structures’ as well.
The thorough error analysis encouraged by this criteria-based assessment system
thus ultimately led to a number of useful consequences. On the one hand, by separating
the foci on lexis and grammar, i resist the temptation of simply counting the total number
of ‘form’-related mistakes and correspondingly reaching a holistic judgment that might
have been fairly negative. Indeed, without distinguishing between spelling, vocabulary
and grammar mistakes, one might simply have reached the conclusion that there was
hardly a completely mistake-free sentence in student A’s performance, and
correspondingly a fairly low mark for ‘form’ (or ‘language’) might then have been
awarded overall. In contrast, when analysing the text with reference to the various
specific descriptors for ‘lexis’ and ‘grammar’, it was not only possible to understand
where most of these different mistakes had come from; this approach also crucially
helped to find out which errors truly hindered the successful achievement of a
‘communicative purpose’ (and which ones did not), and to adjust the various criteria-
Chapter 4
123
related marks accordingly. In turn, this not only translated into a more nuanced (and more
soundly justified) assessment; it also provided a better opportunity to give targeted
feedback to the student (by pointing out the area that had caused the most problems in the
test performance and thus primarily needed to be worked on).
4.5.4. ‘Summer camp’ task: analysis and assessment of student performances Although the second proposed writing task in this summative test generally
imposed three main content points to be included as well, it also left the students a lot of
room to become creative. As a result, the students who chose this option came up with
very diverse answers, as exemplified in the following two productions 27 :
Student C:
Student D:
Hi! I’m Mr. Cruse the manager of the
new sommer camp “Think about It”. My
camp is in Knogge near the see. Next to
my camp Is a Zoo with a lot of animals.
The participants can do a lot of wateractivitys, like surfing, Water-gym…
The Camp rooms are verry comfortible,
they have two bed’s a bathroom,
television and balcon. In the camp we
have a restaurant with 20 diffrent
nationalitys of food, Chinese food… We
have also a welnessroom, with sauna,
solarium, hairstyler… When it conviced
you come here, it would be funny!
Two weeks summer camp in France
without parents and brothers or sisters.
Groups with 4-5 person.
We will climbing, do water ski, rally,
and differents sports (football, tennis…)
We will sleep in a old house. In a room
there sleep 4-5 person (girls or boys).
The last week we will sleep in Nature not
in a building, and cooking four yourself.
In the evening we will make a disco.
You will have sometimes freetime.
10 females and 8 males will sitting you.
We have place for 60 childrens (14-18
years)
When you come with us in the summer
camp you’re never lost the very funny
time. You will have a lot of sports and
will learn a lot of diffrent people.
Once again, both students generally addressed the required content points: there is
evidence of available activities, buildings and rooms in each of the two productions (even
though the rather luxurious sleeping facilities described in production C do not really
correspond to a typical ‘camp’). In different ways, students C and D also respected the
instruction to make their ‘brochures’ as lively as possible; both of them chose to do so by
27
See appendix 14 for copies of the original student samples and the correspondingly used assessment
sheets.
124
Chapter 4
addressing the correct target audience (teenagers who should come to the camp) in a very
direct way. However, whereas the first student chose to create an imaginary persona (‘Mr
Cruse’) that consistently addressed the reader, the speaker in the second text rather
randomly alternated between the subjects “we” and “you”; as a result, the overall
coherence of production D suffered to a certain extent.
In terms of format, the ‘brochure’ context could potentially be invoked to defend
the style of student D’s first two sentences: while in general, incomplete constructions
(i.e. sentences without verbs) should evidently be avoided in writing, one could indeed
point out that advertisements or titles can occasionally take the form that the student had
used to begin her answer with. The fact that the student consistently used verbs
throughout the remainder of her answer further compounded the impression that these
first two “sentences” were intentionally contracted. Nevertheless, in regard to overall
coherence and cohesion, student D’s answer certainly did not exceed the ‘basic standard’:
there was no systematic use of ‘linking words’ or other cohesive devices (apart from
‘and’), and thus the individual points that she made were ‘simply listed’ throughout. On
the other hand, the intentional use of paragraphing for new sets of ideas certainly
indicated a form of ‘organisation’ that was clearly missing in student C’s text (but which,
in turn, was compensated for by the fitting opening and concluding sentences that
provided a general, coherent frame to that latter text).
Yet what makes these written productions particularly interesting samples for a
more thorough analysis (in view of assessment) is the fact that both of them clearly
contained a range of problems with grammatical and lexical accuracy. In student C’s
production, for instance, a whole range of ‘lexis’-related mistakes appeared: inconsistent
use of lower and upper cases, frequently inaccurate spelling, as well as a few wrongly
improvised expressions based on L1 transfer (e.g. ‘welnessroom, hairstyler’). Some of
these mistakes may certainly be attributed to ‘risk taking’, since an A2 learner is not
necessarily familiar with the expressions ‘convinced’ and ‘comfortable’; spelling
mistakes in those rather complex words are thus understandable. However, others cannot
be excused in this way: the student was, after all, supposed to know the correct spelling of
words like ‘sea’ and ‘very’. Of course, the single usage of each word does not allow us to
verify whether these were mere ‘slips’ (due to ‘inattention’, which is arguably the case for
the wrongly copied word ‘sommer’) or consistent errors in the student’s interlanguage.
Nevertheless, they certainly contribute to the overall impression that this learner’s
orthographic control was fairly poor in general. Furthermore, some grammatical items
Chapter 4
125
were also used inaccurately (e.g ‘bed’s’ instead of the correct plural ‘beds’); in regard to
this criterion in the marking grid, an occasionally inappropriate use of punctuation marks
affected the overall logic of some sentences as well (for example in ‘diffrent nationalitys
of food, Chinese food…’, where the comma indicated an enumeration, yet a colon would
have been more adequate).
In a ‘classic’ assessment scheme where all these mistakes would have been
considered in a unified ‘form’ criterion, an insufficient mark could easily have been the
consequence, especially in reference to supposedly ‘known’ items such as ‘summer’,
‘sea’ and the general rule in the English language that most nouns are spelled in lower
case. To a certain extent, it might also be reasonable to expect that a ‘pre-intermediate’
student should get these elements right, particularly in a summative test which (partly)
verifies adequate mastery of previously encountered material (which, in this case, would
definitely apply to the words ‘sea’, ‘very’ and ‘different’). However, taking into account
the actual complexity of the free writing process mentioned above, a slightly more lenient
approach to accuracy might be more appropriate in this instance. More importantly,
following the CEFR approach of analysing what the learner actually did right in this
assignment, it would be utterly harsh to deny the general communicative success that her
writing performance essentially achieved in spite of its lexical and orthographic
shortcomings. In fact, one might argue that only the final sentence may truly have caused
problems in terms of communication; the learner’s choice of words was not only marked
by the rather important confusion of ‘if’ and ‘when’, but the expression ‘it would be
funny’ (instead of ‘you will have lots of fun’ or ‘that would be great’) ultimately led to a
different meaning than the initially intended one. In contrast, spelling mistakes are often
much more forgivable as they do not frequently cause a complete failure to bring the
intended message across; the student’s orthographic control thus effectively corresponded
to the relevant A2 descriptor which simply calls for ‘reasonable phonetic accuracy’. As
that descriptor functioned as the foundation for the ‘basic standard’ defined in the
marking grid, a corresponding ‘band 3’ was ultimately awarded to the ‘lexis’ criterion of
the student’s performance due to her clearly ‘limited control of spelling and word
formation’.
Once the assessment focus was turned exclusively onto the grammatical proficiency
that the learner displayed in this text, it became even clearer that there were not many
elements which actually hindered communicative success. Apart from a few minor
mistakes such as wrong word order (‘we have also’ instead of ‘we also have’) or slightly
126
Chapter 4
inaccurate constructions (‘next to my camp Is a Zoo’ instead of ‘there is a zoo’), the
candidate generally managed to express her ideas rather fluently. Even the basic mistake
of placing an incorrect apostrophe in the plural ‘bed’s’ did certainly not obscure the
intended message (although it is certainly an element that a student should be supposed to
get right even at that level). Once again, the final sentence caused the greatest problem:
the wrongly constructed conditional sentence did not really work there. However, one
needs to bear in mind that the ‘Second Conditional’ does not belong to the grammatical
structures that a language learner is expected to have mastered (or even encountered) at
that early stage; as such, the inaccuracy resulted from ‘risk taking’ in this instance and
should therefore not be overestimated.
In a different respect, though, the student’s overall text had actually failed to respect
one of the instructions: the first content point to be included had been specified by way of
a sentence in the ‘will-future’ (i.e. ‘describe the things that the participants will see and
do’). In contrast, student C had consistently used the present simple tense in her answer
instead. In that sense, she had not demonstrated that she could use one of the targeted
grammatical structures in a wider context. To a certain extent, of course, her choice of
tense was not totally inappropriate for the purpose of describing the available summer
camp activities per se (since the present simple is, after all, perfectly fine for general
descriptions). In turn, this underlines once more that such free writing tasks can only be
drawn upon in a limited way to verify the achievement of a particular grammatical
objective, since it is not always entirely predictable which structures the students are
ultimately going to use. However, at least in relation to the first content point of the
‘summer camp’ task, the instructions had provided a clear hint that future forms were
expected (even if this might not have been stressed quite as obviously as in the
‘horoscope’ activity discussed above). Even though student C’s text had not led to much
obvious ‘difficulty in terms of communication’ with the linguistic choices she had
independently made, her production was thus essentially not marked by a ‘sufficient range
of structures for the task’ (but rather a ‘limited’ one as she virtually stuck to the present
simple tense throughout). Taking into account the various instances of ‘faulty
punctuation’ in her production as well, I ultimately decided to award the corresponding
‘band 3’ for ‘grammatical structures’, which essentially reflected that communication had
still largely been possible even though accuracy was recurrently a problem. The general
fulfilment of the communicative goal was also visible in student C’s final mark (6/9),
Chapter 4
127
which was sufficient but still underlined a lingering tendency towards ‘basic standard’
and the occasional need for ‘some effort by the reader’.
An example of how limited accuracy can lead to much more noticeable problems of
communication can be seen in student D’s text. Throughout this production, it is evident
that numerous rather basic grammatical elements had not been mastered by this learner
yet: frequently encountered plurals (‘childrens’ / ‘person’ instead of ‘people’), indefinite
articles (‘a old house’) and verb forms (‘we will climbing’) constituted only a selection of
items that were not used correctly in this text. As these items and structures had all
repeatedly been previously encountered and rehearsed, they certainly did not result from
‘risk taking’ but rather from an insufficient achievement of relevant objectives (which, in
turn, had a negative impact on the learner’s overall proficiency in writing). In relation to
the ‘grammatical structures’ criterion, the student thus tended towards the ‘band 1’
descriptor that implied ‘essentially no control of structures’; however, even if the
student’s handling of grammar certainly lacked consistency, there were also various
constructions that did work (e.g. ‘You will sleep in a[n] old house / ‘We have place [i.e.
room] for 60 children[…]’). Ultimately, communication was therefore still possible to a
sufficient extent to provide some relevant information in regard to a number of content
points. As punctuation was also generally handled adequately, ‘band 2’ was ultimately
awarded in this category.
Numerous issues also affected the student’s lexical and orthographic competences:
in this case, attempts to improvise in order to bring the general message across were
unsuccessful on various occasions. While spelling mistakes such as ‘cooking four
yourself’ did not impede communication, other linguistic “experiments” already seemed
rather awkward (e.g. ’10 females and 8 males’ / using the word ‘childrens’ to refer to 14to 18-year-old teenagers). Finally, some constructions were only possible to understand if
one was familiar with the student’s L1 (in this case Luxembourgish): sentences like ‘learn
a lot of diffrent people’ (instead of ‘get to know’) or ‘you’re never lost the funny time’
(probably meaning ‘you will never be bored’) would certainly cause problems of
interpretation for a native speaker of English. Similarly, the sentence stipulating that the
camp employees ‘will sitting you’ did not lead to successful communication; while it is
likely that the term had been derived from ‘babysitting’ and was supposed to mean ‘to
watch over you’, this example clearly underlines that ‘risk taking’ did not always lead to
positive results in this student’s production.
128
Chapter 4
The suggested possible interpretations of the three abovementioned (parts of)
sentences certainly illustrate that a ‘serious effort by the reader’ was necessary to make
sense of student D’s text at times. In essence, there were thus too many problems with
lexis and spelling to consider that the ‘basic standard’ had been reached here; ultimately,
the vocabulary ‘range’ exhibited by the learner did not entirely fulfil the condition of
being ‘minimally adequate for the task’ in all aspects. On the other hand, as her intended
meaning could still be followed (particularly in the first half of her answer) or inferred
(towards the end) most of the time, I considered that the second band in the marking grid
was the most adequate one to attribute to this aspect of the student’s performance as well.
The marks in both ‘form’-related categories thus reflected the insufficient accuracy in the
student’s performance, but they also took into account that a complete breakdown in
communication had in many cases been avoided. Once again, it seems rather likely that
the student would have risked getting a very low mark overall if all her mistakes had
simply been added up and seen as the main basis for her final score.
In terms of overall ‘content’ marks, these accuracy problems evidently had an effect
as well. After all, some ideas that the student had had in mind did not totally come across
due to the wrong expressions or constructions she had occasionally picked (especially in
the final paragraph). Hence it was evident that the message had not been ‘fully
communicated to the reader’ (i.e. band 5 had most certainly not been reached). On the
other hand, there were also many elements that one could still understand even if these
pieces of information had not always been perfectly phrased. For example, the camp
location, facilities, available activities as well as targeted customer groups were all
described in intelligible (if not completely accurate) language 28 . In relation to the marking
grid, this certainly means that more than ‘1/3 of the content [had] been dealt with’;
moreover, ‘excessive effort’ was not necessary to understand the bulk of the message. As
a result, the performance markedly exceeded the insufficient ‘band 1’ in terms of overall
content. In fact, the ‘band 3’ descriptor seemed a much more appropriate one to sum up
the student performance: the ‘message [had] only [been] partly communicated’ and ‘some
effort on behalf of the reader’ had been necessary. Since a similar basic adequacy was (as
seen above) attested to the ‘coherence and cohesion’ factors of student D’s production,
28
Indeed, one can hardly say that it is impossible to comprehend what the student wanted to say in such a
sentence as the following: ‘The last week we will sleep in Nature not in a building, and cooking four
yourself’. Particularly the expressions ‘not in a building’ show the student’s attempt to paraphrase: she was
aware that her choice of expressions might not be completely accurate and thus used such strategies to
further clarify her meaning.
Chapter 4
129
the two ‘basic standard’ scores in ‘content’-related criteria ultimately saved a marginally
sufficient final mark (4.5/9) in this instance. All in all, the results of the assessment
procedure thus reflected the two most crucial elements of the student’s performance: there
was a clear need to improve the overall accuracy in both grammar and lexis; however, as
a whole, the communicative goal had still in large parts been reached.
4.5.5. Outcomes of the applied assessment procedure and general comments The detailed analysis of individual student samples above clearly reveals the
systematic and nuanced assessment procedure which the use of a competence-based
marking grid encourages. Of course, even this type of assessment does not completely
eliminate a subjective component inherent to the personal judgments of an individual
assessor; in fact, other teachers might have interpreted and applied the defined marking
criteria in slightly different ways than it was done in these four cases, and differing marks
might thus still have been reached between different scorers. However, the overall
principle of constantly keeping the student’s adequate fulfilment of a communicative
purpose in mind certainly helps to guide the assessment into a direction that
fundamentally considers the student’s writing skills rather than the mere display of his
grammatical or lexical knowledge. Moreover, as the application to two different test tasks
proves, the descriptors in this marking grid were sufficiently versatile and constantly
relevant enough to allow for a founded and balanced appreciation of various student
productions as well as text types.
In this regard, one might add that no major discrepancy ultimately existed between
the final scores respectively awarded to ‘horoscope’ and ‘brochure’ productions. Both
sets of marks presented a fairly similar range (between 4.5/9 and 8/9 for the 14
productions that had dealt with the first task, and between 4.5/9 and 7/9 for the 5 ‘summer
camp brochures’) 29 . It is of course hard to guess how individual students might have fared
if they had chosen the alternative topic, even if it seems reasonable to assume that their
results would not have differed to a huge extent (given that the overall level of
proficiency they had reached was bound to lead to comparable levels of accuracy and
fluency in other writing tasks as well). A higher overall reliability of test results could
thus arguably have been reached if all the students had had to address the same
29
See appendix 15 for a detailed breakdown of marks. It should also be stressed, though, that valid
statistical comparisons can of course only be drawn to a limited extent based on such a small sample of
productions.
130
Chapter 4
instructions (and to use more closely similar vocabulary ranges and language structures as
a result). Nevertheless, the comparable values which marked the two sets of results
certainly indicated two important facts: on the one hand, both writing tasks were feasible
for these A2-level students with the language resources they had built up over the course
of the term (and year) up to that point. On the other hand, the two test tasks were also
sufficiently reliable in their own right in the sense that each of them allowed the assessor
to make a clear distinction between ‘good’ and ‘weaker’ performances.
A further interesting fact is revealed by a closer look at the various factors which
had led to the final marks. Thus, a certain consistency could once again be seen across the
bands which individual students had attained in relation to the four main criteria in the
marking grid: as in the speaking tests (see chapter 3), the marks attributed to the various
aspects of each learner’s performance invariably spanned only across one or two different
bands. In part, this was evidently due to the interplay of ‘form’ and ‘content’ alluded to
above: if a student had for example exhibited a clearly limited vocabulary range, it was
fairly logical that a ‘band 5’ achievement in relation to ‘content’ criteria (and the
corresponding ‘full and clear communication of the message’) was very difficult to
achieve for that individual. On the other hand, completely irrelevant answers (and thus
large discrepancies between high language-based and low ‘content’-related marks) had
also been avoided in this instance through the clear indications that each set of
instructions had given about respective elements to be addressed. To some extent, this
certainly helps to explain why no insufficient ‘content’ marks had to be distributed in this
test. As seen in chapter 1 30 , slightly ‘restricting the scope of variety in answers’ in that
manner contributed to increasing test-related reliability as well. In a similar way, the
partially suggested structure in the instructions had also supported the learners in their
efforts to imbue their productions with ‘coherence and cohesion’, which might have had a
positive impact on their scores in that particular area as well.
On the other hand, it also needs to be remarked that maximal marks were either
very rarely reached in relation to specific criteria, or (in the cases of ‘lexis’ and ‘content’)
even not at all. Seeing that the ‘band 5’ descriptors had left some room for imperfections
in the learners’ performances, this might certainly be considered as a disappointing result.
Yet given the rather wide vocabulary range that was required to fulfil all parts of the test
tasks ‘fully and clearly’ (band 5), it is not particularly surprising that the lowest average
30
See the comments of Cohen et al. regarding test reliability in section 1.2.2.
Chapter 4
131
scores in the students’ performances were obtained in regard to ‘lexis’ criteria (and this,
as seen above, sometimes affected their ability to answer all content points in sufficient
levels of detail and complexity). Since the overall focus of the assessment clearly lay on
proficiency rather than specific achievements, a certain amount of unsuccessful
experimentation with less practised words and structures was to be expected (and
accepted); however, as this sometimes hampered the successful communication of the
intended message in the students’ productions, the ‘target standard’ could then not be
awarded. On the one hand, one may therefore argue that the test tasks had aimed at a
vocabulary range that was perhaps slightly too complex in this instance; this was arguably
an issue that could be more satisfactorily adjusted in subsequent tests. At other times in
this summative assessment, though, too many mistakes had simply occurred in the usage
of more familiar items, so that these ‘slips’ no longer really fulfilled the band 5 criteria of
either being ‘few’ in number or of ‘minor’ impact.
However, it is also necessary to keep in mind that the ‘target standard’ (i.e. band 5)
represents the level which students should ideally reach; whether the degree of difficulty
of a particular summative test is appropriate is thus best measured through the number of
students who manage to attain the ‘basic standard’ (band 3) for A2 learners. In this case,
the vast majority of students who had taken the test had encouragingly done so in relation
to ‘lexis’ and ‘grammatical structures’; moreover, those who had failed to reach sufficient
marks due to a lack in accuracy could still reach a ‘pass’ mark if they had managed to
communicate the intended message to a satisfactory extent. This once again underlines
that free writing tasks should not be misused for excessively grammar- or lexis-based
achievement assessments, as we then risk losing sight of the more crucial communicative
purpose of the assignment as a whole.
132
Chapter 4
4.6. Alternative types of writing tasks It was earlier pointed out that the students’ ability to express themselves should be
verified in relation to a wide range of topics and situational contexts (as well as through a
variety of different text types). For that reason, I will now conclude this chapter with a
brief exploration of other writing tasks that were designed for (and implemented in)
summative tests at other points of the school year with such an aim in mind.
4.6.1. Informal letters and emails One communicative purpose which has undoubted potential to meet the interests of
adolescent language learners is the act of sharing information with friends, which
provides an apt, A2-compatible framework for the description of everyday activities and
past experiences. Although contemporary teenagers might tend to prefer other types of
written communication such as text messaging or online chatting, letter writing certainly
still represents a useful skill to develop and master for any language learner. As a result,
both formal and informal letter types were studied and practised in classroom activities
(and homework assignments) over a number of weeks in the third term 31 . Based on the
practice that my students had thus gained, a matching writing task was then implemented
in their first summative test of that term. The corresponding instructions indicated a
number of features that were established as essential for a ‘true’ writing task above 32 : a
context for writing (i.e. a year abroad), a target audience (i.e. a friend at home), a number
of content points to include and the type of format and writing style to use (i.e. an
informal letter) 33 . In this particular case, the students were not offered a choice of writing
task (since it could be safely assumed that they would all find something to write about to
a friend); as a result, the results would provide more reliable results in terms of allowing
legitimate comparisons of displayed proficiency levels.
Even though the assignment partly revisited the topic from the very first summative
test of the year, it certainly represented a much more communicative task this time; the
learners were not only invited to provide a written production that included adequate
information, but also to present it in an adequate format and to follow appropriate
sociolinguistic conventions. Moreover, the compulsory content points had, in this case,
31
Some of these activities were based on Hutchinson, Lifelines Pre-Intermediate Student’s Book, pp.58-59;
for instance, the students were asked to write a homework assignment in which they imagined that they
were writing to their old school friends twenty years from now, telling them about everything that had
happened in the meantime.
32
See section 4.2.
33
See appendix 16.
Chapter 4
133
been deliberately chosen to target the learners’ ability to ‘describe past experiences or
events’; in turn, this would also supply some indications about the ways in which the
students had learnt to apply a number of tenses they had encountered over the course of
term in a wider, coherent context (which would be interesting in terms of achievement
assessment). However, a potential drawback of these instructions was the fairly artificial
way in which they imposed the content points: by simply telling the students what they
had to do, an ‘authentic’ feeling of truly engaging in written interaction with another
person did evidently not arise. In other words, the fact that their contributions purely
followed the purpose of ‘display writing’ was essentially not disguised in this instance.
An interesting variation which arguably tends to confer a higher degree of
authenticity to this type of communicative context (and purpose) consists in confronting
the learners with an actual written message that they first need to read and interpret, and
only then reply to in a meaningful and relevant way. In an attempt to put such a strategy
into practice, I deliberately implemented another test task where I did not present the
writing instructions in a classic, impersonal form; instead, the test sheet simply showed an
email message from an English-speaking exchange student who wanted some information
about my pupils’ school (and its canteen) by addressing them in a distinctly informal
writing style 34 . Hence, during the summative test, the candidates first of all had to read
the text they had been given, scan it for precise questions to answer and then only start
“replying” to the exchange student by writing an appropriate, informal “email message”
in return. As this exercise activated not only productive but also receptive language skills,
this writing task was strategically implemented in the final test of the year (when my
students had ostensibly reached their highest level of proficiency and would not be
overwhelmed by the more extensive amount of provided input). Nevertheless, it is clear
that the combination of reception and production can be a double-edged sword. On the
one hand, as Harmer points out, ‘reception’ is of course often a ‘part of production’:
in many situations production can only continue in combination with the practice
of receptive skills. […] Letters are often written in reply to other letters, and email conversation proceeds much like spoken dialogues. […] The fact that
reception and production are so bound up together suggests strongly that we
should not have students practise skills in isolation even if such a thing were
possible. 35
34
35
See appendix 16.
Harmer, op.cit., p.251.
134
Chapter 4
It goes without saying that if ‘reception’ is thus used as a stepping-stone leading to more
authentic ‘production’ in classroom activities, there is no reason why the same should not
be applicable to summative tests.
Yet if the aim of such tests is to gain insight into competence development
explicitly in regard to productive skills, there are a number of issues that do in fact arise
in such combined tasks. In the summative test described above, two problems in
particular were noticeable in the students’ final written productions. Firstly, some
students ultimately failed to address all the content points, presumably because they had
not identified all the questions in the initial text (or simply forgotten to address some of
them); as a result, they did not give a sufficiently complete reply in their own emails.
Secondly, in response to those elements that they had spotted, their answers contained
numerous expressions they had simply copied from the text; as a result, the evidence that
could be gathered about the students’ own lexis-related competences was ultimately
blurred and unreliable. In other words: what the communicative context and purpose of
the exercise might have gained in (semi-)authenticity was lost in terms of providing a
safely founded (and “authentic”) assessment of the students’ actual lexical resources.
Hence, a shorter email containing only a few short questions could be a useful alternative
to limit the amount of modelled input and readily provided lexis, yet still maintain a
communicative task layout that is more engaging than a few mechanical and direct
writing instructions.
4.6.2. Story writing A much less guided type of written production which allows the students to express
themselves very freely and creatively can be reached through exercises that encourage
story writing. However, in this instance as well, different types and amounts of
scaffolding can be provided to support the learners in their efforts. Rather than only
giving them a very basic type of instruction (e.g. ‘write a story about a fantastic day’), the
topic that they should write about can thus be supplied in more engaging (but also
sometimes more challenging) ways. A particularly useful variation consists for example
in handing the students the beginning of a story and asking them to finish it in an
adequate way. In their resulting written productions, the students are thus not only
required to include relevant ideas and subject matter; a truly convincing effort will also be
marked by a suitably coherent and cohesive structure which does not disrespect the
details that have been offered in the opening lines of the story. In comparison to the two
Chapter 4
135
types of test tasks analysed in the case study above, a more thorough analysis of the
students’ use of such elements as paragraphing and cohesive devices can be possible in
the case of such story writing. While this is another example of ‘reception leading to
production’, the possibility of copying provided expressions and ideas has been
significantly reduced in this instance. In contrast to the “email reply” mentioned above,
the students have to develop a storyline and cannot directly repeat the information that the
test sheet already contains.
Such an exercise was for example implemented in the last test of the second term36
(a point in time when the students had already had the chance to significantly develop
their linguistic resources since the beginning of the year). Evidently, one can only expect
the students to master such a task in an adequate way if they are familiar with story
writing strategies and have had the chance to experiment with this type of creative
production in the classroom (and have, of course, received relevant feedback about their
efforts) prior to the test. Similarly, they are most likely to have adequate command of
topic-relevant vocabulary and grammatical structures if the test task closely respects
thematic areas and grammar points that have been treated in some depth beforehand. All
of these conditions were respected in this particular instance; not only had the students
been asked to write a number of stories both in the classroom and in homework
assignments, but the test task itself was also largely based on the subject matter (scary or
embarrassing incidents) and narrative tenses (past simple and past continuous) that the
course content had focused on for a number of weeks. In that sense, the complex
intertwinement of achievement and proficiency assessments that needed to be applied to
the resulting student productions was once again underlined: while the learners certainly
needed to bring into play ‘all and any language at their disposal’ (i.e. their entire
proficiency repertoire) to compose adequate stories, it was also clear that the achievement
of recently treated objectives could not be entirely disregarded in the corresponding
assessment.
It should also be pointed out that a choice of two different story beginnings was
offered to the students in the ensuing summative test. However, since the type of text that
had to be produced was essentially the same in both cases, I would argue that the act of
supplying two slightly different options only had a very limited impact on the overall
reliability of the test results in this case. In fact, they offered the students an additional
36
See appendix 17.
136
Chapter 4
chance to show what they were capable of in writing; especially if such a strongly
creative effort was demanded by a task as in this case, it seemed only fair to offer at least
a variety of options that could suitably stimulate the learners’ imagination. As further
support, two very broad guidelines were included below the story extracts on the test
sheet: by insisting that the students had to include both ‘what they did’ and ‘how they
felt’ in the given situation, their room for creativity was not majorly reduced, yet a very
basic answer content had been indicated which they could fall back on. Additionally, this
inclusion of two core features to be respected also had a useful impact on the ensuing
assessment procedure, as it made a more systematic and criteria-oriented analysis of the
content in the students’ texts possible. Nevertheless, especially if this type of exercise is
implemented in summative tests on a more regular basis, it certainly seems legitimate to
eventually leave out such clues and thus to encourage students even more to come up
with independent and creative solutions in their aim to fulfil the general communicative
goal.
In fact, a more extreme way of limiting the amount of ‘reception’ as a basis for
‘production’ is exemplified in a proficiency-based test task that appears in Cambridge
PET 37 examinations: in the most extensive writing tasks, the candidates can show their
creative writing abilities by developing a suitable story simply based on a short opening
sentence (such as ‘Jo looked at the map and decided to go left.’ 38 ). In this case, a very
wide variety of student answers is evidently possible, and the students are certainly
encouraged to activate the full range of their writing competences. However, great care
needs to be taken in the corresponding content assessment since the relevance of various
student productions may sometimes be problematic; if no points to be included have been
indicated, it is much easier for students to veer off topic and thus to produce an answer
that may be fairly proficient in terms of language, yet can still be largely devoid of
suitable content.
Interestingly, one element of a ‘good’ writing task is actually missing in the
abovementioned examples of story writing: there is no implied audience (other than the
teacher) which the written production should address. Of course, artificial contexts could
be imagined and indicated to circumvent this, for example by stating in the instructions
that the students had entered a ‘story writing contest’ and should therefore try to make
37
While the PET examination actually targets the CEFR B1 level, the type of task described in this instance
can certainly be adapted to A2-level purposes as well.
38
Example taken from Preliminary English Test for Schools – Handbook for teachers, p.21.
Chapter 4
137
their story as original and interesting as possible. However, the imposed nature of such
indications does not seem to add very much to the task in the way of “true” authenticity.
In fact, in contrast to situated writing activities such as letter writing, where an implied
reader is of great importance to the adopted genre and style of the produced text, the
absence of an ‘audience’ is not nearly as decisive in the context of story writing. As the
students are aware that their written production should focus on the logical development
and sequencing of events, it seems safe to assume that they will aim to respect the
particular conventions of this type of creative writing even if no audience is explicitly
mentioned.
The different types of tasks described in this chapter evidently only represent a very
small portion of possible ways in which “true” writing skills can be activated and verified
in summative tests; many more are easily conceivable and realisable. In addition, it has
certainly been underlined that the transformation from “traditional”, excessively
grammar-based and mechanical writing tasks to more stimulating and competence-based
ones does not need to be an intrinsically difficult and challenging proposition at all. In
fact, a number of simple yet careful and reasoned modifications can often suffice to turn a
basic writing exercise into a contextualised and meaningful writing task. If tasks with
truly communicative purposes are implemented in tandem with such a multifaceted and
universally applicable assessment tool as the marking grid described in this chapter, a
more thorough and nuanced insight into the students’ writing competences can certainly
be gained through the students’ written productions in regular summative tests.
138
Chapter 4
Chapter 5: Conclusions and future outlook. At the beginning of this thesis, the three notions of validity, reliability and feasibility
were established as inevitable cornerstones for truly pertinent forms of testing and
assessment. The question was then raised to what extent a competence-based approach to
the productive skills of speaking and writing could be helpful to enhance the adequacy of
our summative tests in relation to all three of these key concepts. After exploring the
theoretical implications and usefulness of a framework such as the CEFR, as well as
describing and analysing attempts to apply some of its guiding principles to test design
and corresponding assessment strategies in practice, it is now possible to reach founded
conclusions about the opportunities – but also challenges – that such a revamped
approach implies.
5.1. The impact of a competence­based approach on test design and assessment 5.1.1. Effects on validity The renewed focus which the CEFR-inspired, competence-based approach has put
on inherently communicative purposes of speaking and writing acts can only have a
positive impact on the content and face validity of tests that correspondingly target the
students’ productive skills. Indeed, if a given test task does not merely try to get an
overview of the student’s knowledge of discrete language items, but rather asks the
learner to activate a wide range of linguistic competences to reach a specific
communicative goal, then we are truly testing the student’s ability to produce meaningful
writing or speaking samples (and thus the test is ‘actually measuring what it intends to
measure’ 1 ). As seen in the respective practical examples, important steps to this aim can
for example be taken by getting students to perform a range of productive and interactive
tasks in oral tests, or by striving to contextualise writing tasks and imbuing them with a
clear communicative purpose whenever possible. In turn, the inferences which can be
1
See pp.13-14 in chapter 1.
140
Chapter 5
drawn from student performances in such tests in regard to their overall language
proficiency and levels of competence reached raises construct validity in relation to these
two concepts; in excessively grammar-centred tasks, for instance, this would not nearly
be possible to the same extent.
Naturally, it is important to keep in mind that summative tests alone can never give
a complete picture of a student’s veritable proficiency level. While they can certainly
contribute useful indications to that effect (especially if they observe such guidelines as
including a range of different task types), one should not forget that competences can
never be fully revealed by any single, isolated performance. Hence, even with test tasks
that are marked by increased degrees of content and construct validity (in terms of truly
allowing the assessor to verify speaking and writing skills), regular summative
assessments still have to be complemented by a whole range of formative ones. Only if
the latter are continuously and conscientiously gathered in the classroom can we hope to
reach completely informed conclusions about the proficiency levels that our learners have
attained. In relation to speaking, this means for example that a student’s five-minute
performance in a term’s oral test must be seen in conjunction with his regular
contributions in classroom activities; even if the test tasks sample a range of different
aspects, they cannot possibly cover all the activities, situations and subject-matters that
are dealt with in class over the course of a term or year. While a specific mark can of
course be attributed to an isolated performance in a summative test, a broader proficiency
assessment can thus only be reached after diverse language samples have been provided
over the course of a more extensive period of time (and thus, in addition to summative
tests, also in contexts where time and achievement pressures have less or no impact on
student-related reliability).
As far as a comprehensive examination of writing competences is concerned, a
further interesting feature that is inherent to this particular productive skill may
additionally be pointed out. By their very nature, summative tests almost inevitably need
to focus on product writing: only the final result (i.e. the student’s finished text) is
considered as a basis for the eventual assessment. While the previous chapter has
demonstrated that numerous key insights can of course be gained from such a complex
sample, another unavoidable corollary is that summative assessments cannot possibly
take into account the various features of the actual process that have led to this product.
Chapter 5
141
Yet using strategies such as planning, drafting, editing and re-drafting 2 in appropriate
ways is undeniably part of the student’s writing competences as well. In this regard,
portfolio work (including samples from different stages of the composition process) could
for example provide a useful alternative to draw upon in order to complement the product
focus of summative tests with a process one, and thus to lead to an even more complete
picture of the student’s writing abilities.
Aside from the aforementioned positive impact on the content validity of test tasks,
an alignment to the communicative approach of the CEFR can also raise the validity of
the applied assessment principles by catering more suitably for the proficiency levels that
the students are likely to have reached at precise stages of their language learning.
Through a realistic, criteria-referenced system that allows for remaining imperfections in
the students’ language productions, more appropriate and precisely defined standards can
thus be set as targets that the students’ productions need to meet. Indeed, especially at
lower levels (such as A2), the beneficial effects of moving away from an excessively
form-focused assessment scheme have repeatedly been pointed out in this thesis. By
avoiding an unfair comparison of the students’ work to the unrealistic expectations of
native-speaker flawlessness, more important (and valid) features such as the overall
communicative success of a particular production sample can be put into the centre of the
assessment system, and gauged in relation to the actual linguistic means that students can
be expected to have at their disposal at a given stage of their language learning process.
At the same time, however, the practical experimentation with marking grids has
also revealed a number of prerequisites that need to be respected when trying to derive a
workable summative assessment system from the original CEFR scales. Hence, the need
for adaptation to precise purposes and needs, firmly and explicitly encouraged by the
Framework writers themselves, is reflected in the more concise and specific terminology
that has to be used in marking grids for both speaking and writing tests. Of course, the
CEFR descriptor scales can serve as an important basis when it comes to defining
appropriate ‘basic’ and ‘target’ standards to be generally attained in the students’
corresponding performances. Yet their overall proficiency focus ultimately implies that
any attempt to simply equate the original descriptor scales with numerical values (e.g.
mark X for level A1, mark Y for level A2…) for an isolated summative assessment would
be ill-advised. Since single performances can never completely reveal the underlying
2
See Harmer, op.cit., p.257
142
Chapter 5
competences at work, it would indeed be contradictory to directly apply the CEFR
descriptors in a way that would suggest that this was actually possible. Moreover, a
marking system spanning several CEFR levels would evidently not be appropriate to
identify the relatively small nuances between the individual performances in a class of
school-based learners who mostly operate at one same level (in this case A2). Instead, it
seems more sensible to use CEFR-derived descriptors that still reflect the overall
competence levels to be reached in a given learning cycle at school yet can be applied to
situate a specific, “one-off” performance in relation to both content- and language-related
criteria with more precision.
In turn, this also leads back to the difficulty of uniting achievement and proficiency
factors when assessing a student’s speaking or writing performance in a summative test.
If verifying the achievement of particular learning objectives constitutes a fundamental
purpose of any such test, the added influence of overall target language proficiency
becomes evident if purposeful, communicative writing or speaking tasks are to be dealt
with successfully. One possible way in which this dual challenge can be approached has
been exemplified in the various sample tests described in this thesis: if the designed
productive tasks are generally (but not exclusively) centred on grammatical structures and
topical lexis encountered and developed in class prior to the summative test, the students’
ability to deal with the tasks in an appropriate way depends in no small part on an
adequate mastery of those items (and thus their achievement of fairly narrow objectives).
However, as they additionally need to call upon ‘all and any [supplementary] language at
their disposal’ to unify the different parts of their answers into a coherent whole, a more
general insight into overall speaking or writing proficiency may also be gained through
such
meaningful
and
communicative
productions.
Rather
than
unsuccessful
‘experimental’ structures that the student may have included in his performance, recurrent
errors in relation to familiar and rehearsed items might then be regarded as more
legitimate reasons for lower form-related marks since they arguably represent the
learner’s failed achievement of intermediate objectives. If too many such errors mark the
student’s production, an overall failure to correctly transmit the communicative message
may of course be the consequence as well. However, if this is not the case, it is important
that the overall assessment of a student performance takes into account that numerous
additional elements (rather than just a few specific grammar and lexis items) come into
play during a true “free” writing or speaking activity, and that the resulting end product
should thus not be approached with the same grammar-focused achievement foci as it
Chapter 5
143
would be the case for discrete-item tasks. Through the use of multifaceted marking grids
that sensibly split up various content, coherence, lexis and grammar features, such a
nuanced form of assessment is certainly encouraged.
Nevertheless, it has also become clear in this thesis that several CEFR components
need to be further adjusted to a school-based context. Thus, the leniency with which some
A2 descriptors handle accuracy (in terms of grammar but also spelling and punctuation,
for instance) does not always make them particularly suited to the application in
summative assessments. This explains why, as in the cases explored in the previous two
chapters, marking grids for speaking and writing performances of ‘A2’ learners may
sometimes aim for ‘B1’-level accuracy. While still allowing for a number of mistakes to
be included in a suitable performance (and thus respecting one of the most valuable
characteristics of the lower CEFR levels), these descriptors also reflect that systematic
and in-depth exploration of specific language items has preceded the test performance in
the classroom; this should, after all, not be neglected.
5.1.2. Effects on reliability The use of suitable criteria-based marking grids leads to promising ramifications in
regard to various aspects of reliability as well. As illustrated in chapters 3 and 4, student
performances can be analysed in much more nuanced, targeted and unbiased ways if
soundly researched and clearly defined assessment criteria rather than holistic and
arguably impulsive impressions provide the basis for assessment. In that sense,
individual scorer reliability can certainly be increased through the rigorous application
of the same exact criteria to all student productions. Of course, this does not mean that
strategies such as norm-referencing need to be completely abandoned; for example,
comparing two student productions in relation to a same criterion in the marking grid in
order to see which one more fittingly corresponds to the phrasing of a particular
descriptor certainly makes sense. However, it is important that each student’s work is not
exclusively compared to his peers’, so that the attainment of a given mark is possible by
reaching a fair(er) and reasonable standard.
Ideally, criteria-referenced assessments can also pave the way towards higher interrater reliability. Indeed, if every single teacher consistently bases his assessments on
exactly the same marking criteria as his colleagues, the chance that major discrepancies
will affect the eventually awarded scores is bound to be reduced. However, this is not an
automatic certainty. As was pointed out in regard to the marking grids that were used to
144
Chapter 5
assess both speaking and writing performances, the descriptors which such grids contain
cannot be absolutely precise since they need to anticipate (and be applied to) a
sufficiently wide range of linguistic structures and content elements that the students
might use in their performances. Yet if that is the case, then not all the expressions in
such descriptors will be interpreted in exactly the same way by all the teachers who refer
to them.
Indeed, while the syllabus for 9TE (for instance) provides a list of grammatical
functions (as well as possible topic areas, notions and communicative tasks) to be covered
over the course of the learning cycle 3 , there is no explicit gradation in terms of ‘simple’
or ‘complex’ structures, for example. This is certainly an understandable omission in the
light of the extremely complicated (and perhaps impossible) decision-making process
which this would have involved; in fact, such a distinction of structures into different
levels of complexity would be very difficult to justify in purely objective terms, and
would almost inevitably have to rely on arbitrary judgments to a certain extent. However,
as a result, this means that various subjective interpretations can still be applied to the
descriptors in the marking grids, with correspondingly divergent outcomes 4 .
To counteract the risks of such persisting discrepancies, it is clear that a consensus
needs to be firmly established among teachers as to what exact requirements can be
associated to the different quality descriptors in marking grids. A first and foremost
prerequisite evidently consists in familiarising oneself with these assessment tools and
with the competence-based framework that inspired them; only by actively consulting and
applying them in practice can a true ‘feel’ for the logic behind these grids be acquired. To
reach a consensus of interpretation, the necessity of cooperation with other teachers
then also becomes apparent. If the risk of simply substituting one way of reaching overly
subjective (and thus unreliable) assessments with another is to be avoided, it is clear that
open-minded discussions and exchanges about potentially ambiguous descriptors or
criteria must become a priority; only in that way can a satisfactory and unified approach
to summative assessment ultimately be reached. At school level, informed opinions (but
also encountered challenges) that individual teachers have reached through practice could
be exchanged in departmental meetings. Regularly attending teacher training
seminars which expressly deal with the practical application of marking grids to specific
benchmark samples also constitutes a useful initiative, particularly if the insights gained
3
4
See syllabus for 9TE (2009-2010 version) as published on http://programmes.myschool.lu, pp.10-12.
See section 4.4. in the previous chapter.
Chapter 5
145
are subsequently shared with other teachers at one’s own school. On a national scale,
discussions and decisions about the applied assessment strategies or tools (for example in
meetings of the different Commissions Nationales des Programmes) can further lead to
consensus building and thus an increase of inter-scorer reliability can be deliberately and
systematically sought for. All of these different steps certainly stress the importance of
using a common framework to guarantee a greater harmonisation of testing and
assessment strategies across the country.
5.1.3. Feasibility of the explored testing and assessment systems The practical implementation of skills-oriented writing and speaking tasks in the
classroom has led to useful insights about feasibility in regard to the three key test phases
of design, administration and assessment. First of all, these experiments have illustrated
that setting up test tasks which deliberately focus on communicative purposes is well
within the reach of every practitioner. The systematic contextualisation and search for
authenticity-enhancing factors may of course increase the necessary amount of time to
compile convincing skills-based test tasks in comparison to more traditional, narrowly
grammar-focused exercises. However, particularly in the case of writing, the required
modifications in test design are often much smaller and easier to realise than one might
initially have expected. As long as elements such as targeted format, audience, topic and
purpose are respected, a writing task can quickly pursue a meaningful, communicative
goal. In the case of speaking, the integration of both production and interaction elements
may prove slightly more challenging; the search for suitable visual prompts may
additionally require some time. Nevertheless, considering the numerous possibilities that
are offered by multimedia and Internet resources nowadays, even this process has rapidly
become much less time-consuming.
In this respect, additional benefits of a more collective approach to teaching could
easily emerge as well. If competence-based tasks and test layouts are mutually shared (or
even cooperatively set up) by teachers in one same school, for instance, excessive
amounts of time and effort spent on individual research and preparations can be replaced
by a much more fruitful and efficient way of working as part of a resourceful ‘community
of practice’. If the resulting tests are then implemented in various classes of the same
level, increased reliability will additionally exist between the results of those different
classes (and the respective students may also perceive increased fairness if the level of
146
Chapter 5
difficulty of the tests they take is not inextricably linked to the person who teaches them
that year).
During the administration phase of the speaking and writing tests described in the
previous two chapters, it was possible to gain some useful insights about their feasibility
as well. The inclusion of more communicative writing tasks into summative class tests
did not pose any major problems, as the elicitation of slightly more extensive writing
samples could simply take the place of more traditional “writing” exercises that had
essentially focused on grammar or rather inaptly referred to “reading comprehension”
(i.e. questions about known texts essentially representing pure memorisation tasks).
Nevertheless, as numerous students tended to address these writing tasks only towards the
end of the summative tests, factors such as time pressure and concentration loss may
occasionally have affected the quality of their performances; in turn, this slightly reduced
the reliability of inferences that could consequently be drawn about their proficiency
levels. As an alternative, one could for example explore the possible implementation of
“pure” writing tests (instead of combining them with tasks that focus on other skills),
which would allow the learners more time to plan and organise their answers and thus
arguably offer them a fairer chance to access the full potential of their writing
competences. At the same time, however, it is of course paramount not to confer a
disproportionate amount of classroom time purely to the testing of different skills; after
all, sufficient time must remain for language learning in the classroom as well. This can
only happen if students do not constantly have the impression that their main activity at
school consists in an endless circle of preparing for and writing tests.
A more complicated case emerges in regard to the feasibility of administering
regular speaking tests in the classroom. As illustrated in chapter 3, the actual
implementation of a unique summative oral test will almost inevitably take a significant
amount of time. If sufficiently extensive performance samples are to be gathered through
a range of productive and interactive activities, a minimal test duration of 5 minutes per
student (or 10 minutes per pair of candidates) is virtually inevitable even at A2 level. Yet
with classes of often more than twenty students, this means that three or four entire
lessons will then be necessary for a single test. Even though the rest of the class can of
course be asked to deal with other assignments while their peers are being examined, and
classroom time is thus not entirely “lost”, teachers may still feel uncomfortable to devote
such a significant amount of lessons purely to testing. Additionally, it has been suggested
that solutions will have to be found to keep the majority of the class occupied during the
Chapter 5
147
test administration phase; as spare rooms and available staff to supervise the students are
often at a premium in daily teaching practice, one may have to conduct speaking tests in
the students’ own classroom (with their peers in the background). While such worries
may be countered by pointing at the previously neglected and thus long overdue focus on
speaking skills which must (and also legitimately should) be part of a balanced,
competence-based teaching and assessment scheme, it is true that time and location
problems may surface if the theoretical intentions of the curriculum are not adequately
supported through suitable practical solutions 5 .
However, taking into consideration that skills-oriented tests do not need to focus as
narrowly on a few individual objectives as it used to be the case for traditional grammarbased tests, one may argue that the intervals between regular summative tests could in
fact be further stretched without imposing excessive amounts of revision material on the
learners. In that sense, one possible initiative to counteract time pressures would be to
rethink the current three-term organisation of our secondary school system in favour of a
semester-based one. Presuming that the number of tests to be administered would not
simultaneously (or, if so, only marginally) be raised, the consecration of four lessons to
speaking tests might then be easier to take on board, as more classroom time would still
be available for extensive and systematic competence-based language learning.
Alternatively, if the current course organisation is kept, the reintroduction of genuine “test
periods” (e.g. one week to be completely set aside for testing at the end of a term) might
be worthy of consideration, particularly taking into account that similar oral tests will
have to be conducted on a regular basis in other foreign language courses as well.
A third imaginable solution would be to repeatedly conduct smaller summative
assessments during regular speaking activities instead of using a single, bigger test for
that purpose. In that scenario, the teacher would simply walk around the classroom during
those activities while the students would all be simultaneously engaged in speaking acts;
marking grids could then be filled out for a few pairs of learners by listening into their
respective conversations without needing extra time for another formal test. However,
there are several potential drawbacks to that strategy. For instance, not all students’
performances could be sampled during one such activity; different speaking activities
would subsequently have to be implemented in other lessons and, as a consequence, not
all the student performances would have dealt with the same types of topics or tasks, and
5
These time problems are of course particularly relevant in EST classes, where a maximum of four English
lessons is available per week in 8e and 9e TE (as opposed to six in 6eM classes of the ES system).
148
Chapter 5
the reliability of test results would inevitably suffer. The collection of data would also be
problematic due to the high level of noise in the classroom and the impossibility of
recording material (at least in usable quality). Finally, this would also take us back to the
problem of essentially turning an excessive number of classroom activities into tests
instead of more constructive learning opportunities.
However, even if the current system may not be ideal for a maximally efficient
implementation of speaking tests just yet, the feasibility of integrating such tests into
normal teaching routines is ultimately strongly linked to the practitioners’ willingness to
find suitable solutions. In chapter 3, the described practical setup provided one example
of how this can be achieved; other ways are surely imaginable as well. Considering that
the massive importance of speaking skills (even at low levels) can finally be truly
acknowledged in a competence-based teaching and assessment system, simply using time
and location issues as an “easy way out” would lead to nothing more than a misguided
perpetuation of unbalanced testing traditions – and, considering the disregard of explicit
syllabus instructions, an actual ‘dereliction of duty’. Hence, even if conditions favouring
the feasibility of speaking tests could certainly still be optimised in the present school
system (for example through the more general adaptations suggested above), it would
clearly be wrong to view such tests as impossible to accommodate in our daily teaching
routines at this point in time.
As far as the assessment of student performances goes, one of the most important
conclusions to be drawn from the results of the speaking and writing tests described in
this thesis is certainly the realisation that the students were generally able to deal with all
the constituent tasks in satisfactory ways (as reflected in their largely sufficient final
marks). Thus, the proposed types of skills-based tasks were clearly feasible for “preintermediate” learners even with the limited resources that their low proficiency level
implied. Hence, especially the widespread assumption that significant speaking samples
cannot be collected in A2-level classes (and that speaking tests should therefore not be
systematically implemented there) has been emphatically refuted. In the same vein, the
students’ various writing samples have shown that they are undeniably capable of
producing coherent and communicative texts if the corresponding tasks (and assessment
criteria) are appropriately tailored to their proficiency level; an over-insistence on more
restrictive, discrete-item exercises in “writing” tests has therefore become obsolete.
The applied marking grids themselves have also proven to be very useful once the
theoretical soundness of their descriptors and a practical ‘gradation of quality’ (for
Chapter 5
149
example into five bands rather than merely three) have been established. The resulting
criteria-referenced assessment system favours a more analytic and nuanced interpretation
of the students’ speaking and writing performances than an overly holistic and impulsive
judgment would allow. Even though this new approach might be slightly more timeconsuming, it encourages a more thorough and relevant type of error analysis. In turn, this
not only augments the validity and reliability of the assessment; it also facilitates the
provision of pertinent and founded feedback to the student about precise areas that he or
she needs to improve on. Besides, it conveys increased objectivity and fairness to the
student. In that sense, the small amount of extra time arguably invested into the
assessment procedure is undoubtedly worthwhile.
5.2. The Luxembourg ELT curriculum, the CEFR and competence­
based assessment: perspectives The conclusions reached above illustrate that competence-based forms of testing and
assessment certainly have the potential of leading to numerous promising and vital
changes in the overall approach to language teaching in our national school system.
However, numerous challenges still lie ahead before the Luxembourg ELT curriculum
has been fully adapted so as to maximally realise this potential; this final section will
therefore explore some of the most salient developments which may still be necessary in
the future.
In that context, it is first of all important to stress the exact meaning and scope of
the two key concepts of ‘curriculum’ and ‘syllabus’. As Jonnaert et al. point out, a
‘curriculum’ is a much wider concept providing the overall framework and direction for a
national education system in terms of its guiding pedagogic principles and intended
outcomes:
Un curriculum est un ensemble d’éléments à visée éducative qui, articulés entre
eux, permettent l’orientation et l’opérationnalisation d’un système éducatif à
travers des plans d’actions pédagogiques et administratifs. Il est ancré dans les
réalités historiques, sociales, linguistiques, politiques, économiques, religieuses,
géographiques et culturelles d’un pays, d’une région ou d’une localité. 6
6
Philippe Jonnaert, Moussadak Ettayebi, Rosette Defise, Curriculum et compétences – Un cadre
opérationnel, De Boeck (Brussels: 2009), p.35.
150
Chapter 5
While its predominant focus can lie on a number of different factors 7 , any pertinent
curriculum ensures that the corresponding school system provides its students with a
relevant and efficient education by taking into account such crucial factors as sociopolitical, linguistic and economic contexts and necessities. A ‘syllabus’ (or ‘programme
d’études’), on the other hand, is in fact one of the practical means to realise the general
curricular goals:
Si le curriculum oriente l’action éducative dans un système éducatif, les
programmes d’études définissent les contenus des apprentissages et des
formations.
In essence, Jonnaert et al. therefore define the relation between ‘curriculum’ and
‘syllabus’ as one of ‘hierarchical inclusion’: if the curriculum postulates the overall
direction and aims of an education system, the different syllabi correspondingly need to
guide teachers as to how these can be pursued in practice through clear indications of
course contents and teaching methods. To avoid contradictions or excessively dissimilar
foci and orientations between the various syllabi, it is of crucial importance that the
curriculum links them through a coherent and shared logic 8 .
By describing, illustrating and analysing examples of how summative tests can be
adapted to more competence-oriented ways of assessing the productive skills, this thesis
has been centred on a core and decisive feature of the ongoing reformation of a national
school system which largely seeks an alignment with the six proficiency levels and
overall communicative approach of the CEFR. At the same time, this focus on testing and
assessment reflects the most common influence which the CEFR has had across Europe
since its publication; as Little points out, ‘to date, [the CEFR’s] impact on language
testing far outweighs its impact on curriculum design and pedagogy’ 9 . However, it is
clear that numerous other elements than only testing and assessment need to be adapted if
7
Jonnaert et al. draw attention to four different ‘curricular ideologies’ (p.30):
1. The ‘Scholar Academic Ideology’, which prioritises the transmission of knowledge;
2. The ‘Social Efficiency Ideology’, which focuses on social factors and aims to produce individuals
who fit into their surrounding society and help to maintain it;
3. The ‘Learner Centered Ideology’, which targets the ‘social, intellectual, emotional and physical’
development of the individual learner;
4. The ‘Social Reconstruction Ideology’, which is ‘based on a vision of society’ and thus ‘considers
education as a means to facilitate the construction of an equitable society’.
8
Ibid., p.31: ‘Le curriculum assure une cohérence inter-programmes d’études et évite que ces derniers ne
s’isolent en autant de silos avec une logique et une terminologie qui leur est chaque fois spécifique.’
9
David Little, ‘The Common European Framework of Reference for Languages: Perspectives on the
Making of Supranational Language Education Policy’ in The Modern Language Journal, 91, iv (2007),
p.648.
Chapter 5
151
a truly coherent and competence-oriented curriculum is to guide the language teaching
practices in our national education system. As Little further stresses,
[t]here are two ways in which the CEFR can influence official curricula and
curriculum guidelines. On the one hand, desired learning outcomes can be
related to the common reference levels… On the other hand, the CEFR’s
descriptive scheme can be used to analyze learners’ needs and specify their
target repertoire in terms that carry clear pedagogical implications. 10
In the current Luxembourg ELT curriculum, attempts to integrate both of these steps are
clearly visible. The way in which ‘learning outcomes’ of the various syllabi are being
aligned with specific CEFR levels is for example illustrated through the A2-level
proficiency that 9e and 6e students are currently expected to reach in order to successfully
complete their respective learning cycles. Correspondingly, the ‘target repertoire’ of
linguistic competences (which has been adapted in view of the ‘learners’ needs’ from the
CEFR) is clearly stated in regard to the ‘four skills’ both in the official syllabi for those
classes and in the end-of-cycle document (‘complément au bulletin’) which certifies their
achievements accordingly.
In this respect, one may argue that the overall alignment of a language curriculum
with the CEFR potentially grants a certain coherence to the syllabi for the various
constituent learning cycles, as the learner’s progression is consistently measured through
the attainment of different proficiency levels that stem from one same, unified
framework. In terms of overall curricular aims, the competence-based approach of the
CEFR also points to a promising, learner-centred pedagogy that abandons an obsolete,
pure transmission of knowledge which may prove of little use to the learners once they
have finished school; instead, the school system aims to produce independent and
competent language users (rather than individuals who may know more about the
technicalities of a language, but struggle to communicate effectively and with sufficient
fluency). Given the Europe-wide impact of the CEFR on the education systems of
numerous member states (including the entry requirements to their higher education
institutions), the choice to pursue such an alignment also seems a sensible one if the
comparability and pertinence of results which students ultimately achieve in our school
system should be increased on an international level (even if, at the same time, Fulcher’s
warnings about a premature ‘reification’ of the CEFR should not be ignored, and thus an
overly precipitated, forced and potentially invalid alignment to the Framework must be
carefully avoided).
10
Ibid., p.649.
152
Chapter 5
However, the challenges which still lie ahead in the ongoing overhaul of the
Luxembourg ELT curriculum also become evident. As has been pointed out, the CEFR
neither presents an inherent focus on a school-based context nor (for that reason)
indicates pedagogical steps and measures which facilitate the progression from one
proficiency level to the next. What further complicates matters is that this development
process is not a linear one; as Heyworth notes, the levels ‘are not designed to be split up
into equal chunks of time in a syllabus, and it will take longer to move from B2 to C1
than from A1 to A2’ 11 , for instance. This creates a range of problems in terms of syllabus
design, both within each individual stage (or cycle) of the language learning course and
when it comes to linking them with each other in a coherent and valid way. Indeed, how
can we make sure that the expected CEFR level that we set for the end of a given learning
cycle is a realistic learning outcome, and that the pace of progression which we define is a
sensible and feasible one?
While the ‘A2’ level which this thesis has focused on was a logical choice for the
lowest classes of the English curriculum (given the extreme limitations of ‘A1’ and the
clearly excessive requirements of ‘B1’ for first-time 6e, 8e and 9e students 12 ), schoolbased learners’ further rate of progression is more difficult to predict. Preliminary
progress estimates (in terms of CEFR levels to be gradually attained) have been
established for both ES and EST systems in the Luxembourg school system 13 , replacing
the previous (rather vaguely defined) gradations from ‘elementary’ to ‘advanced’
proficiency levels, but in essence they still need to be validated through practice. At the
time when this thesis is being written, fitting competence-based syllabi for the attainment
of higher CEFR levels (from ‘B1’ onwards) are still being developed and need to be finetuned to our students’ language needs; naturally, they also have to tie in seamlessly with
the preceding A2 syllabi to guarantee a coherent and thoughtfully interlinked system. A
possible way to fulfil this latter requirement may for example consist in using the ‘target
standards’ defined for ‘A2’ as ‘basic standards’ for ‘B1’; indeed, a student who has fully
11
Heyworth, art.cit., p.17.
The adequacy of ‘A2’ for these classes was also partially underlined by the results of online placement
tests designed by the University of Oxford (http://) and implemented to identify the students’ proficiency in
terms of the CEFR levels. The majority of 6e and 9e students from the Luxembourg school system who
took these tests in June 2010 were ultimately attested an ‘A2’ level. However, these tests were purely based
on ‘English in use’ and ‘listening’ exercises; without a thorough assessment of more extensive writing and
speaking samples (i.e. evidence of productive skills development), these results can only be of an indicative
nature.
13
See synopsis for syllabus for 6eM and 5eC (2009), p.4 and synopsis for syllabus for 9TE (2009), p.7
(accessible via www.myschool.lu)
12
Chapter 5
153
achieved the ‘target standard’ in the first English cycle has effectively started working
within ‘B1’ (even more so since some of the ‘band 5’ descriptors in the corresponding
marking grids are, as has been shown, more geared towards ‘B1’ or ‘A2+’ than a basic
‘A2’). Nevertheless, given that the CEFR does not offer any precise projections on how
long the transition from one level to the next may take in a school-based context, this will
to a certain extent have to be established (and possibly re-evaluated) through empirical
means.
Another important consideration affects the persisting reliance on an approach that
revolves around the ‘four skills’ and is astutely touched upon by Keddle:
In many secondary schools the programme tends to focus on reading and writing
skills, and makes more progress in these than in speaking and listening. When
matching a standard classroom syllabus with the CEF, one has to ‘hold back’
progress in reading and writing in order to allow the speaking and listening areas
to catch up. 14
While Keddle consequently welcomes the increased valorisation of speaking and listening
as a result of the CEFR’s impact on syllabus design, her quotation also reveals a deepergoing issue of the ‘four skills’ approach in view of certified achievements and defined
learning outcomes. In fact, the question arises whether we should invariably insist on a
single, unified CEFR level to be attained by our students at the end of a learning cycle
(i.e. A2 in all four skills), or whether ‘mixed profiles’ involving different levels for
different skills might sometimes make more sense. If Keddle opposes writing and reading
to speaking and listening, one might similarly set the students’ progress in the receptive
skills against the headway they make with the productive ones. After all, a student whose
writing corresponds to A2 proficiency may be perfectly able to read and understand a B1level text. In that sense, it will be necessary to consider to what extent global or mixedlevel requirements may be the most suitable solution for the definition of targeted
learning outcomes at various stages of the ELT curriculum.
In a similar vein, the question also arises which (unified or mixed) final levels our
ES and EST systems should respectively aim for, and how these achievements can be
validly certified – after all, if the students’ performances in both receptive and productive
skills should determine the attainment of the targeted proficiency level, then the format of
our final examinations (and the correspondingly applied assessment strategies) will have
to significantly change as well.
14
Keddle, art.cit., p.48.
154
Chapter 5
All of the abovementioned issues evidently transcend the scope of the present thesis
by a long way; however, they will most certainly have to be addressed in depth if a
coherent and valid competence-based ELT curriculum is to guide our national education
system at all levels.
A distinctly positive development which the move towards a competence-based
teaching and assessment system implies (and which has been underlined in this thesis) is
the move away from an excessively de-contextualised and grammar-focused (i.e.
knowledge-centred) language curriculum. This does not mean that grammar will no
longer remain an important component in our English courses; far from it. However,
when ‘preparing…syllab[i] and developing activities that truly reflect both the CEF and
the tried and tested grammar strands’ 15 , a two-way adaptation process is necessary. On
the one hand, the original CEFR descriptors and levels do not carry sufficiently precise
indications about specific grammatical structures and forms that are needed to reach a
given proficiency level; therefore, a more systematic grammar dimension must be added
to them in a school-based context. On the other hand, if a curriculum is largely based on
the distinct functional/situational approach of the CEFR, the excessive concentration on
grammatical elements will be reduced and supplemented through more communicative
aims. As a result, more purposeful and situated learning (where grammar is not studied in
a de-contextualised way but rather integrated and used to fulfil a communicative aim)
will come to dominate both classroom activities and test tasks.
As Jonnaert et al. also vitally stress, it is only in such a way that a competencebased curriculum can ultimately truly make sense. Not only do they point to the
particularly fruitful effects of an appropriate contextualisation of learning situations 16 ;
they also postulate that competences must in fact be constructed through action in – or in
response to – precise situations. The assessment of whether or not a given competence
has been developed to a sufficient extent is then based on the adequate treatment of a
similar situation at a later point in time 17 . In that sense, a fundamentally competenceoriented curriculum should inform about the types of behaviour that a learner needs to
demonstrate to be deemed ‘competent’, and (through the various syllabi) communicate
pedagogic steps and strategies to teachers as to how the development of these
15
Ibid., p.44.
Jonnaert et al., op.cit., p.70: ‘les stratégies orientées vers la contextualisation des contenus
d’apprentissage et la construction de sens par les apprenants, ont statistiquement le plus d’effets positifs
sur les résultats des apprentissages.’ Emphasis added.
17
Ibid., pp.68-71.
16
Chapter 5
155
competences can be favoured through appropriate situations in the classroom 18 . The
alignment of curricular goals with the CEFR and the different competences it defines as
necessary for the fulfilment of various communicative purposes certainly comes much
closer to such a logic than an approach which simply defines a range of de-contextualised
and grammar-based objectives to be achieved. Understandable anxieties as to whether this
could lead to an excessive disregard of transmitting knowledge about the target language
to the learners is countered by Jonnaert et al. in a way that is reminiscent of the CEFR
approach focusing on ‘what the learners do with grammar’:
les savoirs disciplinaires ne sont ni exclus ni minimisés dans une approche
située. Leur utilisation comme ressource en situation renforce au contraire la
pertinence de leur construction par les étudiants. 19
In a final perspective, a further interesting quality of the ‘situated approach’ which
Jonnaert et al. thus advocate is the implication that the corresponding learning situations
are bound to be complex and multi-disciplinary 20 . Although the competence-oriented
rewriting of individual syllabi for the various secondary school subjects is currently
taking place in a parallel rather than interconnected way, exploiting their commonalities
more deliberately and efficiently could certainly prove a valuable and rewarding direction
to take in the future. Indeed, substituting the strict separation and juxtaposition of
different disciplines with a more integrative and combined approach (reflecting and
maximising for example the benefits of the plurilingual aspects of numerous learner
competences alluded to in the CEFR) may constitute a potent (albeit challenging) way to
further enhance the curricular coherence in our education system.
Yet wherever we ultimately go from here, it is certain that the successful shift
towards a fully effective competence-based teaching and assessment scheme will only be
possible if the practitioners do their utmost to find suitable and practicable applications of
the curricular guidelines “in the field” 21 . As this thesis has tried to exemplify, there are
for instance many pertinent and feasible ways in which this can be done in regard to
developing, testing and assessing our students’ productive skills even at lower levels.
Given the thorough and nuanced insight which this gives us into their overall target
18
Ibid., p.71.
Ibid., p.90. Emphasis added.
20
Ibid., p.72 : ‘Ces situations sont par nature complexes et pluridisciplinaires’.
21
See ibid., p.52: ‘L’adhésion des enseignants au curriculum constitue une variable très importante de la
réussite des innovations proposées à travers les réformes curriculaires.’
19
156
Chapter 5
language proficiency (and thus makes it possible to verify the success of our language
teaching in a significantly wider scope), it highlights one of the numerous things that we
can do to guide our language teaching into a direction that is not only much more relevant
to “real world” contexts, but also produces results of higher validity, reliability and
potentially even international comparability.
157
Bibliography Books:
ASTOLFI, Jean-Pierre, L’erreur, un outil pour enseigner, ESF (Paris: 1997)
BROWN, H. Douglas, Principles of Language Learning and Teaching (5th ed.),
Longman/Pearson (New York: 2007)
BROWN, H. Douglas Teaching by Principles, An Interactive Approach to Language
Pedagogy (3rd ed.), Pearson Longman (New York: 2007)
COHEN, Louis, MANION, Lawrence & MORRISON, Keith, Research Methods in
Education, Routledge (London / New York: 2007)
COUNCIL OF EUROPE, Common European Framework of Reference for Languages:
Learning, Teaching, Assessment, Cambridge University Press (Cambridge: 2001)
DENSCOMBE, Martin, The Good Research Guide (3rd ed.), Open University Press (New
York: 2007)
HARMER, Jeremy, The Practice of English Language Teaching, Pearson Longman
(Harlow, England: 2006)
HASSELGREEN, Angela et al., Bergen ‘Can Do’ project, Council of Europe
(Strasbourg: 2003)
JONNAERT, Philippe, ETTAYEBI, Moussadak & DEFISE, Rosette, Curriculum et
compétences – Un cadre opérationnel, De Boeck (Brussels: 2009)
UNIVERSITY OF CAMBRIDGE ESOL EXAMINATIONS, Key English Test –
Handbook for Teachers, UCLES (Cambridge: 2009); no individual authors indicated.
UNIVERSITY OF CAMBRIDGE ESOL EXAMINATIONS, Preliminary English Test
for Schools – Handbook for teachers, UCLES (Cambridge: 2008); no individual
authors indicated.
UR, Penny, A Course in Language Teaching: Practice and Theory, Cambridge University
Press (Cambridge: 2006)
Articles and essays:
ALDERSON, J. Charles et al., ‘Analysing Tests of Reading and Listening in Relation to
the Common European Framework of Reference: The Experience of the Dutch CEFR
Construct Project’ in Language Assessment Quarterly, 3, 1 (2006), pp.3-30.
ALDERSON, J. Charles, ‘The CEFR and the Need for More Research’ in The Modern
Language Journal, 91, iv (2007), pp.659-663.
BONNET, Gerard, ‘The CEFR and Education Policies in Europe’ in The Modern
Language Journal, 91, iv (2007), pp.669-672.
FIGUERAS, Neus, ‘The CEFR, a Lever for the Improvement of Language Professionals
in Europe’ in The Modern Language Journal, 91, iv (2007), pp.673-675.
FULCHER, Glenn, ‘Are Europe’s tests being built on an ‘unsafe’ framework?’ in The
Guardian Weekly (18 March 2004), accessible at
http://www.guardian.co.uk/education/2004/mar/18/tefl2.
158
Bibliography
FULCHER, Glenn, ‘Testing times ahead?’ in Liaison Magazine, Issue 1 (July 2008),
pp.20-23, accessible at http://www.llas.ac.uk/news/newsletter.html.
GOODRICH, Heidi, ‘Understanding Rubrics’ in Educational Leadership, 54, 4 (January
1997), pp.14-17.
GOODRICH ANDRADE, Heidi, ‘Using Rubrics to Promote Thinking and Learning’ in
Educational Leadership, 57, 5 (February 2000), pp.13-18.
HEYWORTH, Frank, ‘Why the CEF is important’ in Morrow, Keith (ed.), Insights from
the Common European Framework, Oxford University Press (Oxford: 2004), pp.1221.
HUHTA, Ari et al., ‘A diagnostic language assessment system for adult learners’ in J.
Charles Alderson (ed.), Common European Framework of Reference for Languages:
learning, teaching, assessment: case studies, Council of Europe (Strasbourg: 2002),
pp.130-146
HULSTIJN, Jan H., ‘The Shaky Ground Beneath the CEFR: Quantitative and Qualitative
Dimensions of Language Proficiency’ in The Modern Language Journal, 91, iv
(2007), pp.663-667.
LENZ, Peter, ‘The European Language Portfolio’ in Morrow, Keith (ed.), Insights from
the Common European Framework, Oxford University Press (Oxford: 2004), pp.2231.
LITTLE, David, ‘The Common European Framework of Reference for Languages:
Perspectives on the Making of Supranational Language Education Policy’, The
Modern Language Journal, 91, iv (2007), pp.645-653.
KEDDLE, Julia Starr ‘The CEF and the secondary school syllabus’ in Morrow, Keith
(ed.), Insights from the Common European Framework, Oxford University Press
(Oxford: 2004), pp.43-54.
KRUMM, Hans-Jürgen, ‘The CEFR and Its (Ab)Uses in the Context of Migration’, in
The Modern Language Journal, 91, iv (2007), pp.667-669.
MARIANI, Luciano, ‘Learning to learn with the CEF’ in Morrow, Keith (ed.), Insights
from the Common European Framework, Oxford University Press (Oxford: 2004),
pp.32-42.
MORROW, Keith, ‘Background to the CEF’ in Morrow, Keith (ed.), Insights from the
Common European Framework, Oxford University Press (Oxford: 2004), pp.3-11.
NORTH, Brian, ‘Relating assessments, examinations, and courses to the CEF’ in
Morrow, Keith (ed.), Insights from the Common European Framework, Oxford
University Press (Oxford: 2004), pp.77-90.
NORTH, Brian, ‘The CEFR Illustrative Descriptor Scales’ in The Modern Language
Journal, 91, iv (2007), pp.656-659.
POPHAM, W. James, ‘What’s Wrong – and What’s Right – with Rubrics’ in Educational
Leadership, 55, 2 (October 1997), pp.72-75.
WEIR, Cyril J., ‘Limitations of the Common European Framework for developing
comparable examinations and tests’ in Language Testing, 22 (2005), pp.281-299.
Accessible at http://ltj.sagepub.com/cgi/content/abstract/22/3/281
Bibliography
159
WESTHOFF, Gerard, ‘Challenges and Opportunities of the CEFR for Reimagining
Foreign Language Pedagogy’ in The Modern Language Journal, 91, iv (2007), pp.676679.
Websites and online documents:
CAMBRIDGE ESOL: Teacher Resources – KET, accessible at
http://www.cambridgeesol.org/resources/teacher/ket.html#schools
CAMBRIDGE ESOL: Teacher Resources – PET, accessible at
http://www.cambridgeesol.org/resources/teacher/pet.html
COUNCIL OF EUROPE, Manual for relating Language Examinations to the Common
European Framework of Reference for Languages, accessible at
http://www.coe.int/t/dg4/linguistic/Manuel1_EN.asp#TopOfPage.
OXFORD ENGLISH TESTING: Oxford Online Placement Test, accessible at
http:// www.oxfordenglishtesting.com
mySchool! (for syllabi and syllabus-related documents for 6eM and 9TE), accessible at
http://programmes.myschool.lu and www.myschool.lu
Seminar handouts and documents:
CLARKE, Martyn, PowerPoint notes from the seminar ‘Creating writing tasks from the
CEFR’, held in Luxembourg City in October 2009, © Oxford University Press
CLARKE, Martyn, PowerPoint notes from the seminar ‘Creating speaking tasks from the
CEFR’, held in Luxembourg City in February 2010, © Oxford University Press
HORNER, David, PowerPoint notes from the seminar ‘The logic behind marking grids’,
held in Luxembourg City in March 2010.
MORROW, Keith, handouts from the seminar ‘Mapping and designing competencebased tests of speaking and writing’, held in Luxembourg City in October 2009.
Coursebooks and other sources of teaching material:
GAMMIDGE, Mick, Speaking Extra, Cambridge University Press (Cambridge: 2004)
HUTCHINSON, Tom, Lifelines Pre-Intermediate Student’s Book, Oxford University
Press (Oxford: 1997)
HUTCHINSON, Tom & WARD, Ann, Lifelines Pre-Intermediate Teacher’s Book,
Oxford University Press (Oxford: 1997)
Google Images, www.google.com (for pictures)
Wikipedia, en.wikipedia.org (for information about famous people and pictures)
161
List of appendices A. Appendices linked to speaking tests
•
Appendix 1: Scripted T-S questions used in Speaking Test 1
p.163
•
Appendix 2: Cue cards and visual prompts used in Speaking Test 1
p.164
•
Appendix 3: Marking grid used for Speaking Test 1
p.168
•
Appendix 4: Sample assessment sheets used during Speaking Test 1
p.169
•
Appendix 5: Final results of Speaking Test 1
p.172
•
Appendix 6: Sample handouts with visual prompts used in Speaking Test 2
p.173
•
Appendix 7: Sample handouts for S-S interaction in Speaking Test 2
p.179
•
Appendix 8: Marking grid used for Speaking Test 2
p.181
•
Appendix 9: Sample assessment sheets used during Speaking Test 2
p.182
•
Appendix 10: Final results of Speaking Test 2
p.185
B. Appendices linked to writing tests
•
Appendix 11: Sample student productions from test I,1
p.187
•
Appendix 12: Marking grid used for writing tasks
p.188
•
Appendix 13: Free writing tasks for summative test II,1: instructions
p.189
•
Appendix 14: Sample student productions and assessment sheets from Test II,1
p.190
•
Appendix 15: Final results of writing performances in summative test II,1
p.194
•
Appendix 16: Alternative free writing tasks – informal letters and emails
p.195
•
Appendix 17: Alternative free writing tasks – story writing
p.196
163
A) APPENDICES LINKED TO SPEAKING TESTS
Appendix 1: Scripted T-S questions used in Speaking Test 1 1
Teacher questions to student A
Good morning / afternoon.
What’s your first name?
Teacher questions to student B
Backup prompts
Can you tell me your
first name, please?
And what’s your first name?
Can you spell that, please?
What’s your surname?
Can you spell that, please?
Student B, where are you from?
And where are you from?
How do you write your
surname?
What
is
your
nationality?
What is your home
country?
Do you have any brothers and
sisters?
What is your father’s job?
And what is your address?
Do you like your hometown?
Why?
And how many brothers and How many people do
sisters do you have?
you live with?
What are their names?
What does your father
do?
What job do you want
to do?
What does your mother do?
Where do you live?
What is your address?
Do you live in…?
And where do you live?
What about you? Do you like
your hometown?
Are there any problems where
you live?
Are there any problems in your
hometown?
What’s negative about
living
in
your
hometown?
What do you do in your free time?
What don’t you like to do in your
free time?
What did you do last weekend?
What did you do on Friday and
Saturday?
Do you have an e-mail address?
Can you spell it, please?
And what is your e-mail address?
Can you spell it, please?
1
Note: per pair of candidates, only a selection of questions was used (and not necessarily in identical order as
presented on this page).
Name:
Date of Birth: Place of Birth: Work: Celebrity number: 1
Ask your partner the following questions about his/her celebrity: Celebrity number: 2
Ask your partner the following questions about his/her celebrity: Celebrity number: 3
Ask your partner the following questions about his/her celebrity: ‐ What / your celebrity / look like? ‐ When / born? ‐ Where / born? ‐ Which awards / win? (When?) ‐ married? ‐ have / children? (How many? Names?) ‐ What / your celebrity / look like? ‐ When / born? ‐ Where / born? ‐ Job? ‐ married? ‐ have / children? (How many? Names?) ‐ What / your celebrity / look like? ‐ When / born? ‐ Where / born? ‐ What / parents’ names? ‐ What / his last film? When / make it? ‐ have / famous friends? (Who?) 164
Name: Brad Pitt
Date of Birth: 18/12/1963 Place of Birth: Oklahoma Parents: William Alvin Pitt / Jane Etta Last film: Inglourious Basterds (2009) Friends: George Clooney, Matt Damon Angelina Jolie 04/06/1975 Los Angeles 1 Oscar (1999) Brad Pitt 6 (3 adopted, 3 biological)
Appendix 2: Cue cards and visual prompts used in Speaking Test 1
Source of images : Google images (www.google.lu) / Wikipedia (en.wikipedia.org)
Paris Hilton
17/02/1981 New York City model, singer, actress, businesswoman Marital status: single (boyfriend: Doug Reinhardt) Children:
/
Name: Date of Birth: Place of Birth: Awards: Husband: Children:
Robert Pattinson
13/05/1986 English 2004 single Los Angeles
Name: Date of Birth: Nationality: Last album: Most famous song: Marital status: Amy Winehouse
14/09/1983 English Back to Black (2006) ‘Rehab’ (2006) divorced (Blake Fielder‐Civil)
Name: Date of Birth: Place of Birth: Start of career: First hit: Marital status: Katy Perry
25/08/1984 Santa Barbara, California 2001 ‘I Kissed A Girl’ (May 2008) single (boyfriend: Russell Brand) Celebrity number: 5
Ask your partner the following questions about his/her celebrity: Celebrity number: 6
Ask your partner the following questions about his/her celebrity: ‐ What / your celebrity / look like? ‐ When / born? ‐ Where / from? ‐ When / start / career (= Karriere)? ‐ married? ‐ Where / live / now? ‐ What / your celebrity / look like? ‐ When / born? ‐ What nationality? ‐ Title of last album? When / come out? ‐ Most famous song? ‐ married?
‐ What / your celebrity / look like? ‐ When / born? ‐ Where / born? ‐ When / start / career (= Karriere)? ‐ What / be / first hit? When / come out? ‐ married? 165
Celebrity number: 4
Ask your partner the following questions about his/her celebrity: Appendix 2
Name: Date of Birth: Nationality: Start of career: Marital status: Place of residence: 166
Eminem Marshall Bruce Mathers III 17/08/1972 rapper, actor ‘The Real Slim Shady’ (2001) ex‐wife Kimberley Anne Scott, daughters Alaina & Whitney
Artist name:
Real name: Date of Birth: Start of career: Last album: Children: 50Cent
Curtis James Jackson III 06/07/1975 1996 Before I Self‐Destruct (2009) 1 boy (Marquise Jackson) Name: Beyoncé Giselle Knowles
Date of Birth: 04/09/1981 Place of Birth: Houston, Texas Biggest success: 5 Grammy Awards (2004) Marital status: married (rapper/producer Jay‐Z) Children:
/
Celebrity number: 7 Ask your partner the following questions about his/her celebrity: Celebrity number: 8
Ask your partner the following questions about his/her celebrity: Celebrity number: 9
Ask your partner the following questions about his/her celebrity: ‐ What / your celebrity / look like? ‐ What / artist name? ‐ What / real name? ‐ What / job? ‐ Title of first hit? When / come out? ‐ have / family? (married? children?) ‐ What / your celebrity / look like? ‐ What / artist name? ‐ What / real name? ‐ When / start / career (= Karriere)? ‐ Title of last album? When / come out? ‐ have / children? (How many? Names?) ‐ What / your celebrity / look like? ‐ When / born? ‐ Where / born? ‐ What / biggest success? (When?) ‐ married? ‐ have / children? (How many? Names?) Appendix 2
Artist name: Real name: Date of Birth: Work: First hit: Family: Shakira Shakira Isabel Ripoll 02/02/1977 Colombian Laundry Service (2001) The Bahamas Name: Date of Birth: Work: Place of Birth: First solo hit: Girlfriend: Robbie Williams
13/02/1974 singer‐songwriter Stoke‐on‐Trent, England ‘Freedom’ (1996) Ayda Field Name: Thierry Henry
Date of Birth: 17/08/1977 Place of Birth: Les Ulis, Paris, France Job: football player (Barcelona FC) Biggest successes: World Cup win (1998) and Champions League win (2009) Children:
1 daughter (Téa)
Celebrity number: 10 Ask your partner the following questions about his/her celebrity: Celebrity number: 11
Ask your partner the following questions about his/her celebrity: Celebrity number: 12
Ask your partner the following questions about his/her celebrity: ‐ What / your celebrity / look like? ‐ What / artist name? ‐ What / real name? ‐ Where / from? ‐ Title of first album? When / come out? ‐ Where / live / now? ‐ What / your celebrity / look like? ‐ When / born? ‐ Job? ‐ Where / born? ‐ Title of first hit? When / come out? ‐ girlfriend? (Name?) ‐ What / your celebrity / look like? ‐ When / born? ‐ Where / born? ‐ Job? ‐ What / biggest success? (When?) ‐ have / children? (How many? Names?) Appendix 2
Artist name: Real name: Date of Birth: Nationality: First English album: Place of residence: 167
Band
3
Uses basic sentence structures
and grammatical forms
correctly.
•
Speech is intelligible throughout
though mispronunciation may
occur.
•
Succeeds in using relevant
vocabulary to communicate in
everyday situations and carry
out the tasks set.
•
Speech is mostly fluent, with
little (self-)correction. Basic
control of intonation.
•
Occasional L1 interference still
to be expected and acceptable.
•
Speech is mostly intelligible,
despite limited control of
phonological features.
•
No serious effort and little or no
prompting by listener required.
•
Uses a limited range of
structures with some
grammatical errors of most basic
type.
•
•
•
•
0
•
Sufficient control of relevant
vocabulary with minor
hesitations.
Some efforts, prompting and
assistance by listener required.
Has little awareness of sentence
structure and little control of
very few grammatical forms.
•
•
Insufficient control of relevant
vocabulary, speech often
presented in isolated or
memorised phrases.
•
Considerable efforts, prompting
and assistance by listener
required.
•
Interactive Communication
•
Communication is confidently
handled in everyday situations.
•
•
Occasional hesitations and
pauses; inconsistent handling of
intonation. Some effort by the
listener is required.
Pronunciation and intonation are
still heavily influenced by L1.
Speech is repeatedly
unintelligible, with frequent
mispronunciation.
Speech is monotonous; little
awareness of intonation.
Considerable effort by the
listener is required.
•
•
•
•
Global Achievement
•
All parts of all tasks are
successfully dealt with in the
time allotted.
In general, responses are
relevant and meaning is
conveyed successfully.
•
Speech and attitude reflect
willingness to engage in
English.
Can react quite spontaneously,
ask for clarifications and give
them when prompted.
•
Student shows readiness to take
measured risks in making
him/herself clearly enough
understood.
One part of the task is not (or
not fully) dealt with.
Communication is occasionally
strenuous, but everyday
situations are still mostly dealt
with.
Responses tend to be evasive,
but meaning is generally
conveyed successfully.
Can ask for clarification in
English.
Communication is erratic and
repeatedly breaks down. Some
long pauses may occur.
•
Inability to respond or response
is largely irrelevant.
•
May use L1 to ask for
clarification.
Pronunciation and intonation are
excessively aligned to L1.
No rateable language. Totally incomprehensible. Totally irrelevant.
•
•
Speech tends to be minimalistic,
with a reluctance to engage in
English.
•
Student generally avoids taking
risks.
•
Most of the tasks are dealt with
insufficiently (or not at all).
•
Speech and attitude produce a
negative impression (reflect
unwillingness to engage in
English).
•
Unwillingness to take part in
everyday-type conversations. No
risk-taking.
Appendix 3: Marking grid used for Speaking Test 1
•
•
1
Pronunciation
168
2
Grammar and Vocabulary
169
Appendix 4: Sample assessment sheets used during Speaking Test 1 1
Student 20
Note: the crossed-out, corrected marks in this assessment
grid reflect the two-stage assessment process mentioned
in section 3.2.5. of the main text. They exemplify the
slight amendments that were occasionally made in the
closer consideration of individual criteria during the
second stage.
Student 9
1
The text boxes have been inserted to guarantee the learners’ anonymity. Student numbers are identical to
those used in appendices 5 and 10.
170
Appendix 4
Student 12
Student 10
Appendix 4
Student 16
Student 6
171
172
Appendix 5: Final results of Speaking Test 1
Speaking Test 1: Final Marks
12
10
8
6
4
2
0
Grammar and Vocabulary marks
Pronunciation marks
3
3
2
2
1
1
0
0
1
3
5
7
9
11 13 15 17 19 21
1
Interactive Communication marks
3
2
2
1
1
0
0
3
5
7
9
11 13 15 17 19 21
5
7
9
11 13 15 17 19 21
Global Achievement marks
3
1
3
1
3
5
7
9
11 13 15 17 19 21
173
Appendix 6: Sample handouts with visual prompts used in Speaking
Test 2 1 Speaking test II,3: Part ONE 1. Describe the two pictures. What do you see in each one of them? 2. You have the chance to go to ONE of these two places. Which one will you choose? Why? 1
Source of images: Microsoft Word (Office 2007) Online Clipart Library / Google images
(www.google.com)
174
Appendix 6
Speaking test II,3: Part ONE 1. Describe the two pictures. What do you see in each one of them? 2. You have the chance to live in ONE of these two places. Which one will you choose? Why? Appendix 6
175
Speaking test II,3: Part ONE 1. Describe the two pictures. What do you see in each one of them? 2. You have the chance to get a holiday in ONE of these two places. Which one will you choose? Why? 176
Appendix 6
Speaking test II,3: Part ONE 1. Describe the two pictures. What do you see in each one of them? 2. You have the chance to get ONE of these two houses. Which one will you choose? Why? Appendix 6
177
Speaking test II,3: Part ONE 1. Describe the two pictures. What do you see in each one of them? 2. You have the chance to get ONE of these two jobs. Which one will you choose? Why? 178
Appendix 6
Speaking test II,3: Part ONE 1. Describe the two pictures. What do you see in each one of them? 2. You have the chance to do ONE of these activities. What will you do? Why? 179
Appendix 7: Sample handouts for S-S interaction in Speaking Test 2 Speaking test II,3: Part 2 (student A)
1
You are a tourist in this town, and you don’t know your way around. Ask your partner for
directions to the following places:
1. You are at the bus station. You want to go to the Royal Hotel. Ask for directions
politely.
2. Now you want to leave Royal Hotel. You want to go to a bakery next. Ask for
directions politely.
Student B instructions:
You live in this town. Your partner is a tourist here and will ask you for directions to two
different places. Use the map to give your partner the correct directions. Be as precise as
possible (use street names, the position of buildings…).
1
Source of map: Tom Hutchinson & Ann Ward, Lifelines Pre-Intermediate Teacher’s Book, Oxford
University Press (Oxford: 1997), p.128.
180
Appendix 7
Speaking test II,3: Part 2 (student B)
2
You are a tourist in the following town, and you don’t know your way around. Ask your
partner for directions to the following places:
1. You are at the train station. You want to go to the hospital. Ask for directions
politely.
2. Now you’re finished at the hospital. You want to go to the cinema next. Ask for
directions politely.
Student A instructions:
You live in this town. Your partner is a tourist here and will ask you for directions to two
different places. Use the map to give your partner the correct directions. Be as precise as
possible (use street names, the position of buildings…).
2
Source of map: Tom Hutchinson, Lifelines Pre-Intermediate Student’s Book, Oxford University Press
(Oxford: 1997), p.52.
BAND 1
BAND
2
BAND 4
Basic meaning conveyed in very
familiar everyday situations.
Most of the response is not
relevant.
CONTENT
ƒ
Task response
BAND 3
Response is mostly
some digressions.
Little communication possible.
Communication
handled
everyday situations.
relevant,
Meaning conveyed successfully.
Requires considerable prompting
and support.
Only isolated words or memorised
utterances are produced.
Little
or
attempted.
no
Limited control of
grammatical forms.
paraphrasing
very
few
Cannot produce basic sentence
forms.
Noticeable pauses and slow
speech, frequent repetition and
self-correction.
Very short basic utterances, which
are
juxtaposed
rather
than
connected or linked through
repetitious
use
of
simple
connectives.
Requires prompting and support.
Limited range of vocabulary which
is minimally adequate for the task
and which may lead to repetition.
Features of bands 3 and 5.
Little effort by listener required.
Hardly
any
control
of
organisational features even at
sentence level.
Features of bands 1 and 3.
Serious effort by listener required.
short
Produces simple speech fluently.
Usually maintains flow of speech
but
uses
repetition,
selfcorrection and / or slow speech
to keep going.
May
over-use
connectives
and
markers.
certain
discourse
Requires little prompting and
support.
In general, adequate range of
vocabulary for the task.
Attempt to vary expressions, but
with inaccuracy.
Paraphrasing rarely attempted.
Paraphrase attempted, but with
mixed success.
Only a limited range of structures
is used.
In general, adequate range for
the task.
Limited control of basic forms and
sentences.
Subordinate structures are rare
and tend to lack accuracy.
Good degree of control of simple
grammatical
forms
and
sentences. Both simple and
complex forms are used.
Frequent grammatical errors.
Minor and occasional errors.
181
GRAMMATICAL
STRUCTURES
ƒ
Appropriacy
ƒ
Range
ƒ
Accuracy
No rateable language. Totally incomprehensible. Totally irrelevant.
LEXIS
Appropriacy
Range
ƒ
Accuracy
Paraphrase
Long pauses before most words.
Responses limited to
phrases or isolated words.
Intelligible throughout though
mispronunciation
may
occasionally cause momentary
strain for the listener.
Speech is intelligible, despite
limited control of phonological
features. Effort on behalf of the
listener required.
Speech is often unintelligible.
in
In general, response is relevant .
Some effort on behalf of the
listener required.
PRONUNCIATION AND
DISCOURSE
MANAGEMENT
ƒ
Pronunciation
ƒ
Fluency (speech
rate and
continuity)
ƒ
Effort to link ideas
and language so
as to form a
coherent,
connected speech
ƒ
Prompting and
support
BAND 5
Appendix 8: Marking grid used for Speaking Test 2
Source : official syllabus for 9TE, October 2009 version (http://programmes.myschool.lu), p.72. (p.72)
BAND 0
182
Appendix 9: Sample assessment sheets used during Speaking Test 2 1
Student 13
Student 14
1
The text boxes have been inserted to guarantee the learners’ anonymity. Student numbers are identical to
those used in appendices 5 and 10.
Appendix 9
Student 12
Student 10
183
184
Appendix 9
Student 5
Student 21
185
Appendix 10: Final results of Speaking Test 2 1
Speaking Test 2: Final Marks
Student 23
Student 22
Student 21
Student 20
Student 19
Student 18
5
4
4
3
3
2
2
1
1
0
0
1
3
5
1
7 11 13 15 17 19 21 23
5
5
4
4
3
3
2
2
1
1
0
0
1
3
5
7 11 13 15 17 19 21 23
3
5
7 11 13 15 17 19 21 23
Grammatical structures marks
Content marks
1
Student 17
Pronunciation and discourse management marks
Lexis marks
5
Student 16
Student 15
Student 14
Student 13
Student 12
Student 11
Student 10
Student 7
Student 6
Student 5
Student 4
Student 3
Student 2
Student 1
20
18
16
14
12
10
8
6
4
2
0
1
3
5
7 11 13 15 17 19 21 23
The omission of students 8+9 in these charts is based on the fact that these students had left the class after
the first term. For a similar reason, student 23 (who joined the class in the second term) has been added. All
other student numbers have been kept identical to the results of speaking test 1 to allow for a better
comparability of performance.
187
B) APPENDICES LINKED TO WRITING TESTS
Appendix 11: Sample student productions from Test I,1 1
Student A
Student B
Student C
1
The text boxes have been inserted to guarantee the learners’ anonymity.
GRAMMATICAL
STRUCTURES
ƒ
Range
ƒ
Accuracy
Response
is
seriously
incoherent.
Hardly
any
control
of
organisational features even at
sentence level.
Extremely limited range.
Hardly any control of spelling
and word formation.
Extremely limited range
structures.
Essentially
no
control
structures and punctuation
of
of
BAND 3
BAND 4
BAND 5
Only 2/3 of the content elements
dealt with: message only partly
communicated to the reader;
and /or not all content elements
dealt with successfully: message
requires
some effort on behalf of the
reader.
Format only partly respected.
In general, content elements
addressed
successfully:
message clearly and fully
communicated to the reader.
Awareness of format.
In general, coherent response
though there may be some
inconsistencies.
Short simple sentences which
are simply listed rather than
connected and presented as a
text.
Good
control
of
simple
sentences and use of simple
linking devices such as ‘and’,
‘or’, ‘so’, ‘but’ and ‘because’.
Complex
sentence
forms
attempted, but they tend to be
less accurate than simple
sentences.
Information presented with
some organisation
Limited range which is minimally
adequate for the task and which
may lead to repetition.
Limited control of spelling and /
or word formation.
Limited range of structures.
Frequent grammatical errors and
faulty punctuation, which cause
difficulty
in
terms
of
communication.
Features of bands 3 and 5.
Little effort by reader required.
Only about 1/3 of the content
elements dealt with and /or
hardly any content elements
dealt with successfully: message
hardly communicated; message
requires excessive effort by the
reader.
No awareness of format.
Features of bands 1 and 3.
Serious effort by reader required.
LEXIS
ƒ
Appropriacy
ƒ
Range
ƒ
Accuracy
No rateable language. Totally incomprehensible. Totally irrelevant.
COHERENCE
AND
COHESION
ƒ
Logic
ƒ
Fluency
ƒ
Control of linking
devices,
referencing...
ƒ
Paragraphing
BAND
2
188
CONTENT
ƒ
Task achievement
ƒ
Format (if
required)
ƒ
Effect on reader
BAND 1
In general, appropriate and
adequate range for the task.
Few minor errors, which do not
reduce communication and are
mainly due to inattention or risk
taking.
Sufficient range of structures
for the task.
Few minor errors, which do not
reduce communication and are
mainly due to inattention or risk
taking.
Appendix 12: Marking grid used for writing tasks
Source : official syllabus for 9TE, October 2009 version (http://programmes.myschool.lu), p.76.
BAND 0
189
Appendix 13: Free writing tasks for summative test II,1: instructions
8. Free writing (9 marks)
Choose either A or B and write about 80-100 words (on your answer sheet).
A. ‘Next week’s horoscope’
You are a horoscope writer for a magazine. Your article for next week is almost
finished, but you still have to write the last horoscope (for Pisces). In your text,
make predictions and give advice (= Ratschläge / conseils) about:
•
•
•
love and friendship
money and finances
work and/or studies
You can add any other important elements that you can think of.
OR:
B. ‘The greatest summer camp ever!’
You are the manager of a summer camp. You want more people to come to your
camp, so you write a publicity brochure about it. In your text, describe:
•
•
•
the things that the participants will see and do at the camp;
the buildings, rooms and services that you offer;
any other reasons why people should come to your camp.
Remember: You want as many people as possible to come to your camp. So make
your text convincing (= überzeugend) and original!
190
Appendix 14: Sample student productions and assessment sheets from Test
II,1 1
1. ‘Horoscope’ task
Student A (student 18)
1
The text boxes have been inserted to guarantee the learners’ anonymity. Student numbers are identical to
those used in appendices 5 and 10.
Appendix 14
Student B (student 13)
191
192
Appendix 14
2. ‘Summer camp’ task
Student C (student 17)
Appendix 14
Student D (student 20)
193
194
Appendix 15: Final results of writing performances in summative test II,1 1
Writing Test II,1: Final Marks
9
8
7
6
5
4
3
2
1
0
Coherence and cohesion bands reached
Content bands reached
5
5
4
4
3
3
2
2
1
1
0
0
1
1 2 3 4 5 6 7 101112131415161718192022
Lexis bands reached
5
4
4
3
3
2
2
1
1
0
0
1
3
5
7
11
13
15
17
5
7
11
13
15
17
19
22
Grammatical structures bands reached
5
1
3
19
22
1
3
5
7
11
13
15
17
19
22
Student numbers refer to the same individual students as in the general result graphs for both speaking
tests. Students 8+9 had already left the class prior to this test. Student 23 had not yet joined while student
21 was absent on the day the test was written. Therefore, these students have not been included into these
graphs.
195
Appendix 16: Alternative free writing tasks – informal letters and emails
1. Informal letter writing
Ex.7: Free writing – a letter to a friend (10 marks)
You are spending a year in another country at the moment. Write a letter of 80-100 words
to your best friend at home. Use a new page on your answer sheet. Say:
• where you are and why;
• what you have done there until now;
• your best/worst experiences (= Erfahrungen) in the foreign country. (What
happened? When?)
Use the correct style for an informal letter. Put your address (in the foreign country), your
friend’s address and the date into the right places on the page.
2. Informal email writing
You have received the following email from an English-speaking exchange student.
Reply with an informal email and answer all his questions in 100-120 words.
Hi there,
my name’s Eddie and I’m a student from Miami, Florida. Next month I’m
coming to Luxembourg as an exchange student. I’ll have to go to your school
for three months, so that’s why I’m writing to you!
My cousin did the same exchange programme last year. She really liked your
school, but she says that the food in your canteen was horrible when she was
there. Do you agree? What kinds of food did they use to serve there? What
was it like last year?
Now, my cousin has heard that your school canteen has been completely
redone. So tell me, what has changed? Has the food become better?
Over here in the States, most students go to the canteen every day. Do many
students go to your canteen, or do they go to any other places for lunch?
What do you usually do at midday?
Please write to me soon. I’ll write again next week with more questions.
Thanks for your time!
See you,
Eddie.
196
Appendix 17: Alternative free writing tasks – story writing
5. Free writing: ‘An extraordinary night…’ (10 marks)
On your answer sheet, write an ending for ONE of the following stories.
A. Last summer, my parents were away on holiday. One night, some of
my best friends came over to my house. During the first two hours, we
all had great fun: we ordered pizzas, put on some music, and we
talked a lot. But suddenly, just as …
B. I am a student in London. One time, I went to the cinema with a good
friend from university. But when I returned, I saw that the front door
of my flat was slightly open. I could see that a light was on inside …
In your story, describe:
•
what you and other people (for example your friends) did (be creative!);
•
how you felt.
Your story must have about 80-100 words and a clear, coherent structure (Æ paragraphs,
linking words…).